SlideShare a Scribd company logo
Machine Learning in Non-Stationary Environments
Adaptive Computation and Machine Learning
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns,
Associate Editors
A complete list of the books published in this series can be found at the back
of the book.
MACHINE LEARNING IN NON-STATIONARY ENVIRONMENTS
Introduction to Covariate Shift Adaptation
Masashi Sugiyama and Motoaki Kawanabe
The MIT Press
Cambridge, Massachusetts
London, England
© 2012 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic
or mechanical means (including photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
For information about special quantity discounts, please email
special_sales@mitpress.mit.edu
This book was set in Syntex and TimesRoman by Newgen.
Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
Sugiyama, Masashi, 1974-
Machine learning in non-stationary environments : introduction to covariate shift adaptation /
Masashi Sugiyarna and Motoaki Kawanabe.
p. cm. - (Adaptive computation and machine learning series)
Includes bibliographical references and index.
ISBN 978-0-262-01709-1 (hardcover: alk. paper)
I. Machine learning. I. Kawanabe, Motoaki. II. Title.
Q325.5.S845 2012
006.3'I-dc23
10 9 8 7 6 5 4 3 2 1
2011032824
Contents
Foreword xi
Preface xiii
INTRODUCTION
1 Introduction and Problem Formulation 3
1.1 Machine Learning under Covariate Shift 3
1.2 Quick Tour of Covariate Shift Adaptation 5
1.3 Problem Formulation 7
1.4
1.3.1 Function Learning from Examples 7
1.3.2 Loss Functions 8
1.3.3 Generalization Error 9
1.3.4 Covariate Shift 9
1.3.5 Models for Function Learning 10
1.3.6 Specification of Models 13
Structure of This Book 14
1.4.1 Part II: Learning under Covariate Shift
1.4.2 Part III: Learning Causing Covariate Shift
II LEARNING UNDER COVARIATE SHIFT
2 Function Approximation 21
14
17
2.1 Importance-Weighting Techniques for Covariate Shift Adaptation 22
2.1.1 Importance-Weighted ERM 22
2.1.2 Adaptive IWERM 23
2.1.3 Regularized IWERM 23
2.2 Examples of Importance-Weighted Regression Methods 25
2.2.1 Squared Loss: Least-Squares Regression 26
2.2.2 Absolute Loss: Least-Absolute Regression 30
2.2.3 Huber Loss: Huber Regression 31
2.2.4 Deadzone-Linear Loss: Support Vector Regression 33
2.3 Examples of Importance-Weighted Classification Methods 35
vi
2.3.1 Squared Loss: Fisher Discriminant Analysis 36
2.3.2 Logistic Loss: Logistic Regression Classifier 38
2.3.3 Hinge Loss: Support Vector Machine 39
2.3.4 Exponential Loss: Boosting 40
2.4 Numerical Examples 40
2.4.1 Regression 40
2.4.2 Classification 41
2.5 Summary and Discussion 45
3 Model Selection 47
3.1 Importance-Weighted Akaike Information Criterion 47
3.2 Importance-Weighted Subspace Information Criterion 50
3.2.1 Input Dependence vs. Input Independence in
Generalization Error Analysis 51
3.2.2 Approximately Correct Models 53
3.2.3 Input-Dependent Analysis of Generalization Error 54
3.3 Importance-Weighted Cross-Validation 64
3.4 Numerical Examples 66
3.4.1 Regression 66
3.4.2 Classification 69
3.5 Summary and Discussion 70
4 Importance Estimation 73
4.1 Kernel Density Estimation 73
4.2 Kernel Mean Matching 75
4.3 Logistic Regression 76
4.4 Kullback-Leibler Importance Estimation Procedure 78
4.4.1 Algorithm 78
4.4.2 Model Selection by Cross-Validation 81
4.4.3 Basis Function Design 82
4.5 Least-Squares Importance Fitting 83
4.5.1 Algorithm 83
4.5.2 Basis Function Design and Model Selection 84
4.5.3 Regularization Path Tracking 85
4.6 Unconstrained Least-Squares Importance Fitting 87
4.6.1 Algorithm 87
Contents
4.6.2 Analytic Computation of Leave-One-Out Cross-Validation 88
4.7 Numerical Examples 88
4.7.1 Setting 90
4.7.2 Importance Estimation by KLiEP 90
4.7.3 Covariate Shift Adaptation by IWLS and IWCV 92
4.8 Experimental Comparison 94
4.9 Summary 101
5 Direct Density-Ratio Estimation with Dimensionality Reduction 103
5.1 Density Difference in Hetero-Distributional Subspace 103
5.2 Characterization of Hetero-Distributional Subspace 104
Contents
5.3 Identifying Hetero-Distributional Subspace 106
5.3.1 Basic Idea 106
5.3.2 Fisher Discriminant Analysis 108
5.3.3 Local Fisher Discriminant Analysis 109
5.4 Using LFDA for Finding Hetero-Distributional Subspace 112
5.5 Density-Ratio Estimation in the Hetero-Distributional Subspace 113
5.6 Numerical Examples 113
5.6.1 Illustrative Example 113
5.6.2 Performance Comparison Using Artificial Data Sets 117
5.7 Summary 121
6 Relation to Sample Selection Bias 125
6.1 Heckman's Sample Selection Model 125
6.2 Distributional Change and Sample Selection Bias 129
6.3 The Two-Step Algorithm 131
6.4 Relation to Covariate Shift Approach 134
7 Applications of Covariate Shift Adaptation 137
7.1 Brain-Computer Interface 137
7.1.1 Background 137
7.1.2 Experimental Setup 138
7.1.3 Experimental Results 140
7.2 Speaker Identification 142
7.2.1 Background 142
7.2.2 Formulation 142
7.2.3 Experimental Results 144
7.3 Natural Language Processing 149
7.3.1 Formulation 149
7.3.2 Experimental Results 151
7.4 Perceived Age Prediction from Face Images 152
7.4.1 Background 152
7.4.2 Formulation 153
7.4.3 Incorporating Characteristics of Human Age Perception 153
7.4.4 Experimental Results 155
7.5 Human Activity Recognition from Accelerometric Data 157
7.5.1 Background 157
7.5.2 Importance-Weighted Least-Squares Probabilistic Classifier 157
7.5.3 Experimental Results 160
7.6 Sample Reuse in Reinforcement Learning 165
7.6.1 Markov Decision Problems 165
7.6.2 Policy Iteration 166
7.6.3 Value Function Approximation 167
7.6.4 Sample Reuse by Covariate Shift Adaptation 168
7.6.5 On-Policy vs. Off-Policy 169
7.6.6 Importance Weighting in Value Function Approximation 170
7.6.7 Automatic Selection of the Flattening Parameter 174
vii
viii
7.6.8 Sample Reuse Policy Iteration
7.6.9 Robot Control Experiments
175
176
Contents
III LEARNING CAUSING COVARIATE SHIFT
8 Active Learning 183
8.1 Preliminaries 183
8.1.1 Setup 183
8.1.2 Decomposition of Generalization Error 185
8.1.3 Basic Strategy of Active Learning 188
8.2 Population-Based Active Learning Methods 188
8.2.1 Classical Method of Active Learning for Correct Models 189
8.2.2 Limitations of Classical Approach and Countermeasures 190
8.2.3 Input-Independent Variance-Only Method 191
8.2.4 Input-Dependent Variance-Only Method 193
8.2.5 Input-Independent Bias-and-Variance Approach 195
8.3 Numerical Examples of Population-Based Active Learning Methods 198
8.3.1 Setup 198
8.3.2 Accuracy of Generalization Error Estimation 200
8.3.3 Obtained Generalization Error 202
8.4 Pool-Based Active Learning Methods 204
8.4.1 Classical Active Learning Method for Correct Models and
Its Limitations 204
8.4.2 Input-Independent Variance-Only Method 205
8.4.3 Input-Dependent Variance-Only Method 206
8.4.4 Input-Independent Bias-and-Variance Approach 207
8.5 Numerical Examples of Pool-Based Active Learning Methods 209
8.6 Summary and Discussion 212
9 Active Learning with Model Selection 215
9.1 Direct Approach and the Active Learning/Model Selection Dilemma 215
9.2 Sequential Approach 216
9.3 Batch Approach 218
9.4 Ensemble Active Learning 219
9.5 Numerical Examples 220
9.5.1 Setting 220
9.5.2 Analysis of Batch Approach 221
9.5.3 Analysis of Sequential Approach 222
9.5.4 Comparison of Obtained Generalization Error 222
9.6 Summary and Discussion 223
10 Applications of Active Learning 225
10.1 Design of Efficient Exploration Strategies in Reinforcement Learning 225
10.1.1 Efficient Exploration with Active Learning 225
10.1.2 Reinforcement Learning Revisited 226
10.1.3 Decomposition of Generalization Error 228
Contents
10.1.4 Estimating Generalization Error for Active Learning 229
10.1.5 Designing Sampling Policies 230
10.1.6 Active Learning in Policy Iteration 231
10.1.7 Robot Control Experiments 232
10.2 Wafer Alignment in Semiconductor Exposure Apparatus 234
IV CONCLUSIONS
11 Conclusions and Future Prospects 241
11.1 Conclusions 241
11.2 Future Prospects 242
Appendix: List of Symbols and Abbreviations 243
Bibliography 247
Index 259
ix
Foreword
Modern machine learning faces a number of grand challenges. The ever grow­
ing World Wide Web, high throughput methods in genomics, and modern
imaging methods in brain science, to name just a few, pose ever larger prob­
lems where learning methods need to scale, to increase their efficiency, and
algorithms need to become able to deal with million-dimensional inputs at
terabytes of data. At the same time it becomes more and more important to
efficiently and robustly model highly complex problems that are structured
(e.g., a grammar underlies the data) and exhibit nonlinear behavior. In addi­
tion, data from the real world are typically non-stationary, so there is a need to
compensate for the non-stationary aspects of the data in order to map the prob­
lem back to stationarity. Finally, when explaining that while machine learning
and modern statistics generate a vast number of algorithms that tackle the
above challenges, it becomes increasingly important for the practitioner not
only to predict and generalize well on unseen data but to also to explain the
nonlinear predictive learning machine, that is, to harvest the prediction capa­
bility for making inferences about the world that will contribute to a better
understanding of the sciences.
The present book contributes to one aspect of the above-mentioned grand
challenges: namely, the world of non-stationary data is addressed. Classically,
learning always assumes that the underlying probability distribution of the data
from which inference is made stays the same. In other words, it is understood
that there is no change in distribution between the sample from which we learn
to and the novel (unseen) out-of-sample data. In many practical settings this
assumption is incorrect, and thus standard prediction will likely be suboptimal.
The present book very successfully assembles the state-of-the-art research
results on learning in non-stationary environments-with a focus on the
covariate shift model-and has embedded this body of work into the general
literature from machine learning (semisupervised learning, online learning,
xii Foreword
transductive learning, domain adaptation) and statistics (sample selection bias).
It will be an excellent starting point for future research in machine learning,
statistics, and engineering that strives for truly autonomous learning machines
that are able to learn under non-stationarity.
Klaus-Robert Muller
Machine Learning Laboratory, Computer Science Department
Technische UniversWit Berlin, Germany
Preface
In the twenty-first century, theory and practical algorithms of machine learning
have been studied extensively, and there has been a rapid growth of comput­
ing power and the spread of the Internet. These machine learning methods are
usually based on the presupposition that the data generation mechanism does
not change over time. However, modern real-world applications of machine
learning such as image recognition, natural language processing, speech recog­
nition, robot control, bioinformatics, computational chemistry, and brain signal
analysis often violate this important presumption, raising a challenge in the
machine learning and statistics communities.
To cope with this non-stationarity problem, various approaches have been
investigated in machine learning and related research fields. They are called
by various names, such as covariate shift adaptation, sample selection bias,
semisupervised learning, transfer learning, and domain adaptation. In this
book, we consistently use the term covariate shift adaptation, and cover issues
including theory and algorithms of function approximation, model selection,
active learning, and real-world applications.
We were motivated to write the present book when we held a seminar
on machine learning in a non-stationary environment at the Mathematis­
ches Forschungsinstitut Oberwolfach (MFO) in Germany in 2008, together
with Prof. Dr. Klaus-Robert Miiller and Mr. Paul von Biinau of the Tech­
nische UniversWit Berlin. We thank them for their constant support and their
encouragement to finishing this book.
Most of this book is based on the journal and conference papers we have
published since 2005. We acknowledge all the collaborators for their fruit­
ful discussions: Takayuki Akiyama, Hirotaka Hachiya, Shohei Hido, Tsuyoshi
Ide, Yasuyuki Ihara, Takafumi Kanamori, Hisashi Kashima, Matthias Kraule­
dat, Shin-ichi Nakajima, Hitemitsu Ogawa, Jun Sese, Taiji Suzuki, Ichiro
Takeuchi, Yuta Tsuboi, Kazuya Ueki, and Makoto Yamada. Finally, we thank
xiv Preface
the Ministry of Education, Culture, Sports, Science and Technology in Japan,
the Alexander von Humboldt Foundation in Germany, the Okawa Founda­
tion, the Microsoft Institute for Japanese Academic Research Collaboration's
Collaborative Research Project, the IBM Faculty Award, the Mathematisches
Forschungsinstitut Oberwolfach Research-in-Pairs Program, the Asian Office
of Aerospace Research and Development, the Support Center for Advanced
Telecommunications Technology Research Foundation, the Japan Society for
the Promotion of Science, the Funding Program for World-Leading Innova­
tive R&D on Science and Technology, the Federal Ministry of Economics and
Technology, Germany, and the 1ST program of the European Community under
the PASCAL2 Network of Excellence for their financial support.
Picture taken at Mathematisches Forschungsinstitut Oberwolfach (MFO) in October 2008. From
right to left, Masashi Sugiyama, Prof. Dr. Klaus-Robert Miiller, Motoaki Kawanabe, and Mr. Paul
von Biinau. Photo courtesy of Archives of the Mathematisches Forschungsinstitut Oberwolfach.
I
INTRODUCTION
1 Introduction and Problem Formulation
In this chapter, we provide an introduction to covariate shift adaptation toward
machine learning in a non-stationary environment.
1.1 Machine Learning under Covariate Shift
Machine learning is an interdisciplinary field of science and engineering study­
ing that studies mathematical foundations and practical applications of systems
that learn. Depending on the type of learning, paradigms of machine learning
can be categorized into three types:
• Supervised learning The goal of supervised learning is to infer an underly­
ing input-output relation based on input-output samples. Once the underlying
relation can be successfully learned, output values for unseen input points can
be predicted. Thus, the learning machine can generalize to unexperienced situ­
ations. Studies of supervised learning are aimed at letting the learning machine
acquire the best generalization performance from a small number of training
samples. The supervised learning problem can be formulated as a function
approximation problem from samples.
• Unsupervised learning In contrast to supervised learning, output values
are not provided as training samples in unsupervised learning. The general
goal of unsupervised learning is to extract valuable information hidden behind
data. However, its specific goal depends heavily on the situation, and the
unsupervised learning problem is sometimes not mathematically well-defined.
Data clustering aimed at grouping similar data is a typical example. In data
clustering, how to measure the similarity between data samples needs to
be predetermined, but there is no objective criterion that can quantitatively
evaluate the validity of the affinity measure; often it is merely subjectively
determined.
4 1 Introduction and Problem Formulation
• Reinforcement learning The goal of reinforcement learning is to acquire
a policy function (a mapping from a state to an action) of a computer agent.
The policy function is an input-output relation, so the goal of reinforcement
learning is the same as that of supervised learning. However, unlike supervised
learning, the output data cannot be observed directly. Therefore, the policy
function needs to be learned without supervisors. However, in contrast to unsu­
pervised learning, rewards are provided as training samples for an agent's
action. Based on the reward information, reinforcement learning tries to learn
the policy function in such a way that the sum of rewards the agent will receive
in the future is maximized.
The purpose of this book is to provide a comprehensive overview of theory,
algorithms, and applications of supervised learning under the situation called
covariate shift.
When developing methods of supervised learning, it is commonly assumed
that samples used as a training set and data points used for testing the
generalization performancel follow the same probability distribution (e.g.,
[ 195, 20, 19 3, 42, 74, 141]). However, this common assumption is not fulfilled
in recent real-world applications of machine learning such as robot control,
brain signal analysis, and bioinformatics. Thus, there is a strong need for theo­
ries and algorithms of supervised learning under such a changing environment.
However, if there is no connection between training data and test data, noth­
ing about test data can be learned from training samples. This means that a
reasonable assumption is necessary for relating training samples to test data.
Covariate shift is one of the assumptions in supervised learning. The sit­
uation where the training input points and test input points follow different
probability distributions, but the conditional distributions of output values
given input points are unchanged, is called the covariate shift [ 145]. This means
that the target function we want to learn is unchanged between the training
phase and the test phase, but the distributions of input points are different for
training and test data.
A situation of supervised learning where input-only samples are available in
addition to input-output samples is called semisupervised learning [ 30]. The
covariate shift adaptation techniques covered in this book fall into the cate­
gory of semisupervised learning since input-only samples drawn from the test
distribution are utilized for improving the generalization performance under
covariate shift.
1. Such test points are not available during the training phase; they are given in the future after
training has been completed.
1.2. Quick Tour of Covariate Shift Adaptation
1.2 Quick Tour of Covariate Shift Adaptation
5
Before going into the technical detail, in this section we briefly describe the
core idea of covariate shift adaptation, using an illustative example. To this
end, let us consider a regression problem of learning a function f(x) from its
samples {(x:r,yt)}7�1' Once a good approximation fuction l(x) is obtained,
we can predict the output value yte at an unseen test input point xte by means
of ?exte).
Let us consider a covariate shift situation where the training and test input
points follow different probability distributions, but the learning target func­
tion f(x) is common to both training and test samples. In the toy regression
example illustrated in figure l.la, training samples are located in the left-hand
side of the graph and test samples are distributed in the right-hand side. Thus,
this is an extrapolation problem where the test samples are located outside the
-0.5 0.5 1.5 2.5
(a) Training and test data
1.5
(c) Function learned by ordinary
least squares
Figure 1.1
0.5
-0.5
I ,
I ,
: I
: I
: I
(b) Input data densities
1.5
(d) Function learned by importance­
weighted least squares
A regression example with covariate shift. (a) The learning target function I(x) (the solid line),
training samples (0), and test samples (x). (b) Probability density functions of training and test
input points and their ratio. (c) Learned function lex) (the dashed line) obtained by ordinary least
squares. (d) Learned function lex) (the dashed-dotted line) obtained by importance-weighted least
squares. Note that the test samples are not used for function learning.
6 1 Introduction and Problem Formulation
training region. Note that the test samples are not given to us in the training
phase; they are plotted in the graph only for illustration purposes. The prob­
ability densities of the training and test input points, Plf(X) and Pie(X), are
plotted in figure 1. 1b.
Let us consider straight-line function fitting by the method of least squares:
where
This ordinary least squares gives a function that goes through the training sam­
ples well, as illustrated in figure LIe. However, the function learned by least
squares is not useful for predicting the output values of the test samples located
in the right-hand side of the graph.
Intuitively, training samples that are far from the test region (say, training
samples with x < 1 in figure 1. 1a) are less informative for predicting the output
values of the test samples located in the right-hand side of the graph. This gives
the idea that ignoring such less informative training samples and learning only
from the training samples that are close to the test region (say, training samples
with x> 1.2 in figure 1. 1a) is more promising. The key idea of covariate shift
adaptation is to (softly) choose informative training samples in a systematic
way, by considering the importance of each training sample in the prediction
of test output values. More specifically, we use the ratio of training and test
input densities (see figure l.lb),
Ple(xn
-
( If)'Plf Xj
as a weight for the i-th training sample in the least-squares fitting:
Then we can obtain a function that extrapolates the test samples well (see
figure 1.ld). Note that the test samples are not used for obtaining this function.
In this example, the training samples located in the left-hand side of the graph
(say, x < 1.2) have almost zero importance (see figure LIb). Thus, these sam­
ples are essentially ignored in the above importance-weighted least-squares
1.3. Problem Formulation 7
method, and informative samples in the middle of the graph are automatically
selected by importance weighting.
As illustrated above, importance weights play an essential role in covariate
shift adaptation. Below, the problem of covariate shift adaptation is formulated
more formally.
1.3 Problem Formulation
In this section, we formulate the supervised learning problem, which includes
regression and classification. We pay particular attention to covariate shift and
model misspecijication; these two issues play the central roles in the following
chapters.
1.3.1 Function Learning from Examples
Let us consider the supervised learning problem of estimating an unknown
input-output dependency from training samples. Let
be the training samples, where the training input point
is an independent and identically distributed (i.i.d.) sample following a
probability distribution Ptf(x) with density Ptf(X):
{ tf}ntr i!.::!. Po
( )Xi i=! tf X •
The training output value
y:' EYe JR, i = 1, 2, . . . , ntf
follows a conditional probability distribution P(y Ix) with conditional density
p(ylx).
P(ylx) may be regarded as the superposition of the true output I(x) and
noise E:
y = I(x) +E.
8 1 Introduction and Problem Formulation
We assume that noise E has mean 0 and variance a2• Then the function I(x)
coincides with the conditional mean of y given x.
The above formulation is summarized in figure 1.2.
1.3.2 Loss Functions
Let loss(x, y,J) be the loss function which measures the discrepancy between
the true output value y at an input point x and its estimatey. In the regression
scenarios where Y is continuous, the squared loss is often used.
loss(x, y,J)=(y - y)2.
On the other hand, in the binary classification scenarios where Y={+1, -I},
the following Oil-loss is a typical choice since it corresponds to the misclassi­
jication rate.
I
0 if sgn(J)=y,
loss(x, y,J)=
1 otherwise,
where sgn(J) denotes the sign ofy:
sgn(J) :={+
�
-1
ify>O,
ify=O,
ify <0.
Although the above loss functions are independent ofx, the loss can generally
depend onx [141].
f(x)
!(x) .'
Figure 1.2
Framework of supervised learning.
1.3. Problem Formulation 9
1.3.3 Generalization Error
Let us consider a test sample (xte, ie), which is not given in the training phase
but will be given in the test phase. xte EX is a test input point following a
test distribution Pte(x) with density Pte(x), and ie EY is a test output value
following the conditional distribution P(yIx = xte) with conditional density
p(yIx = xte). Note that the conditional distribution is common to both train­
ing and test samples. The test error expected over all test samples (or the
generalization error) is expressed as
where IExte denotes the expectation over xte drawn from Pte(x) and lEy'e denotes
the expectation over ie drawn from P(ylx = xte). The goal of supervised
learning is to determine the value of the parameter () so that the generaliza­
tion error is minimized, that is, output values for unseen test input points can
be accurately estimated in terms of the expected loss.
1.3.4 Covariate Shift
In standard supervised learning theories (e.g., [195,20,193,42,74,141,21]), the
test input distribution Pte(x) is assumed to agree with the training input distri­
bution Ptr(x). However, in this book we consider the situation under covariate
shift [145], that is, the test input distribution Pte(x) and the training input
distribution Ptr(x) are generally different:
Under covariate shift, most of the standard machine learning techniques do
not work properly due to the differing distributions. The main goal of this
book is to provide machine learning methods that can mitigate the influence of
covariate shift.
In the following chapters, we assume that the ratio of test to training input
densities is bounded, that is,
Pte(x)
-- < 00 for all x EX.
Ptr(x)
This means that the support of the test input distribution must be contained in
that of the training input distribution. The above ratio is called the importance
[51], and it plays a central role in covariate shift adaptation.
10 1 Introduction and Problem Formulation
1.3.5 Models for Function Learning
Let us employ a parameterized function I(x;(J)for estimating the output value
y, where
Here, T denotes the transpose of a vector or a matrix, and e denotes the domain
of parameter(J.
1.3.5.1 Linear-in-Input Model The simplest choice of parametric model
would be the linear-in-input model:
( 1. 1)
where
This model has linearity in both input variable x and parameter (J, and the
number b of parameters is d + 1, where d is the dimensionality of x. The
linear-in-input model can represent only a linear input-output relation, so its
expressibility is limited. However, since the effect of each input variable X(k)
can be specified directly by the parameter Bb it would have high interpretabil­
ity. For this reason, this simple model is still often used in many practical
data analysis tasks such as natural language processing, bioinformatics, and
computational chemistry.
1.3.5.2 Linear-in-Parameter Model A slight extension of the linear-in-input
model is the linear-in-parameter model:
b
I(x;(J) = LBeCfJe(x), ( 1.2)
e=1
where {CfJe(x)}�=1 are fixed, linearly independent functions. This model is linear
in parameter(J,and we often refer to it as the linear model. Popular choices of
basis functions include polynomials and trigonometric polynomials.
When the input dimensionality is d = 1, the polynomial basis functions are
given by
1.3. Problem Formulation 11
where b= t + 1. The trigonometric polynomial basis functions are given by
{CPe(X)}�=1= {I, sinx, cosx, sin2x, cos2x, . . . , sincx, coscx},
where b=2c + 1.
For multidimensional cases, basis functions are often built by combining
one-dimensional basis functions. Popular choices include the additive model
and the multiplicative model. The additive model is given by
d
�
"" (k)f(x;(J)= ��ek,eCPe(X ).
k=1 e=1
Thus, a one-dimensional model for each dimension is combined with the others
in an additive manner (figure 1. 3a). The number of parameters in the additive
model is
b = cd.
The multiplicative model is given by
d
I(x;(J)= L eelh....ednCPek(X(k)).
el,e2,...,ed=1 k=1
Thus, a one-dimensional model for each dimension is combined with the oth­
ers in a multiplicative manner (figure 1. 3b). The number of parameters in the
multiplicative model is
b=cd
•
4
2
o
-2
2
o 0
(a) Additive model
2
2
o 0
(b) Multiplicative model
Figure 1.3
Examples of an additive model lex) = (X(1)2 - X(2) and of a multiplicative model lex) =
_X(l)X(2)+X(1)(X(2)2.
12 1 Introduction and Problem Formulation
In general, the multiplicative model can represent more complex functions
than the additive model (see figure 1. 3). However, the multiplicative model
contains exponentially many parameters with respect to the input dimension­
ality d-such a phenomenon is often referred to as the curse of dimensionality
[ 12]. Thus, the multiplicative model is not tractable in high-dimensional prob­
lems. On the other hand, the number of parameters in the additive model
increases only linearly with respect to the input dimensionality d, which is
more preferable in high-dimensional cases [71].
1.3.5.3 Kernel Model The number of parameters in the linear-in-parameter
model is related to the input dimensionality d. Another means for determining
the number of parameters is to relate the number of parameters to the number
of training samples, ntr• The kernel model follows this idea, and is defined by
""
l(x;(J)= L)eK(x,x�),
£=1
where K(', .) is a kernel function. The Gaussian kernel would be a typical
choice (see figure 1.4):
( IIx _XI1l2 )K(x,x')=exp -
2h2
'
where h (>0) controls the width of the Gaussian function.
( 1. 3)
In the kernel model, the number b of parameters is set to ntn which is inde­
pendent of the input dimensionality d. For this reason, the kernel model is
often preferred in high-dimensional problems. The kernel model is still lin­
ear in parameters, so it is a kind of linear-in-parameter model; indeed, letting
0.8
0.6
0.4
0.2
Figure 1.4
-5 0 5 10
Gaussian functions (equation 1.3) centered at the origin with width h.
1.3. Problem Formulation 13
b = ntr and ((Je(x) = K(x, x�) in the linear-in-parameter model (equation 1.2)
yields the kernel model. Thus, many learning algorithms explained in this book
could be applied to both models in the same way.
However, when we discuss convergence properties of the learned func­
tion lex; 0) when the number of training samples is increased to infinity,
the kernel model should be treated differently from the linear-in-parameter
model because the number of parameters increases as the number of train­
ing samples grows. In such a case, standard asymptotic analysis tools such
as the Cramer-Rao paradigm are not applicable. For this reason, statisticians
categorize the linear-in-parameter model and the kernel model in different
classes: the linear-in-parameter model is categorized as a parametric model,
whereas the kernel model is categorized as a nonparametric model. Analysis
of the asymptotic behavior of nonparametric models is generally more difficult
than that of parametric models, and highly sophisticated mathematical tools are
needed (see, e.g., [ 19 1, 192,69]).
A practical compromise would be to use a fixed number of kernel functions,
that is, for fixed b,
b
lex; 0) = I:eeK(X,Ce),
e=1
where, for example, {cel�=1 are template points for example chosen randomly
from the domain or from the training input points {Xn7�1 without replacement.
1.3.6 SpeCification of Models
A model lcx; 0) is said to be correctly specified if there exists a parameter 0*
such that
lex; 0*) =
I(x).
Otherwise, the model is said to be misspecified. In practice, the model used
for learning would be misspecified to a greater or lesser extent since we do
not generally have strong enough prior knowledge to correctly specify the
model. Thus, it is important to consider misspecified models when developing
machine learning algorithms.
On the other hand, it is meaningless to discuss properties of learning algo­
rithms if the model is totally misspecified-for example, approximating highly
nonlinearly fluctuated functions by a straight line does not provide meaning­
ful prediction (figure 1.5). Thus, we effectively consider the situation where
the model at hand is not correctly specified but is approximately correct.
14 1 Introduction and Problem Formulation
x
Figure 1.5
Approximating a highly nonlinear function f(x) by the linear-in-input model l(x), which is
totally misspecified.
This approximate correctness plays an important role when designing model
selection algorithms (chapter 3) and active learning algorithms (chapter 8).
1.4 Structure of This Book
This book covers issues related to the covariate shift problems, from funda­
mental learning algorithms to state-of-the-art applications.
Figure 1.6 summarizes the structure of chapters.
1.4.1 Part II: Learning under Covariate Shift
In part II, topics on learning under covariate shift are covered.
In chapter 2, function learning methods under covariate shift are introduced.
Ordinary empirical risk minimization learning is not consistent under covari­
ate shift for misspecified models, and this inconsistency issue can be resolved
by considering importance-weighted lossfunctions. Here, various importance­
weighted empirical risk minimization methods are introduced, including least
squares and Huber's method for regression, and Fisher discriminant analysis,
logistic regression, support vector machines, and boosting for classification.
Their adaptive and regularized variants are also introduced. The numerical
behavior of these importance-weighted learning methods is illustrated through
experiments.
In chapter 3, the problem of model selection is addressed. Success of
machine learning techniques depends heavily on the choice of hyperparam­
eters such as basisfunctions, the kernel bandwidth, the regularization param­
eter, and the importance-flattening parameter. Thus, model selection is one of
the most fundamental and crucial topics in machine learning. Standard model
selection schemes such as the Akaike information criterion, cross-validation,
and the subspace information criterion have their own theoretical justification
1.4. Structure of This Book
Machine Learning in Non·Stationary Environments:
Introduction to Covariate Shift Adaptation
Part II
Learning Under Covariate Shift
I
Chapter 2 lFunction Approximation
Part I
Introduction
Part III
I
Chapter 3
IModel Selection Learning Causing Covariate Shift
I Chapter 4
IImportance Estimation
Chapter 5
Direct Density-Ratio Estimation
with Dimensionality Reduction
I
Chapter 6
IRelation to Sample Selection Bias
Chapter 7
Applications of
Covariate Shift Adaptation
Conclusions
Chapter 11
Conclusions and Future Prospects
Figure 1.6
Structure of this book.
Chapter 8
Active Learning
15
16 1 Introduction and Problem Formulation
in terms of the unbiasedness as generalization error estimators. However, such
theoretical guarantees are no longer valid under covariate shift. In this chap­
ter, various their modified variants using importance-weighting techniques are
introduced, and the modified methods are shown to be properly unbiased even
under covariate shift. The usefulness of these modified model selection criteria
is illustrated through numerical experiments.
In chapter 4, the problem of importance estimation is addressed. As shown
in the preceding chapters, importance-weighting techniques play essential
roles in covariate shift adaptation. However, the importance values are usually
unknown a priori, so they must be estimated from data samples. In this chapter,
importance estimation methods are introduced, including importance estima­
tion via kernel density estimation, the kernel mean matching method, a logistic
regression approach, the Kullback-Leibler importance estimation procedure,
and the least-squares importance fitting methods. The latter methods allow
one to estimate the importance weights without performing through density
estimation. Since density estimation is known to be difficult, the direct impor­
tance estimation approaches would be more accurate and preferable in practice.
The numerical behavior of direct importance estimation methods is illustrated
through experiments. Characteristics of importance estimation methods are
also discussed.
In chapter 5, a dimensionality reduction scheme for density-ratio estima­
tion, called direct density-ratio estimation with dimensionality reduction (D
pronounced as "D-cube"), is introduced. The basic idea of D3 is to find a
low-dimensional subspace in which training and test densities are significantly
different, and estimate the density ratio only in this subspace. A supervised
dimensionality reduction technique called local Fisher discriminant analysis
(LFDA) is employed for identifying such a subspace. The usefulness of the D3
approach is illustrated through numerical experiments.
In chapter 6, the covariate shift approach is compared with related formula­
tions called sample selection bias. Studies of correcting sample selection bias
were initiated by Heckman [77,76], who received the Nobel Prize in economics
for this achievement in 2000. We give a comprehensive review of Heckman's
correction model, and discuss its relation to covariate shift adaptation.
In chapter 7, state-of-the-art applications of covariate shift adaptation tech­
niques to various real-world problems are described. This chapter includes
non-stationarity adaptation in brain-computer interfaces, speaker identifica­
tion through change in voice quality, domain adaptation in natural language
processing, age prediction from face images under changing illumination con­
ditions, user adaptation in human activity recognition, and efficient sample
reuse in autonomous robot control.
1.4. Structure of This Book 17
1.4.2 Part III: Learning Causing Covariate Shift
In part III, we discuss the situation where covariate shift is intentionally caused
by users in order to improve generalization ability.
In chapter 8, the problem of active learning is addressed. The goal of active
learning is to find the most "informative" training input points so that learning
can be successfully achieved from only a small number of training samples.
Active learning is particularly useful when the cost of data sampling is expen­
sive. In the active learning scenario, covariate shift-mismatch of training and
test input distributions-occurs naturally occurs since the training input dis­
tribution is designed by users, while the test input distribution is determined
by the environment. Thus, covariate shift is inevitable in active learning. In
this chapter, active learning methods for regression are introduced in light of
covariate shift. Their mutual relation and numerical examples are also shown.
Furthermore, these active learning methods are extended to the pool-based
scenarios, where a set of input-only samples is provided in advance and users
want to specify good input-only samples to gather output values.
In chapter 9, the problem of active learning with model selection is
addressed. As explained in the previous chapters, model selection and active
learning are two important challenges for successful learning. A natural desire
is to perform model selection and active learning at the same time, that is,
we want to choose the best model and the best training input points. How­
ever, this is actually a chicken-and-egg problem since training input samples
should have been fixed for performing model selection and models should
have been fixed for performing active learning. In this chapter, several compro­
mise approaches, such as the sequential approach, the batch approach, and the
ensemble approach, are discussed. Then, through numerical examples, limita­
tions of the sequential and batch approaches are pointed out, and the usefulness
of the ensemble active learning approach is demonstrated.
In chapter 10, applications of active learning techniques to real-world prob­
lems are shown. This chapter includes efficient exploration for autonomous
robot control and efficient sensor design in semiconductor wafer alignment.
II
LEARNING UNDER COVARIATE SHIFT
2 Function Approximation
In this chapter, we introduce learning methods that can cope with covariate
shift.
We employ a parameterized function icx; (J) for approximating a target
function f(x) from training samples {(x:r,y:')}7!l (see section 1.3.5).
A standard method to learn the parameter (J would be empirical risk
minimization (ERM) (e.g.,[193,141]):
[1 %
]OERM := argmin ;- Lloss(x:r,y:',l(x:r; (J» ,
8 tr ;=1
where loss(x,y,y) is a loss function (see section 1.3.2). If Ptr(x) = Pte(x),
OERM is known to be consistentl [145]. Under covariate shift where Ptr(x) f=
Pte(x), however,the situation differs: ERM still gives a consistent estimator if
the model is correctly specified, but it is no longer consistent if the model is
misspecified [145]:
plim [OERM] f= (J*,
ntr--+OO
where "plim" denotes convergence in probability, and (J* is the optimal
parameter in the model:
(J* := argmin[Gen].
8
1. For correctly specified models, an estimator is said to be consistent if it converges to the true
parameter in probability. For misspecified models, we use the term "consistency" for conver­
gence to the optimal parameter in the model (i.e., the optimal approximation to the learning target
function in the model under the generalization error Gen).
22 2 Function Approximation
Gen is the generalization error defined as
where lEx'e denotes the expectation over x,e drawn from the test input dis­
tribution P,e(x), and lEy'e denotes the expectation over y'e drawn from the
conditional distribution P(yIx= x,e).
This chapter is devoted to introducing various techniques of covariate shift
adaptation in function learning.
2.1 Importance-Weighting Techniques for Covariate Shift Adaptation
In this section,we show how the inconsistency of ERM can be overcome.
2.1.1 Importance-Weighted ERM
The failure of the ERM method comes from the fact that the training input
distribution is different from the test input distribution. Importance sampling
(e.g.,[51]) is a standard technique to compensate for the difference of distribu­
tions. The following identity shows the essential idea of importance sampling.
For a function g,
where lExtrandlEx'e denote the expectation over xdrawn from Ptr(x)and P,e(x),
respectively. The quantity
is called the importance. The above identity shows that the expectation of a
function g over x,ecan be computed by the importance-weighted expectation
of the function over x'r.Thus,the difference of distributions can be systemat­
ically adjusted by importance weighting. This is the key equation that plays a
central role of in covariate shift adaptation throughout the book.
2.1. Importance-Weighting Techniques for Covariate Shift Adaptation
Under covariate shift,importance-weighted ERM (IWERM),
[1
ntr
(tr)
]
---
._ . Pte Xi tr tr --- tr.fJIWERM .-argmm - L--tr-I oss(xi' Yi' !(Xi, fJ)) ,
9 ntri=! Ptr(Xi)
23
is shown to be consistent even for misspecified models [145],that is,it satisfies
plim [OIWERMJ =fJ*.
ntr-+(X)
2.1.2 Adaptive IWERM
As shown above, IWERM gives a consistent estimator. However, it also can
produce an unstable estimator, and therefore IWERM may not be the best pos­
sible method for finite samples [145]-in practice, a slightly stabilized variant
of IWERM would be preferable, that is, one achieved by slightly "flatten­
ing" the importance weight in IWERM. We call this variant adaptive IWERM
(AIWERM):
� [1
ntr
(Pte(X
tr) )Y � ]fJy :=argmin - L -(
;r)
10ss(x:
r
, y;
r
, !(x:
r
;fJ)) ,
9 ntri=! Ptr Xi
(2.1)
where y (O:s y:s 1) is called theflattening parameter.
The flattening parameter controls stability and consistency of the estima­
tor; y =0 corresponds to ordinary ERM (the uniform weight, which yields
a stable but inconsistent estimator), and y =1 corresponds to IWERM (the
importance weight, which yields a consistent but unstable estimator). An inter­
mediate value of y would provide the optimal control of the trade-off between
stability and consistency (which is also known as the bias-variance trade-off).
A good choice of y would roughly depend on the number ntr of training
samples. When ntr is large,bias usually dominates variance,and thus a smaller
bias estimator obtained by large y is preferable. On the other hand, when ntr is
small, variance generally dominates bias, and hence a smaller variance estima­
tor obtained by small y is appropriate. However, a good choice of y may also
depend on many unknown factors, such as the learning target function and the
noise level. Thus, the flattening parameter y should be determined carefully
by a reliable model selection method (see chapter 3).
2.1.3 Regularized IWERM
Instead of flattening the importance weight, we may add a regularizer to the
empirical risk term. We call this regularized IWERM:
24 2 Function Approximation
[1
nrr
( tf)
]
--... . Pte Xi tr tr --- tr9;.:=argmm - L --t-f loss(xi, Yi '!(Xi;9)) +AR(9) ,
8 ntf i=! Ptf(Xi) (2.2)
where R(9) is a regularization function, and A (2: 0) is the regularization
parameter that controls the strength of regularization. For some C(A) (2: 0),
the solution of equation 2.2 also can be obtained by solving the following
constrained optimization problem:
� [1
nrr
Pte(xtf) �
]9 =argmin - '" --'- loss(xtf ytf !(xtf.9));. � ( tf) I' I' I' ,
8 ntf i=! Ptf Xi
subject to R(9):s C(A).
A typical choice of the regularization function R(9) is the squared £2-norm
(see figure 2.1):
b
R(9) =Lei.
e=!
This is differentiable and convex,so it is often convenient to use it in devising
computationally efficient optimization algorithms. The feasible region (where
the constraint R(9):s C(A) is satisfied) is illustrated in figure 2.2a.
Another useful choice is the £!-norm (see figure 2.1):
b
R(9) =Lleel·
e=!
-2
Figure 2.1
o
6
-- Squared regularizer
- - - Absolute regularizer
2
Regularization functions: squared regularizer (82) and absolute regularizer (181).
2.2. Examples of Importance-Weighted Regression Methods 25
(a) Squared regularization (b) Absolute regularization
Figure 2.2
Feasible regions by regularization.
This is not differentiable,but it is still convex. It is known that the absolute reg­
ularizer induces a sparse solution,that is,the parameters {eel�=1tend to become
zero [200,183,31]. When the solution 0 is sparse, output values l(x;0) may
be computed efficiently,which is a useful property if the number of parameters
is very large. Furthermore, when the linear-in-input model (see section 1.3.5)
(2.3)
is used, making a solution sparse corresponds to choosing a subset of input
variables {X(k)}f=1which are responsible for predicting output values y. This is
highly useful when each input variable has some meaning (type of experiment,
etc.),and we want to interpret the "reason" for the prediction. Such a technique
can be applied in bioinformatics, natural language processing, and computa­
tional chemistry. The feasible region of the absolute regularizer is illustrated in
figure 2.2b.
The reason why the solution becomes sparse by absolute regularization may
be intuitively understood from figure 2.3, where the squared loss is adopted.
The feasible region of the absolute regularizer has "corners" on axes, and thus
the solution tends to be on one of the corners, which is sparse. On the other
hand, such sparseness is not available when the squared regularizer is used.
2.2 Examples of Importance-Weighted Regression Methods
The above importance-weighting idea is very general, and can be applied to
various learning algorithms. In this section, we provide examples of regression
methods including least squares and robust regression. Classification methods
will be covered in section 2.3.
26
/��;:::[�:���::;'.'.'.)�  .......
,' ,'
"""
.
•••.
••.•••.•.....•.
..
..
..
.
.
..
.
.
...
.
..
..
.
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.•.•.
.
.•
.
•
...•....••.• ..•
.
•..
•.........
•. ..
....
.........................
•
.................
(a) Squared regularization
Figure 2.3
Sparseness brought by absolute regularization.
2 Function Approximation
...��.•.
.
.
. .
.
::::::::::::::.........
..
-::.:.::::::::::::::.::::::••...••..
••..
•.•
.
•
•
.
.
)( (.. (::::::::::::�..�:��.:.:.:.:L.·'.�) ......
..
..........
::.......
....
...
...
""
···��:;:;.:�i��§.
...
···················
(b) Absolute regularization
-- Squared loss
-2
Figure 2.4
o
y-y
Loss functions for regression.
- - - Absolute loss
Huber loss
• • • • • ,Deadzone-linear loss
2
2.2.1 Squared Loss: Least-Squares Regression
Least squares (LS) is one of the most fundamental regression techniques in
statistics and machine learning. The adaptive importance-weighting method
for the squared loss, called adaptive importance-weighted LS (AIWLS), is
given as follows (see figure 2.4):
(2.4)
2.2. Examples of Importance-Weighted Regression Methods 27
where O :s y :s 1. Let us employ the linear-in-parameter model (see
section 1.3.5.2) for learning:
b
rex; (J)=LeeCfJe(x),e=1 (2.5)
where {CfJe(xm=1 are fixed, linearly independent functions. Then the above
minimizerOy is given analytically,as follows.
Let Xlfbe the design matrix, that is, Xlfis the nlfx b matrix with the (i, £)-th
element
Then we have
and thus equation 2.4 is expressed in a matrix form as
�
(Xlf(J-if)TW�(rf(J-if)nlf
where W�is the diagonal matrix with the i-th diagonal element
w - _le_l_(P (Xlf))Y
[ y].,1 -
( If) ,Plf Xi
and
If ( tr tr tr)TY =YI' Y2' ... , Yntf •
Taking its derivative with respect to (Jand equating it to zero yields
XlfTW�Xlf(J=X'fW�if.
Let Ly be the learning matrix given by
L =(XtrTWtrXtr)-IXtrTWtry y y'
28 2 Function Approximation
where we assume that the inverse of XtrT
w�xtrexists. ThenOy is given by
(2.6)
A MATLAB® implementation of adaptive importance-weighted LS is
available from http://sugiyama-www.cs.titech.ac.jprsugi/software/IWLS/.
The above analytic solution is easy to implement and useful for theoretical
analysis of the solution.However,when the number of parameters is very large,
computing the solution by means of equation 2.6 may not be tractable since the
matrix XtrTw�rrthat we need to invert is very high-dimensiona1.2 Another
way to obtain the solution is gradient descent-the parameter (Jis updated
so that the squared-error term is reduced, and this procedure is repeated until
convergence (see figure 2.5):
8e+- 8e- 8 t(Pte(
(;::»)y
(t8e'CfJe'(x:r)-y;r)CfJe(x:r) for all e,
i=! Ptr I
e'=!
where 8 (>0) is a learning-rate parameter, and the rest of the second term
corresponds to the gradient of the objective function (equation 2.4). If 8 is
large, the solution goes down the slope very fast (see figure 2.5a); however,
it can overshoot the bottom of the objective function and fluctuate around the
(a) When c is large
Figure 2.5
Schematic illustration of gradient descent.
2. In practice, we may solve the following linear equation,
XttTW�xtt9y =XttTW�ytt,
(b) When c is small
for computing the solution. This would be slightly more computationally efficient than computing
the solution by means of equation 2.6. However, solving this equation would still be intractable
when the number of parameters is very large.
2.2. Examples of Importance-Weighted Regression Methods 29
bottom. On the other hand, if B is small, the speed of going down the slope
is slow, but the solution stably converges to the bottom (see figure 2.5b). A
suitable scheme would be to start from large B to quickly go down the slope,
and then gradually decrease the value of B so that it properly converges to the
bottom of the objective function. However,determining appropriate scheduling
of B is highly problem-dependent, and it is easy to appropriately choose B in
practice.
Note that in general, the gradient method is guaranteed only to be able to
find one of the local optima, whereas in the case of LS for linear-in-parameter
models (equation 2.5), we can always find the globally optimal solution thanks
to the convexity of the objective function (see, e.g., [27]).
When the number ntf of training samples is very large, computing the gradi­
ent as above is rather time-consuming. In such a case, the following stochastic
gradient method [7] is computationally more efficient-for a randomly cho­
sen sample index i E {I,2,. . .,ntf} in each iteration, repeat the following
single-sample update until convergence:
ee +-ee- B (Pte(Xn)Y (�ee'CfJe(xtf) -if)CfJe(xtf) for all £.
P(xtf) � I I I
tr 1
£'=1
Convergence of the stochastic gradient method is guaranteed in the probabilis­
tic sense.
The AIWLS method can easily be extended to kernel models (see
section 1.3.5.3) by letting b = ntf and CfJe(x) = K(x,xD, where K(·,·) is a
kernel function:
ntr
l(x;()=L:eeK(x,xD·
e=1
The Gaussian kernel is a popular choice:
/ (IIX-X/1I2)K(x,x) = exp
2h2
'
where h >0 controls the width of the Gaussian function. In this case,the design
matrix xtf becomes the kernel Gram matrix Kt" that is, Ktf is the ntf x ntf
matrix with the (i, £)-th element
30 2 Function Approximation
Then the learned parameter Oy can be obtained by means of equation 2.6 with
the learning matrix Ly given as
using the fact that the kernel matrix Ktris symmetric:
The (stochastic) gradient descent method is similarly available by replacing
band CfJe(x) with ntrand K(x,xD, respectively, which is still guaranteed to
converge to the globally optimal solution.
2.2.2 Absolute Loss: Least-Absolute Regression
The LS method often suffers from excessive sensitivity to outliers (i.e., irreg­
ular values) and less reliability. Here, we introduce an alternative approach
to LS based on the least-absolute (LA) method, which we refer to as adap­
tive importance-weighted least-absolute regression (AIWLAR)-instead of
the squared loss, the absolute loss is used (see figure 2.4):
(2.7)
The LS regression method actually estimates the conditional mean of output
y given input x. This may be intuitively understood from the fact that min­
imization under the squared loss amounts to obtaining the mean of samples
{zd7=1:
If one of the values in the set {Zi}7=1 is extremely large or small due to, for
instance, some measurement error, the mean will be strongly affected by that
outlier sample. Thus, all the values {Zi}7=1are responsible for the mean, and
therefore even a single outlier observation can significantly damage the learned
function.
On the other hand, the LA regression method is actually estimates the con­
ditional median of output y, given input x. Indeed, minimization under the
2.2. Examples of Importance-Weighted Regression Methods 31
absolute loss amounts to obtaining the median:
where Zl .::::Z2.::::• . • .::::Z2n+l' The median is not influenced by the magnitude
of the values {Z;}i;fn, but only by their order. Thus, as long as the order is
kept unchanged, the median is not affected by outliers-in fact, the median is
known to be the most robust estimator in the light of breakdown-pointanalysis
[83,138].
The minimization problem (equation 2.7) looks cumbersome due to the
absolute value operator, which is non-differentiable. However, the following
mathematical trick mitigates this issue [27]:
Ixl= min b subject to - b.:::: x.:::: b.
b
Then the minimization problem (equation 2.7) is reduced to the following
optimization problem:
{min
9.{bi}7�1
subject to
nrr ( ( If))Y""' PIe Xi b
.
� ( If) ";=1 Plf Xi
-b < f
�
(Xlf.9) - ylf< b 'Vi.1_ I' 1 - n
If the linear-in-parameter model (equation 2.5) is used for learning, the above
optimization problem is reduced to a linear program [27] which can be solved
efficiently by using a standard optimization software.
The number of constraints is nlf in the above linear program. When nlf is
large, we may employ sophisticated optimization techniques such as column
generation by considering increasing sets of active constraints [37] for effi­
ciently solving the linear programming problem. Alternatively,an approximate
solution can be obtained by gradient descent or (quasi)-Newton methods if the
absolute loss is approximated by a smooth loss (see section 2.2.3).
2.2.3 Huber Loss: Huber Regression
The LA regression is useful for suppressing the influence of outliers. However,
when the training output noise is Gaussian, the LA method is not statistically
efficient, that is, it tends to have a large variance when there are no outliers.
A popular alternative is the Huber loss [83], which bridges the LS and LA
methods. The adaptive importance-weighting method for the Huber loss, called
32 2 Function Approximation
adaptive importance-weighted Huber regression (AIWHR), is follows:
where T (�O) is the robustness parameter and p< is the Huber loss, defined as
follows (see figure 2.4):
if Iyl:s T,
iflyl >T.
Thus, the squared loss is applied to "good" samples with small fitting error,and
the absolute loss is applied to "bad" samples with large fitting error. Note that
the Huber loss is a convex function, and therefore the unique global solution
exists.
The Huber loss function is rather intricate, but for the linear-in-parameter
model (equation 2.5), the solution can be obtained by solving the following
convex quadratic programming problem [109]:
b
subject to - Vi :s LBe({Je(x:r)-y;' - Ui :s Vi for all i.
e=1
Another way to obtain the solution is gradient descent (notice that the Huber
loss is once differentiable):
where £ (>0) is a learning rate parameter, and /::;.p< is the derivative of p<
given by
"p,(y ) = { y
-T
if Iyl:s T,
if y >T,
if y < -T.
2.2. Examples of Importance-Weighted Regression Methods 33
Its stochastic version is
where the sample index i E {I,2, ...,nlf} is randomly chosen in each iteration,
and the above gradient descent process is repeated until convergence.
2.2.4 Deadzone-Linear Loss: Support Vector Regression
Another variant of the absolute loss is the deadzone-linear loss (see figure 2.4):
� [1
nlf (Ple(Xlf))Y �
]8 =argmin - '"'" --'- If(Xlf;8) - ifI 'Y n � p (Xlf) , , •
8 tr i=1 tr 1
where I . I.is the deadzone-linear loss defined by
Ixl.:=10
Ixl -E
if Ixl:::: E,
iflxl >E.
That is, if the magnitude of the residual If(x:f;8) -yJfI is less than E, no
error is assessed. This loss is also called the E-insensitive loss and is used
in support vector regression [193]. We refer to this method as adaptive
importance-weighted support vector regression (AIWSVR).
When E =0, the deadzone-linear loss is reduced to the absolute loss (see
section 2.2.2). Thus the deadzone-linear loss and the absolute loss are related
to one another. However, the effect of the deadzone-linear loss is quite dif­
ferent from that of the absolute loss when E > O-the influence of "good"
samples (with small residual) is deemphasized in the deadzone-linear loss,
while the absolute loss tends to suppress the influence of "bad" samples (with
large residual) compared with the squared loss.
The solution Oy can be obtained by solving the following optimization
problem [27]:
min
9,{bil7,!!"1
subject to
t(Ple(X:?)Y bi;=1 Plf(X;)
-bi- E:::: f(x:f;8) -y;':::: bi +E,
bi2: 0, 'Vi.
34 2 Function Approximation
If the linear-in-parameter model (equation 2.5) is used for learning, the above
optimization problem is reduced to a linear program [27] which can be solved
efficiently by using a standard optimization software.
The support vector regression was shown to be equivalent to minimizing the
conditional value-at-risk (CVaR) of the absolute residuals [180]. The CVaR
corresponds to the mean of the error for a set of "bad" samples (see figure 2.6),
and is a popular risk measure in finance [137].
More specifically,let us consider the cumulative distribution of the absolute
residuals 11(x:r;9) - y:rlover all training samples {(x:r,yn}7!]:
where Problr denotes the probability over training samples {(x:r,yn}7!]. For
fJ E [0, 1), let (){fJ(9) be the tOOth fJ-percentile of the distribution of absolute
residuals:
()(fJ(9) := argmin(){ subject to <I>«(){19) 2: fJ·
Thus, only the fraction (1 - fJ) of the absolute residuals lj(xY;9) - y:'1
exceeds the threshold (){fJ(9). ()(fJ(9) is referred to as the value-at-risk (VaR).
Let us consider the fJ-tail distribution of the absolute residuals:
<l>fJ«(){19) = {�«(){19) -fJ
I -fJ
o
if (){ < (){fJ(9),
if (){ 2: (){fJ(9).
Probability
1- (3
� l1li • �
o a(3 ¢(3
Absolute residual
Figure 2.6
The conditional value-at-risk (CVaR).
2.3. Examples of Importance-Weighted Classification Methods 35
Let <PtJ(fJ) be the mean of the,B-tail distribution of the absolute residuals:
where lE<I>p denotes the expectation over the distribution <l>p. !f>p(fJ) is called
the CVaR. By definition, the CVaR of the absolute residuals is reduced to the
mean absolute residuals if,B=0,and it converges to the worst absolute residual
as,B tends to 1. Thus, the CVaR smoothly bridges the LA approach and the
Chebyshev approximation (a.k.a. minimax) method. CVaR is also referred to
as the expected shortfall.
2.3 Examples of Importance-Weighted Classification Methods
In this section, we provide examples of importance-weighted classification
methods including Fisher discriminant analysis [50, 57], logistic regression
[74],supportvector machines [26,193,141],and boosting [52,28,53]. For sim­
plicity, we focus on the binary classification case where Y ={+1, -I}. Let n�
and n� be the numbers of training samples in classes +1 and -1,respectively.
In the classification setup, the following Oil-loss is typically used as the
error metric since it corresponds to the misclassijication rate.
{ ° if sgn(y) =sgn(y ),
loss(x , y,y) =
1 otherwise,
where y is the true output value at an input pointx ,yis an estimate of y, and
sgn(y) denotes the sign ofy:
sgn(y) := {+
�
-1
ify>O,
ify=O,
ify< 0.
This means that, in classification scenarios, the sign ofyis important, and the
magnitude ofydoes not affect the misclassification error.
The above O/I-loss can be equivalently expressed as
{ ° if sgn(yy) =1,
loss(x , y,y) =
1 otherwise.
36
5
,
4 ,
,
, ,
V> 3 , ,
V>
,0
--' ,
,2 ,
0
-2
Figure 2.7
, .
' .
, .
"
.. '(f
..
..
..
0
/y
2
2 Function Approximation
-- Squared loss
- - - Logistic loss
,-, -, Hinge loss
••••• , Exponential loss
-- 0/1-1055
Loss functions for classification.y are the true output value at an input point x and yis an estimate
ofy.
For this reason, the loss function in classification is often expressed in terms
of yy, which is called the margin. The profile of the 0/1-10ss is illustrated as a
function of yy in figure 2.7.
Minimizing the 0/1-10ss is the ultimate goal of classification. However,since
the 0/1-10ss is not a convex function, optimization under it is hard. To cope
with this problem, alternative convex loss functions have been proposed for
classification scenarios (figure 2.7). In this section, we review classification
methods with such convex loss functions.
2.3.1 Squared Loss: Fisher Discriminant Analysis
Fisher discriminant analysis (FDA) is one of the classical classification meth­
ods [50]. In FDA, the input samples are first projected onto a one-dimensional
subspace (i.e.,a line),and then the projected samples are linearly separated into
two classes by thresholding; for multiclass extension,see, for example, [57].
Let p., p.
+
, and p.- be the means of {x:,}7!], {x:rly!r =+1}7!], and {x:rly!' =
-1}7!], respectively:
1
ntr
p.:=_
'""' xtr,
n �
Itf ;=1
1
,, +.__
'""' xtr� ,- + � i'
ntr i:yY=+l
2.3. Examples of Importance-Weighted Classification Methods 37
where
Li:y)'=+ldenotes the summation over index i such that yt = +1. Let Sb
and SW be the between-class scatter matrix and the within-class scatter matrix,
respectively,defined as
Sb := n�(p,+_p,)(p,+-p,)T+ n�(p,-_p,)(p,--p,)T,
Sw:= L (x�_p,+)(x:r_p,+)T+ L (x:r_p,-)(x:r_p,-)T.
The FDA projection direction if>FDA E IRd is defined as
That is, FDA seeks a projection direction if> with large between-class scatter
and small within-class scatter after projection.
The above ratio is called the Rayleigh quotient. A strong advantage of the
Rayleigh quotient formulation is that globally optimal solutions can be com­
puted analytically even though the objective function is not convex. Indeed,
the FDA projection direction if>FDA is given by
where if>max is the generalized eigenvector associated with the largest general­
ized eigenvalue of the following generalized eigenvalue problem:
Thanks to the analytic solution,the FDA projection direction can be computed
efficiently. Finally, the projected samples are classified by thresholding.
The FDA solution can be also obtained in the LS regression framework
where Y = R Suppose the training output values {yJr}7!1 are
Ir { l/n�
Yi ex
-1/n�
if x:rbelongs to class + 1,
if x:'belongs to class -1.
We use the linear-in-input model (equation 2.3) for learning, and the classifi­
cation result yeof a test sample xleis obtained by the sign of the output of the
38 2 Function Approximation
learned function.
ye=sgn
(r(xte;0)).
In this setting,if the parameter0is learned by the LS method,this classification
method is essentially equivalent to FDA [42,21].
Under covariate shift, the adaptive importance-weighting idea can be
employed in FDA, which we call adaptive importance-weighted FDA
(AIWFDA):
� [1 ntr
(Pte(Xtr))Y �
2]Oy =argmin - L --;-r
(I(x:r;0)-yn .9 ntr ;=1 Ptr(X;)
The solution is given analytically in exactly the same way as the regression
case (see section 2.2.1).
As explained in the beginning of this section, the margin y!rfcx:r;0)plays
an important role in classification. If y'r=±1, the squared error used above
can be expressed in terms of the margin as
(r(x:r;0)_y:')2=(y:')2(fc
X;;:0)_1)2
=(1 - y!rr(x:r;0))2,
where we used the facts that (y:')2=1 and y!r=1/ y!r.This expression of the
squared loss is illustrated in figure 2.7,showing that the above squared loss is a
convex upper bound of the Oil-loss. Therefore,minimizing the error under the
squared loss corresponds to minimizing the upper bound of the Oil-loss error,
although the bound is rather loose.
2.3.2 Logistic Loss: Logistic Regression Classifier
Logistic regression (LR)-which sounds like a regression method-is a clas­
sifier that gives a confidence value (the class-posterior probability) for the
classification results [74].
The LR classifier employs a parametric model of the following form for
expressing the class-posterior probability p(ylx):
� 1
p(ylx)= :xc
1 + exp
(-yI(x;0))
The parameter0is usually learned by maximum likelihood estimation (MLE).
Since the negative log-likelihood can be regarded as the empirical error, the
2.3. Examples of Importance-Weighted Classification Methods 39
adaptive importance-weighting idea can be employed in LR, which we call
adaptive importance-weighted LR (AIWLR):
8y =argmin [t(Pte(x!?)Y
log (1 + exp (-y;rf(x:r; 8»))].
e
i=! Ptr(xi)
The profile of the above loss function, which is called the logistic loss, is
illustrated in figure 2.7 as a function of margin y
[(x).
Since the above objective function is convex when a linear-in-parameter
model (2.5) is used, the global optimal solution can be obtained by standard
nonlinear optimization methods such as the gradient descent method, the con­
jugate gradient method, Newton's method, and quasi-Newton methods [117].
The gradient descent update rule is given by
for all e.
A C-language implementation of importance-weighted kernel logistic
regression is available from the Web page http://sugiyama-www.cs.titech.ac
.jpryamada/iwklr.html.
2.3.3 Hinge Loss: Support Vector Machine
The support vector machine (SVM) [26, 193, 141] is a popular classifica­
tion technique that finds a separating hyperplane with maximum margin.
Although the original SVM was derived within the framework of the Vapnik­
Chervonenkis theory and the maximum margin principle, the SVM learning
criterion can be equivalently expressed in the following form3 [47]:
This implies that SVM is actually similar to FDA and LR; only the loss func­
tion is different. The profile of the above loss function, which is called the
hinge loss, is illustrated in figure 2.7. As shown in the graph, the hinge loss is
a convex upper bound of the Oil-loss and is sharper than the squared loss.
3. For simplicity, we have omitted the regularization term.
40 2 Function Approximation
Adaptive importance weighting can be applied to SVM, which we call
adaptive importance-weighted SVM (AIWSVM):
� [nt, (Pte(xtr))Y � ](Jy=argmin L --;,- max
(O, I- y:'l(x:'; (J)) .
9 ;=1 Pt,(x;)
The support vector classifier was shown to minimize the conditional value­
at-risk (CVaR) of the margin [181] (see section 2.2.4 for the definition of
CVaR).
2.3.4 Exponential Loss: Boosting
Boosting is an iterative learning algorithm that produces a convex combination
of base classifiers [52]. Although boosting has its origin in the framework of
probably approximately correct (PAC) learning [190], it can be regarded as a
stagewise optimization of the exponential loss [28,53]:
The profile of the exponential loss is illustrated in figure 2.7. The exponential
loss is a convex upper bound of the Oil-loss that is looser than the hinge loss.
The adaptive importance-weighting idea can be applied to boosting, a
process we call adaptive importance-weighted boosting (AIWB):
�
[ntr (Pte(Xtr))Y �
](J =argmin '"' --'- exp
(_ytrI(xtr;(J)) .
Y � P (xt') "(J
i=l tf 1
2.4 Numerical Examples
In this section we illustrate how ERM, IWERM, and AIWERM behave, using
toy regression and classification problems.
2.4.1 Regression
We assume that the conditional distribution P(ylx) has mean I(x)and vari­
ance a2, that is, output values contain independent additive noise. Let the
learning target function I(x) be the sincfunction:
{I if x =O,
I(x) =sinc(x ) := sin(JTx)
otherwise.
JTX
2.4. Numerical Examples
Let the training and test input densities be
Ptr(x)= N(x;1, (1/2)2),
Pte(x)= N(x;2, (1/4)2),
41
where N(x;fL, a2) denotes the Gaussian density with mean fL and variance a2•
The profiles of the densities are plotted in figure 2.8a. Since the training input
points are distributed in the left-hand side of the input domain and the test
input points are distributed in the right-hand side, we are considering a (weak)
extrapolation problem. We create the training output value {yJr}7!! as
where {En7!! are i.i.d. noise drawn from
We let the number of training samples be ntr= 150, and we use a linear-in-input
model (see section 1.3.5.1) for function learning:
If ordinary least squares (OLS, which is an ERM method with the squared
loss) is used for fitting the straight-line model, we have a good approxima­
tion of the left-hand side of the sinc function (see figure 2.8b). However,
this is not an appropriate function for estimating the test output values (x in
the figure). Thus, OLS results in a large test error. Figure 2.8d depicts the
learned function obtained by importance-weighted LS (lWLS). IWLS gives
a better function for estimating the test output values than OLS, although
it is rather unstable. Figure 2.8c depicts a learned function obtained by
AIWLS with y = 0.5 (see section 2.2.1), which yields better estimation of
the test output values than IWLS (AIWLS with y = 1) and OLS (AIWLS
with y =0).
2.4.2 Classification
Through the above regression examples, we found that importance weight­
ing tends to improve the prediction performance in regression scenarios.
Here, we apply the importance-weighting technique to a toy classification
problem.
42
15
05
#'.
i/'if 
. ;
'"
'.
O�"��----�------NU��----���---'�"�--
-05 o 05 15 2 25 3
o h-----,.-----,- f(x)
- - -'f(x)
-05 0 Training
X Test
x
(a) Input data densities
-0.5 0 05 15 2 25 3
(b) Function learned by OLS (AIWLS with r = 0)
3
x
(c) Function learned by AIWLS with r = 0.5)
(d) Function learned by IWLS (AIWLS with r = 1)
Figure 2.8
2 Function Approximation
An illustrative regression example with covariate shift. (a) The probability density functions of the
training and test input points and their ratio. (b}-(d) The learning target function f(x) (the solid
line), training samples (0), a learned function f(x) (the dashed line), and test samples (x). Note
that the test samples are not used for function learning.
2.4. Numerical Examples 43
Let us consider a binary classification problem in the two-dimensional input
space (d=2). We define the class-posterior probabilities, given input x,by
I +tanh
(x(!) +mineO, X(2»))
p(y=+llx)= 2
'
where x=(x(!),X(2») Tand
p(y=-llx)=1 - p(y=+llx).
The optimal decision boundary,that is, a set of all xsuch that
p(y=+llx)=p(y=-llx),
is illustrated in figure 2.9a.
Let the training and test input densities be
Ptr(x)=�N(x; [�2l[� �D+�N(x; [�l[� �D,
Pte(x)=�N(x; [�ll[� �D+�N(x; [�ll[� �D,
(2.8)
(2.9)
(2.10)
where N(x;It, 1:) is the multivariate Gaussian density with mean It and
covariance matrix 1:. This setup implies that we are considering a (weak)
extrapolation problem. Contours of the training and test input densities are
illustrated in figure 2.9(a).
Let the number of training samples be ntr=500. We create training input
points {x:'}7!! following Ptr(x) and training output labels {y!,}7!! following
p(yIx=x:').Similarly, let the number of test samples be nte=500. We cre­
ate ntetest input points {xie};::!following Pte(x)and test output labels {y�e};::!
following p(yIx=xie).
We use the linear-in-input model for function learning:
l(x;(J)=e!x(l) +e2x(2) +e3,
and determine the parameter (J by AIWFDA (see section 2.3.1).
Figure 2.9b depicts an example of realizations of training and test samples,
as well as decision boundaries obtained by AIWFDA with y=0, 0.5, 1. In this
particular realization, y = 0.5 or 1 works better than y = O.
44 2 Function Approximation
8
7 Negative Positive
6
5
4
3
2
0
-1
-2
-3
-4 -2 0 2 4 6
(a) Optimal decision boundary (the thick solid line) and contours of training and
test input densities (thin solid lines).
8
6
4
x
2
o
-2
-8 -6
y=0.5
• x
x +
-4 -2
•
•
o
o 0 Train-pos
o x Train-neg
o Test-pos
xc§) + Test-neg
o
ff9.nO
o
Oo
'U:!tTOoo 080
P °0
o�o %
0 0 0
8 �DD
Di:f:J lbx�[J]q:jD
ft rf! ffiJD 0
� -l='D D 0
�
2 4 6
(b) Optimal decision boundary (solid line) and learned boundaries (dashed lines).
o and x denote the positive and negative training samples, while D and + denote
the positive and negative test samples. Note that the test samples are not given
in the training phase; they are plotted in the figure for illustration purposes.
Figure 2.9
An illustrative classification example with covariate shift.
2.5. Summary and Discussion 45
2.5 Summary and Discussion
Most of the standard machine learning methods assume that the data generat­
ing mechanism does not change over time (e.g., [195, 193, 42, 74, 141, 21]).
However, this fundamental prerequisite is often violated in many practical
problems, such as off-policy reinforcement learning [174, 66, 67, 4], spam fil­
tering [18], speech recognition [204], audio tagging [198], natural language
processing [186], bioinformatics [10, 25], face recognition [188], and brain­
computer interfacing [201,160, 102]. When training and test distributions are
different, ordinary estimators are biased and therefore good generalization
performance may not be obtained.
If the training and test distributions have nothing in common,we may not be
able to learn anything about the test distribution from training samples. Thus,
we need a reasonable assumption that links the training and test distributions.
In this chapter, we focused on a specific type of distribution change called
covariate shift [145], where the input distribution changes but the conditional
distribution of outputs given inputs does not change-extrapolation would be
a typical example of covariate shift.
We have seen that the use of importance weights contributes to reduc­
ing the bias caused by covariate shift and allows us to obtain consistent
estimators even under covariate shift (section 2.1.1). However, a naive use
of the importance weights does not necessarily produce reliable solutions
since importance-weighted estimators tend to have large variance. This is
because training samples which are not "typical" in the test distribution are
downweighted, and thus the effective number of training samples becomes
smaller. This is the price we have to pay for bias reduction. In order to mit­
igate this problem, we introduced stabilization techniques in sections 2.1.2
and 2.1.3: flattening the importance-weights and regularizing the solution. The
importance-weighting techniques have a wide range of applicability, and any
learning methods can be adjusted in a systematic manner as long as they are
based on the empirical error (or the log-likelihood). We have shown exam­
ples of such learning methods in regression (section 2.2) and in classification
(section 2.3). Numerical results illustrating the behavior of these methods were
presented in section 2.4.
The introduction of stabilizers such as the flattening parameter and the
regularization parameter raises another important issue to be addressed; how
to optimally determine these trade-off parameters. This is a model selection
problem and will be discussed in detail in chapter 3.
3
Model Selection
As shown in chapter 2, adaptive importance-weighted learning methods are
promisingin the covariate shift scenarios, giventhat theflattening parameter y
is chosen appropriately. Although y = 0.5 worked well for both the regression
and the classification scenarios in the numerical examples in section 2.4, y =
0.5 is not always the best choice; a good value of y may dependon the learning
target function, the models, the noise level in the training samples, and so on.
Therefore, model selection needs to be appropriately carried out for enhancing
the generalization capability under covariate shift.
The goal of model selection is to determine the model (e.g., basis functions,
the flattening parameter y, and the regularization parameter A) so that the gen­
eralization error is minimized [1, 108, 2, 182, 142, 135, 35, 3, 136, 144, 195, 45,
118, 96, 85, 193, 165, 161, 159].The true generalization error is not accessible
since it contains the unknown learning target function. Thus, some generaliza­
tion error estimators need to be used instead. However, standard generalization
error estimators such as cross-validation are heavily biased under covariate
shift, and thus are no longer reliable. In this chapter, we describe generaliza­
tion error estimators that possess proper unbiasedness even under covariate
shift.
3.1 Importance-Weighted Akaike Information Criterion
In density estimation problems, the Akaike information criterion (AIC) [2]
is an asymptotic unbiased estimator of the Kullback-Leibler divergence [97]
from the true density to an estimated density, up to a constant term. Ale
can be employed in supervised learning if the supervised learning problem
is formulated as the problem of estimating the conditional density of out­
put values, given input points. However, the asymptotic unbiasedness of Ale
is no longer true under covariate shift; a variant of Ale which we refer
48 3 Model Selection
to as importance-weighted Ale (lWAIC) is instead asymptotically unbiased
[145]:
1 � Pte(xt') � � 1 ��- 1
IWAIC:=- -� --;,-loss(x:', yJ', f (X:';8» +
-tr(FG ),
nt,
i=1
Pt,(xi) nt,
(3.1)
where0 is a learned parameter vector, and P and G are the matrices with the
(£, £')-th element
� ._
1 �
(
Pte(xn
)
2alOSS(x:',y:',1(x:';0»Fee' . - - � ---
, ntr i=1
Ptr(x:') aee
aloss(xt',yt',lext,;0»
x
1 1 1
aee,
,
1
ntt
(tr) '121 (tr tr f
�
( tr. 1I»
Geer:= __ L: Pte Xi U OSS Xi' Yi' Xi' U ,
, nt,. Pt,(x:,) aeeaeer1=1
When Pt,(x)=Pte(X), IWAIC is reduced to
--..., .-..,where F and G are the matrices with the (£, £,)-th elements
� 1
ntt
aloss(xt' yt' f�(xtr.0» aloss(xt' yt' f�(xtr.0»
F' '- _ '"" l' I' 1 ' l' l' 1 '
e,e' ,- ntr
�
aee aee,
'
1=1
� 1
ntt
a210ss(xt' yt' f�(xtr.0»
G' * - __ '"" l' l' l'
eer'- � ., ntr. a�a�,1=1
This is called the Takeuchi information criterion (TIC) [182]. Furthermore,
when the modell(x; 8) is correctly specified, P' agrees with G'. Then TIC is
reduced to
1 L:ntt
� � dim(8)
AIC:=- - loss(xtr,ytr,f(xtr;8» + --,
n " 1
ntr ;=1 tr
3.1. Importance-Weighted Akaike Information Criterion 49
where dim(O) denotes thedimensionalityof the parameter vector O. This is the
original Akaike information criterion [2].
Thus, IWAIC would be regarded as a natural extension of AIC. Note that in
the derivation of IWAIC (and also of TIC and AIC), proper regularity condi­
tions are assumed (see, e.g., [197]). This excludes the use of nonsmooth loss
functions such as the 0/1-10ss or nonidentifiable models such as multilayer
perceptrons and Gaussian mixture models [196].
Let us consider the following Gaussian linear regression scenario:
• The conditional density p(ylx) is Gaussian with mean I(x) and
variance a2:
p(ylx) = N(y; I(x), a2),
where N(y; /-L, a2) denotes the Gaussian density with mean /-L and
variance a2•
• The linear-in-parameter model is used (see section 1.3.5.2) is used:
b
lex; 0) = Lee({Je(X),
e=
where b is the number of parameters and {({Je( x)}�= are fixed, linearly
independent functions.
• The parameter is learned by a linear learning method, that is, the learned
parameter8 is given by
8= Ly tr
,
where L is a b x ntr learning matrix that is independent of the training output
noise contained iny t
" and
tr. ( tf tf tr )T
y .
= y, Y2' ..., Yntr .
• The generalization error is defined as
that is, the squared loss is used.
50 3 Model Selection
To make the following discussion simple, we subtract constant terms from
the generalization error:
Gen' :=Gen - C' _a2, (3. 2)
where C' is defined as
C':=lE [P(xte)] •
xte
Note that C' is independent of the learned function.
Under this setup, the generalizationerror estimator based on IWAle is given
as follows [145]:
GenIWAIC=(ULytr, Lytr) - 2(ULytr, Ld') + 2tr(ULQL�),
where (., .) denotes the inner product,
� 1 T
U:=-Xtr WXtr,
nt,
and Q is the diagonal matrix with the i-th diagonal element
IWAle is asymptotically unbiased even under covariate shift; more pre­
cisely, IWAle satisfies
(3.3)
where lE{xYl7'!!"1 denotes the expectations over {X:,}7!I drawn i.i.d. from
Ptr(x), and lE{YYl?'!!"1 denotes the expectations over {yJr}7!p each drawn from
p(ylx= x:').
3.2 Importance-Weighted Subspace Information Criterion
IWAle has nice theoretical properties such as asymptotic lack of bias (equa­
tion 3.3). However, there are two issues for possible improvement-input
independence and model specification.
3.2. Importance-Weighted Subspace Information Criterion
3.2.1 Input Dependence vs. Input Independence in Generalization Error
Analysis
51
The first issue for improving IWAIC is the way lack of bias is evaluated. In
IWAIC, unbiasedness is evaluated in terms of the expectations over both train­
ing input points and training output values (see equation 3.3). However, in
practice we are given only a single training set {(x:r, y:r)}7!], and ideally we
want to predict the single-trial generalization error, that is, the generalization
error for a single realization of the training set at hand. From this viewpoint,
we do not want to average out the random variables; we want to plug the real­
ization of the random variables into the generalization error and evaluate the
realized value of the generalization error.
However, we may not be able to avoid taking the expectation over the train­
ing output values {y!,}7!] (i.e., the expectation over the training output noise)
since the training output noise is inaccessible. In contrast, the location of the
training input points {x:,}7!] is accessible. Therefore, it would be advanta­
geousto predictthe generalization error without takingthe expectation overthe
training input points, that is, to predict the conditional expectation of the gen­
eralization error, given training input points. Below, we refer to estimating the
conditional expectation of the generalization error as input-dependent analysis
of the generalization error. On the other hand, estimating the full expectation
of the generalization error is referred to as input-independent analysis of the
generalization error.
In order to illustrate a possible advantage of the input-dependent approach,
let us consider a simple model selection scenario where we have only one
training sample (x, y) (see figure 3.1). The solid curves in figure 3.la depict
GM] (y Ix), the generalization error for a model M] as a function of the (noisy)
training output value y, given a training input point x. The three solid curves
correspond to the cases where the realization of the training input pointx is x
'
,
x
"
, andX
III
, respectively.The value of the generalizationerror for the model M]
in the input-independent approach is depicted by the dash-dotted line, where
the expectation is taken over both the training input point x and the train­
ing output value y (this corresponds to the mean of the values of the three
solid curves). The values of the generalization error in the input-dependent
approach are depicted by the dotted lines, where the expectation is taken over
only the training output value y, conditioned on x = x
'
, x
"
, and X
III
, respec­
tively (this corresponds to the mean of the values of each solid curve). The
graph in figure 3.lb depicts the generalization errors for a model M2 in the
same manner.
52 3 Model Selection
GM)(ylx = X
"I
)
GM)(ylx = X
iII
)
••••• �
y
X.���/•••••••••••••••• ••••• ,
--t-------------.. y
(a) Generalization error for model M(
GM2(ylx = x')
lE GM2(ylx = x')
ulx=x'
lE GM,(x,y)
x,y
GM2(ylx = x")
--+---------------�� y
(b) Generalization error for model M2
Figure 3.1
Schematic illustrations of the input-dependent and input-independent approaches to generalization
error estimation.
3.2. Importance-Weighted Subspace Information Criterion 53
In the input-independent framework, the model MJ is judged to be better
than M2, regardless of the realization of the training input point, because the
dash-dotted line in figure 3.1a is lower than that in figure 3.1b. However, M2is
actually better than MJifx
"
is realized asx. In the input-dependentframework,
the goodnessof the modelis adaptively evaluated, dependingon the realization
of the training input point x. This illustrates a possibility that input-dependent
analysis of the generalization error allows one to choose a better model than
input-independent analysis.
3.2.2 Approximately Correct Models
The second issue for improving IWAIC is model misspecification. IWAIC has
the same asymptotic accuracy for all models (see equation 3.3). However,
in practice it may not be so difficult to distinguish good models from com­
pletely useless (i.e., heavily misspecified) models since the magnitude of the
generalization error is significantly different. This means that we are essen­
tially interested in choosing a very good model from a set of reasonably good
models. In this scenario, if a generalization error estimator is more accurate
for better models (in other words, approximately correct models), the model
selection performance will be further improved.
In order to formalize the concept of "approximately correct models," let us
focus on the following setup:
• The squared-loss (see section 1.3.2) is used:
loss(x, y,)!)=(y_ y)2.
• Noise in training and test output values is assumed to be i.i.d. with mean
zero and variance a2• Then the generalization error is expressed as follows
(see section 1.3.3):
Gen:=IEIE [(i(xte;0) _ yte)2]xteyte
=IEIE [(i(xte;0) -f(xte)+ f(xte)_ yte)2]xteyte
=IE [(i(xte;0) -f(xte))2]+IE [(f(xte)_ yte)2]xte yte
+ 2IE [i(xte;0) -f(xte)]IE [i(xte)_ yte]xte yte
=IE [(i(xte;0) _ f(xte))2]+ a2,xte
54 3 Model Selection
whereIEx'e denotesthe expectationover x'e drawnfrom Pte(x), andlEy,e denotes
the expectation over yte drawn from p(yl x = xte).
• A linear-in-parameter model (see section 1.3.5.1) is used:
b
lex; 8) =L8eCfJe(x),
e=!
where b is the number of parameters and {CfJe(x)}�=! are fixed, linearly
independent functions.
Let 8* be the optimal parameter under the above-defined generalization
error:
8*:=argminGen.
9
Then the learning target function f(x) can be decomposed as
f(x) =lex; 8*) + <5r(x), (3.4)
where <5r(x) is the residual.rex) is orthogonalto the basis functions {CfJe(x)}��!
under Pte(x):
The functionrex) governs the nature of the model error, and <5 is the possible
magnitude of this error. In order to separate these two factors, we impose the
following normalization condition onrex):
The abovedecompositionis illustrated in figure 3.2. An "approximately cor­
rect model" refers to the model with small <5 (but <5 =F 0). Rather informally, a
good model should have a small model error <5.
3.2.3 Input-Dependent Analysis of Generalization Error
Based on input-dependent analysis (section 3.2.1) and approximately cor­
rect models (section 3.2.2), a generalization error estimator called the sub­
space information criterion (SIC) [165, 161] has been developed for linear
regression, and has been extended to be able to cope with covariate shift
[162]. Here we review the derivation of this criterion, which we refer to
3.2. Importance-Weighted Subspace Information Criterion 55
8r(x) f(x)
/ �3irX;8j/
span({<pe(x)}�=l)
Figure 3.2
Orthogonal decomposition of f(x). An approximately correct model is a model with small model
error o.
as importance-weighted SIC (lWSIC). More specifically, we first derive the
basic form of IWSIC, "pre-IWSIC," in equation 3.8. Then versions of IWSIC
for various learning methods are derived, including linear learning (equa­
tion 3.11, and affine learning (equation 3.11», smooth nonlinear learning
(equation 3.17), and general nonlinear learning (equation 3.18).
3.2.3.1 Preliminaries Since l(x;0) and r (x) are orthogonal to one another
(see figure 3.2), Gen' defined by equation 3.2 can be written as
Gen' =JE [j(xte;0)2] - 2JE [l(xte;O)f(xte)]xte xte
=JE [j(xte;0)2] - 2JE [l(xte;0) (j(xte; 8*) + r( xte»)]
xte xte
=(VO,O) - 2(VO, 8*),
where V is the b x b matrix with the (£, £,)-th element
Ve,e' :=JE [CPe(Xte)cper(xte)].
x'e
(3.5)
A basic concept of IWSIC is to replace the unknown 8* in equa­
tion 3.5 with the importance-weighted least-squares (lWLS) estimator01 (see
section 2.2.1):
where
tr. (tf tf tf )T
Y .= YI' Y2' ..., Yntr •
(3.6)
56
X lf is the n lf x b design matrix with the (i, £)-th element
and WI is the diagonal matrix with the i-th diagonal element
W . .
'
=
P le(X:')
[ il",.
( If)'
P lf Xj
3 Model Selection
However, simply replacing ()* with81 causes a bias since the sampleylf is also
used for obtaining the target parameter8. The bias is expressed as
where lE{yn7!:1 denotes the expectations over {y;r}7!p each of which is drawn
from p(yI x =
x:'). Below, we study the behavior of this bias.
3.2.3.2 Approximate Bias Correction and pre-IWSIC The output values
CfcX:f; ()*)}7!1 can be expressed as
Let r lf be the ntr-dimensional vector defined by
r lf -(r(xlf) r(xlf) r(xlf »T
- l ' 2 ' ···' ntr '
where rex) is the residual function (see figure 3.2). Then the training output
vectorylf can be expressed as
where 8 is the model error(see figure 3.2), and
3.2. Importance-Weighted Subspace Information Criterion 57
is the vector of training output noise. Due to the i.i.d. noise assumption, flf
satisfies
where Ontr denotes the nlf-dimensional vector with all zeros and Intr is the nlf-
dimensional identity matrix.
For a parameter vector 0 learned from ylf, the bias caused by replacing (J*
with01 can be expressed as
(3.7)
where IE(Y),}7�1 denotes the expectations over (y:f}7!1' each of which is drawn
from p(ylx =
x:'). In the above derivation, we used the fact that LIXlf = Ib,
which follows from equation 3.6. Based on the above expression, we define
"pre-IWSIC" as follows.
(3.8)
that is, IE(Y),}7�1 [8(UO, Llrtf)] in equation 3.7 was ignored. Below, we show
that pre-IWSIC is asymptotically unbiased even under covariate shift. More
precisely, it satisfies
(3.9)
where 0p denotes the asymptotic order in probability with respect to the
distribution of { x:,}7!1'
[Proof of equation 3.9:] Llrlf is expressed as
58
Then the law of large numbers [131] asserts that
=
f Pte(x)
CPe(x)cpdX)Ptf(X)d x
Ptf(X)
=
fPte(X)CPe(X)CPe'(x)d x
<00,
3 Model Selection
where "plim" denotes convergence in probability. This implies that
On the other hand, when ntf is large, the central limit theorem [131] asserts that
XtfTw tf _ � Pte Xi (tf) (tf)
[ 1 ] 1
ntf (tr)- Ir -
- �-- tf-CPe Xi r Xi
ntf e ntf i=1 Ptf(Xi)
where we use the fact that CPe(x) and r(x) are orthogonal to one another under
Pte(x) (see figure 3.2). Then we have
I
Llrtr = Op (n�2).
This implies that when8is convergent,
and therefore we have equation 3.9.
Note that IWSIC's asymptotic lack of bias shown above is in terms of the
expectation only over the training output values {y:f}7�1; training input points
{ xn7�1 are fixed(i.e., input-dependentanalysis; cf.input-independentanalysis,
3.2. Importance-Weighted Subspace Information Criterion 59
explained in section 3.1). The asymptotic order of the bias of pre-IWSIC is
proportional to the model error 8, implying that pre-IWSIC is more accurate
for "better" models; if the model is correct (i.e., 8 =
0), pre-IWSIC is exactly
unbiased with finite samples.
Below, for types of parameter learning including linear, affine, smooth non­
linear, and general nonlinear, we show how to approximate the third term of
pre-IWSIC,
from the training samples {(x:', y:')}7!]. In the above equation, IE{YFI7!::, essen­
tially means the expectation over training output noise {E:'}7!]. Below, we stick
to using lE{yFI7!::, for expressing the expectation over {E:'}7!].
3.2.3.3 Linear Learning First, let us consider linear learning, that is, the
learned parameter-0 is given by
(3.10)
where L is a b x nt, learning matrix which is independentof the training output
noise. This independence assumption is essentially used as
Linear learning includes adaptive importance-weighted least squares (see
section 2.2.1) and importance-weighted least squares with the squared reg­
ularizer (see section 2.1.3).
For linear learning (equation 3.10), we have
60 3 Model Selection
where a2 is the unknown noise variance. We approximate the noise variance
a2 by
where Lo is the learning matrix of the ordinary least-squares estimator:
Then IWSIC for linear learning is given by
(3.11)
When the model is correctly specified (i.e., (, =
0), (J2 is an unbiased estima­
tor of a2 with finite samples. However, for misspecified models (i.e., (, f= 0), it
is biased even asymptotically:
On the other hand, tr(ULL�) =
Op (n;;l) yields
implying that even for misspecified models, IWSIC still satisfies
(3.12)
In section 8.2.4, weintroduceanotherinput-dependentestimatorof the gen­
eralization error under the active learning setup-it is interesting to note that
the constant terms ignored above (see the definition of Gen' in equation 3.2)
and the constant terms ignored in activelearning(equation 8.2) are different.
The appearance of IWSIC above rather resembles IWAIC under linear
regression scenarios:
3.2. Importance-Weighted Subspace Information Criterion
where
� 1 TU:=-Xtr WXtr,
ntr
and Q is the diagonal matrix with the i-th diagonal element
61
Asymptotic unbiasedness of IWAIC in the input-dependent framework is
given by
that is, the model error 8 does not affect the speed of convergence.This differ­
ence is significant when we are comparing the performance of "good" models
(with small 8).
3.2.3.4 Affine Learning Next, we consider a simple nonlinear learning
method called affine learning. That is, for a b x ntr matrix L and a b­
dimensional vector c, both of which are independent of the noise €tr, the
learned parameter vector0 is given by
O=Lir+c. (3.13)
This includes additive regularization learning [121]:
where
1:=()I.J, A2, ..., Antr)T
is a tuning parameter vector. The learned parameter vector 0 is given by
equation 3.13 with
L=(XtrTXtr+Ib)-IXtrT,
C
=
(XtrTXtr+Ib)-lXtrT1,
62
where Ib denotes the b-dimensionalidentity matrix.
For affine learning (equation 3.13), it holds that
IE [(U8, L1E tr)]= IE [(U(Lytr+c), L1E tr)]
{YY}7�1 {YY}7�1
=a2tr(ULL�).
3 Model Selection
Thus, we can still use linear IWSIC(equation 3.11) for affine learning without
sacrificing its unbiasedness.
3.2.3.5 Smooth Nonlinear Learning Let us consider a smooth nonlinear
learning method. That is, using an almost differentiable [148] operator L, 0-
is given by
0-=L(ytr).
This includes Huber's robust regression (see section 2.2.3):
where r (2: 0) is a tuning parameter and
Pr(y ):=
l�y2 iflyl:::::r,
rlyl - � r2 if Iyl > r .
(3.14)
Note that Pr(y ) is twice almost differentiable, which yields a once almost
differentiable operator L.
Let V be the nt, x nt, matrix with the (i, i')-th element
(3.15)
where 'Vi is the partial derivative operator with respect to the i-th element
of input i' and [L�ULl,(i') denotes the i'-th element of the output of the
vector-valued function [L�UL](i').
Suppose that the noiseE t' is Gaussian. Then, for smooth nonlinear learning
(equation 3.14), we have
(3.16)
3.2. Importance-Weighted Subspace Information Criterion 63
This result follows from Stein's identity [148]: for an ntf-dimensional
i.i.d. Gaussian vector Etf with mean Ontr and covariance matrix a2Intr' and for
any almost-differentiable function hO: �ntf ---+ R it holds that
where E:� is the i'-th element of Etf. If we let
h(E) =[LiUL](if),
h(E) is almost differentiable since L(if) is almost differentiable. Then an
elementwise application of Stein's identity to a vector-valued function h(E)
establishes equation 3.16.
Based on this, we define importance-weighted linearly approximated SIC
(IWLASIC) as
--- ..-...-. ..-. 2GenlWLASIC =(U8, 8) - 2(U8, Ldf) + 2a tr(V), (3.17)
which still maintainsthe same lack of bias as the linear case underthe Gaussian
noise assumption.It is easy to confirmthat equation 3.17is reducedto the orig­
inal form(equation 3.11) whenOis obtainedby linear learning(equation 3.10).
Thus, IWLASIC may be regarded as a natural extension of the original
IWSIC.
Note that IWLASIC (equation 3.17) is defined for the true noise variance
a2; if a2 in IWLASIC is replaced by an estimator0'2, unbiasedness of IWLA­
SIC can no longer be guaranteed. On the other hand, IWSIC for linear and
affine learning(see equation 3.11) is defined with an unbiased noise variance
estimator0'2, but it is still guaranteed to be unbiased.
3.2.3.6 General Nonlinear Learning Finally, let us consider general nonlin­
ear learning methods. That is, for a general nonlinear operator L, the learned
parameter vector 8 is given by
o=L(if).
This includes importance-weighted least squares with the absolute regularizer
(see section 2.1.3).
64 3 Model Selection
1. Obtain the learned parameter vector B, using the training samples
{ex�, yyn7!, as usual.
2. Estimate the noise by {E;tr I E;tr = yf - lex:';B)}7!"
3. Create bootstrap noise samples {Eitr}7!, by sampling with replacement from
{E;tr}�!l·
4. Obtain the learned parameter vector 0 using the bootstrap samples
{ex�, y;tr) I y;tr = lexn + E;tr}7!"
5. Calculate (VO, L,Etr).
6. Repeat steps 3 to 5 a number of times and output the mean of (VB, L,Et').
Figure 3.3
Bootstrap procedure in IWBASIC.
For general nonlinear learning, we estimate the third term IE{YYI7�J
[(V8, LJ€tr)] in pre-IWSIC (equation 3.8), using the bootstrap method
[43, 45]:
where lE{ytrln� denotes the expectation over the bootstrap replication, and
�
I i-I --...
() and E t, correspond to the learned parameter vector () and the training out-
put noise vector€tr estimated from the bootstrap samples, respectively. More
specifically, we compute iE{yYI7�J [(Vii, LIE t')] by bootstrapping residuals as
described in figure 3.3.
Based on this bootstrap procedure, importance-weighted bootstrap­
approximated SIC (lWBASIC) is defined as
(3.18)
3.3 Importance-Weighted Cross-Validation
IWAIC and IWSIC do not accept the 0/1-10ss (see section 1.3.2). Thus, they
cannot be employed for estimating the misclassification rate in classification
scenarios. In this section, we describe a more general model selection method
that can be applied to an arbitrary loss function including the 0/1-10ss. Below,
we consider the generalization error in the following general form:
3.3. Importance-Weighted Cross-Validation 65
One of the popular techniques for estimating the generalization error for
arbitrary loss functionsis cross-validation(CV) [150,195].CV has beenshown
to give an almost unbiased estimate of the generalization error with finite
samples [105, 141]. However, such unbiased estimation is no longer possi­
ble under covariate shift. To cope with this problem, a variant of CV called
importance-weighted CV (IWCV) has been proposed [160].Let us randomly
divide the training set Z =
{(x:', y:')}7�1 into k disjoint nonempty subsets
{Z;}�=1 of (approximately) the same size. Let Tzi (X ) be a function learned
from {Z;I};'¥; (i.e., without Z;). Then the k-fold IWCV (kIWCV) estimate of
the generalization error Gen is given by
k
� 1 " 1 " P,e(x) �GenkIWCV =
- � - � --loss(x, y, fz.(x)),
k ;=1 IZ;I (X,Y)EZi
p,,(x) I
where IZ;I is the number ofsamples in the subset Z; (see figure 3.4).
When k =
n,C. kIWCV is called IW leave-one-out CV (IWLOOCV):
1
ntr (tr)
G� " Pte x; I ( " " f,�( I'))enIWLOOCV =
- � --,,- oss x;, y;, ; x; ,
ntr ;=1 Ptr(X;)
where l(x) is a function learned from all samples without (x:', y:'). It has
been proved that IWLOOCV gives an almost unbiased estimate of the general­
ization error even under covariate shift [160]. More precisely, IWLOOCV for
nt, training samples gives an unbiased estimate of the generalization error for
Subset
I
[eee]
..
v
Estimation
1
Subset Subset Subset
i-I i i + I
[eee] [eee] [eee]
. .
. .
-.....":......
Validation
..
Subset
k
[eee]
j
v
Estimation
iZi(x) .................. ... ..... ..................... • £(x, y, iZi(x))
Figure 3.4
Cross-validation.
66 3 Model Selection
ntr - 1 training samples:
where IE(Xn7!1 denotes the expectation over {X:,}7!1 drawn i.i.d. from Ptr(x),
lE(yn7!1 denotes the expectations over {YI'}7!p each of which is drawn from
p(ylx =
xn, and Genntr-! denotes the generalization error for ntr - 1 train­
ing samples. A similar proof is possible for kIWCV, but the bias is slightly
larger [74].
Almost unbiasedness of IWCV holds for any loss function, any model, and
any parameter learning method; even nonidentifiable models [196] or nonpara­
metric learning methods(e.g., [141]) are allowed. Thus, IWCV is very flexible
and useful in practical model selection under covariate shift.
Unbiasedness of IWCV shown aboveis within the input-independentframe­
work (see section 3.2.1); in the input-dependent framework, unbiasedness of
IWCV is described as
This is the same asymptoticorder as inIWAIC and IWSIC. However, since it is
independentof the model error 8, IWSIC would be more accurate in regression
scenarios with good models.
3.4 Numerical Examples
Here we illustrate how IWAIC, IWSIC, and IWCV behave.
3.4.1 Regression
Let us continue the one-dimensional regression simulation of section 2.4.1.
As shown in figure 2.8 in section 2.4.1, adaptive importance-weighted least
squares (AIWLS) with flattening parameter y =
0.5 appears to work well for
that particular realization.However, the best value of y depends on the realiza­
tion of samples. In order to investigate this issue systematically, let us repeat
the simulation 1000 times with different random seeds. That is, in each run,
{(x:'' EI')};!! are randomly drawn and the scores of tenfold IWCV, IWSIC,
IWAIC, and tenfold CV are calculated for y =
0, 0.1, 0.2, ..., 1. The means
and standard deviations of the generalization error Gen and its estimate by
each method are depicted as functions of y in figure 3.5.
3.4. Numerical Examples
0.5
0.4
0.3
0.2
0.1
0.5
0.4
0.3
0.2
0.1
0.5
0.4
0.3
0.2
0.1
0.5
0.4
0.3
0.2
0.1
1.4
1.2
0.8
0.6
0.4
0.2
o
o
o
o
o
Figure 3.5
0.2
0.2
/'0..
GenlWSIC
0.2
/'0..
GenlWAIC
0.2
/'0..
GenCV
0.2
67
0.4 0.6 0.8
y
0.4 0.6 0.8
y
0.4 0.6 0.8
y
0.4 0.6 0.8
y
0.4 0.6 0.8
y
Generalization error and its estimates as functions of the flattening parameter y in adaptive
importance-weighted least squares (AIWL S) for the regression examples in figure 2.8. Dashed
curves in the bottom four graphs depict the true generalization error for clear comparison. Note
that the vertical scale of CV is different from others since it takes a wider range of values. Also
note that IWAIC and IW SIC are estimators of the generalization error up to some constants (see
equations 3.3 and 3.12). For clear comparison, we included those ignored constants in the plots,
which does not essentially change the result.
68 3 Model Selection
Note that IWAIC and IWSIC are estimators of the generalization error
up to some constants (see equations 3.3 and 3.12). For clear comparison,
we included those ignored constants in the plots, which does not essentially
change the result. The graphs show that IWCV, IWSIC, and IWAIC give
reasonably good unbiased estimates of the generalization error, while CV is
heavily biased. The variance of IWCV is slightly larger than those of IWSIC
and IWAIC, which would be the price we have to pay in compensation for gen­
erality (as discussed in section 3.3, IWCV is a much more general method than
IWSIC and IWAIC). Fortunately, as shown below, this rather large variance
appears not to affect the model selection performance so much.
Next we investigate the model selection performance: the flattening param­
eter y (see section 2.1.2) is chosen from {O, 0.1, 0.2, ..., I} so that the score
of each method is minimized. The mean and standard deviation of the gener­
alization error Gen of the learned function obtained by each method over 1000
runs are described in the top row of table 3.1. This shows that IWCV, IWSIC,
and IWAIC give significantly smaller generalization errors than CV under the
t-test [78] at the significance level of 5 percent.IWCV, IWSIC, and IWAIC are
comparable to each other. For reference, the generalization error when the flat­
tening parameter y is chosen optimally(i.e., for each trial, y is chosen so that
the true generalization error is minimized) is described as OPT in the table.
The result shows that the generalization error values of IWCV, IWSIC, and
IWAIC are rather close to the optimal value.
The bottom row of table 3.1 describes the results when the polynomial
model of order 2 is used for learning. This shows that IWCV and IWSIC still
work well, andoutperformIWAIC andCV.When the second-orderpolynomial
model is used, the target function is rather realizable in the test region (see
figure 2.8). Therefore, IWSIC tends to be more accurate (see section 3.2.3).
Table 3.1
The mean and standard deviation of the generalization error Gen obtained by each method for the
toy regression data set
IW SIC IWAIC CV OPT
(150,1) °0.077 ± 0.020 °0.077 ± 0.023 °0.076 ± 0.019 0.356 ± 0.086 0.069 ± 0.011
(100,2) °0.104 ± 0.113 °0.101 ± 0.103 0.113 ± 0.135 0.110 ± 0.135 0.072 ± 0.014
The best method and comparable ones by the t-test at the 5 percent significance level 5% are
indicated by o. For reference, the generalization error obtained with the optimal y (i.e., the min­
imum generalization error) is described as OPT. ntr is the number of training samples, while
k is the order of the polynomial regression model. (ntr, k) = (150,1) and (100,2) roughly cor­
respond to "misspecijied and large sample size;' and "approximately correct and small sample
size;' respectively.
3.4. Numerical Examples 69
The good performance of IWCV is maintained thanks to the almost total lack
of bias unbiasedness. OrdinaryCV is not extremely poor dueto the the fact that
it is almost realizable, but it is still inferior to IWCV and IWSIC. IWAIC tends
to work rather poorly since ntr
=
100 is relatively small compared with the
high complexity of the second-order model, and hence the asymptotic results
are not valid (see sections 3.1 and 3.2.3).
The above simulation results illustrate that IWCV performs quite well
in regression under covariate shift; its performance is comparable to that
of IWSIC, which is a generalization error estimator specialized for linear
regression with linear parameter learning.
3.4.2 Classification
Let us continue the toy classification simulation in section 2.4.2. IWSIC and
IWAIC cannot be applied to classification problemsbecause they do not accept
the O/I-loss. For this reason, we compare only IWCV and ordinary CV here.
In figure 2.9b in section 2.4.2, adaptive importance-weighted Fisher dis­
criminant analysis (AIWFDA) with a middle/large flattening parameter y
appears to work well for that particular realization. Here, we investigate the
choice of the flattening parameter value by IWCV and CV more extensively.
Figure 3.6 depicts the means and standard deviations of the generalization
error Gen (which corresponds to the misclassification rate) and its estimate
by each method over 1000 runs, as functions of the flattening parameter y in
AIWFDA. The graphs clearly show that IWCV gives much better estimates of
the generalization error thanCV does.
Next we investigate the model selection performance: the flattening param­
eter y is chosen from {0,0.1, 0.2, ..., I} so that the score of each model
selection criterionis minimized.The mean andstandard deviationof the gener­
alization error Gen of the learned function obtained by each method over 1000
runs are described in table 3.2. The table shows that IWCV gives significantly
smaller test errors than CV does.
Table 3.2
The mean and standard deviation of the generalization error Gen (i.e., the misclassification Rate)
obtained by each method for the toy classification data set
IWCV CV OPT
°0.108 ± 0.027 0.131 ± 0.029 0.091 ± 0.009
The best method and comparable ones by the t-test at the 5 percent significance level are indi­
cated by o. For reference, the generalization error obtained with the optimal y (i.e., the minimum
generalization error) is described as OPT.
70
0.2
0.3 r Gen
0.1 �:t--!�--+I-:iE----i:I-+� -Ir---i�-+I--1�O L-L-____�_______L______�______�______L_
o 0.2 0.4 0.6 0.8
y
"'t �,�
:;l�-!-++1-·I'1--1+·1o � M M M 1
0.3 ./'...
GenCV
0.2
0.1
......
-..
--
Y
O L-L-____�_______L______�______�______L_
o 0.2 0.4 0.6 0.8
y
Figure 3.6
3 Model Selection
The generalization error Gen (i.e., the misclassification rate) and its estimates as functions of the
tuning parameter y in adaptive importance-weighted Fisher discriminant analysis (AIWFDA) for
the toy classification examples in figure 2.9. Dashed curves in the bottom two graphs depict the
true generalization error in the top graph for clear comparison.
This simulation result illustrates that IWCV also is useful in classification
under covariate shift.
3.5 Summary and Discussion
In this chapter, we have addressed the model selection problem under the
covariate shift paradigm--training input points and test input points are drawn
from different distributions (i.e., Ptrain (x) 'I Ptest (x)), but the functional rela­
tion remains unchanged (i.e., Ptrain (yIx) =
Ptest (y Ix)). Under covariate shift,
standard model selection schemes such as the Akaike information criterion
(AIC), the subspace information criterion (SIC), and cross-validation (CV)
are heavily biased and do not work as desired. On the other hand, their
importance-weighted counterparts, IWAIC (section 3.1), IWSIC (section 3.2),
and IWCV (section 3.3) have been shown to possess proper unbiasedness.
3.5. Summary and Discussion 71
Through simulations (section 3.4), the importance-weighted model selection
criteria were shown to be useful for improving the generalization performance
under covariate shift.
Although the importance-weighted model selection criteria were shown to
work well, they tend to have larger variance, and therefore the model selec­
tion performance can be unstable. Investigating the effect of large variance on
model selection performance will be an important direction to pursue, such as
following the line of [151] and [6].
In the experiments in section 3.4, we assumed that the importance weights
are known. However, this may not be the case in practice.We will discuss how
the importance weights can be accurately estimated from data in chapter 4.
4 Importance Estimation
In chapters 2 and 3, we have seen that the importance weight
w(X) =
Pte(X)
Ptr(X)
can be used for asymptotically canceling the bias caused by covariate shift.
However, the importance weight is unknown in practice, and needs to be
estimated from data. In this chapter, we give a comprehensive overview of
importance estimation methods.
The setup in this chapter is that in addition to the i.i.d. training input samples
{ tr}ntr i-!::!,. p.
( )Xi i=l tr X ,
we are given i.i.d. test input samples
{ te}nte i.�. p.
( )Xj j=l te X •
Although this setup is similar to semisupervised learning [30], our attention is
directed to covariate shift adaptation.
The goal of importance estimation is to estimate the importance function
w(x ) (or the importance values at the training input points, (w(x:')}7�1) from
{Xn7�1 and (X�e}��l'
4.1 Kernel Density Estimation
Kernel density estimation (KDE) is a nonparametric technique to estimate
a probability density function p(x) from its i.i.d. samples {xd7=1' For the
74
Gaussian kernel
/ ( IIX _X/1I2 )K,,(x,x )=exp -
20'2 '
KDE is expressed as
4 Importance Estimation
(4.1)
The performance of KDE depends on the choice of the kernel width o'. It
can be optimized by cross-validation (CV) as follows [69]. First, divide the
samples {xd7=1 into k disjoint subsets {Xr}�=l of (approximately) the same
size. Then obtain a density estimate PXr (x) from {A;};,.r (i.e.,without Xr),and
compute its log-likelihood for Xr:
1 '"" �
IXrl
� logpx'(x),
XEXr
where IXrl denotes the number of elements in the set Xr• Repeat this pro­
cedure for r =1,2,. . . , k, and choose the value of a such that the aver­
age of the above holdout log-likelihood over all r is maximized. Note
that the average holdout log-likelihood is an almost unbiased estimate of
the Kullback-Leibler divergence from p(x) to p(x), up to some irrelevant
constant.
KDE can be used for importance estimation by first obtaining density esti­
mators Ptr(X) and Pte (x) separately from { X:'}7!1 and {xje}j�l' respectively,and
then estimating the importance by
�
( )
_ Pte (x)
wx -� .
Ptr(x)
However,a potential limitation of this naive approach is that KDE suffers from
the curse ofdimensionality [193,69],that is,the number of samples needed to
maintain the same approximation quality grows exponentially as the dimen­
sion of the input space increases. This is critical when the number of available
samples is limited. Therefore,the KDE-based approach may not be reliable in
high-dimensional problems.
In the following sections, we consider directly estimating the importance
w(x) without going through density estimation of Ptr(x) and Pte (x). An
4.2. Kernel Mean Matching 75
intuitive advantage of this direct estimation approach is that knowing the den­
sities Ptr(x) and Pte(x) implies knowing the importance w(x), but not vice
versa-the importance w(x) cannot be uniquely decomposed into Ptr(x) and
Pte(x). Thus, estimating the importance w(x) could be substantially simpler
than estimating the densities Ptr(x) and Pte(x).
The Russian mathematician Vladimir Vapnik-who developed one of
the most successful classification algorithms, the support vector machine­
advocated the following principle [193]: One should not solve more difficult
intermediate problems when solving a target problem. The support vector
machine follows this principle by directly learning the decision boundary
that is sufficient for pattern recognition instead of solving a more gen­
eral, and thus more difficult, problem of estimating the data generation
probability.
The idea of direct importance estimation would also follow Vapnik's prin­
ciple since one can avoid solving a substantially more difficult problem of
estimating the densities Ptr(x) and Pte(x).
4.2 Kernel Mean Matching
Kernel mean matching (KMM) allows one to directly obtain an esti­
mate of the importance values without going through density estima­
tion [82]. The basic idea of KMM is to find w(x) such that the mean
discrepancy between nonlinearly transformed samples drawn from Ptr(x)
and Pte(x) is minimized in a universal reproducing kernel Hilbert space
(RKHS) [149]. The Gaussian kernel (equation 4.1) is an example of ker­
nels that induce a universal RKHS, and it has been shown that the solu­
tion of the following optimization problem agrees with the true importance
values:
(4.2)
subject to lE [ w(xtr)] = 1 and w(x) � 0,
xtr
where II . 1111 denotes the norm in the Gaussian RKHS, K,,(x ,x') is the Gaus­
sian kernel (equation 4.1), and lExtr and lExte denote the expectation overxtr
andxte drawn from Ptr(x) and Pte(x), respectively. Note that for eachfixedx,
K"(x , .) is a function belonging to the RKHS ?t.
76 4 Importance Estimation
An empirical version of the above problem is reduced to the following
quadratic program (QP):
1
I
nrr
Isubject to
ntr
t;:Wi - ntr :::; E and 0:::; WI,W2,···,wnrr
:::; B,
where
nte
ntr '"'
K ( tr te)Kj=- � (J Xj , Xj •
nte j=
B (� O) and E (� O) are tuning parameters that control the regularization
effects. The solution { Wj }7! is an estimate of the importance at {xn7!.
Since KMM does not involve density estimation, it is expected to work
well even in high-dimensional cases. However, its performance depends on
the choice of the tuning parameters B, E, and cr, and they cannot be simply
optimized by, for instance, CV, since estimates of the importance are available
only at {xn7!. As shown in [90], an inductive variant of KMM (i.e., the entire
importance function is estimated) exists. This allows one to optimize B and E
by CV over the objective function (equation 4.2). However, the Gaussian ker­
nel width cr may not be appropriately determined by CV since changing the
value of cr means changing the RKHSs; the objective function (equation 4.2)
is defined using the RKHS norm, and thus the objective values for different
norms are not comparable.
A popular heuristic to choose cr is to use the median distance between sam­
ples as the Gaussian width cr [141, 147]. However, there seems to be no strong
justification for this heuristic. For the choice of E, a theoretical result given
in [82] can be used as guidance. However, it is still hard to determine the best
value of E in practice.
4.3 Logistic Regression
Another approach to directly estimating importance is to use a probabilistic
classifier. Let us assign a selector variable17=-1 to samples drawn from Ptr(x)
and17=1 to samples drawn from Pte(x), That is, the two densities are written as
Ptr( X)=P( XI17=-1),
Pte(x)=P( XI17=1).
4.3. Logistic Regression 77
Note that rJ is regarded as a random variable.
An application of Bayes's theorem shows that the importance can be
expressed in terms of rJ, as follows [128, 32, 17]:
p(rJ=-I) p(rJ=llx)
w(x)= .
p(rJ=l) p(rJ=-llx)
The ratio p(rJ=-I)/p(rJ=1) may be easily estimated by the ratio of the
numbers of samples:
p(rJ=-I) ntr
p(rJ=1) � n,e·
The conditional probability p(rJlx) can be approximated by discriminating
{Xn7�1 and (X7}��1' using a logistic regression (LR) classifier, where rJ plays
the role of a class variable. Below we briefly explain the LR method.
The LR classifier employs a parametric model of the following form for
expressing the conditional probability p(rJlx):
� 1
p(rJlx) = ( m ) ,1 + exp -rJLe=l�e(Pe(x)
where m is the number of basis functions and {4>e(x)}�1 are fixed basis func­
tions. The parameter � is learned so that the negative regularized log-likelihood
is minimized:
f= arg;nin [�log (1 + exp (��e4>e(x:r»))
+ �log (1 + exp (- ��e4>e(x�e»)) + A�T�].
Since the above objective function is convex, the global optimal solution can
be obtained by standard nonlinear optimization methods such as the gradi­
ent method and the (quasi-)Newton method [74, 117]. Then the importance
estimator is given by
(
m
)___ ntrw(x)=
n
'
e
exp .f;�e4>e(x) .
This model is often called the log-linear model.
(4.3)
78 4 Importance Estimation
An advantage of the LR method is that model selection (i.e., the choice
of the basis functions {4>e(x)}�=! as well as the regularization parameter Je)
is possible by standard CV since the learning problem involved above is a
standard supervised classification problem.
When multiclass LR classifiers are used, importance values among mul­
tiple densities can be estimated simultaneously [16]. However, training LR
classifiers is rather time-consuming. A computationally efficient alternative to
LR, called the least-squares probabilistic classifier (LSPC), has been proposed
[155]. LSPC would be useful in large-scale density-ratio estimation.
4.4 Kullback-Leibler Importance Estimation Procedure
The Kullback-Leibler importance estimation procedure (KLIEP) [164]
directly gives an estimate of the importance function without going through
density estimation by matching the two distributions in terms of the Kullback­
Leibler divergence [97].
4.4.1 Algorithm
Let us model the importance weight w(x) by means of the following linear-in­
parameter model (see section 1.3.5.2):
w(x) =
LaeqJe(X), (4.4)
e=!
where t is the number of parameters,
a = (aJ ,a2 ," " at)T
are parameters to be learned from data samples, T denotes the transpose, and
{qJe(x)}�=! are basis functions such that
qJe(X) 2: 0 for all x E V and e = 1,2,. . . , t.
Note that t and {qJe(x)}�=! can depend on the samples {xn7!! and {x�e}j::p
so kernel models are also allowed. Later, we explain how the basis functions
{qJe(x)}�=! are designed in practice.
An estimate of the density Pte(x) is given by using the model w(x) as
4.4. Kullback-Leibler Importance Estimation Procedure 79
In KLIEP, the parameters « are determined so that the Kullback-Leibler
divergence from Pte(x) to Pte(x) is minimized:
where IExte denotes the expectation overxte drawn from Pte(x). The first term
is a constant, so it can safely be ignored. We define the negative of the second
term by KL':
(4.5)
Since Pte(x)(=w(x)Ptr(x» is a probability density function, it should satisfy
1= r Pte(x)dx= r w(x)Ptr(x)dx=lE [ w(xtr)].
1'0 1'0 xtr
(4.6)
Consequently, the KLIEP optimization problem is given by replacing the
expectations in equations 4.5 and 4.6 with empirical averages as
This is a convex optimization problem, and the global solution-which tends to
be sparse [27]-can be obtained, for example, simply by performing gradient
ascent and feasibility satisfaction iteratively. A pseudo code is summarized in
figure 4.1.
Properties of KLIEP-type algorithms are theoretically investigated in
[128, 32, 171, 120]. In particular, the following facts are known regarding the
convergence properties:
• When a fixed set of basis functions (i.e., a parametric model) is used for
importance estimation, KLIEP converges to the optimal parameter in the model
with convergence rate Op(n-!) under n=ntr=nte> where Op denotes the
80
I t
· - { ()}t {tr}ntr d
{te}ntenpu . m - 'Pg x g=l, xi i=l' an Xj j=l
Output: w(x)
4 Importance Estimation
Aj,g+-- 'Pg(xje) for j = 1,2,...,nte and C = 1,2, . . . ,t;
bg+-- n�r L:�,� 'Pg(x}r) for C = 1,2,. . . ,t;
Initialize a (> Ot) and E (0 < E « 1);
Repeat until convergence
end
a+-- a + EAT(lnte'/Aa); % Gradient ascent
a+-- a + (1 - bTa)b/(bTb); % Constraint satisfaction
a+-- max(Ot, a); % Constraint satisfaction
a+-- a/(bTa); % Constraint satisfaction
w(x)+-- L:�=1 ag<pg(x);
Figure 4.1
Pseudo code of KLIEP. 0, denotes the t-dimensional vector with all zeros, and In,e denotes the n",­
dimensional vector with all ones. .I indicates the elementwise division, and T denotes the transpose.
Inequalities and the "max" operation for vectors are applied elementwise.
asymptotic order in probability. This is the optimal convergence rate in the
parametric setup. Furthermore, KLIEP has asymptotic normality around the
optimal solution.
• When a nonparametric model (e.g., kernel basis functions centered at
test samples; see section 4.4.3) is used for importance estimation, KLIEP
converges to the optimal solution with a convergence rate slightly slower than
Op(n-!). This is the optimal convergence rate in the minimax sense.
Note that the importance model of KLIEP is the linear-in-parameter model
(equation 4.4), while that of LR is the log-linear model (equation 4.3). A
variant of KLIEP for log-linear models has been studied in [185, 120], which
is computationally more efficient when the number of test samples is large.
The KLIEP idea also can be applied to Gaussian mixture models [202] and
probabilistic principal-component-analyzer mixture models [205].
4.4.2 Model Selection by Cross-Validation
The performance of KLIEP depends on the choice of basis functions
{CPe(xm=l' Here we explain how they can be appropriately chosen from data
samples.
Since KLIEP is based on the maximization of KL' (see equation 4.5),
it would be natural to select the model such that KL' is maximized. The
expectation over Pte(x) involved in KL' can be numerically approximated by
cross-validation (CV) as follows. First, divide the test samples {xje}j�l into
4.4. Kullback-Leibler Importance Estimation Procedure 81
k disjoint subsets {X,te}�=l of (approximately) the same size. Then obtain an
importanceestimate W xJe(x)from {XJ"}hir (i.e.,without x,te),and approximate
KL' using Xrlr as
--, 1
L �
KLr:= -- log wxle (X).
IX lel r
r XEX�e
This procedure is repeated for r = 1,2,. . ., k, and the average IT is used as
an estimate of KL':
� 1 k �
KL':= k LKL'r. (4.7)
r=1
For model selection, we compute IT for all model candidates (the basis
functions {CPe(x)}�=1 in the current setting),and choose the one that minimizes
IT. A pseudo code of the CV procedure is summarized in figure 4.2.
One of the potential limitations of CV in general is that it is not reliable
for small samples since data splitting by CV further reduces the sample size.
On the other hand, in our CV procedure the data splitting is performed only
over the test input samples {X�e}]::l' not over the training samples. Therefore,
even when the number of training samples is small,our CV procedure does not
suffer from the small sample problem as long as a large number of test input
samples are available.
Input: M = {mlm = {'Pe(x)}�=d, {x�r}��rl' and {X�e}j�l
Output: w(x)
Split {X�e}j�l into k disjoint subsets {Xj}j=l;
for each model m E M
end
for each split r = 1,2,..., k
end
Wx;e(X) +--- KLIEP(m, {x�r}��rl' {Xr}j#r);
_i 1 '" �
KLr(m) +--- Ix;el DaJEx;e log Wx,"e(x);
-i 1 k _i
KL (m) +---
Ti Lr=l KLr(m);
_i
in +--- argmaxmEM KL (m);
w(x) +--- KLIEP(in {xtr}ntr {xte}nte)., t t=l' J J =1 '
Figure 4.2
Pseudo code of CV-based model selection for KLIEP.
82 4 Importance Estimation
4.4.3 Basis Function Design
A good model may be chosen by the above CV procedure, given that a set
of promising model candidates is prepared. As model candidates we use a
Gaussian kernel model centered at the test input points {x�e}�::' That is,
nte
w(x)=La(K,,(x ,x�e),
(=
whereK"(x ,x') is the Gaussian kernel with width a:
( IIx -X'1I2 )K,,(x ,x') :=exp -
2a2 •
Our reason for choosing the test input points {x�e}j:: as the Gaussian centers,
not the training input points {x:,}7!, is as follows. By definition, the importance
w(x) tends to take large values if the training input density Ptr(x) is small
and the test input density Pte(x) is large; conversely, w(x) tends to be small
(i.e., close to zero) if Ptr(x) is large and Pte(x) is small. When a nonnegative
function is approximated by a Gaussian kernel model, many kernels may be
needed in the region where the output of the target function is large; on the
other hand, only a small number of kernels will be enough in the region where
the output of the target function is close to zero (see figure 4.3). Following
this heuristic, we decided to allocate many kernels at high test input density
regions, which can be achieved by setting the Gaussian centers at the test input
. t { te}ntepom s Xj j='
Alternatively, we may locate( ntr + nte) Gaussian kernels at both {x:,}7!
and {x�e}�::' However, this seems not to further improve the performance, but
slightly increases the computational cost. When nte is very large, just using
all the test input points {x�e}�:: as Gaussian centers is already computationally
w(x)
Figure 4.3
:
.
:
.
.
.
. • I .
• I
- , . - . . - ,• I . : I . : I
: ). :
-
.,- :
I •
• I _.. I
I : • I : . I
I . • I • • I
: .
. .
.
. : .- -.
. :
",- .. . - ,
.... ;f.. .....
Heuristic of Gaussian center allocation. Many kernels may be needed in the region where the
output of the target function is large, and only a small number of kernels will be enough in the
region where the output of the target function is close to zero.
4.5. Least-Squares Importance Fitting 83
rather demanding.To ease this problem, a subset of {xie};::! may in practice be
used as Gaussian centers for computational efficiency. That is,
w(x) = I>!eK<1(X, ce), (4.8)
e=!
where Ce is a template point randomly chosen from {xie};::! and t (:s nte) is a
prefixed number.
A MATLAB® implementation of the entire KLIEP algorithm is available
from http://sugiyama-www.cs.titech.ac.jprsugi/software/KLIEP/.
4.5 Least-Squares Importance Fitting
KLIEP employed the Kullback-Leibler divergence for measuring the discrep­
ancy between two densities. Least-squares importancefitting (LSIF) [88] uses
the squared-loss for importance function fitting.
4.5.1 Algorithm
The importance w(x) is again modeled by the linear-in-parameter model
(equation 4.4). The parameters {ae}�=! in the model w(x) are determined so
that the following squared error Jis minimized:
where the last term is a constant, and therefore can be safely ignored. Let us
denote the first two terms by J':
I
J'(a):= J(a) - - IE [W2(xtr)]
2 xt,
= � IE [ w2(xt')] - IE [w(x)] .
2 xtr xte
Approximating the expectations in J' by empirical averages, we obtain
� I ntr Inte
J'(a):= - " W2(xt,) -- " w(xte)
2n� I n� Jtr i=! � }=!
84
where H is the txt matrix with the (£,£')-th element
1
ntr
H� . '"' ( tr) ( tr)e,e' . = - � ({Je X i ({Jer X i 'ntr i=
and Ii is the t-dimensional vector with the £-th element
4 Importance Estimation
(4.9)
(4.10)
Taking into account the nonnegativity of the importance function w(x), the
optimization problem is formulated as follows.
m}n [�aT Ha _ liT
a +A.1�a]
subject to a 2: 0" (4.11)
where 1, and 0, are the t-dimensional vectors with all ones and zeros, respec­
tively. The vector inequality a 2: 0, is applied in the elementwise manner,
that is,
Cle 2: 0 fod = 1,2, . . .,t.
In equation 4.11, a penalty term A.1�a is included for regularization pur­
poses, where A. (2: 0) is a regularization parameter. Equation 4.11 is a convex
quadratic programming problem, and therefore the unique global optimal
solution can be computed efficiently by a standard optimization package.
4.5.2 Basis Function Design and Model Selection
Basis functions may be designed in the same way as KLIEP,that is, Gaus­
sian basis functions centered at (a subset of) the test input points {x�e}�� (see
section 4.4).
Model selection of the Gaussian width a and the regularization parameter A.
is possible by CV: First, {xn7� and {x�e}�� are divided into k disjoint subsets
4.5. Least-Squares Importance Fitting 85
{:t;!f}7=! and {X;e}�=i' respectively. Then an importance estimate wxy,x:e (x) is
obtained using {:t;!fh"" and {x;e}j,e, (i.e., without X,!f and x:e), and the cost J'
is approximated using the holdout samples X,!f and x:e as
This procedure is repeated for r = 1,2,. . ., k, and the average J is used as an
estimate of J':
For LSIF, an information criterion has also been derived [88], which is an
asymptotic unbiased estimator of the error criterion J'.
4.5.3 Regularization Path Tracking
The LSIF solution « is shown to be piecewise linear with respect to the reg­
ularization parameter A (see figure 4.4). Therefore, the regularization path
(i.e., solutions for all A) can be computed efficiently based on the parametric
optimization technique [14,44, 70].
A basic idea of regularization path tracking is to check the violation of the
Karush-Kuhn-Tucker (KKT) conditions [27]-which are necessary and suffi­
cient conditions for optimality of convex programs-when the regularization
parameter A is changed. A pseudo code of the regularization path tracking
algorithm for LSIF is described in figure 4.5.
-0 a(A3)
��
A
� !0 � a(A2)
a(Ao) =Ot
Figure 4.4
Regularization path tracking of LSIF. The solution a(A) is shown to be piecewise-linear in the
parameter space as a function of A. Starting from A =00, the trajectory of the solution is traced as
A is decreased to zero. When A � AO for some AO � 0, the solution stays at the origin 0,. When A
gets smaller than AO, the solution departs from the origin. As A is further decreased, for some Al
such that 0::: Al ::: AO, the solution goes straight toa(AI) with a constant "speed." Then the solution
path changes direction and, for some A2 such that °::: A2::: AJ, the solution is headed straight for
a(A2) with a constant speed as A is further decreased. This process is repeated until A reaches zero.
86 4 Importance Estimation
Input: Ii and Ii % see equations 4.9 and 4.10 for the definition
Output: entire regularization path &(A) for A ::::: 0
T +-- 0;
k+-- a�gmaxi{hi I i = 1,2,...,t};
A7 +-- hk;
A+-- {1,2,...,t}{k};
&(A7) +-- Dt; % vector with all zeros
While A7 > 0
E +-- 0IAlxt; % matrix with all zeros
For i = 1,2,..., IAI
'Bi,ji +-- 1; % A= {h,i2,··· ,jlAI I jl < j2 < ... < jlAI }
end
� H -E(-
�
T )G+-- � .
u
+--
8�IE( �I
)�xIAI '
DIAl
v+-- 8-1 ( 1�);DIAl
If vS Dt+IAI % final interval
A7+1 +-- 0;
&(A7+1) +-- (Ul, U2,..., ut)T;
else % an intermediate interval
end
k +-- argmaxdUdVi I Vi > 0, i = 1,2,...,t + IAI};
A7+1+-- max{O, uk/vd;
&(A7+1) +-- (Ul, U2,..., ut)T
-A7+1(VI, v2,..., Vt)T
;
Iflskst
A+-- Au {k};
else
end
T +-- T + 1;
Figure 4.5
Pseudo code for computing the entire regularization path of LSIF. The computation of a-I is
sometimes unstable. For stabilization purposes, small positive diagonals may be added to ii.
4.6. Unconstrained Least-Squares Importance Fitting 87
The pseudo code shows that a quadratic programming solver is no longer
needed for obtaining the LSIF solution-just computing matrix inverses is
enough. This contributes highly to saving computation time. Furthermore, the
regularization path algorithm is computationally very efficient when the solu­
tion is sparse, that is, most of the elements are zero since the number of change
points tends to be small for sparse solutions.
An R implementation of the entire LSIF algorithm is available from
http://www.math.cm.is.nagoya-u.ac.jprkanamori/softwareILSIFl.
4.6 Unconstrained Least-Squares Importance Fitting
LSIF combined with regularization path tracking is computationally very effi­
cient. However, it sometimes suffers from a numerical problem, and therefore
is not reliable in practice.To cope with this problem, an approximation method
called unconstrained LSIF (uLSIF) has been introduced [88].
4.6.1 Algorithm
The approximation idea is very simple: the non-negativity constraint in the
optimization problem (equation 4.11) is dropped. This results in the following
unconstrained optimization problem:
min -PHP- h P+ -P P .[1 T� �T 'A T ]PEP.' 2 2
(4.12)
In the above, a quadratic regularization term 'APTP/2 is included instead of
the linear one 'Al�Ol since the linear penalty term does not work as a regularizer
without the nonnegativity constraint. Equation 4.12 is an unconstrained convex
quadratic program, and the solution can be analytically computed as
where It is the t-dimensional identity matrix.
Since the nonnegativity constraint P2: 0, is dropped, some of the learned
parameters can be negative. To compensate for this approximation error, the
solution is modified as
(4.13)
88 4 Importance Estimation
where the "max" operation for a pair of vectors is applied in the elemen­
twise manner. The error caused by ignoring the nonnegativity constraint and
the above rounding-up operation is theoretically investigated in [88].
An advantage of the above unconstrained formulation is that the solution can
be computed just by solving a system of linear equations. Therefore, the com­
putation is fast and stable. In addition, uLSIF has been shown to be superior in
terms of condition numbers [90].
4.6.2 Analytic Computation of Leave-One-Out Cross-Validation
Another notable advantage of uLSIF is that the score of leave-one-out CV
(LOOCV) can be computed analytically. Thanks to this property, the compu­
tational complexity for performing LOOCV is of the same order as computing
a single solution, which is explained below.
In the current setting, two sets of samples {x:,}7!1 and {xje};::1 are given that
generally are of different size. To explain the idea in a simple manner, we
assume that ntr < nte , andx:r andx:e (i = 1,2,. . ., ntr) are held out at the same
time; {xje};::ntr+1 are always used for importance estimation.
Let Wi(x) be an estimate of the importance function obtained withoutx:r
andx:o. Then the LOOCV score is expressed as
(4.14)
Our approach to efficiently computing the LOOCV score is to use the
Sherman-Woodbury-Morrison formula [64] for computing matrix inverses.
For an invertible square matrix A and vectors � and 1/ such that 1/TA-I� f= -1,
the Sherman-Woodbury-Morrison formula states that
A pseudo code of uLSIF with LOOCV-based model selection is summarized
in figure 4.6.
MATLAB® and R implementations of the entire uLSIF algorithm
are available from http://sugiyama-www.cs.titech.ac.jprsugi/software/uLSIF/
http://www.math.cm.is.nagoya-u.ac.jprkanamoriisoftwareILSIF/.
4.7 Numerical Examples
In this section, we illustrate the behavior of the KLIEP method, and how it can
be applied to covariate shift adaptation.
4.7. Numerical Examples
Input: {X�}7:1 and {X�e}]::l
Output: w(x)
t �min(lOO, n'e);
n �min(ntr, n'e);
Randomly chooset centers {cel�=l from {X�e}]�lwithout replacement;
For each candidate of Gaussian width a
end
� 1
ntr (IX"-ceI2+lx"-ce,"2)
He,e' �- L exp - r
2
r for£,£,=1,2" " ,t;
ntr i= l 2a
� 1
nle ("Xte- Ce112)
he�- L exp - }
2
for£=1,2,,,,,t;
n,e j=l 2 a
tr (IIx:'-ceIl2) ,
Xe' �exp - fOf!=1,2" , "n and£=1,2" " ,t;,r 2a2
'e (IIx:e-Ce112)
f ' 2 d 0
2Xii �exp - or 1=1, '"''n an {.=1, '"''t;, 2 a2
For each candidate of regularization parameter J...
end
� � J...(n,,-l)
B�H+ I,;
ntr ( �T�-l tr )�-I� T �-l tr ' h B X
Bo�B h1n +B X dlag �-l ;
ntrl�- 1�(Xtr*B r)
�-l �-l
( 1T(Xte*ii-Ixtr) )BI�B Xte+B X"diag
,
�_ ;
nIT_IT(Xtr*B IXtr)
B2�max
(
O,xn,
ntr-l
(:,e�o-
'
Bl»);ntr(n,e- 1)
II(X"*B)T1112 IT(Xte*B)1
LOOCV(a, J...)� 2, _ ' 2 n;
2n n
(a,):) �argminCO',A) LOOCV(a, J...);
� 1
nII ("Xtr _ Ce112+IIxtr-Ce'112)
He,e' �- Lexp - r �2
r for£,£'=I,2" " ,t;
n" i=l 2 a
� 1 "'e
("xte-ceI2)
he�- L exp - } �2
for£=I,2" " ,t;
n'e j=l 2 a
,..., ......... I-ii�max(O" (H+H,)- h);
� �� ("x-ceI2)
w(x) �L.,.. aeexp
2a2
;
e=1
Figure 4.6
89
Pseudo code of uLSIF with LOOCV. B * B' denotes the elementwise multiplication of matrices B
and B' of the same size. For n-dimensional vectors b and b', diag (V) denotes the n x n diagonal
matrix with i-th diagonal element b;jb;.
90 4 Importance Estimation
4.7.1 Setting
Let us consider a one-dimensional toy regression problem of learning the
following function:
11 if x=O,
I(x)=sinc(x) := sin(JTx)
otherwise.
JTX
Let the training and test input densities be
Ptr(x)=N(x; 1,(1/2)2),
Pte(x)=N(x; 2,(1/4)2),
where N(x; f-L, (2) denotes the Gaussian density with mean f-L and variance a2•
We create the training output value {y:,}7![ by
where the i.i.d. noise {E!r}7! has density N(E; 0,(1/4)2). Test output values
{y)e}]:: are generated in the same way. Let the number of training samples be
ntr=200 and the number of test samples be nte=1000. The goal is to obtain a
function l(x) such that the generalization error is minimized.
This setting implies that we are considering a (weak) extrapolation problem
(see figure 4.7, where only 100 test samples are plotted for clear visibility).
4.7.2 Importance Estimation by KLiEP
First, we illustrate the behavior of KLIEP in importance estimation, where we
use only the input points {x!'}7! and {x)e}]::.
Figure 4.8 depicts the true importance and its estimates by KLIEP; the
Gaussian kernel model with b = 100 is used, and three Gaussian widths
a=0.02,0.2,0.8 are tested. The graphs show that the performance of KLIEP
is highly dependent on the Gaussian width; the estimated importance function
w(x) is fluctuates highly when a is small, and is overly smoothed when a is
large. When a is chosen appropriately, KLIEP seems to work reasonably well
for this example.
Figure 4.9 depicts the values of the true J(see equation 4.5), and its estimate
by fivefold CV (see equation 4.7); the means, the 25th percentiles, and the 75th
percentiles over 100 trials are plotted as functions of the Gaussian width a.
This shows that CV gives a very good estimate of KL', which results in an
appropriate choice of a.
4.7. Numerical Examples
1.2
0.8
0.6
0.4
0.2
..
, ,
, ,
. '
,
,
,
,
,
,
-�.�5��--------����
--
��-
2
-
.5
�'�-
(a) Training input densityPtr(x) and test
input densityPte(x).
Figure 4.7
Illustrative example.
I
�"
"
"
"
:4
40
I
(a) Gaussian width (J = 0.02.
15
10
1.5
20
15
10
o
(b) Target function/(x) training
samples {(xy, yy)}�� I' and test samples
{(xle, yj)}J� I
-�.�5--��� 0� .5""�"�1.�5 --�--�2� .5--�
(b) Gaussian width (J = 0.2.
oL-______���___��
-0.5 0.5 1.5 2.5
(c) Gaussian width (J = 0.8.
Figure 4.8
91
Results of importance estimation by KLIEP. w(x) is the true importance function, and w(x) is its
estimation obtained by KLIEP.
92
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
0.8
0.02
Figure 4.9
0.2 0.5
a (Gaussian Width)
4 Importance Estimation
0.8
Model selection curve for KLIEP. KL' is the true score of an estimated importance (see
equation 4.5), and Ki7cv is its estimate by fivefold CV (see equation 4.7).
4.7.3 Covariate Shift Adaptation by IWLS and IWCV
Next, we illustrate how the estimated importance is used for covariate shift
adaptation. Here we use { (xl',y!')}7!! and {x�e}j�! for learning; the test output
values {yj"}j�! are used only for evaluating the generalization performance.
We use the following polynomial regression model:
l (x; 9):= L:8; xc, (4.15)
c=o
where t is the order of polynomials. The parameter vector 9 is learned by
importance-weighted least squares (lWLS):
IWLS is asymptotically unbiased when the true importance w(x!') is used as
weights. On the other hand, ordinary LS is not asymptotically unbiased due
to covariate shift, given that the modell (x; 9) is not correctly specified (see
section 2.1.1). For the linear regression model (equation 4.15), the minimizer
9IWLS is given analytically by
4.7. Numerical Examples
where
Xtr ._
(Xlf)e-I
i,e·- i '
W�If d' (�( If) �( If) �( tr »):= lag wXI ' Wx2 ,• • • ,W xntr '
If . ( tr tr tr )T
Y .= YI' Y2' ... , Yntr
•
93
diag(a, b, . . . , c) denotes the diagonal matrix with diagonal elements
a, b, . . . , c.
We choose the order t of polynomials based on importance-weighted CV
(IWCV; see section 3.3). More specifically, we first divide the training samples
{z:flz:' =(XJf , y!')}7!1 into k disjoint subsets (Zn�=I' Then we learn a function
l(x) from {Zn;';f; (i.e., withoutZn by IWLS, and compute its mean test
error for the remaining samples Zr
� 1 " � (� )2Gen;:=
Zlf � W(X ) fi(x ) -
Y .
I ; I tr(X,Y)EZj
This procedure is repeated for i=1,2,. . ., k, and its average Gen is used as an
estimate of Gen:
� 1 k �
Gen :=k L Genj.
;=1
(4.16)
For model selection, we compute Gen for all model candidates (the order
t E { I , 2,3} of polynomials in the current setting), and choose the one that
minimizes Gen. We set the number of folds in IWCV to k=5. IWCV is shown
to be almost unbiased when the true importanceW(XJf) is used as weights,
while ordinary CV for misspecified models is highly biased due to covariate
shift (see section 3.3).
Figure 4.10 depicts the functions learned by IWLS with different orders of
polynomials. The results show that for all cases, the learned functions reason­
ably go through the test samples (note that the test output points are not used
for obtaining the learned functions). Figure 4.11a depicts the true generaliza­
tion error of IWLS and its estimate by IWCV; the means, the 25th percentiles,
and the 75th percentiles over 100 runs are plotted as functions of the order
of polynomials. This shows that IWCV roughly grasps the trend of the true
generalization error. For comparison purposes, include the results by ordinary
LS and ordinary CV in figures 4.10 and 4.11. Figure 4.10 shows that the func­
tions obtained by ordinary LS nicely go through the training samples, but not
94
1.5
-{l_s
-{l.5
o
0.5 1.5
-«x)
- - • flWl5 (x)
......"fLS (xl
o Training
)< Test
2.5
(a) Polynomial of order 1.
1.5 •
•
4 Importance Estimation
(b) Polynomial of order 2.
(c) Polynomial of order 3.
Figure 4.10
Learned functions obtained by IWLS and LS, which are denoted by llWLs(x) and As(x),
respectively.
through the test samples. Figure 4.11 shows that the scores of ordinary CV tend
to be biased, implying that model selection by ordinary CV is not reliable.
Finally, we compare the generalization errors obtained by IWLSILS and
IWCV/CV, which are summarized in figure 4.12 as box plots. This shows that
IWLS+IWCV tends to outperform other methods, illustrating the usefulness
of the covariate shift adaptation method.
4.8 Experimental Comparison
In this section, we compare the accuracy and computational efficiency of
density-ratio estimation methods.
Let the dimension of the domain be d, and
Ptr(x)=N(x; (0, 0, . . . , 0)T, Id),
Pte(x)=N(x; (1, 0, . . ., O)T, Id),
4.8. Experimental Comparison 95
0.18 0.4
--Gen
0,16 - - •
GenlWOI 0.35
......" Gencv
0.3
0.14
0.25
0,12
0.2
0.1
0.15
0.08
0.1
0.06 0.05
t (OrderofPolynomials) t (Order ofPolynomial)
(a) IWLS. (b) LS.
Figure 4.11
Model selection curves for IWLS/LS and IWCV/CV. Gen denotes the true generalization error of
a learned function; GenIWCV and Gencv denote their estimates by fivefold IWCV and fivefold CV,
respectively (see equation 4.16).
0.35
-
0.3
0.25
0.2
0.15
..,�� -
0.05
- -
- -
-
-
95%
75%
50%
25%
5%
L-_�___��___�___�___
IWLS+IWCV IWLS+CV L5+IWCV LS+CV
Figure 4.12
Box plots of generalization errors.
where Id denotes the d x d identity matrix and N(x; It, I:) denotes the multi­
dimensional Gaussian density with mean It and covariance matrix I:. The task
is to estimate the importance at training points:
_
( tf) _
Pte(x :' )
Wi-W X, - ---I
Ptf( X:')
for i = 1, 2,... , ntf.
96 4 Importance Estimation
We compare the following methods:
• KDE(CV) The Gaussian kernel (equation 4.1) is used where the kernel
widths of the training and test densities are separately optimized based on
fivefold CV.
• KMM(med) The performance of KMM is dependent on B, E, and a. We
set B=1000 and E=C.,;n;;-1)/y'rl;" following the original KMM paper [82],
and the Gaussian width a is set to the median distance between samples within
the training set and the test set [141, 147].
• LR(CV) The Gaussian kernel model (equation 4.8) is used. The ker­
nel width a and the regularization parameter A are chosen based on
fivefold Cv.
• KLIEP(CV) The Gaussian kernel model (equation 4.8) is used. The
kernel width a is selected based on fivefold CV.
• uLSIF(CV) The Gaussian kernel model (equation 4.8) is used. The
kernel width a and the regularization parameter A are determined based
on LOOCV.
All the methods are implemented using the MATLAB® environment, where
the CPLEX® optimizer is used for solving quadratic programs in KMM and
the LIBLINEAR implementation is used for LR [103].
We set the number of test points to nte=1000, and considered the following
setups for the number ntr of training samples and the input dimensionality d:
(a) ntr is fixed to ntr=100, and d is changed as d=1,2,. . .,20
(b) d is fixed to d=10, and ntr is changed as ntr=50, 60, . . .,150.
We ran the experiments 100 times for each d, each nt" and each method, and
evaluated the quality of the importance estimates {wd7!! by the normalized
mean-squared error (NMSE):
1
ntt ( � )2'" w· w
NMSE :=-� ntt
I
�
- ntt
I
ntr ;=1 Li'=l Wi' Li'=l Wi'
For the purpose of covariate shift adaptation, the global scale of the impor­
tance values is not important. Thus, the above NMSE, evaluating only the
relative magnitude among {wd7!!, would be a suitable error metric for the
current experiments.
NMSEs averaged over 100 trials (a) as a function of input dimensionality d
and (b) as a function of the training sample size ntr are plotted in log scale in
4.8. Experimental Comparison 97
figure 4.13.Error bars are omitted for clear visibility-instead, the best method
in terms of the mean error and comparable methods based on the t-test at the
significance level 1 percent are indicated by 0; the methods with significant
difference from the best methods are indicated by x .
Figure 4.13a shows that the error of KDE(CV) sharply increases as the input
dimensionality grows, while LR, KLIEP, and uLSIF tend to give much smaller
errors than KDE. This would be an advantage of directly estimating the impor­
tance without going through density estimation. KMM tends to perform poorly,
which is caused by an inappropriate choice of the Gaussian kernel width. On
the other hand, model selection in LR, KLIEP, and uLSIF seems to work quite
well. Figure 4.13b shows that the errors of all methods tend to decrease as the
number of training samples grows. Again LR, KLIEP, and uLSIF tend to give
much smaller errors than KDE and KMM.
Next, we investigate the computation time. Each method has a different
model selection strategy: KMM does not involve CV; KDE and KLIEP involve
CV over the kernel width; and LR and uLSIF involve CV over both the kernel
width and the regularization parameter. Thus, the naive comparison of the total
computation time is not so meaningful. For this reason, we first investigate
the computation time of each importance estimation method after the model
parameters have been determined.
The average CPU computation time over 100 trials is summarized in
figure 4.14. Figure 4.14a shows that the computation times of KDE, KLIEP,
and uLSIF are almost independent of the input dimensionality, while those of
KMM and LR are rather dependent on the input dimensionality. Note that LR
for d :s 3 is slow due to a convergence problem of the LIBLINEAR package.
The uLSIF is one of the fastest methods. Figure 4.14b shows that the compu­
tation times of LR, KLIEP, and uLSIF are nearly independent of the number
of training samples, while those of KDE and KMM sharply increase as the
number of training samples increases.
Both LR and uLSIF have high accuracy, and their computation times after
model selection are comparable. Finally, we compare the entire computation
times of LR and uLSIF including CV, which are summarized in figure 4.15.
The Gaussian width (J and the regularization parameter A. are chosen over the
9 x 9 grid for both LR and uLSIF. Therefore, the comparison of the entire com­
putation time is fair.Figures 4.15a and 4.15b show that uLSIF is approximately
five times faster than LR.
Overall, uLSIF is shown to be comparable to the best existing method (LR)
in terms of accuracy, but is computationally more efficient than LR.
98
Q;
�
� 10
""
.Q
g
�
-;;;
�
0
�
Q;>
0
w
V1
::;:
Z
Q)
'"
�
Q)
>
<X:
Q;
-;;;
10
-4
�
'"
.Q
:§.
�
.=
0
0
Q; 10
-5
>
0
w
V1
::;:
Z
Q)
'"
�
Q)
>
<X:
10
-6
50
f
"I
- -KDE(CV)
• • • • KMM(med)
_. _. LR(CV)
"""" KLlEP(CV)
__'"
.It
• • •
-- uL5IF(CV)
... ....
.... .... ., ....
" • . c';-A":�
4 Importance Estimation
-_ ...
,
... «
.. ..... ..
. .. . ..
/rI'
c. -c. .... . .
't
('I.
)'/
'''';tJ
4
/'
II'
6 8 10 12 14
d (Input dimension)
16
(a) When input dimensionality is changed.
18 20
"
.... ....
.... .... ......
..
.. ..... ..
..
-::-- .......
.
..
.. -:-:-:-..
....
. ::-.":'0......
.. ..""eooo.......................
................
'.
100 150
n
"
(Number of training samples)
(b) When training sample size is changed.
Figure 4.13
NMSEs averaged over 100 trials in log scale for the artificial data set. Error bars are omitted for
clear visibility. Instead, the best method in terms of the mean error and comparable methods based
on the t·test at the significance level I percent are indicated by 0; the methods with significant
difference from the best methods are indicated by x.
4.8. Experimental Comparison
0.16
0.14
]'�
�
:g 0.12 "
� 0.1
o
'"
E
.,
"
o
..,'"
0.08
5 0.06
c.
E
8
- -KDE(CV)
• • • • KMM(med)
_. _. LR(CV)
11111111 KLlEP(CV)
-- uLSIF(CV)
� 0.04
�
� --�-�����--�--------
u
'"
�
�
..
·S
0
�
:;;>
0
'"
E
..,
"
o
- ..... .... _ ....
-
"
0.02
O L-�--�--�----�--�--�----L---�---L--�
0. 15
0. 1
4 6 8 10 12 14 16 18
d (Input dimension)
(a) When input dimensionality is changed.
.'
"
1
111
11""I I I I
�
""'
I""IIIIIIIIIIIIIIIIIIIII1:'J'111111111111111111111111111""'
1
11111
1
1
11
1
20
.�c. 0.05
E
---
8
'"
COl
�
� _
.
............ .... --
-----
--------
-' ....... .... ...
O L-----------------�------------------�
50 100 150
n
"
(Number oftraining samples)
(b) When training sample size is changed.
Figure 4.14
Average computation time (after model selection) over 100 trials for the artificial data set.
99
100

18 '.
� 16
�
-;;;
E 14
g
'" 12
15
'"
.g 10
c
.g
� 8
:0
C.
E
8 6
�'"
CI
�
�
«
12
2
4 Importance Estimation
4 6 8 10 12 14 16 18 20
d (Input dimension)
(a) When input dimensionality is changed.
__.
_
.
_.
_
.
_
._._
._.
- ._" _._._.
- ._0 _._. _._. _._._. .
U
10
'"
�
�
-;;;
'5
g
�'"
E
.�
c
o
.�:0
C.
E
8
�'"
CI
�
�
8
6
4
O L---------------------�------------------�
50 100 150
n
tr
(Number oftraining samples)
(b) When training sample size is changed.
Figure 4.15
Average computation time over I()() trials for the artificial data set (including model selection of
the Gaussian width (J and the regularization parameter A over the 9 x 9 grid).
4.9. Summary 101
4.9 Summary
In this chapter, we have shown methods of importance estimation that can
avoid solving a substantially more difficult task of density estimation. Table 4.1
summarizes properties of the importance estimation methods.
Kernel density estimation (KDE; section 4.1) is computationally very
efficient since no optimization is involved, and model selection is possi­
ble by cross-validation (CV). However, KDE may suffer from the curse of
dimensionality due to the difficulty of density estimation in high dimensions.
Kernel mean matching (KMM; section 4.2) may potentially work better by
directly estimating the importance. However, since objective model selection
methods are missing for KMM, model parameters such as the Gaussian width
need to be determined by hand. This is highly unreliable unless we have strong
prior knowledge. Furthermore, the computation of KMM is rather demanding
since a quadratic programming problem has to be solved.
Logistic regression (LR; section 4.3) and the Kullback-Leibler importance
estimation procedure (KLIEP; section 4.4) also do not involve density estima­
tion. However, in contrast to KMM, LR and KLIEP are equipped with CV for
model selection, which is a significant advantage over KMM. Nevertheless, LR
and KLIEP are computationally rather expensive since nonlinear optimization
problems have to be solved.
Least-squares importance fitting (LSIF; section 4.5) is qualitatively similar
to LR and KLIEP, that is, it can avoid density estimation, model selection is
possible, and nonlinear optimization is involved. LSIF is more advantageous
than LR and KLIEP in that it is equipped with a regularization path track­
ing algorithm. Thanks to this, model selection for LSIF is computationally
much more efficient than for LR and KLIEP. However, the regularization path
tracking algorithm tends to be numerically unstable.
Table 4.1
Importance estimation methods
Out-of-sample
Methods Density Estimation Model Selection Optimization Prediction
KDE Necessary Available Analytic Possible
KMM Not necessary Not available Convex QP Not possible
LR Not necessary Available Convex non-linear Possible
KLIEP Not necessary Available Convex non-linear Possible
LSIF Not necessary Available Convex QP Possible
uLSIF Not necessary Available Analytic Possible
QP, quadratic program.
102 4 Importance Estimation
Unconstrained LSIF (uLSIF; section 4.6) exhibits good properties of other
methods (e.g., no density estimation is involved and a built-in model selection
method is available). In addition to these properties, the solution of uLSIF can
be computed analytically by solving a system of linear equations. Therefore,
uLSIF is computationally very efficient and numerically stable. Furthermore,
thanks to the availability of the closed-form solution, the LOOCV score of
uLSIF can be computed analytically without repeating holdout loops, which
highly contributes to reducing the computation time in the model selection
phase.
Consequently, uLSIF is a preferable method for importance estimation.
5 Direct Density-Ratio Estimation with Dimensionality
Reduction
As shown in chapter 5, various methods have been developed for directly
estimating the density ratio without going through density estimation. How­
ever, even these methods can perform rather poorly when the dimensional­
ity of the data domain is high. In this chapter, a dimensionality reduction
scheme for density-ratio estimation, called direct density-ratio estimation with
dimensionality reduction (D3; pronounced as "D-cube") [158], is introduced.
5.1 Density Difference in Hetero-Distributional Subspace
The basic assumption behind D3 is that the densities Ptr(x) and Pte(x) are dif­
ferent not in the entire space, but only in some subspace. This assumption can
be mathematically formulated with the following linear mixing model.
Let {u:'}7!! be i.i.d. samples drawn from an m-dimensional distribution with
density! Ptr(u), where m is in {I , 2,. . . ,d}. We assume Ptr(u) > 0 for all u .
Let {u�e}��! be i.i.d. samples drawn from another m-dimensional distribution
with density Pte(u). Let {v:,}7!! and {v�e}��! be i.i.d. samples drawn from a
(d - m)-dimensional distribution with density p (v). We assume p (v) > 0 for
all v. Let A be a d x m matrix and B be a d x (d- m) matrix such that the
column vectors of A and B span the entire space. Based on these quantities,
we consider the case where the samples {xn7!! and {x�e}��! are generated as
I. With abuse of notation, we use p,,(u) and Pte(u), which are different from p,,(x) and Pte(x),
for simplicity.
104 5 Direct Density-Ratio Estimation with Dimensionality Reduction
Thus, Ptr(x) and Pte(x) are expressed as
Ptr(X) = c Ptr(u) p (v),
Pte(X) = C Pte(U) P (V),
where c is the Jacobian between the observation x and (u , v). We call R(A)
and R(B) the hetero-distributional subspace and the homo-distributional sub­
space, respectively, where RO denotes the range of a matrix. Note that R(A)
and R(B) are not generally orthogonal to one another (see figure 5.1).
Under the above decomposability assumption with independence of U and
v , the density ratio of Pte(X) and Ptr(x) can be simplified as
w (X) =
Pte(x)
=
c Pte(u) p (v)
=
Pte(u)
= w (u).
Ptr(x) c Ptr(u) p (v) Ptr(u)
(5.1)
This means that the density ratio does not have to be estimated in the entire
d-dimensional space, but only in the heterodistributional subspace of dimen­
sion m (:::: d) . Now we want to extract the hetero-distributional components u:r
and u�e from the original high-dimensional samples x:r and x�e. This allows us
to estimate the density ratio only in R(A) via equation 5.1. As illustrated in
figure 5.1, the oblique projection of x:r and x�e onto R(A) along R(B) gives
u:r and u�e.
5.2 Characterization of Hetero-Distributional Subspace
Let us denote the oblique projection matrix onto R(A) along R(B) by
PR(A),R(B)' In order to characterize the oblique projection matrix PR(A),R(B),
let us consider matrices U and V whose rows consist of dual bases for the col­
umn vectors of A and B, respectively. More specifically, U is an m x d matrix
and V is a (d-m) x d matrix such that they are bi-orthogonal to one another:
UB = Omx(d-mj.
VA = O(d-m)xm,
where Omxm' denotes the m x m' matrix with all zeros. Thus, R(B) and R(UT)
are orthogonal to one another, and R(A) and R(VT) are orthogonal to one
another, where T denotes the transpose. When R(A) and R(B) are orthogo­
nal to one another, R(UT) agrees with R(A) and R(VT) agrees with R(B) ;
however, in general they are different, as illustrated in figure 5.1.
5.2. Characterization of Hetero-Distributional Subspace
R(B)
R(UT)
(a) Ptr(x) ex: Ptr(u)p(v).
Figure 5.1
105
R(B)
, R(UT)
(b) Pte(x) ex: Pte(u)p(v).
A schematic picture of the hetero-distributional subspace for d = 2 and m = I. Let Aex(I,0)T
and Bex(l,2)T. Then Uex(2,-I) and Vex(0, l). R.(A) and R.(B) are called the hetero­
distributional subspace and the homo-distributional subspace, respectively. If a data point x is
projected onto R.(A) along R.(B), the homo-distributional component v can be eliminated and
the hetero-distributional component u can be extracted.
The relation between A and B and the relation between U and V can be
characterized in terms of the covariance matrix 1: (of either Ptr(x) or Pte(x» as
AT1:-1B = O(d-m)xm,
U1:VT = Omx(d-m)'
(5.2)
(5.3)
These orthogonality relations in terms of 1: follow from the statistical inde­
pendence between the components in R(A) and R(B)-more specifically,
equation 5.2 follows from the fact that the sphering operation (transforming
samples x by x +- 1:-1/2X in advance) orthogonalizes independent compo­
nents u and v [84], and equation 5.3 is its dual expression [92]. After sphering,
the covariance matrix becomes identity, and consequently all the discussions
become simpler. However, estimating the covariance matrix from samples can
be erroneous in high-dimensional problems, and taking its inverse further mag­
nifies the estimation error. For this reason, we decided to deal directly with
nonorthogonal A and B below.
For normalization purposes, we further assume that
VB = Id-m,
106 5 Direct Density-Ratio Estimation with Dimensionality Reduction
where 1m denotes the m-dimensional identity matrix. Then the oblique projec­
tion matrices PR(A),R(B) and PR(B),R(A) can be expressed as
PR(A),R(B)= A U,
PR(B),R(A)= BV,
which can be confirmed by the facts that P �(A),R(B) = PR(A),R(B) (idempo­
tence); the null space of PR(A),R(B) is R(B); and the range of PR(A),R(B) is
R(A); the same is true for PR(B),R(A)' The above expressions of PR(A),R(B)
and PR(B),R(A) imply that Uexpresses projected images in an m-dimensional
coordinate system within R(A) , while V expresses projected images in
a (d - m)-dimensional coordinate system within R(B). We call U and
V the hetero-distributional mapping and the homo-distributional mapping,
respectively.
Now u:r, u�e, v:r, and v�e are expressed as
ute = Uxte,] ]
vte = Vxte.] ]
Thus, if the hetero-distributional mapping U is estimated, estimation of the
density ratio w (x) can be carried out in a low-dimensional hetero-distributional
subspace via equation 5.1.
The above framework is called the direct density-ratio estimation with
dimensionality reduction (D3) [158]. For the time being, we assume that the
dimension m of the hetero-distributional subspace is known; we show how m
is estimated from data in section 5.5.
5.3 Identifying Hetero-Distributional Subspace by Supervised Dimensionality
Reduction
In this section, we explain how the hetero-distributional subspace is estimated.
5.3.1 Basic Idea
In order to estimate the hetero-distributional subspace, we need a criterion that
reflects the degree of distributional difference in a subspace. A key observation
in this context is that the existence of distributional difference can be checked
to determine whether samples from the two distributions are separated from
one another. That is, if samples of one distribution can be distinguished from
the samples of the other distribution, one may conclude that two distributions
5.3. Identifying Hetero-Distributional Subspace 107
are different; otherwise, distributions may be similar. We employ this idea for
finding the hetero-distributional subspace.
Let us denote the samples projected onto the hetero-distributional sub­
space by
{ute I ute =
Uxte}nte l
'
j j j j=
Then our goal is to find the matrix Usuch that {u:,}7!1 and {U�e};::1 are maxi­
mally separated from one another.For that purpose, we may use any supervised
dimensionality reduction methods.
Among supervised dimensionality reduction methods (e.g., [72, 73, 56, 63,
62]), we decided to use local Fisher discriminant analysis (LFDA; [154]),
which is an extension of classical Fisher discriminant analysis (FDA; [50])
for multimodally distributed data. LFDA has useful properties, in practice;
for instance, there is no limitation on the dimension of the reduced subspace:
(FDA is limited to one-dimensional projection for two-class problems [57]); it
works well even when data have multimodal structure such as separate clus­
ters; it is robust against outliers; its solution can be analytically computed using
eigenvalue decomposition as stable and efficient as the original FDA; and its
experimental performance has been shown to be better than other supervised
learning methods.
Below, we briefly review technical details of LFDA, showing how to use it
for hetero-distributional subspace search.To simplify the notation, we consider
a set of binary-labeled training samples
and reduce the dimensionality of Xb using an m x d transformation matrix
T, as
Effectively, the training samples {(Xb Yk)}Z=1 correspond to the following
setup: for n =
ntr + nte,
{ }n {tf}ntr U {te}nteXk k=1
=
x; ;=1 Xj j=I'
if x E {xte}ntek j j=I'
108 5 Direct Density-Ratio Estimation with Dimensionality Reduction
5.3.2 Fisher Discriminant Analysis
Since LFDA is an extension of FDA [50], we first briefly review the original
FDA (see also section 2.3.1).
Let n+ and n_ be the numbers of samples in class +1 and class -1, respec­
tively. Let /L, /L+, and /L- be the means of {xd�=i' {XkIYk = +1}�=i' and
{xklYk= -1}�=i' respectively:
Let Sb and SW be the between-class scatter matrix and the within-class scatter
matrix, respectively, defined as
Sb :=n+(/L+ -/L)(/L+ - /L)T + n_(/L_ - /L) (/L- - /L)T,
Sw:= L (Xk -/L+)(Xk -/L+)T + L (Xk -/L_)(Xk - /L-)T.
k:Yk=+1 k:Yk=-1
The FDA transformation matrix TFDA is defined as
TFDA:= argmax [tr(TSbTT(TSWTT)-l)].
TElRmxd
That is, FDA seeks a transformation matrix T with large between-class scatter
and small within-class scatter in the embedding space IRm. In the above formu­
lation, we implicitly assume that SW is full-rank and T has rank m so that the
inverse of TswTT exists.
Let {tz}1=1 be the generalized eigenvectors associated with the generalized
eigenvalues {1JZ}1=1 of the following generalized eigenvalue problem [57]:
We assume that the generalized eigenvalues are sorted as
5.3. Identifying Hetero-Distributional Subspace
Then a solution TFDA is analytically given as follows (e.g., [42]):
109
FDA works very well if samples in each class are Gaussian with common
covariance structure. However, it tends to give undesired results if samples in
a class form several separate clusters or there are outliers. Furthermore, the
between-class scatter matrix Sb is known to have rank 1 in the current setup
(see, e.g., [57]), implying that we can obtain only one meaningful feature tI
through the FDA criterion; the remaining features {tZ}1=2 found by FDA are
arbitrary in the null space of Sb. This is an essential limitation of FDA in
dimensionality reduction.
5.3.3 Local Fisher Discriminant Analysis
In order to overcome the weaknesses of FDA explained above, LFDA has been
introduced [154]. Here, we explain the main idea of LFDA briefly.
The scatter matrices Sb and SW in the original FDA can be expressed in the
pairwise form as follows [154]:
where
/lln - lln+ ifYk=Yk' =+l,
W:k' := lln - lln_ ifYk=Yk' = -l,
lin ifYk=/=Yk',
ifYk=Yk' = - 1,
ifYk=/=Yk'·
Note that (lIn - lln+) and (lin - lln-
) included in the definition of Wb are
negative values.
110 5 Direct Density-Ratio Estimation with Dimensionality Reduction
Based on the above pairwise expression, let us define the local between-class
scatter matrix Sib and the local within-class scatter matrix Slw as
Sib ._
1 � Wlb ( ) ( )T
.- 2 � k,k' Xk-Xk' Xk-Xk' ,
k,k'=1
Slw. 1 L:n Wlw ( ) ( )T.= - kk' Xk-Xk' Xk-Xk' ,
2 '
k,k'=1
where
{Ak'k'(l/n-1/n+) if Yk= Yk'=+l,
W��k':= Ak,k'(l/n-1/n_) if Yk= Yk'=-l,
1/n if Yk #= Yk',
if Yk= Yk'=+1,
if Yk= Yk'=-1,
if Yk #= Yk'·
Ak,k' is the affinity value between Xk and Xk' (e.g., as defined based on the local
scaling heuristic [208]):
Tk is the local scaling factor around Xk defined by
wherexkK)
denotes the K-th nearest neighbor of Xk. A heuristic choice of K=7
was shown to be useful through extensive simulations [208,154]. Note that the
local scaling factors are computed in a classwise manner in LFDA (see the
pseudo code of LFDA in figure 5.2) .
Based on the local scatter matrices Sib and Slw, the LFDA transformation
matrix TLFDA is defined as
TLFDA:= argmax [tr(TSlbTT(TSIWTT)-I) ].
TElRmxd
5.3. Identifying Hetero-Distributional Subspace
Input: Two sets of samples {Xn�:l and {X;e}j�l on �d
Dimensionality of embedding space m (1 :::; m :::; d)
Output: m x d transformation matrix TLFDA
x�r +-- 7th nearest neighbor of x�r among {X�;}��l
for i = 1,2, ... , ntr;
x;e +-- 7th nearest neighbor of x;e among {x;nj,
t
�l
for j = 1,2, . . .,nte;
Tt +-- Ilx!, - xn for i = 1,2, ...,ntr;
Tr +-- Ilx;e - x;ell for j =
1,2, ... ,nte;
Atr ( Ilx�r_x�;112 ) " .
., 12ii'
+-- exp - t t lor 2,2 = , ,...,ntr;, �r�!
Ate ( IIx;e - X;�112 ) " .
., _
.j,j' +-- exp -
TrTr
lor J, J - 1,2, ...,nte,
Xtr +-- (xrlx�rl'..Ix;;..);
Xte +-- (xtelxtel'.'Ixte ).1 2 nte ,
Gtr +-- Xtrdiag(Atrl )XtrT _ xtrAtrxtrT.ntr ,
Gte +-- Xtediag(Atel )XteT _ xteAtexteT.nh ,
Slw +-- �Gtr + �Gte.
ntr nte '
n +-- ntr
+
nte;
Sib +-- (.1 _ �)Gtr + (.1 _ �)Gte + !!k XtrXtrT +!!llXteXteT
n ntr n nte n n
_.lXtrl (Xtel )T _ .lXtel (Xtrl )T.
n ntr nte n nte ntr '
{''It,'lj;d�l +-- generalized eigenvalues and eigenvectors of
Slbnl. = 'YlSlwnl.. % 'Yl > 'Yl > ... > 'Yl0/ " 0/, ,,1 _ ,,2 _ _ " d
{�1}�1 +-- orthonormal basis of {'lj;d�l;
111
% span({-¢1}�1) = span({'lj;1}�1) for m' = 1,2, .. ., m
TLFDA +-- (�11�21" 'l�m)T;
Figure 5.2
Pseudo code of LFDA. In denotes the n-dimensional vectors with all ones, and diag(b) denotes
the diagonal matrix with diagonal elements specified by a vector b.
Recalling that (lin- lln+) and (lin- lln_) included in the definition of
Wlb are negative values, the definitions of Sib and Slw imply that LFDA seeks a
transformation matrix T such that nearby data pairs in the same class are made
close and the data pairs in different classes are made apart; far apart data pairs
in the same class are not forced to be close.
By the localization effect brought by the introduction of the affinity matrix,
LFDA can overcome the weakness of the original FDA against clustered data
and outliers. When Au' = I for all k, k' (i.e., no locality), Slw and Sib are
reduced to SW and Sb. Thus, LFDA can be regarded as a natural localized
112 5 Direct Density-Ratio Estimation with Dimensionality Reduction
variant of FDA. The between-class scatter matrix Sb in the original FDA had
only rank 1, whereas its local counterpart Sib in LFDA usually has full rank
with no multiplicity in eigenvalues (given n 2: d) . Therefore, LFDA can be
applied to dimensionality reduction into any dimensional spaces, which is a
significant advantage over the original FDA.
A solution TLFDA can be computed in the same way as the original FDA.
Namely, the LFDA solution is given as
where {tl}f=, are the generalized eigenvectors associated with the generalized
eigenvalues 111 2: 1'/2 2: ...2: IJd of the following generalized eigenvalue problem:
(5.4)
Since the LFDA solution can be computed in the same way as the original FDA
solution, LFDA is computationally as efficient as the original FDA. A pseudo
code of LFDA is summarized in figure 5.2.
A MATLAB® implementation of LFDA is available from http://sugiyama­
www.cs.titech.ac.jprsugi/softwareILFDAI.
5.4 Using LFDA for Finding Hetero-Distributional Subspace
Finally, we show how to obtain an estimate of the transformation matrix U
needed in the density-ratio estimation procedure (see section 5.3) from the
LFDA transformation matrix TLFDA.
First, an orthonormal basis {tI};:! of the LFDA subspace is computed from
the generalized eigenvectors {tl};:! so that the span of {tl};:! agrees with
the span of {tl};:, for all m' (1 ::: m' ::: m) . This can be carried out in a
straightforward way, for instance, by the Gram-Schmidt orthonormalization
(see, e.g., [5]). Then an estimate fj is given as
and the samples are transformed as
�ur '. = �Uxtr cor' 1 2I I
l' I = , , . . . , nt "
�ue '. = �Uxte cor ' 1 2
J J
l' } = , ,..., nte·
The above expression of fj implies another useful advantage of LFDA.
In density-ratio estimation, one needs the LFDA solution for each reduced
5.6. Numerical Examples 113
dimensionality m = 1,2,. . ., d (see section 5.5). However, we do not actually
have to compute the LFDA solution for each m, but only to solve the gen­
eralized eigenvalue problem (equation 5.4) once for m = d and compute the
orthonormal basis {tz}f=l; the solution for m < d can be obtained by simply
taking the first m basis vectors {tz};:1 •
5.5 Density-Ratio Estimation in the Hetero-Distributional Subspace
Given that the hetero-distributional subspace has been successfully identified
by the above procedure, the next step is to estimate the density ratio within the
subspace. Since the direct importance estimator unconstrained least-squares
importancefitting (uLSIF; [88]) explained in section 4.6 was shown to be accu­
rate and computationally very efficient, it would be advantageous to combine
LFDA with uLSIF.
So far, we have explained how the dimensionality reduction idea can
be incorporated into density-ratio estimation, when the dimension m of the
hetero-distributional subspace is known in advance. Here we address how the
dimension m is estimated from samples, which results in a practical procedure.
For dimensionality selection, the ev score of the uLSIF algorithm can be
utilized. In particular, uLSIF allows one to compute the leave-one-out ev
(LOOeV) score analytically (see section 4.6.2):
where Wj(u) is a density-ratio estimate obtained without 11;' and 11;e. Thus,
the above Looev score is computed as a function of m, and the one that
minimizes the LOOeV score is chosen.
The pseudo code of the entire algorithm is summarized in figure 5.3.
5.6 Numerical Examples
In this section, we illustrate how the D3 algorithm behaves.
5.6.1 Illustrative Example
Let the input domain be ]R2 (i.e., d = 2), and the denominator and numerator
densities be set to
114 5 Direct Density-Ratio Estimation with Dimensionality Reduction
Input: Two sets of samples {Xn��l and {xje}j�l on �d
Output: Density-ratio estimate w(x)
Obtain orthonormal basis {�1}�1 using LFDA
with {xtr}':'t, and {xte}nte •
, ,=1 J J=1'
For each reduced dimension m = 1,2, . . . , d
Form projection matrix: Um = (�11�21" 'l�m)T;
ProJ'ect sampleS' {utr I utr = U xtr}n" and• t,m t,m m 1. t=l
{ute I ute = U xte}nte•
J,m J,m m J J=I'
For each candidate of Gaussian width a
For each candidate of regularization parameter A
Compute LOOCV score LOOCV(m, a, A) using
{utr }n" and {ute }nte•t,m 1.=1 J,m )=1'
end
end
end
Choose the best model: (in, 0',:) +- argmincm,CT,A) LOOCV(m, a, A);
Estimate
�
density ratio from {utmJ�";"l and {uj�m}j�1 using uLSIF
with (0',A);
Figure 5.3
Pseudo code of direct density-ratio estimation with dimensionality reduction (D3).
where N(x; 11-, 1:) denotes the multivariate Gaussian density with mean II- and
covariance matrix 1:. The profiles of the above densities and their ratios are
illustrated in figures 5.4 and 5.7a, respectively. We sample ntr = 100 points
from Ptr(x) and nte = 100 points from Pte(x); the samples are illustrated in
figure 5.5. In this data set, the distributions are different only in the one­
dimensional subspace spanned by (1,0)T, that is, the true dimensionality of the
hetero-distributional subspace is m = 1. The true hetero-distributional subspace
is depicted by the solid line in figure 5.5.
The dotted line in figure 5.5 depicts the hetero-distributional subspace
estimated by LFDA with reduced dimensionality 1; when the reduced dimen­
sionality is 2, LFDA gives the entire space. This shows that for reduced dimen­
sionality 1, LFDA gives a very good estimate of the true hetero-distributional
subspace.
Next, we choose reduced dimensionality m as well as the Gaussian width a
and the regularization parameter A. in uLSIF. Figure 5.6 depicts the LOOCV
5.6. Numerical Examples
0.08
0.06
0.04
0.02
Figure 5.4
(a) Profile ofPtr(x).
Two-dimensional toy data set.
3
2
x
-2
o
0 �o
x
x
0.08
0.06
0.04
0.02
o o
o
o
(b) Profile ofPte(x),
tr
X xi
o x
te
J
115
--True subspace
11111 LFDA subspace
_3 L-----L-----L-----L-----L-----L---�
-6 -4 -2 2 4 6
Figure 5.5
Samples and the hetero-distributional subspace of two-dimensional toy data set. The LFDA esti­
mate of the hetero-distributional subspace is spanned by (1.00, 0.01)T, which is very close to the
true hetero-distributional subspace spanned by (1, 0)T.
score of uLSIF, showing that
is the minimizer.
Finally, the density ratio is estimated by uLSIF. Figure 5.7 depicts the true
density ratio, its estimate by uLSIF without dimensionality reduction, and its
estimate by uLSIF with dimensionality reduction by LFDA. For uLSIF without
116 5 Direct Density-Ratio Estimation with Dimensionality Reduction
(j
o
g
Figure 5.6
-1 o
log 0"
2
- �- m=1. 1..=10
1
"""*"" m=1. 1..=10-
0
.
5
",)(" m=1. 1..=10
°
- - - m=2. 1..=10-
1
-- m=2.A.=10-
0
.
5
""'" m=2.A.=10
0
LOOCV score of uLSIF for two-dimensional toy data set.
4
°
X(l)
(a) True density ratio.
°
(b) Density-ratio estimation without
dimensionality reduction. NMSE =
1.52 x 10-5
Figure 5.7
4
°
(c) Density-ratio estimation with
dimensionality reduction. NMSE =
0.89 x 10-5
True and estimated density ratio functions. x·(1) and X·(2) in (c) denote the LFDA solution, that is,
x·(1) = l.OOx(1) + O.Olx(2) and X·(2) = -O.Olx(1) + l.OOX(2), respectively.
5.6. Numerical Examples 117
dimensionality reduction,
(a,'i) = (1,10-0.5)
is chosen by LOOCV (see figure 5.6 with m = 2). This shows that when dimen­
sionality reduction is not performed, independence between the density ratio
w(x) and the second input element X(2) (figure 5.7a) is not incorporated, and
the estimated density ratio has Gaussian-tail structure along X(2) (figure 5.7b).
On the other hand, when dimensionality reduction is carried out, independence
between the density ratio function w(x) and the second input element X(2)
can be successfully captured, and consequently a more accurate estimator is
obtained (figure 5.7c).
The accuracy of an estimated density ratio is measured by the normalized
mean-squared error (NMSE):
(5.5)
By dimensionality reduction, NMSE is reduced from 1.52 x 10-5 to 0.89 x
10-5• Thus, we gain a 41.5 percent reduction in NMSE.
5.6.2 Performance Comparison Using Artificial Data Sets
Here, we investigate the performance of the D3 algorithm using six artifi­
cial data sets. The input domain of the data sets is d-dimensional (d 2: 2),
and the true dimensionality of the hetero-distributional subspace is m = 1
or 2. The homo-distributional component of the data sets is the (d - m)­
dimensional Gaussian distribution with mean zero and covariance identity. The
hetero-distributional component of each data set is given as follows:
(a) Data set 1 (shifting, m = 1):
utr�N ([�l[� �D,
ute�N ([�l [� �D·
(b) Data set 2 (shrinking, m = 2):
utf�N ([�l [� �D'
te ([0] [1/4 0 ])u �N
O ' ° 1/4
.
118 5 Direct Density-Ratio Estimation with Dimensionality Reduction
(c) Data set 3 (magnifying, m = 2):
u
lf
�
N([�],[1
�4 1�4])'
u
te
�
N([�l [� �]).
(d) Data set 4(rotating, m = 2):
u
tf
�
N([�l [� �]),
u
te
�
N([���],[��� ���]).
(e) Data set 5 (one-dimensional splitting, m = 1):
u
tf
�
N([�l [� �]),
u
te
�
�N([�l [1
�4 1�4])+ �N([�l [1
�4 1�4])'
(f) Data set 6 (two-dimensional splitting, m = 2):
u
tf
�
N([�l [� �]),
u
te
�
�N([�l [1
�4 1�4])+ �N([�
2l [1
�4 1�4])
+ �N([�],[1
�4 1�4])+ �N([�2]' [1
�4 1�4])'
The number of samples is set to ntf = 200and nte = 1000for all the data sets.
Examples of realized samples are illustrated in figure 5.8. For each dimension­
ality d = 2,3,. . .,10,the density ratio is estimated using uLSIF with/without
dimensionality reduction. This experiment is repeated 100 times for each d
with different random seed.
Figure 5.9 depicts choice of the dimensionality of the hetero-distributional
subspace by LOOeV for each d. This shows that for data sets 1, 2, 4, and
5.6. Numerical Examples 119
6
x
4
x x
x xx
x
� x
�xli'
xx x
x X
xfi
x
0 fi
x
0
x
�x
xx
-1 x-2
IJ!'� x
-2
x
-4 XX x.
-3 x
x
-6
x
-2 0 2 4 6 -5 0
X(l)
X(l)
(a) Data set 1 (m = 1) (b) Data set 2 (m = 2)
8
6
6
4
4
0
0
-x 0 I!. '"
0
0
-x
0
x
-2 -2
-4
-4 0
0
-6
-6
-6 -4 -2 0 4 6 -4 -2 0 4 6
X(l)
X(l)
(c) Data set 3 (m = 2) (d) Data set 4 (m = 2)
4 4
x x
fi
x 0
fi
x 0
-1 -1
-2 -2
x it
-3 x -3
-4
-2 0 4 -4 -2 0
X(l)
X(l)
(e) Data set 5 (m = 1) (d) Data set 6 (m = 2)
Figure 5.8
Artificial data sets.
120 5 Direct Density-Ratio Estimation with Dimensionality Reduction
2
eo eo
70 70
4 4
<E 6 <E 6
7
•
8 8
9 9
" "
10 10
4 6 8 10 4 6 8 9 10
d d
(a) Data set 1 (m = 1) (b) Data set 2 (m = 2)
"0
4 4
so
(E 6 <E 6
"
JO JO
8
20 20
9
"
10 10
4 6 9 10 4 6 8 9 10
d d
(c) Data set 3 (m = 2) (d) Data set 4 (m = 2)
100 100
90
80 eo
70
4
60
C
so so
<E <E
" "
30 30
20 20
"
10 10
'---0
4 6 10 4 6 8 10
d d
(e) Data set 5 (m = 1) (d) Data set 6 (m = 2)
Figure 5.9
Dimension choice of the hetero-distributional subspace by LOOCV over 100 runs.
5.7. Summary 121
5, dimensionality choice by LOOCV works well. For data set 3, iii = 1 is
always chosen although the true dimensionality is m = 2. For data set 6,
dimensionality choice is rather unstable, but it still works reasonably well.
Figure 5.10 depicts the value of NMSE (see equation 5.5) averaged over
100 trials. For each d, the t-test (see, e.g., [78]) at the significance level 5
percent is performed, and the best method as well as the comparable method in
terms of mean NMSE are indicated by x. (In other words, the method without
the symbol x is significantly worse than the other method.) This shows that
mean NMSE of the baseline method (no dimensionality reduction) tends to
grow rapidly as the dimensionality d increases. On the other hand, increase
of mean NMSE of the D3 algorithm is much smaller than that of the baseline
method. Consequently, mean NMSE of D3 is much smaller than that of the
baseline method when the input dimensionality d is large. The difference of
mean NMSE is statistically significant for the data sets 1, 2, 5, and 6.
The above experiments show that the dimensionality reduction scheme is
useful in high-dimensional density-ratio estimation.
5.7 Summary
The direct importance estimators explained in chapter 4 tend to perform
better than naively taking the ratio of density estimators. However, in high­
dimensional problems, there is still room for further improvement in terms
of estimation accuracy. In this chapter, we have explained how dimensional­
ity reduction can be incorporated into direct importance estimation. The basic
idea was to perform importance estimation only in a subspace where two den­
sities are significantly different. We called this framework direct density-ratio
estimation with dimensionality reduction (D3).
Finding such a subspace, which we called the hetem-distributional sub­
space, is a key challenge in this context. We explained that the hetero­
distributional subspace can be identified by finding a subspace in which train­
ing and test samples are maximally separated. We showed that the accuracy of
the unconstrained least-squares importancefitting (uLSIF; [88]) algorithm can
be improved by combining uLSIF with a supervised dimensionality reduction
method called local Fisher discriminant analysis (LFDA; [154]).
We chose LFDA because it was shown to be superior in terms of both
accuracy and computational efficiency. On the other hand, supervised dimen­
sionality reduction is one of the most active research topics, and better methods
will be developed in the future. The framework of D3 allows one to use any
supervised dimensionality reduction method for importance estimation. Thus,
if better methods of supervised dimensionality reduction are developed in the
122 5 Direct Density-Ratio Estimation with Dimensionality Reduction
11 ___No dimensionality reduction
10 -D3
9
w ---
---
,,,
---� 8
--z
w
7
6 �
5
2 3 4 5 6 7 8 9 10
d
(a) Data set 1
X 10
-4
4.5
)C. ......1(.... ,..,.. .... )( .... �- ........-)It ..- 1(....01(
4.4
4.3
� 4.2
z
4.1
3.9'--:-2--:-3--C4--:-5--:-6--:-7--:-8--:-9--C:10�
d
X 10
-5
4.5
4
(c) Data set 3
................... "' --
w
3.5
"
,
,
- --
� 3
z
2.5
2 3 4 5 6 7 8 9 10
d
(e) Data set 5
Figure 5.10
X 10
-4
1.5
1.4
1.3
1.2
w
en
� 1.1
0.9
"
" �
w
0.8 �
..
2 3 4 5 6 7 8 9 10
d
X 10
-4
1.4
1.35
1.3
(b) Data set 2
,x
)C'
,,
;)It
....
Je
...... ..".. .. -)It ..-I(
� 1.25
z
1.2
1.15
1.1
2 3 4 5 6 7 8 9 10
d
(d) Data set 4
x 10
-4
1.18
,"'" ...
1.16 ,,
,,1.14 ,,,1.12 ,w
en 1.1
::;:
z 1.08 ,,
1.06
�1.04
1.02
2 3 4 5 6 7 8 9 10
d
(d) Data set 6
Mean NMSE of the estimated density-ratio functions over 100 runs. For each d, the t-test at the
significance level 5 percent is performed and the best method as well as the comparable method in
terms of mean NMSE, are indicated by x.
5.7. Summary 123
future, the new dimensionality reduction methods can be incorporated into the
direct importance estimation framework.
The D3 formulation explained in this chapter assumes that the components
inside and outside the hetero-distributional subspace are statistically indepen­
dent. A possible generalization of this framework is to weaken this condition,
for example, by following the line of [173].
We focused on a linear hetero-distributional subspace. Another possible
generalization of D3 would be to consider a nonlinear hetero-distributional
manifold. Using a kernelized version of LFDA [154] is one possibility. Another
possibility is to use a mixture of probabilistic principal component analyzers
in importance estimation [205].
6
Relation to Sample Selection Bias
One of the most famous works on learning under changing environment is
Heckman's method for coping with sample selection bias [76, 77]. Sam­
ple selection bias has been proposed and extensively studied in economet­
rics and sociology, and Heckman received the Nobel Prize for economics
in 2000.
Sample selection bias indicates the situation where the training data set con­
sists of nonrandomly selected (i.e., biased) samples. Data samples collected
through Internet surveys typically suffer from sample selection bias-samples
corresponding to those who do not have access to the Internet are completely
missing. Since the number of conservative people, such as the elderly, is less
than reality in the Internet surveys, rather progressive conclusions tend to be
drawn.
Heckman introduced a sample selection model in his seminal papers [76,
77], which characterized the selection bias based on parametric assumptions.
Then he proposed a two-step procedurefor correcting sample selection bias. In
this chapter, we briefly explain Heckman's model and his two-step procedure,
and discuss its relation to the covariate shift approach.
6.1 Heckman's Sample Selection Model
In order to deal with sample selection bias, Heckman considered a lin­
ear regression model specifying the selection process, in addition to the
regression model that characterizes the target behavioral relationship for the
entire population. The sign of the latent response variable of this regres­
sion model determines whether each sample is observed or not. The key
assumption in this model is that error terms of the two regression mod­
els are correlated to one another, causing nonrandom sample selection (see
figure 6.1).
126 6 Relation to Sample Selection Bias
8
0 0
0
7 0
000
0
6
5
0
..
--4 .. cJ
.. ..
.. ..
.. ..
3
..
......
..
.. .. ..
..
2
..
..
0
0 0.2 0.4 0.6 0.8
X
Figure 6.1
A numerical example of sample selection bias. This data set was generated according to Heck­
man's model. The circles denote observed samples, and the crosses are missing samples. Because
of large positive correlation between the error terms (p = 0.9), we can see that only the samples
with higher responses y are observed when the covariate x is small. Thus, the OLS estimator
(solid line) computed from the observed samples is significantly biased from the true populational
regressor (dotted line).
Let Yt be the target quantity on which we want to make inference. Heckman
assumed a linear regression model
(6.1)
where x is a covariate and UI is an error term. However, Yt is not always
observed. Samples are selected depending on another random variable Y2*;
the variable Yt is observable only if Y2* > O. Let Y be the observed response
variable, that is,
{
y*
Y=
I
missing value
(6.2)
The variable Y2* obeys another linear regression model,
(6.3)
6.1. Heckman's Sample Selection Model 127
where z is a covariate and U2 is an error term. Note that the covariates x in
equation 6.1 and z in equation 6.3 can contain common components.
In Heckman's papers [76, 77], the error terms UI and U2 are assumed to be
jointly bivariate-normally distributed with mean zero and covariance matrix
1:, conditioned on x and z. Since only the sign of the latent variable Y{ mat­
ters later, we can fix the variance of U2 at 1 without loss of generality. We
denote the variance of UI and the covariance between UI and U2 by a
2
and p,
respectively:
If there is no correlation between UI and U2 (p = 0), the selection process
(equation 6.3) is independentof the regression model (equation 6.1) of interest.
This situation is called missing at random (see [104D. On the other hand, if the
error terms are correlated (p f= 0), the selection process makes the distribution
of Y different from that of Yt. For example, the expectation of the observation
Y has some bias from that of Yt (i.e., xTIn
Figure 6.2 illustrates the sample selection bias for different selection prob­
abilities when the correlation p is nonzero. If the mean of Y2*, zTy, is large in
the positive direction, most of the samples are observed and the selection bias
",::o'-+l:_""-=�'"""'�____ yt
selection bias
(a) z T 'Y » 0
Figure 6.2
Selection bias
Sample selection biases for different selection probabilities in Heckman's model (when the corre­
lation p is nonzero). The symbol 0 denotes the mean of yt for all samples, while the symbol x
denotes the mean of yt only for y; > O. (a) If zT Y is large in the positive direction, most of the
samples are observed and the selection bias is small. (b) If zT Y is large in the negative direction,
only a small portion of samples can be observed and the sample selection bias becomes large.
128 6 Relation to Sample Selection Bias
is small (see figure 6.2a). On the other hand, if ZTY is large in the negative
direction, only a small portion of samples can be observed and the sample
selection bias becomes large (see figure 6.2b).
As illustrated in figure 6.3, the bias becomes positive in positively correlated
cases (p>0) and negative in negatively correlated cases (p < 0). Indeed, the
conditional expectation of the observation Y, given x, can be expressed as
lE[Ylx, Z]=lE[Ytlx, z, Y2*>0]
=xTP + lE[Ud U2>-ZTy], (6.4)
where the second term is the selection bias from the conditional expectation
lE[YtIx]=X TP of the latent variable Y2*' Heckman calculated the second term
in equation 6.4 explicitly, as explained below.
(a) No correlation (p = 0).
Y2*
Selection bias Selection bias
(b) Positive correlation (p > 0). (c) Negative correlation (p < 0).
Figure 6.3
Sample selection bias corresponding to no, positive, and negative correlation p. When p i= 0, the
conditional mean of the observed area (Y{ � 0) differs from the unconditional average xTp. In
Heckman's model, this gap (which is the sample selection bias) can be calculated explicitly.
6.2. Distributional Change and Sample Selection Bias
6.2 Distributional Change and Sample Selection Bias
129
As defined in equation 6.1, the conditional density of the latent variable Yt,
given x, is Gaussian:
*
1 (y-xT{J )p (Y1x;{J, a)=-;¢
a
' (6.5)
where ¢ denotes the density function of the standard Gaussian distribution
N(O, I). Thanks to the Gaussian assumption of the error terms U and U2,
we can also characterize the conditional distribution of the observation Y
explicitly.
Let D be an indicator variable taking the value 1 if Y is observed, and 0 if
not, that is,
D=I(Y2* > 0),
where I (.) is the indicator function. The assumption is that in the selection
process (equation 6.3), we cannot observe the value of Y2*; we can observe
only D (i.e., the sign of Yn. The conditional probability of D=1 can be
expressed as
P(D=Ilx, z;y)=P(Y2* > Olx, z;y)
=<I>(ZTy), (6.6)
where <I> is the cumulative distribution function of the standard Gaussian (i.e.,
the errorfunction). This binary-response distribution is called the probit model
in statistics.
By the definition of the indicator D and the selection process (equation 6.3),
the conditional distribution of the observation Y given D=I, x, and Z is
rewritten as
Po(ylx, z)=P(Y:s YID=I, x, z)
P(Y:s y, D=Ilx, z)
P(D=Ilx, z)
P(Yt:s y, Y2* > Olx, z)
P(D=llx, z)
(6.7)
From equation 6.6, the denominator of equation 6.7 is expressed by the error
function <1>.
130 6 Relation to Sample Selection Bias
Figure 6.4
In order to compute the distribution function of the observable Y in equation 6.7, the shaded area
of the bivariate Gaussian distribution should be integrated.
In order to calculate the numerator of equation 6.7, we need to integrate
the correlated Gaussian distribution over the region illustrated in figure 6.4 for
each fixed y. To this end, the error term UI is decomposed into two terms:
UI = a�UI+paU2,
where UI is independent of U2• Although the resulting formula still contains a
single integral, by taking the partial derivative with respect to y, the conditional
density Po can be expressed using the density ¢ of the Gaussian and the error
function <l> as shown below.
a
po(ylx, z;fJ, y, p, a) =
ay
Po(ylx, z)
=
1
¢
(y-XTfJ ) <l>
(P(y-xTfJ)/a+zTy ). (6.8)
a<l>(zTy) a J(I- p
2
)
Note that if the selection process (equation 6.3) is independent of ft(p = 0),
this conditional density results in that of ft (see equation 6.5). The detailed
derivation can be found in [19].
Figure 6.5 shows the conditional density function (equation 6.8) for differ­
ent sample selection thresholds when p = 0.5 and p = -0.5. Here, we took the
standard Gaussian N(O, I) for the distribution of ft, that is, the other param­
eters are set to xTfJ = 0 and a = 1. The smaller the mean zTY of f2* is, the
larger the shift is from the unconditional distribution. In other words, we suffer
6.3. The Two-Step Algorithm
O.S
0.4
0.3
l>
.�
C!i
0.2
0.1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
O�·�··�����--�--���-3 -2 -1
(a)p = 0.5.
Figure 6.5
0.5
0.4
0.1
-1
,
,
,
,
,
.
,
,
,
,
,
.
,
,
(b)p = -0.5
,
,
"
'.
131
Conditional density functions for p = 0.5 and p = -0.5 (xTf3 = 0 and a = I), denoted by the
solid lines. The selection thresholds are set to zT Y = -3. -2. -I, 0, 1 from closest to farthest. The
dashed line denotes the Gaussian density when p = O.
from larger selection bias when the threshold is relatively higher, and therefore
a small portion of the entire population is observed (see figure 6.2).
The conditional expectation and variance of Y can be calculated from
the moment generating function. As in the previous calculation, the result­
ing formula contains a single integral, but its derivative at the origin can be
written explicitly with the Gaussian density ¢ and the error function <I> as
follows.
(6.9)
The function A(t) = ¢(t)/<1>(-t) is often called the inverse Mills ratio,
and is also known as the hazard ratio in survival data analysis. As plot­
ted in figure 6.6, it is a monotonically increasing function of t, and can be
approximated by a linear function for a wide range of its argument.
6.3 The Two-Step Algorithm
Based on the theoreticaldevelopmentin the previous section, Heckman [76,77]
treated sampleselection bias as an ordinary modelspecificationerror or "omit­
ted variable" bias. More specifically, he reformulated the linear regression
132 6 Relation to Sample Selection Bias
3
oL---����----�--�----�--�
-3
Figure 6.6
Hazard ratio.
-2 -1
model (equation 6.1) as
o
t
Y=lE[YID=1.x.z] + V
=XTf3 + pa)..(-zTy) + V,
2 3
where V has mean 0 conditioned on being observable, that is,
lE[V ID=1, x, Z]=0.
(6.10)
If the hazard ratio term were available or, equivalently, the parameter y of the
selection process (equation 6.3) were known, it would become just a linear
regression problem with the extended covariates [x, )..(-zTy)], and then ordi­
nary least squares (OLS) would give a consistent estimator of the regression
parameters f3 and pa. However, we need to estimate the parameter y from the
data as well.
In [77], Heckman considers a situation where the covariate z in the selec­
tion process can be observed in any case, that is, regardless of whether Yt is
observed or not. Then, it is possible to estimate the regression parameter y
from the input-output pairs of (z, D) by probit analysis [114], which is the
first step of Heckman's algorithm. More specifically, the regression parameter
y is determined as
[no n
]y:=arg�ax �log <I>(z�y) + ifI
log {I- <I>(z�y)} . (6.11)
6.3. The Two-Step Algorithm 133
Input: observed samples {Xi,zi,yd�l and missing samples {Zi}�noH
Output: parameter estimates (fj,:Y,ij,ji)
1. Fit the selection model (Le., estimate 1') by probit analysis (equation 6.11)
from all input-output pairs {Zi' 1}��1 and {Zi' O}f=noH'
2. Calculate the estimates of the hazard ratio by plugging in :y for the
observed samples:
i = 1, ... ,no.
3. Estimate the parameters {3 and pain the modified linear regres­
sion model (equation 6.10) by the ordinary least squares (OLS) based on the
extended covariates {Xi)�i}��l and the observed responses {Yi}��l' The
estimators fj and {Xi are consistent estimators of {3 and pO', respectively.
4. Estimate the parameters 0=2 and Ii from the residuals
as follows (See the conditional variance, equation 6.9:
-
� pO'
P =-=-.
a
Figure 6.7
Pseudo code of Heckman's two-step algorithm. Step 4 is not necessary if we are interested only in
prediction.
Since the probit estimator y is consistent, the plug-in hazard ratio ).(-zTy)
can be used as the extra covariate in the second OLS step. This is the core
idea of Heckman's two-step algorithm. A pseudo code of Heckman's two-step
procedure is summarized in figure 6.7.
Heckman considered his estimator to be useful for providing good initial
values for more efficient methods or for exploratory analysis. Nevertheless,
it has become a standard way to obtain final results for the sample selection
model. However, there are several criticisms of Heckman's two-step algo­
rithm [126], such as statistical inefficiency, collinearity, and the restrictive
Gaussian-noise assumption.
134 6 Relation to Sample Selection Bias
At first, Heckman's estimator is statistically less efficient. Because the mod­
ified regression problem (equation 6.11) has the heterogeneous variance as in
equation 6.9, the general least squares (GLS) [41] may be used as an alter­
native to OLS, which minimizes the squared error weighted according to the
inverse of the noise variance at each sample point. However, GLS is not an
efficient estimator-an efficient estimator can be obtained by the maximum
likelihood estimator (MLE), which maximizes the following log-likelihood
function:
% %
10gC({3, y, p, a) = �)ogPO(y;lXi' Zi; {3, y, p, a) + L)og<l>(z�y)
;=1 ;=1
n
+ L 10g{I-<I>(z�y)},
;=no+l
where we assume that we are given no input-output pairs {Xi, Zi, yd7�1 for
observed samples and (n - no) input-only samples (i.e., covariates) {Zi}7=no+l
for missing samples. For computing the MLE solution, gradient ascent or
quasi-Newton iterations are usually employed, and Heckman's two-step esti­
mator can be used as an initial value since it is computationally very efficient
and widely available in standard software toolboxes.
The second criticism is that collinearity problems (two variables are highly
correlated to one another) can occur rather frequently. For example, if all the
covariates in Z are includedin x, the extra covariate A(-ZTy) in the second step
can be collinear with the other covariates in x, because the hazard ratio A(t)
is an approximately linear function over a wide range of its argument. In other
words, in order to let Heckman's algorithm work in practice, we need variables
in Z which are good predictors of Y; and do not appear in x. Unfortunately, it
is often very difficult to find such variables in practice.
Finally, Heckman's procedure relies heavily on the Gaussian assumptions
for the error terms in equations 6.1 and 6.3. Instead, some authors have pro­
posed semiparametric or nonparametric procedures with milder distributional
assumptions. See [119,126] for details.
6.4 Relation to Covariate Shift Approach
In this section, we discuss similarities and differences between Heckman's
approach and the covariateshift approach.For ease of comparison, we consider
the case where the covariates z and x in the sample selection model coincide.
Note that the distribution of the entire population in Heckman's formulation
6.4. Relation to Covariate Shift Approach 135
corresponds to the test distribution in the covariate shift formulation, and that
of observed samples corresponds to the training distribution:
Covariate shift
Pte(x, y) {:::::::}
Ptr(X, y) {:::::::}
Heckman's model
p(x, y)
p(x, YID = 1).
Although Heckman did not specify the probability distribution for the covari­
ates x, we will proceed as if they were generated from a probability
distribution.
One of the major differences is that in Heckman's model, the conditional
distribution p(ylx, D = 1) of observed samples changes from its popula­
tional counterpart p(ylx) in addition to a change in covariate distributions.
In this sense, the sample selection model deals with more general distribution
changes-it is reduced to covariate shift (i.e., the conditional distribution does
not change) only if the selection process is independentof the behavioral model
(p = 0 and no selection bias). On the other hand, in Heckman's model the dis­
tributional change including the selection bias is computed under a very strong
assumption of linear regression models with bivariate Gaussian error terms.
When this assumption is not fulfilled, there is no guarantee that the selection
bias can be captured reasonably. In machine learning applications, we rarely
expect that statistical models at hand are correctly specified and noise is purely
Gaussian. Thus, Heckman's model is too restrictive to be used in real-world
machine learning applications.
Another major difference between Heckman's approach and the covariate
shift approach is the way the sample selection bias is reduced. As explained
in chapters 2 and 3, covariate shift adaptation is carried out via importance
weighting, which substantially increases the estimation variance. On the other
hand, Heckman's correction compensates directly for the selection bias in
order to make the estimator consistent, which does not increase the estima­
tion variance. However, the price we have to pay for the "free" bias reduction
is that Heckman's procedure is not at all robust against model misspecifica­
tion. The bias caused by model misspecification can be corrected by postbias
subtraction, but this is a hard task in general (e.g., bootstrap bias estimates
are not accurate; see [45]). On the other hand, in covariate shift adapta­
tion, importance weighting guarantees to asymptotically minimize the bias for
misspecified models.1 Since the increase of variance caused by importance
1. Note that when the model is correctly specified, no importance weighting is needed (see
chapter 2).
136 6 Relation to Sample Selection Bias
weighting can be controlled by regularization with model selection, the covari­
ate shift approach would be more useful for machine learning applications
where correctly specified models are not available.
Finally, in covariate shift adaptation, the importance weight w(x) =
Pte(x)/Ptr(x) is estimated in an nonparametric way over the multidimensional
input space (see chapter 4). If we compute the importance weight between the
joint distributions2 of (x, y) for Heckman's model, we get
p(x, y)
w(x, y):
p(x, yID=I)
p(ylx)
� ------���-------
p(ylx, D=I)P(D=llx)
=
{<l>(p(y - XTP)/(J + XT'1 )}-
.J(1 - p2)
Due to the linear and Gaussian assumptions, this importance weight has a fixed
functional form specified by the error function <l> and depends only on the
one-dimensional projection xT('1 - E.P) of the input and the response variable(1
y. Especially if the selection process is independent of the behavioral model
(p=0, when only the covariate density changes), the importance weight is
reduced to
In this respect, the covariate shift approach covers more flexible changes of
distributions than Heckman's sample selection model.
2. Such a joint importance weight can be used for multitask learning [16].
7 Applications of Covariate Shift Adaptation
In this chapter, we show applications of covariate shift adaptation techniques to
real-world problems: the brain-computer interface in section 7.1, speaker iden­
tification in section 7.2, natural language processing in section 7.3, face-based
age prediction in section 7.4, and human activity recognition from accelero­
metric data in section 7.5. In section 7.6, covariate shift adaptation techniques
are employed for efficient sample reuse in the framework of reinforcement
learning.
7.1 Brain-Computer Interface
In this section, importance-weighting methods are applied to brain-computer
interfaces (BCls), which have attracted a great deal of attention in biomedical
engineering and machine learning [160,102].
7.1.1 Background
A BCI system allows direct communication from human to machine [201,40].
Cerebral electric activity is recorded via the electroencephalogram (EEG):
electrodes attached to the scalp measure the electric signals of the brain. These
signals are amplified and transmitted to the computer, which translates them
into device control commands. The crucial requirement for the successful func­
tioning of BCI is that the electric activity on the scalp surface already reflects,
for instances, motor intentions, such as the neural correlate of preparation for
hand or foot movements. A BCI system based on motor imagery can detect the
motor-related EEG changes and uses this information, for example, to perform
a choice between two alternatives: the detection of the preparation to move the
left hand leads to the choice of the first control command, whereas the right
hand intention would lead to the second command. By this means, it is possible
to operate devices which are connected to the computer (see figure 7.1).
138
imagine
left hand
movements
imagine
right hand
movements
Figure 7.1
Illustration of Bel system.
7 Applications of Covariate Shift Adaptation
--- -{>
--- -i>
For classification of appropriately preprocessed EEG signals [130,122,101],
the Fisher discriminant analysis (FDA) [50] has been shown to work well
[201,39,8]. On the other hand, strong non-stationarity effects have often been
observed in brain signals between training and test sessions [194, 116, 143],
which could be regarded as an example of covariate shift. This indicates
that employing importance-weighting methods could further improve the Bel
recognition accuracy.
Here, adaptive importance-weighted FDA (AIWFDA; see section 2.3.1) is
employed for coping with the non-stationarity. AIWFDA is tested on 14 data
sets obtained from 5 different subjects (see table 7.1 for specification), where
the task is binary classification of EEG signals.
7.1.2 Experimental Setup
Here, how the data samples are gathered and preprocessed in the Bel
experiments is briefly described. Further details are given in [23,22].
The data in this study were recorded in a series of online Bel experiments
in which the event-related desynchronization [122] was used to discriminate
between various mental states. During imagined hand or foot movement, the
spectral power in the frequency band between 8 Hz and 35 Hz is known to
decrease in the EEG signals of the corresponding motor cortices. The data
acquired from 128 EEG channels at a rate of 1000 Hz were downsampled to
100 Hz and band-pass filtered to specifically selected frequency bands. The
7.1. Brain-Computer Interface 139
Table 7.1
Specification of Bel data
# of Training # of Unlabeled # of Test
Subject Session ID Dim. of Samples Samples Samples Samples
I 3 280 1 12 1 12
2 3 280 120 120
I 3 3 280 35 35
2 I 3 280 1 13 1 12
2 2 3 280 1 12 1 12
2 3 3 280 35 35
3 3 280 91 91
3 2 3 280 1 12 1 12
3 3 3 280 30 30
4 6 280 1 12 1 12
4 2 6 280 126 126
4 3 6 280 35 35
5 2 280 1 12 1 12
5 2 2 280 1 12 1 12
common spatial patterns (CSP) algorithm [130], a spatial filter that maximizes
the band power in one class while minimizing it for the other class, was applied
and the data were projected onto three to six channels. The features were
finally extracted by calculating the log-variance over a window one second
long.
The experiments consisted of an initial training period and three test peri­
ods. During the training period, the letters (L), (R), or (F) were displayed on
the screen to instruct the subjects to imagine left hand, right hand, or foot
movements. Then a classifier was trained on the two classes with the best dis­
criminability. This classifier was then used in test periods. In each test period,
a cursor could be controlled horizontally by the subjects, using the classifier
output. One of the targets on the left and right sides of the computer screen
was highlighted to indicate that the subjects should then try to select this target
with the cursor. The experiments were carried out with five subjects. Note that
the third test period for the fifth subject is missing. So there was a total of 14
test sets (see table 7.1 for specification).
In the evaluation below, the misclassification rates of the classifiers are
reported. Note that the classifier output was not directly translated into the
position of a cursor on the monitor, but underwent some postprocessing steps
140 7 Applications of Covariate Shift Adaptation
(averaging, scaling, and biasing). Therefore, a wrong classification did not
necessarily result in the wrong target being selected. Since the time window
chosen for this evaluation is at the beginning of the trial, a better classification
accuracy on this data will lead to a shortened trial length, and therefore to a
higher bit rate (see [23] for details).
Training samples and unlabeled/test samples were gathered in different
recording sessions, so the non-stationarity in brain signals may have changed
the distributions. On the other hand, the unlabeled samples and test samples
were gathered in the same recording session; more precisely, the unlabeled
samples were gathered in the first half of the session and the test samples (with
labels) were collected in the latter half. Therefore, unlabeled samples may have
contained some information on the test input distribution. However, input dis­
tributions of unlabeled and test samples are not necessarily identical since the
non-stationarity in brain signals can cause a small change in distributions even
within the same session. Thus, this setting realistically renders the classifier
update in online BCI systems.
Ptr(x) and Pte(x) were estimated by maximum likelihood fitting of the
multidimensional Gaussian density with full covariance matrix. Ptr(x) was
estimated using training samples, and Pte(x) was estimated using unlabeled
samples.
7.1.3 Experimental Results
Table 7.2 describes the misclassification rates of test samples by FDA (cor­
responding to AIWFDA with flattening parameter A =
0), AIWFDA with
A chosen based on tenfold IWCV or tenfold CV, and AIWFDA with opti­
mal A (i.e., for each case, A is determined so that the misclassification
rate is minimized). The value of flattening parameter A was selected from
{O, 0.1, 0.2, . . . , 1.0}; chosen flattening parameter values are also in the table.
Table 7.2 also contains the Kullback-Leibler (KL) divergence [97] from the
estimated training input distribution to the estimated test input distribution.
Since we wanted to have an accurate estimate of the KL divergence, we used
test samples for estimating the test input distribution when computing the KL
divergence (only unlabeled samples were used when the test input distribution
is estimated for AIWFDA and IWCV ). The KL values may be interpreted as
the level of covariate shift.
First, FDA is compared with OPT (AIWFDA with optimal A). The table
shows that OPT outperforms FDA in 8 out of 14 cases. This implies that the
non-stationarity in the brain signals can be modeled well by covariate shift,
which motivates us to employ AIWFDA in BCL Within each subject, it can be
7.1. Brain-Computer Interface 141
Table 7.2
Misclassification rates for BCl data
Subject Trial OPT FDA IWCV CV KL
1 * 8.7 (0.5) 9.3 (0) - 10.0 (0.9) 10.0 (0.9) 0.76
2 * 6.2 (0.3) 8.8 (0) 8.8 (0) 8.8 (0) 1. 1 1
3 4.3 (0) 4.3 (0) 4.3 (0) 4.3 (0) 0.69
2 1 40.0 (0) 40.0 (0) 040.0 (0) 41.3 (0.7) 0.97
2 2 * 38.7 (0. 1) 39.3 (0) : 38.7 (0.2) 39.3 (0) 1.05
2 3 25.5 (0) 25.5 (0) 25.5 (0) 25.5 (0) 0.43
3 * 34.4 (0.2) 36.9 (0) + 34.4 (0.2) 34.4 (0.2) 2.63
3 2 * 18.0 (0.4) 2 1 .3 (0) + 19.3 (0.6) 19.3 (0.9) 2 .88
3 3 * 15.0 (0.6) 22.5 (0) + 17.5 (0.3) 17.5 (0.4) 1.25
4 20.0 (0.2) 2 1.3 (0) 2 1.3 (0) 2 1 .3 (0) 9.23
4 2 2.4 (0) 2.4 (0) 2.4 (0) 2 .4 (0) 5.58
4 3 6.4 (0) 6.4 (0) 6.4 (0) 6.4 (0) 1.83
5 1 2 1.3 (0) 2 1.3 (0) 2 1.3 (0) 2 1 .3 (0) 0.79
5 2 * 13.3 (0.5) 15.3 (0) : 14.0 (0. 1) 15.3 (0) 2.0 1
All values are in percent. lWCV or CV refers to AlWFDA with A chosen by ten fold lWCV or
ten fold CY. OPT refers to AlWFDA with optimal A. Values of chosen A are in parentheses (FDA
is denoted as A = 0). * in the table indicates the case where OPT is better than FDA. + is the
case where IWCV outperforms FDA, and - is the opposite case where FDA outperforms IWCY.
o denotes the case where lWCV outperforms CV. KL refers to the Kullback-Leibler divergence
between (estimated) training and test input distributions.
observed that OPT outperforms FDA when the KL divergence is large, but they
are comparable to one another when the KL divergence is small. This agrees
well with the theory that AIWFDA corrects the bias caused by covariate shift,
and AIWFDA is reduced to plain FDA in the absence of covariate shift. Next,
IWCV (applied to AIWFDA) is compared with FDA. IWCV outperforms FDA
in five cases, whereas the opposite case occurs only once. Therefore, IWCV
combined with AIWFDA certainly contributes to improving the classification
accuracy in BCI. The table also shows that within each subject, IWCV tends to
outperform FDA when the KL divergence is large. Finally, IWCV is compared
with CV (applied to AIWFDA). IWCV outperforms CV in three cases, and the
opposite case does not occur. This substantiates the effectiveness of IWCV as
a model selection criterion in BCI. IWCV tends to outperform CV when the
KL divergence is large within each subject.
The above results showed that non-stationarity in brain signals can be suc­
cessfully compensated for by IWCV combined with AIWFDA, which certainly
contributes to improving the recognition accuracy of BCI.
142 7 Applications of Covariate Shift Adaptation
7.2 Speaker Identification
In this section, we describe an application of covariate shift adaptation
techniques to speaker identification [204].
7.2.1 Background
Speaker identification methods are widely used in real-world situations such as
controlling access to information service systems, speaker detection in speech
dialogue, and speaker indexing problems with large audio archives [133].
Recently, the speaker identification and indexing problems have attracted a
great deal of attention.
Popular methods of text-independent speaker identification are based on the
Gaussian mixture model [132] or kernel methods such as the support vec­
tor machine [29, 110]. In these supervised learning methods, it is implicitly
assumed that training and test data follow the same probability distribution.
However, since speech features vary over time due to session-dependent vari­
ation, the recording environment change, and physical conditions/emotions,
the training and test distributions are not necessarily the same in practice. In a
paper by Furui [59], the influence of the session-dependent variation of voice
quality in speaker identification problems is investigated, and the identification
performance is to decrease significantly over three months-the major cause
for the performance degradation is the characteristic variations of the voice
source.
To alleviate the influence of session-dependent variation, it is popular to
use several sessions of speaker utterance samples [111, 113] or to use cepstral
mean normalization [58]. However, gathering several sessions of speaker utter­
ance data and assigning the speaker ID to the collected data are expensive in
both time and cost, and therefore not realistic in practice. Moreover, it is not
possible to perfectly remove the session-dependent variation by cepstral mean
normalization alone.
A more practical and effective setup is semisupervised learning [30], where
unlabeled samples are additionally given from the testing environment. In
semisupervised learning, it is required that the probability distributions of
training and test are related to each other in some sense; otherwise, we may
not be able to learn anything about the test probability distribution from the
training samples. Below, the semisupervised speaker identification problem is
formulated as a covariate shift adaptation problem [204].
7.2.2 Formulation
Here, the speaker identification problem is formulated.
7.2. Speaker Identification 143
An utterance feature X pronounced by a speaker is expressed as a set of
N mel-frequency cepstrum coefficients (MFCC) [129], which are vectors of d
dimensions:
For training, we are given nlr labeled utterance samples
where
denotes the index of the speaker who pronounces x;r
.
The goal of speaker identification is to predict the speaker index of a test
utterance sample Xle based on the training samples. The speaker index c of a
test sample Xle is predicted based on the Bayes decision rule:
For approximating the class-posterior probability, the following logistic model
/i(y=clX) is used:
�( _
IX) -
exp {l(X)}p y-c - c � ,
Le'=1 exp { fe'(X)}
where lex) is a discriminant function corresponding to the speaker c.
The following kernel model is used as the discriminant function [113]:
ntr
� � Ir
fAX)=� ee,eK(X, Xe), c=1, 2, .,. , C,
e=1
where {ee,e}�!1 are parameters corresponding to speaker c and K(X, X') is a
kernel function. The sequence kernel [110] is used as the kernel function here
because it allows us to handle features of different size; for two utterance sam­
ples X=[XI,X2, . • . ,XN] E IRdxN and X'=[x�,x;, .. . ,x�,] E IRdxN' (generally
N :f. N'), the sequence kernel is defined as
N N'
K(X, X')=
N
�'
LLK(XhX;,),
i=l i'=l
144 7 Applications of Covariate Shift Adaptation
where K (x, x') is a vectorial kernel. We use the Gaussian kernel here:
I (- IIX _XI 1l2
)K(x,x ) =exp
2a2 •
The parameters {ec,el��l are learned so that the following negative penalized
log-likelihood is minimized:
where A (2: 0) is the regularization parameter.
In practical speaker identification tasks, speech features are not stationary
due to time-dependent voice variation, change in the recording environment,
and physical conditions/emotion. Thus, the training and test feature distribu­
tions are not generally the same. Here, we deal with such changing environ­
ment via the covariate shift model, and employ importance-weighted logistic
regression (IWLR; see section 2.3.2) with the sequence kernel. The tuning
parameters in kernel logistic regression are chosen by importance-weighted
cross-validation (IWCV; see section 3.3), where the importance weights are
learned by the Kullback-Leibler importance-estimation procedure (KLIEP;
see section 4.4) with the sequence kernel.
7.2.3 Experimental Results
Training and test samples were collected fromten male speakers, and two types
of experiments were conducted: text-dependent and text-independent speaker
identification, In text-dependent speaker identification, the training and test
sentences were commonto all speakers. In text-independentspeaker identifica­
tion, the training sentences were common to all speakers, but the test sentences
were different from training sentences.
The NIT data set [112] was used here. Each speaker uttered several
Japanese sentences for text-dependent and text-independent speaker identifi­
cation evaluation. The following three sentences were used as training and
test samples in the text-dependentspeaker identification experiments (Japanese
sentences written using the Hepburn system of romanization):
• seno takasawa hyakunanajusseNchi hodode mega ookiku yaya futotteiru,
• oogoeo dashisugite kasuregoeni natte shimau,
• tashizaN hikizaNwa dekinakutemo eha kakeru.
7.2. Speaker Identification 145
In the text-independent speaker identification experiments, the same three
sentences were used as training samples:
• seno takasawa hyakunanajusseNchi hodode mega ookiku yaya futotteiru,
• oogoeo dashisugite kasuregoeni natte shimau,
• tashizaN hikizaNwa dekinakutemo eha kakeru.
The following five sentences were used as test samples:
• tobujiyuuwo eru kotowajiNruino yume datta,
• hajimete ruuburubijutsukaNe haittanowa juuyoneNmaeno kotoda,
• jibuNno jitsuryokuwa jibuNga ichibaN yoku shitteiru hazuda,
• koremade shouneNyakyuu mamasaN bareenado chiikisupootsuo sasae
shimiNni micchakushite kitanowamusuuno boraNtiadatta,
• giNzakeno tamagoo yunyuushite fukasase kaichuude sodateru youshokumo
hajimatteiru.
The utterance samples for trammg were recorded in 1990/12, and the
utterance samples for testing were recorded in 199113, 199116, and 199119,
respectively. Since the recording times between training and test utterance
samples are different, the voice quality variation is expected to be included.
Thus, this target speaker identification problem is a challenging task.
The total duration of the training sentences is about 9 seconds. The durations
of the test sentences for text-dependent and text-independent speaker identifi­
cation are 9 seconds and 24 seconds, respectively. There are approximately ten
vowels in the sentences for every 1.5 seconds.
The input utterance is sampled at 16 kHz. A feature vector consists of
26 components: 12 mel-frequency cepstrum coefficients, the normalized log
energy, and their first derivatives. Feature vectors are derived at every 10 mil­
liseconds over the 25.6-millisecond Hamming-windowedspeech segment, and
cepstral mean normalization is applied over the features to remove channel
effects. Each utterance is divided into 300-millisecond disjoint segments, each
of which corresponds to a set of features of size 26 x 30. Thus, the training set
is given as
for text-independent and text-dependent speaker identification evaluation. For
text-independent speaker identification, the sets of test samples for 1991/3,
199116, and 199119 are given as
146 7 Applications of Covariate Shift Adaptation
Xlel =
{Xlel}907 Xle2 =
{Xle2}919 and Xle3 =
{Xle3}9061 1 1=1' 1 l l=I' 1 1 1=1"
For text-dependent speaker identification, the sets of test data are given as
Xlel =
{Xlel}407 Xle2 =
{Xle2}407 and Xle3 =
{Xle3}4122 1 1=1' 2 l 1=1' 2 1 1=1·
In the above, we assume that samples are sorted according to time, with the
index i =
1 being the first sample.
The speaker identification rate was computed at every 1.5 seconds, 3.0 sec­
onds, and 4.5 seconds, and the speaker was identified based on the average
posterior probability
where m =
5, 10, and 15 for 1.5 seconds, 3.0 seconds, and 4.5 seconds,
respectively.
Here, the product-of-Gaussian model (PoG) [81], plain logistic regression
(LR) with the sequence kernel, and importance-weighted logistic regression
(lWLR) with the sequence kernel were tested, and the speaker identification
rates were compared with the data taken in 199113, 199116, and 199119. For
PoG andLR training, we used the 1990/12data set (inputsXlr and their labels).
For PoG training, the means, diagonal covariance matrices, and mixing coef­
ficients were initialized by the results of k-means clustering on all training
sentences for all speakers; then these parameters were estimated via the EM
algorithm [38,21] for each speaker. The number of components, b, was deter­
mined by fivefold cross-validation. In the test phase of PoG, the probability for
the speaker c was computed as
m N
Pc(X:e) =
nnqc(X:"-i+l),
i=1 j=1
where x:e=
[x:�pX:�2' . . . ,X:�N] and qc(x) was a PoG for the speaker c. In the
current setup, N =
30 and m =
5, 10, and 15 (which correspond to 1.5 seconds,
3.0 seconds, and 4.5 seconds, respectively).
For IWLR training, unlabeled samples Xlel, Xle2, and Xle3 in addition to the
training inputs Xlr and their labels (i.e., semisupervised) were used. We first
estimated the importance weight from the training and test data set pairs (Xlr,
Xlel), (Xlr, Xle2), or (Xlr, Xle3) by KLIEP with fivefold CV (see section 4.4),
7.2. Speaker Identification 147
and we used fivefold IWCV (see section 3.3) to decide the Gaussian kernel
width a and regularization parameter A.
In practice, the k-fold CV and k-fold IWCV scores can be strongly affected
by the way the data samples are split into k disjoint subsets (we used k = 5).
This phenomenonis due to the non-i.i.d. nature of the mel-frequency cepstrum
coefficient features, which is differentfromthetheory.Toobtain reliableexper­
imental results, theCV procedure was repeated 50 times with different random
data splits, and the highest score was used for model selection.
Table 7.3 shows the text-independent speaker identification rates in percent
for 199113, 199116, and 199119. IWLR refers to IWLR with a and A cho­
sen by fivefold IWCV, LR refers to LR with a and A chosen by fivefold CV,
and PoG refers to PoG with the number of mixtures b chosen by fivefold
CV. The chosen values of these hyperparameters are described in parenthe­
ses. DoDC (degree of distribution change) refers to the standard deviation of
estimated importance weights {w(X:r)}7!]; the smaller the standard deviation
is, the "flatter" the importance weights are. Flat importance weights imply that
Table 7.3
Correct classification rates for text-independent speaker identification
1991/3 IWLR LR PoG
(DoDC=0.34) (0" = 1.4. A = 10-4) (0" = 1.0, A = 10-2) (b = 16)
1.5 sec. 91.0 88.2 89.7
3.0 sec. 95.0 92.9 94.4
4.5 sec. 97.7 96. 1 94.6
1991/6 IWLR LR PoG
(DoDC=0.37) (0" = 1.3. A = 10-4) (0" = 1.0, A = 10-2) (b = 16)
1.5 sec. 91.0 87.7 90.2
3.0 sec. 95.3 91. 1 94.0
4.5 sec. 97.4 93.4 96. 1
1991/9 IWLR LR PoG
(DoDC=0.35) (0" = 1.2. A = 10-4) (0" = 1.0, A = 10-2) (b = 16)
1.5 sec. 94.8 91.7 92. 1
3.0 sec. 97.9 96.3 95.0
4.5 sec. 98.8 98.3 95.8
All values are in percent. IWLR refers to IWLR with (0", A) chosen by fivefold IWCV, LR refers
to LR with (0", A) chosen by fivefold CV, and PoG refers to PoG with the number of components b
chosen by fivefold Cv. The chosen values of these hyperparameters are described in parentheses.
DoDC refers to the standard deviation of estimated importance weights {w(Xn17�1' which roughly
indicates the degree of distribution change.
148 7 Applications of Covariate Shift Adaptation
there is no significant distribution change between the training and test phases.
Thus, the standard deviation of estimated importance weights may be regarded
as a rough indicator of the degree of distribution change.
As can be seen from the table, IWLR+IWCV outperforms PoG+CV and
LR+CV for all sessions. This result implies that importance weighting is useful
in coping with the influence of non-stationarity in practical speaker identifi­
cation such as utterance variation, change in the the recording environment,
physical conditions/emotions.
Table 7.4 summarizes the text-dependent speaker identification rates in per­
cent for 199113, 199116, and 199119, showing that IWLR+IWCV and LR+CV
slightly outperform PoG and are highly comparable to each other. The result
that IWLR+IWCV and LR+CV are comparable in this experiment is a reason­
able consequencesince the standard deviationof estimated importance weights
is very small in all three cases-implying that there is no significant distribu­
tion change, and therefore no adaptation is necessary. This result indicates that
Table 7.4
Correct classification rates for text-dependent speaker identification
199113 IWLR LR PoG
(DoDC=0.05) (0- = 1.2, 'A = 10-4) (0- = 1.0. 'A = 10-2) (b = 16)
1.5 sec. 100.0 98.9 96.8
3.0 sec. 100.0 100.0 97.7
4.5 sec. 100.0 100.0 97.9
199116 IWLR LR PoG
(DoDC=0.05) (0- = 1.2, 'A = 10-4) (0- = 1.0, 'A = 10-2) (b = 16)
1.5 sec. 97.5 96.2 97.8
3.0 sec. 97.5 97.2 98.1
4.5 sec. 98.9 97.4 98.3
199119 IWLR LR PoG
(DoDC=0.05) (0- = 1.2, 'A = 10-4) (0- = 1.0, 'A = 10-2) (b = 16)
1.5 sec. 100.0 100.0 98.2
3.0 sec. 100.0 100.0 98.4
4.5 sec. 100.0 100.0 98.5
All values are in percent. IWLR refers to IWLR with (0-, 'A) chosen by fivefold IWCV, LR refers
to LR with (0-, 'A) chosen by fivefold CV, and PoG refers to PoG with the number of components b
chosen by fivefold CV. The chosen values of these hyperparameters are described in parentheses.
DoDC refers to the standard deviation of estimated importance weights {w(XnJ7!,, which roughly
indicates the degree of distribution change.
7.3. Natural Language Processing 149
the proposedmethoddoes not degradethe accuracy when there is nosignificant
distribution change.
Overall, the importance-weighting method tends to improve the perfor­
mance when a significant distribution change exists. It also tends to maintain
the good performance of the baseline method when no distribution change
exists. Thus, the importance-weighting method is a promising approach to
handling session-dependent variation in practical speaker identification.
7.3 Natural Language Processing
In this section, importance-weighting methods are applied to a domain adapta­
tion task in natural language processing (NLP).
7.3.1 Formulation
A standard way to train an NLP system is to use data collected from the target
domain in which the system is operated. In practice, though, collecting data
from the target domain is often costly, so it is desirable to (re)usedata obtained
from other domains. However, due to the difference in vocabulary and writing
style, a naive use of the data obtainedfrom other domains results in serious per­
formance degradation. For this reason, domain adaptation is one of the most
important challenges in NLP [24,86,36].
Here, domain adaptation experiments are conducted for a Japanese word
segmentationtask. It is not trivial to detect word boundaries for nonsegmented
languages such as Japanese or Chinese. In the word segmentation task,
represents a sequence of features at character boundaries of a sentence, and
is a sequence of the corresponding labels, which specify whether the current
position is a word boundary. It would be reasonable to consider the domain
adaptation task of word segmentation as a covariate shift adaptation problem
since a word segmentation policy, p(yI X), rarely changes across domains in
the same language, but the distribution of characters, p(X), tends to vary over
different domains.
The goal of the experiments here is to adapt a word segmentation system
from a daily conversation domain to a medical domain [187]. One of the
characteristics of the NLP tasks is their high dimensionality. The total number
of distinct features is about d =
300, 000 in this data set. ntr =
13, 000 labeled
150 7 Applications of Covariate Shift Adaptation
(i.e., word-segmented) sentences from the source domain, and nte =53, 834
unlabeled (i.e., unsegmented) sentences from the target domain are used for
learning. In addition, 1000 labeled sentences from the target domain are used
for the evaluation of the domain adaptation performance.
As a word segmentation model, conditional random fields (CRFs) are used,
which are generalizations of the logistic regression models (see section 2.3.2)
for structured prediction [98]. The CRFs model the conditional probability
p(yIX;9)of output structure y, given an inputX:
p(YIX;9)=", {9 T (X ')}'L..y' exp qJ ,y
where qJ(X,y) is a basis function mapping (X,y) to a b-dimensional fea­
ture vector,I and b is the dimension of the parameter vector 9. Although the
conventional CRF learning algorithm minimizes the regularized negative log­
likelihood, an importance-weighted CRF (IWCRF) training algorithm is used
here for covariate shift adaptation:
The Kullback-Leibler importance estimation procedure (KLIEP; see
section 4.4) for log-linear models is used to estimate the importance weight
w(X).
w(X)
_ -,--
_
e
_
xp
_
(,,-ot
_
Tt
.,---
(X
_
),,- ) �
- ;;; I:��l exp(otTt(x;r))
,
where the basis function t(X)is set to the average value of features for CRFs
in a sentence:
Since KLIEP for log-linear models is computationally more efficient than
KLIEP for linear models when the number of test samples is large (see
section 4.4), we chose the log-linear model here.
1. Its definition is omitted here. See [ 187] for details.
7.3. Natural Language Processing
7.3.2 Experimental Results
The performance is evaluated by the F-measure,
2xRxP
F =----
R +P ,
where R and P denote recall and precision defined by
# of correct words
R = x100,
# of words in test data
# of correct words
P = x100.
# of words in system output
151
The hyperparameterofIWCRF is optimizedbased on an importance-weighted
F-measure (IWF) for separate validation data in which the number of correct
words is weighted according to the importance of the sentence that these words
belong to:
2xIWR(D)xIWP(D)
IWF(D) =-IW- R- (-D-) -+-IW-P(-D-) -
for the validation set D, where
IWR(D) =L(X,Y)ED w(X) L�=l[Y; =Yt]
x100,
T L(X,Y)ED w(X)
IWP(D) =L(X,Y)ED w(X) L�=l[Y; =Yt]
x100,
T L(X,Y)ED w(X)
and
Y =(Yi,)12,"")IT)T =argmaxp(yIX; 8).
Y
Ten percent of the training data is used as the validation set D.
The performances of CRF, IWCRF, and CRF' (a CRF trained with addi­
tional 1000 labeled manual word segmentation samples an from the target
domain [187])are compared. For importance estimation, log-linearKLIEP and
the logistic regression (LR) approach (see section 4.3) are tested, and fivefold
cross-validation is used to find the optimal Gaussian width a in the LR model.
Table 7.5 summarizes the performance, showing that IWCRF+KLIEP sig­
nificantly outperforms CRF. CRF', which used additional labeled samples in
the target domain and thus was highly expensive, tends to perform better
152 7 Applications of Covariate Shift Adaptation
Table 7.5
Word segmentation performance in the target domain
F R P
CRF 92.30 90.58 94.08
IWCRF+KLIEP 94.46 94.32 94.59
IWCRF+LR 93.68 94.30 93.07
CRF' 94.43 93.49 95.39
CRF' indicates the performance of a CRF' trained with additional 1000 manual word segmentation
samples in the target domain.
than CRF, as expected. A notable fact is that IWCRF+KLIEP, which does not
require expensive labeled samples in the target domain, is on par with CRF'.
Empirically, the covariate shift adaptation technique seems to improve the cov­
erage (R) in the target domain. Compared with LR, KLIEP seems to perform
better in this experiment. Since it is easy to obtain a large amount of unla­
beled text data in NLP tasks, domain adaptation by importance weighting is a
promising approach in NLP.
7.4 Perceived Age Prediction from Face Images
In this section, covariate shift adaptation techniques are applied to perceived
age prediction from face images [188].
7.4.1 Background
Recently, demographic analysis in public places such as shopping malls and
stations has been attracting a great deal of attention. Such information is use­
ful for various purposes such as designing effective marketing strategies and
targeting advertisements based on prospective customers' genders and ages.
For this reason, a number of approaches have been explored for age estimation
from face images [61, 54, 65]. Several databases are now publicly available
[49,123,134].
The accuracy of age prediction systems is significantly influenced by the
type of camera, the camera calibration, and lighting variations. The publicly
available databases were collected mainly in semicontrolled environments
such as a studio with appropriate illumination. However, in real-world envi­
ronments, lighting conditions vary considerabley: strong sunlight may be cast
on the sides of faces, or there is not enough light. For this reason, training and
test data tend to have different distributions. Here, covariate shift adaptation
techniques are employed to alleviate changes in lighting conditions.
7.4. Perceived Age Prediction from Face Images 153
7.4.2 Formulation
Let us consider a regression problem of estimating the age y of subject x (x
corresponds to a face-feature vector). Here, age estimation is performed not on
the basis of subjects' real ages, but on their perceived ages. Thus, the "true"
age of a subject is defined as the average perceived age evaluated by those
who observed the subject's face images (the number is rounded to the nearest
integer).
Suppose training samples {(x:f, y:')}7!1 are given. We use the following
kernel model for age estimation (see section 1.3.5.3).
ntt
1(x;9) = L:8eK(x,xe),
e=1
where 9 = (81,82, • • • ,8ntt)T are parameters and K(·, ·) is a kernel function.
Here the Gaussian kernel is used:
/ ( IIX _X/1I2)K(x,x ) =exp
2a2 '
where a (> 0) denotes the Gaussian width. Under the covariate shift formula­
tion, the parameter9may be learned by regularized importance-weighted least
squares (see section 2.2.1):
[1
ntt (If) ntt ]min -'""' Pie Xi (l(xlf;9) _ ylf)2 + A '""'82 ,8 n
�
p (Xlf) I I � ete
i=l tr i C=l
where A (2: 0) is the regularization parameter.
7.4.3 Incorporating Characteristics of Human Age Perception
Human age perception is known to have heterogeneous characteristics. For
example, it is rare to misjudge the age of a 5-year-old child as 15 years, but
the age of a 35-year-old person is often misjudged as 45 years. A paper by
Ueki et al. [189] quantified this phenomenon by carrying out a large-scale
questionnaire survey: each of 72 volunteers was asked to give age labels y to
approximately 1000 face images.
Figure 7.2 depicts the relation between subjects' perceived ages and its stan­
dard deviation. The standarddeviation is approximately 2 (years)when the true
age is less than 15. It increases and goes beyond 6 as the true age increases
from 15 to 35. Then the standard deviation decreases to around 5 as the true
age increases from 35 to 70. This graph shows that the perceived age deviation
154 7 Applications of Covariate Shift Adaptation
c:
o
6
� 5
.�
� 4
'"
"0
lij 3
(ij
O L-__�______L-____�____-L____�____�____--J
10 20 30 40 50 60 70
Age
Figure 7.2
The relation between subjects' perceived ages (horizontal axis) and its standard deviation (vertical
axis).
tends to be small in younger age brackets and large in older age groups. This
agress well with our intuition concerning the human growth process.
In order to match characteristics of age prediction systems to those of human
age perception, the goodness-of-fit term in the training criterion is weighted
according to the inverse variance of the perceived age:
where �(y) is the standard deviation of the perceived age at y (i.e., the values
in figure 7.2). The solution-0 is given analytically by
-0=
(KtrWKtr + AIntr)-lKtrWif,
where KIf is the nlf x nlf matrix with the (i, i')-th element
W is the diagonal matrix with the i-th diagonal element
w
_
Ple(X:') 1
i,i - -
( If) !"2( If)'Plf Xi � Yi
7.4. Perceived Age Prediction from Face Images
Intr is the ntf-dimensional identity matrix, and
tf (tr tr tf )T
Y = Yl ' Y2 ' . .. , Yntr .
7.4.4 Experimental Results
155
The face images recorded under 17 different lighting conditions were used for
experiments: for instance, average illuminance from above is approximately
WOO lux and 500 lux from the front in the standard lighting condition; 250 lux
from above and 125 lux from the front in the dark setting; and 190 lux from
the subject's right and 750 lux from his left in another setting (see figure 7.3).
Images were recorded as movies with the camera at a negative elevation angle
of 15 degrees. The number of subjects was approximately 500 (250 of each
gender). A face detector was used for localizing the eye pupils, and then the
image was rescaled to 64 x 64 pixels. The number of face images in each
environment was about 2500 (5 face images x 500 subjects). As preprocess­
ing, a feature extractor based on convolutional neural networks [184] was
used to extract 100 dimensional features from 64 x 64 face images. Learn­
ing for male/female data was performed separately, assuming that gender
classification had been correctly carried out in advance.
The 250 subjects of eachgender were split into the trainingset (200subjects)
and the test set (50 subjects). For the test samples {(x�e, y�e)}]::l corresponding
to the environment with strong light from a side, the following age-weighted
mean-square error (AWMSE) was calculated as a performance measure for a
learned parameter8:
1 nte 1" (te f� te �(J )2AWMSE =- � -2-t-e Yj - (Xj; ) .
nte j=l � (Yj)
Figure 7.3
(7.1)
Examples of face images under different lighting conditions. Left: standard lighting; middle: dark;
right: strong light from the subjects right-hand side.
156 7 Applications of Covariate Shift Adaptation
The training set and the test set were shuffled five times in such a way that
each subject was selected as a test sample once. The final performance was
evaluated based on the average AWMSE over the five trials.
The performances of the following three methods were compared:
· IW Training samples were taken from all 17 lighting conditions. The
importance weights were estimated by the Kullback-Leibler importance esti­
mation procedure (KLIEP; see section 4.4), using the training samples and
additional unlabeled test samples; the Gaussian width included in KLIEP
was determined based on twofold likelihood cross-validation. The importance
weights for training samples estimated by KLIEP were averaged over samples
in each lighting condition, and the average importance weights were used in
the training of regression models. This had the effect of smoothing the impor­
tance weights. The Gaussian width a and the regularization parameter A were
determined based on fourfold importance-weighted cross-validation (lWCV;
see section 3.3)over AWMSE; that is, the training set was furtherdivided into
a training part (150 subjects) and a validation part (50 subjects).
· NIW Training samples were taken from all 17 lighting conditions. No
importance weight was incorporated in the training criterion (the age weights
were included). The Gaussian width a and the regularization parameter A were
determined based on fourfold cross-validation over AWMSE.
• NIW' Only training samples taken the standard lighting condition were
used. Other setups were the same as NIW.
Table 7.6 summarizes the experimental results, showing that for both male
and female data, IW is better than NIW, and NIW is better than NIW'.
This illustrates that the covariate shift adaptation techniques are useful for
alleviating the influence of changes in lighting conditions.
Table 7.6
The age prediction performance measured by AWMSE
Male Female
IW 2.54 3.90
NIW 2.64 4.40
NIW' 2.83 6.5 1
See equation 7. 1.
7.5. Human Activity Recognition from Accelerometric Data
7.5 Human Activity Recognition from Accelerometric Data
157
In this section, covariate shift adaptation techniques are applied to accelero­
meter-based human activity recognition [68].
7.5.1 Background
Human activity recognition from accelerometric data (e.g., obtained by a smart
phone)has been gathering a great deal of attention recently [11,15,75], since it
can be used for purposes such as remote health care [60,152,100] and worker
behavior monitoring [207]. To construct a good classifier for activity recog­
nition, users are required to prepare accelerometric data with activity labels
for types of actions such as walking, running, and bicycle riding. However,
since gathering labeled data is costly, this initial data collection phase prevents
new users from using the activity recognition system. Thus, overcoming such
a new user problem is an important challenge for making the human activity
recognition system useful in practical ways.
Since unlabeled data are relatively easy to gather, we can typically use
labeled data obtained from existing users and unlabeled data obtained from
a new user for developing the new user's activity classifier. Such a situation is
commonly called semisupervised learning, and various learning methods that
utilize unlabeled samples have been proposed [30]. However, semisupervised
learning methods tend to perform poorly if unlabeled test data have a signifi­
cantly different distribution than the labeled training data. Unfortunately, this
is a typical situation in human activity recognition since motion patterns (and
thus distributions of motion data)depend heavily on users.
In this section, we apply an importance-weighted variant of a probabilis­
tic classification algorithm called least-squares probabilistic classifier (LSPC)
[155] to real-world human activity recognition.
7.5.2 Importance-Weighted Least-Squares Probabilistic Classifier
Here, we describe a covariate shift adaptation method called the importance­
weighted least-squares probabilistic classifier (lWLSPC) [68].
Let us consider a problem of classifying an accelerometric samplex (E JRd)
into activity class y (E {I, 2, . . . , c}), where d is the input dimensionality and c
denotes the number of classes. IWLSPC estimates the class-posterior probabil­
ity p(y Ix) from training input-output samples {(x:r, yn}7![ and test input-only
samples {x�e}]�[ under covariate shift. In the context of human activity recogni­
tion, the labeled training samples {(x:r, yn}7![ correspond to the data obtained
from existing users, and the unlabeled test input samples {x�e}]�[ correspond
to the data obtained from a new user.
158 7 Applications of Covariate Shift Adaptation
Let us model the class-posterior probability p(yIx) by
nte
p(Ylx;9y) =L8y,eK(x,x�e),
e=
where 9y = (8y,,8y,2, . ,. ,8y,nte)T is the parameter vector and K(x,x') is a
kernel function, We focus on the Gaussian kernel here:
, ( IIX_X'1I2)K(x,x ) =exp - 2a2 '
where a denotes the Gaussian kernel width. We determine the parameter9yso
that the following squared error ly is minimized:
ly(9y) = �f(P(YIX;9y) - p(YIX)rPte(x)dx
= �fp(Ylx;9y)2Pte(x)dx -fp(Ylx;9y)p(ylx)Pte(x)dx + C
1 T T= 19y Q9y-qy9y+ C
,
where C is a constant independent of the parameter9y, and Q is the nte x nte
matrix, andqy= (qy,, . . . , qy,nte)T is the nte-dimensional vector defined as
Qe,e' := fK(x,x�e)K(x,x��)Pte(x)dx,
qy,e := fK(x,x�e)p(Ylx)Pte(x)dx.
Here, we approximate Q and qy using the adaptive importance sampling
technique (see section 2,1.2) as follows, First, using the importance weight
defined as
w(x) := Pte(X) ,
Ptr(x)
we express Q andqy in terms of the training distribution as
Qe,e' = fK(x,x�e)K(x,x��)Ptr(x)w(x)dx,
7.5. Human Activity Recognition from Accelerometric Data
qy,e =
fK(x,x�e)p(yIx)Ptr(x)w(x)dx
=
p(y) fK(x,x�e)Ptr(xly)w(x)dx,
159
where Ptr(xly) denotes the training input density for class y. Then, based on
the above expressions, Q and q
yare approximated using the training samples
{(x:f, y:')r:l as follows2:
� 1 ntr
Q . '"K(tf te)K(tf te) (tf)Ve,e' ,
=
- � Xi ,Xe Xi ,Xe' W Xi 'ntf i=l
where the class-prior probability p(y) is approximated by n:i)Int" and n:i)
denotes the number of training samples with label y. Also, v (O :s v :s 1)
denotes the flattening parameter, which controls the bias-variance trade-off
in importance sampling (see section 2,1,2).
Consequently, we arrive at the followingoptimization problem:
where �9;9y is a regularization term to avoid overfitting and A (�O)
denotes the regularization parameter, Then, the IWLSPC solution is given
analytically as
where Inte denotes the nte-dimensional identity matrix. Since the class­
posterior probability is nonnegative by definition, we modify the solution as
2. When J = 1, Q may be approximated directly by using the test input samples {X7}��1 as
160 7 Applications of Covariate Shift Adaptation
follows [206]:
if Z=L�= max (0, L;:KeK(x, x�e)) > 0; otherwise, p(ylx)= lie, where
c denotes the number of classes.
The learned class-posterior probability p(yIx) allows us to predict the class
label y of a new sample x with confidence p(Ylx) as
y=argmax p(Ylx).
y
In practice, the importance w(x) can be estimated by the methods described
in chapter 4, and the tuning parameters such as the regularization parameter
A, the Gaussian kernel width a, and the flattening parameter v may be chosen
based on importance-weighted cross-validation (IWCV; see section 3.3).
An importance-weighted variant of kernel logistic regression (IWKLR; see
section 2.3.2) also can be used for estimating the class-posterior probability
under covariate shift [204]. However, training a large-scale IWKLR model is
computationally challenging since it requires numerical optimization of the all­
class parameter of dimension c x nte• On the other hand, IWLSPC optimizes
the classwise parameter (Jy of dimension nte separately c times in an analytic
form.
7.5.3 Experimental Results
Here, we apply IWLSPC to real-world human activity recognition.
We use three-axis accelerometric data collected by iPodTouch.3 In the data
collection procedure, subjects were asked to perform a specific task such as
walking, running, or bicycle riding. The duration of each task was arbitrary,
and the sampling rate was 20 Hz with small variations. An example of three­
axis accelerometric data for walking is plotted in figure 7.4.
To extract features from the accelerometric data, each data stream was
segmented in a sliding window manner with a width of five seconds and
a sliding step of one second. Depending on subjects, the position and ori­
entation of iPodTouch was arbitrary-held in the hand or kept in a pocket
or a bag. For this reason, we decided to take the lz-norm of the three­
dimensional acceleration vector at each time step, and computed the following
3. The data set is available from http://alkan.mns.kyutech.ac.jp/web/data.html.
7.S. Human Activity Recognition from Accelerometric Data 161
:�f�o 5 10 15 20 25
I-_:2:f .��� _ ��.. , , ,2o 5 10 15 20 252
j �o 5 10 15 20 25Time [second]
Figure 7.4
Example of three-axis accelerometric data for walking.
five orientation-invariant features from each window: mean, standard devia­
tion, fluctuation of amplitude, average energy, and frequency-domain entropy
[11, 15].
Let us consider a situation where two new users want to use the activ­
ity recognition system. Since they do not want to label their accelerometric
data, only unlabeled test samples are available from the new users. Labeled
data obtained from 20 existing users are available for training the new users'
classifiers. Each existing user has at most 100 labeled samples for each action.
We compared the performance of the following six classification methods.
• LapRLS4+IWCV A semisupervised learning method called Laplacian
regularized least squares (LapRLS) [30] with the Gaussian kernel as basis
functions. Hyperparameters are selected by IWCV.
4. Laplacian regularized least squares (LapRLS) is a standard semisupervised learning method
which tries to impose smoothness over nonlinear data manifold [30]. Let us consider a binary
classification problem where y E {+1 , -I}. LapRLS uses a kernel model for class prediction:
"te
k(x; 8):= Le,K(x,x�).
£=1
162 7 Applications of Covariate Shift Adaptation
• LapRLS+CV LapRLS with hyperparameters chosen by ordinary CV (i.e.,
no importance weighting).
• IWKLR+IWCV An adaptive and regularized variant of IWKLR (see
section 2.3.2) with the Gaussian kernel as basis functions. The hyperparam­
eters are selected by IWCV. A MATLAB implementation of a limited memory
BFGS (Broyden-Fletcher-Goldfarb-Shanno) quasi-Newton method included
in the minFunc package [140] was used for optimization.
• KLR+CV IWKLR+IWCV without importance weighting.
• IWLSPC+IWCV The probabilistic classification method described in
section 7.5.2. The hyperparameters are chosen by IWCV.
• LSPC+CV: IWLSPC+IWCV without importance weighting.
The uLSIF method (see section 4.6) was used for importance estimation.
For stabilization purposes, estimated importance weights were averaged in a
userwise manner in IWCV.
The experiments were repeated 50 times with different sample choices.
Table 7.7 shows the experimental results for each new user (specified by
ul and u2) in three binary classification tasks: walk vs. run, walk vs. rid­
ing a bicycle, walk vs. taking a train. The table shows that IWKLR+IWCV
and IWLSPC+IWCV compare favorably with other methods in terms of
classification accuracy. Table 7.8 depicts the computation time for training
The parameter 8 is determined as
[1 ntr 2 n
]mjn
ntr
�(k(X�; 8)-yt) +)..8T8 +1/ '�lL""k(x,; 8)k(x,,; 8) ,
where the first term is the goodness of fit, the second term is the C2-regularizer to avoid overfitting,
and the third term is the Laplacian regularizer to impose smoothness over data manifold. n:=
ntr+n'e, L:= D -W is an n x ngraph Laplacian matrix, W is an affinity matrix defined by
W
'- ( "X, _X;,1I2 )i.i' .-exp
2r2 '
(Xl,...,Xn):= (Xr,...,X�tr,X
e
,...,X�e)' r is an affinity-controlling parameter, and D is the
diagonal matrix given by D", := I:7'�1 W,,;,.
The solution of LapRLS can be analytically computed since the optimization problem is an
unconstrained quadratic program, However, covariate shift is not taken into account in LapRLS,
and thus it will not perform well if training and test distributions are siguificantly different.
7.5. Human Activity Recognition from Accelerometric Data 163
Table 7.7
Mean misclassification rates [%] and standard deviation (in parentheses) Averaged over 50 trials
for each new user (specified by u l and u2) in Human Activity Recognition
LapRLS LapRLS KLR IWKLR LSPC IWLSPC
Walk vs. + CV + IWCV + CV + IWCV + CV + IWCV
Run (u l) 2 1.2(4.7) 1 1.4(4.0) 15.0(6.6) 9.0(0.5) 13.3(3.9) 9.0(0.4)
Bicycle (u l) 9.9( 1. 1) 12.6( 1.3) 9.5(0.7) 8.7(4.6) 9.7(0.7) 8.7(5.0)
Train (u l) 2.2(0.2) 2.2(0.2) 1.4(0.4) 1.2( 1.5) 1.5(0.4) 1.1( 1.5)
Run (u2) 24.7(5.0) 20.6(9.7) 25.6( 1.3) 22.4(6. 1) 25.6(0.8) 21.9(5.9)
Bicycle (u2) 13.0( 1.5) 14.0(2. 1) 1 1.1( 1.8) 1 1.0( 1.7) 10.9( 1.8) 10.4( 1.7)
Train (u2) 3.9( 1.3) 3.7( 1. 1) 3.6(0.6) 2.9(0.7) 3.5(0.5) 3.1(0.5)
A number in boldface indicates that the method is the best or comparable to the best in terms of
the mean misclassification rate by the t-test at the significance level 5 percent.
Table 7.8
Mean computation time (sec) and standard deviation (in parentheses) averaged over 50 trials for
each new user (specified by u l and u2) in Human Activity Recognition
LapRLS LapRLS KLR IWKLR LSPC IWLSPC
Walk vs. + CV + IWCV + CV + IWCV + CV + IWCV
Run (u l) 14. 1(0.7) 14.5(0.8) 86.8( 16.2) 78.8(23.2) 7.3(1.1) 6.6( 1.3)
Bicycle (u l) 38.8(4.8) 52.8(8.1) 38.8(4.8) 52.8(8. 1) 4.2(0.8) 3.7(0.8)
Train (u l) 5.5(0.6) 5.4(0.6) 19.8(7.3) 30.9(6.0) 3.9(0.8) 4.0(0.8)
Run (u2) 12.6(2. 1) 12.1(2.2) 70. 1(12.9) 128.5(5 1.7) 8.2( 1.3) 7.8( 1.5)
Bicycle (u2) 16.8(7.0) 27.2(5.6) 16.8(7.0) 27.2(5.6) 3.7(0.8) 3.1(0.9)
Train (u2) 5.6(0.7) 5.6(0.6) 24.9( 10.8) 29.4( 10.3) 4.1(0.8) 3.9(0.8)
A number in boldface indicates that the method is the best or comparable to the best in terms of
the mean computation time by the t-test at the significance level 5 percent.
classifiers, showing that the LSPC-based methods are computationally much
more efficient than the KLR-based methods.
Figure 7.5 depicts the mean misclassification rate for various coverage
levels, which is the ratio of test sample size used for evaluating the misclas­
sification rate. For example, the coverage 0.8 means that 80 percent of test
samples with high confidence level (obtained by an estimated class-posterior
probability) are used for evaluating the misclassification rate. This renders a
realistic situation where prediction with low confidence level is rejected; if
the prediction is rejected, the prediction obtained in the previous time step
will be inherited since an action usually continues for a certain duration. The
164
l
&
�
c
0
.�-=
.�
�
:E
1 3.3
1 1
8.8
9.7
0.8
.'
.'
..
....
..
.
0.85
.....
............
. '
.'
.'
0.9
Coverage
0.95
(a)Walkvs. run (ul).
..
.....
//
:
/
/
..
'#
#
#
#
.
. . .
.
. . . .
. . . . . . . . .
....
6.5 �===;;..._---:'::--_-=-=__�
0.8 0.85 0.9 0.95
1 .5
..
.. .. .. ..
Coverage
(c)Walkvs. bicycle (ul).
...........
. .. . .. '
. '
. '
. '
.......
.. .. . .
....
..
..
0.2 b::========---_�_�_
0.8 0.85 0.9
Coverage
0.95
(e) Walk vs. train (ul).
Figure 7.5
7 Applications of Covariate Shift Adaptation
25.6
. '
.'
. '
.
'
. '
.'
.'
. '
.'.'
.
'
.
'
.'
.
'
.'
.'.'.'
.
'
.'
1 7.6 E::::.:-----:-�---:-'::-----::-�--�
0.8 0.85 0.9 0.95
1 0.9
. '
.'
6.2 .. .
0.8
3.5
.'
Coverage
(b) Walk vs. run (u2).
.'
. '
0.85
.
'
..
'
.'
.
'..
'
0.9
Coverage
.....,
.
'
0.95
(d) Walk vs. bicycle (u2).
.....
..
.. . . ..
. .
... . .. . ...
..
.
.. .. .. .. .. .. .. .. ..
..
..
..
..
.....
0.3 1:......___�__�___�___�
0.8 0.85 0.9
Coverage
0.95
(f) Walk vs. win (u2).
Misclassification rate as a function of coverage for each new user (specified by u I and u2) in
human activity recognition.
7.6. Sample Reuse in Reinforcement Learning 165
graphs show that for most of the coverage level, IWLSPC+IWCV outperforms
LSPC+CV. The misclassification rates of IWLSPC+IWCV and LSPC+CV
tend to decrease as the coverage level decreases, implying that the confi­
dence estimation by (lW)LSPC is reliable since erroneous prediction can be
successfully rejected.
7.6 Sample Reuse in Reinforcement Learning
Reinforcement learning (RL) [174] is a framework which allows a robot agent
to act optimally in an unknown environment through interaction. Because of
its usefulness and generality, reinforcement learning is gathering a great deal
of attention in the machine learning, artificial intelligence, and robotics com­
munities. In this section, we show that covariate shift adaptation techniques
can be successfully employed in the reinforcement learning scenario [66].
7.6.1 Markov Decision Problems
Let us consider a Markovdecisionproblem (MDP) specified by
(S, A, PI. PT, R, y),
where
• S is a set of states,
• A is a set of actions,
• PIeS) E [0, 1] is the initial-state probability density, . PT(sfls, a) E [0, 1] is the
transition probability density from state s to state Sf when action a is taken,
• R(s, a, Sf) E IR is a reward for transition from s to Sf by taking action a,
•
y E (0, 1] is the discount factor for future rewards.
Let n(als) E [0, 1] be a stochastic policy of the agent, which is the con­
ditional probability density of taking action a given state s. The state-action
value function Q" (s, a) E IR for policy n is the expected discounted sum of
rewards the agent will receive when taking action a in state s and following
policy n thereafter. That is,
where 1E",PT denotes the expectation over {sn, an}:! following n(anlsn) and
PT(sn+Iisn, an).
166 7 Applications of Covariate Shift Adaptation
The goal of reinforcement learning is to obtain the policy which maxi­
mizes the sum of future rewards. Then the optimal policy can be expressed
as follows5:
Jr*(als) =8(a - argmax Q*(s, a'»,
a'
where 8(.) is the Dirac deltafunction and Q*(s, a) is the optimal state-action
value function defined by
Q*(s, a) :=max Q"(s, a).
"
Q"(s, a) can be expressed as the following recurrent form, called the
Bellman equation [174]:
Q"(s, a) = R(s, a) + y lE lE [Q"(s', a')] , VS E S, Va E A, (7.2)
PT(s'ls,a) ,,(a'ls')
where R(s, a) is the expected reward function when the agent takes action a in
state s:
R(s, a) := lE [R(s, a, s')] .PT(s'ls,a)
lEpT(s'ls,a) denotes the conditional expectation of s' over PT(s'ls, a), given s and
a . lE,,(a'ls') denotes the conditional expectation of a' over Jr(a'ls'), given s'.
7.6.2 Policy Iteration
The computation of the value function Q"(s, a) is called policy evaluation.
With Q"(s, a), one can find a better policy Jr'(aIs) by means of
Jr'(als) = 8(a - argmax Q"(s, a'».
a'
This is called (greedy) policy improvement. It is known that repeating policy
evaluation and policy improvement results in the optimal policy Jr*(als) [174].
This entire process is called policy iteration:
E
Q" I E
Q"2 I E I
*Jr1 ---+ 1 ---+ Jr2 ---+ ---+ Jr3 ---+ • , • ---+ Jr ,
5. We assume that given state s, there is only one action maximizing the optimal value function
Q*(s. a).
7.6. Sample Reuse in Reinforcement Learning 167
where JrJ is an initial policy. E and I indicate the policy evaluation and policy
improvement steps, respectively. For technical reasons, we assume that all poli­
cies are strictly positive (i.e., all actions have nonzero probability densities). In
order to guarantee this, explorative policy improvement strategies such as the
Gibbs policy and the E-greedy policy are used here. In the case of the Gibbs
policy,
Jr'(als) =
exp(QJr (s, a)jr) ,
fAexp(QJr (s, a')jr) da'
(7.3)
where r is a positive parameter which determines the randomness of the new
policy Jr'. In the case of the E-greedy policy,
Jr'(als) = { l - E + EjIAI
EjlAI
where
a* = argmax QJr (s, a)
a
if a =a*,
otherwise,
and E E (0, 1] determines how stochastic the new policy Jr' is.
7.6.3 Value Function Approximation
(7.4)
Althoughpolicy iteration is guaranteedto producethe optimalpolicy, it is often
computationally intractable since the number of state-action pairs lSI x IAI is
very large; lSI or IAI can even become infinite when the state space or action
space is continuous. To overcome this problem, the state-action value function
QJr (s, a) is approximated using the following linear model [174, 124, 99]:
where
are the fixed basis functions, T denotes the transpose, B is the number of basis
functions, and
168 7 Applications of Covariate Shift Adaptation
are model parameters. Note that B is usually chosen to be much smaller than
ISl x IAI·
For N-step transitions, the ideal way to learn the parameters 0 is to minimize
the approximation error of the state-action value function Qlr(s, a):
where IEpI,lr,PT denotes the expectation over {sn' an}:=l following the initial state
probability density P,(SI), the policy rr(anlsn), and the transition probability
density PT(Sn+llsn, an).
A fundamental problem of the above formulation is that the target function
Qlr(s, a) cannot be observed directly. To cope with this problem, we attempt
to minimize the square of the Bellman residual instead [99]:
0* :=argmin G,
9
(
� � )2g(s, a; O) := Qlr(s, a; O) - R(s, a) - y IE IE [Qlr(s', a'; O)] ,PT(s'ls,a) ,,(a'ls)
(7.5)
(7.6)
where g(s, a; 0) is the approximation error for one step (s, a) derived from the
Bellman equation6 (equation 7.2).
7.6.4 Sample Reuse by Covariate Shift Adaptation
In policy iteration, the optimal policy is obtained by iteratively performingpol­
icy evaluation and improvement steps [174, 13]. When policies are updated,
many popular policy iteration methods require the user to gather new samples
following the updated policy, and the new samples are used for value func­
tion approximation. However, this approach is inefficient particularly when
6. Note that g(s, a; 9) with a reward observation r instead of the expected reward R(s. a)
corresponds to the square of the temporal difference (TD) error. That is,
(� � )2
gTD(s, a, r; 9) := Qn(s, a, 9) - r - y lE lE [Qn(s', a'; 9)] .
PT(s'ls,a) Jr(a' ls)
Although we use the Bellman residual for measuring the approximation error, it can be easily
replaced with the TD error.
7.6. Sample Reuse in Reinforcement Learning 169
the sampling cost is high, and it will be more cost-efficient if one can reuse
the data collected in the past. A situation where the sampling policy (a policy
used for gathering data samples) and the current policy are different is called
off-policy reinforcement learning [174].
In the off-policy setup, simply employing a standard policy iteration method
such as least-squares policy iteration [99] does not lead to the optimal policy,
as the sampling policy can introduce bias into value function approximation.
This distribution mismatch problem can be eased by the use of importance­
weighting techniques, which cancel the bias asymptotically. However, the
approximation error is not necessarily small when the bias is reduced by impor­
tance weighting; the variance of estimators also needs to be taken into account
since the approximation error is the sum of squared bias and variance. Due to
large variance, naive importance-weighting techniques used in reinforcement
learning tend to be unstable [174, 125].
To overcome the instability problem, an adaptive importance-weighting
technique is useful. As shown in section 2.1.2, the adaptive importance­
weighted estimator smoothly bridges the ordinary estimator and the
importance-weighted estimator, allowing one to control the trade-off between
bias and variance. Thus, given that the trade-off parameter is determined care­
fully, the optimal performance may be achieved in terms of both bias and
variance. However, the optimal value of the trade-off parameter is heavily
dependent on data samples and policies, and therefore using a predetermined
parameter value may not always be effective in practice.
For optimally choosing the value of the trade-off parameter, importance­
weighted cross-validation [160] (see section 3.3) enables one to estimate the
approximation error of value functions in an almost unbiased manner even
under off-policy situations. Thus, one can adaptively choose the trade-off
parameter based on data samples at hand.
7.6.5 On-Policy vs. Off-Policy
Suppose that a data set consisting of M episodes of N steps is available. The
agent initially starts from a randomly selected state Sl following the initial
state probability density PIeS) and chooses an action based on a samplingpol­
icy 3T(anISn). Then the agent makes a transition following PT(sn+llsn, an), and
receives a reward rn (=
R(sn' an, Sn+l»' This is repeated for N steps-thus, the
training data Vi are expressed as
Vi ' _ {di}M
.- m m=l'
170 7 Applications of Covariate Shift Adaptation
where each episodic sample d! consists of a set of quadruple elements as
drr . {( rr rr rr rr )}N
m '= sm,n, am,n, rm,n, sm,n+l n=l'
Two types of policies which have different purposes are used here: the sam­
pling policy n(aIs) for collecting data samples and the current policy JT(aIs)
for computing the value function Q". When n(als) is equal to JT(als) (the sit­
uation called on-policy), just replacing the expectation contained in the error
G defined in equation 7.6 by a sample average gives a consistent estimator
(i.e., the estimated parameter converges to the optimal value as the number of
episodes M goes to infinity):
ONIW :=argmin GNIW,
9
� (� 1 '"
g(s, a; 8, D) := Q"(s, a; 8) - --
� r
I D(s,a) I rEV(s,a)
- 1-nY
I L � , [Q"(s', a'; 8)]r.L/(s,a) , ,,(a Is )
s EV(s,a)
where D(s,a) is a set of quadruple elements containing state s and action a in
the training data D, and L EV and L 'EV denote the summation over rr (s,a) S (s,a)
and s' in the set D(s,a) , respectively. Note that NIW stands for No Importance
Weight.
However, in reality, n(aIs) is usually differentfrom JT(aIs), since the current
policy is updated in policy iteration, The situation where n(aIs) is different
from JT(aIs) is called off-policy, In the off-policy setup,ONIW is no longer con­
sistent. This inconsistency can be avoided by gathering new samples, That is,
when the current policy is updated, new samples are gathered following the
updated policy, and the new samples are used for policy evaluation. However,
when the data sampling cost is high, this is not cost-efficient-it would be
more practical if one could reuse the previously gathered samples.
7.6.6 Importance Weighting in Value Function Approximation
For coping with the off-policy situation, we would like to use importance­
weighting techniques. However, this is not that straightforward since our
7.6. Sample Reuse in Reinforcement Learning 171
training samples of state S and action a are not i.i.d. due to the sequential nature
of MDPs. Below, standard importance-weighting techniques in the context of
MDPs are reviewed.
7.6.6.1 Episodic Importance Weighting The method of episodic importance
weighting (EIW) [174] utilizes the independence between episodes:
p(d, d') = p(d)p(d')
Based on the independence between episodes, the error G defined by equa­
tion 7.6 can be rewritten as
where
. _ p,,(d)
WN ·- -- ·
Pii(d)
p"(d) and Pii(d) are the probability densities of observing episodic data d
under policy n and policy n, respectively:
N
p,,(d) := PI(sd nn(anISn)PT(Sn+!lsn, an),
n=1
N
Pii(d) := PI(sd nn(anISn)PT(Sn+!lsn, an).
n=1
Note that the importance weights can be computed withoutexplicitly knowing
PI and PT since they are canceled out:
Thus, in the off-policy reinforcement learning scenario, importance estimation
(chapter 4) is not necessary. Using the training data Vii, one can construct a
consistent estimator of G as
172 7 Applications of Covariate Shift Adaptation
1
M N
G '- - � ��g WEIW ·-
MN � � m,n m,N,
m=l n=l
where
Based on this, the parameter () is estimated by
OEIW := argmin GEIW,
9
(7.7)
7.6.6.2 Per-Decision Importance Weighting A more sophisticated
importance-weightingtechnique, called theper-decision importance-weighting
(PIW) method, was proposed in [125]. A crucial observation in PIW is that the
error at the n-th step does not depend on the samples after the n-th step, that
is, the error G can be rewritten as
Using the training data Vii, one can construct a consistent estimator as follows
(see equation 7,7):
�
1
M N �
GpIW :=
MN LLgm,nwm,n' (7.8)
m=l n=l
Wm,n in equation 7.8 contains only the relevant terms up to the n-th step, while
Wm,N in equation 7.7 includes all the terms upto the end of the episode.
Based on this, the parameter () is estimated by
OPIW := argmin GpIW'
9
7.6.6.3 Adaptive Per-Decision Importance Weighting The importance­
weighted estimator OPIW (also OEIW) is guaranteed to be consistent. However,
both are not efficient in the statistical sense [145], that is, they do not have
the smallest admissible variance. For this reason, OPIW can have large variance
7.6. Sample Reuse in Reinforcement Learning 173
in finite sample cases, and therefore learning with PIW can be unstable in
practice. Below, an adaptive importance-weighting method (see section 2.1.2)
is employed for enhancing stability.
In order to improve the estimation accuracy, it is important to control the
trade-off between consistency and efficiency (or, similarly, bias and vari­
ance) based on the training data. Here, the flattening parameter v (E [0, 1])
is introduced to control the trade-off by slightly "flattening" the importance
weights [145, 160]:
(7.9)
where AIW stands for "Adaptive PIW." Based on this, the parameter () is
estimated as follows:
OAIW := argmin GAIW.
o
When v = 0, OAIW is reduced to the ordinary estimator ONIW. Therefore, it has
large bias but has relatively small variance. On the other hand, when v = 1,
OAIW is reduced to the importance-weighted estimator OPIW. Therefore, it has
small bias but relatively large variance. In practice, an intermediate v would
yield the best performance.
The solution OAIW can be computed analytically as follows [66]:
where t(s, a; V) is a B-dimensional column vector defined by
�
y "t(s, a; V) :=fb(s, a) - -- � lE [fb(s', a')] .
IV(s,a) I , � n(a'ls')
S E L-'(s,a)
(7.10)
This implies that the cost for computingOAIW is essentially the same as forONIW
and OP1w.
174 7 Applications of Covariate Shift Adaptation
7.6.7 Automatic Selection of the Flattening Parameter
As shown above, the performance of AIW depends on the choice of the flat­
tening parameter v. Ideally, v is set so that the approximation error G is
minimized. However, the true G is inaccessible in practice. To cope with this
problem, the approximation error G may be replaced by its estimator obtained
using IWCV [160] (see section 3.3). Below we explain how IWCV can be
applied to the selection of the flattening parameter v in the context of value
function approximation.
Let us divide a training data set Vrr containing M episodes into K subsets
{Vnf=l of approximately the same size (we used K =
5 in experiments). For
simplicity, assume that M is divisible by K . Let e:IW be the parameter learned
from {V�hl# with AIW (see equation 7.9). Then, the approximation error is
estimated by
K
� 1 '" �kG1WCY =
K
� GIWCY'
k=l
where
The approximation error is estimated by the above K -fold IWCV method
for all candidate models (in the current setting, a candidate model corresponds
to a different value of the flattening parameter v). Then the model candidate
that minimizes the estimated error is selected:
Viwcy =
argmin G1WCY'
In general, the use of IWCV is computationally rather expensive since
� �(JAIW and Gkcy need to be computed many times. For example, when per-
forming fivefold IWCV for 11 candidates of the flattening parameter v E
� �k .{O.O,0.1, . . . , 0.9, 1.0}, (JAIW and G1WCY need to be computed 55 times. How-
ever, this would be acceptable in practice for two reasons. First, sensible model
selection via IWCV allows one to obtain a much better solution with a small
number of samples. Thus, in total, the computation time may not grow that
much. The second reason is that cross-validation is suitable for parallel com­
puting since error estimation for different flattening parameters and different
folds are independent of one another. For instance, when performing fivefold
7.6. Sample Reuse in Reinforcement Learning 175
IWCV for 11 candidates of the flattening parameter, one can compute GIWCV
for all candidates at once in parallel, using 55 CPUs; this would be highly real­
istic in the current computing environment. If a simulated problem is solved,
the storage of all sequences can be more costly than resampling. However, for
real-world applications, it is essential to reuse data, and IWCV will be one of
the more promising approaches.
7.6.8 Sample Reuse Policy Iteration
So far, the AIW+IWCV method has been considered only in the context of
policy evaluation. Here, this method is extended to the full policy iteration
setup.
Let us denote the policy at the l-th iteration by lrl and the maximum number
of iterations by L. In general policy iteration methods, new data samples V"I
are collected following the new policy lrl during the policy evaluation step.
Thus, previously collected data samples {V"I , V"2 , ..., V"I-I } are not used:
E:{D"I } �" [ E:{D"2 } �" [ E:{D"3 } [
lrl -+ Q I -+ lr2 ----+ Q 2 -+ lr3 ----+ . . . ----+ lrL,
where E : {V} indicates policy evaluation using the data sample V. It would be
more cost-efficient if one could reuse all previously collected data samples to
perform policy evaluation with a growing data set as
E:{D"I } �" [ E:{D"I ,D"2 } �" [ E:{D"I ,D"2 ,D"3 } [
lr1 ----+ Q I -+ lr2 ----+ Q 2 -+ lr3 ----+ . , . ----+ lrL'
Reusing previously collected data samples turns this into an off-policy sce­
nario because the previous policies and the current policy are different unless
the current policy has converged to the optimal one. Here, the AIW+IWCV
method is applied to policy iteration. For this purpose, the definition of GAIW
is extended so that multiple sampling policies {lr] , lr2, . . . , lrd are taken into
account:
--::t •
�I(J
AIW :=argmm GAIW '
8
1 M N
G�I '-
1 '"" '"" '"" �
(
"
I' "I' . (J {"""I, }I )
AIW ' -
IMN ���g Sm,n, am,n, , v 1'=1
1'=1 m=1 n=1
(nn ( "I' I "I' »)VI
n'=1 1C1 amn' Smn'X
"
nn ( "I' I "I' ) ,
n'=l 7rJ' am,n' sm,n'
(7.11)
176 7 Applications of Covariate Shift Adaptation
where G�IW is the approximation error estimated at the l-th policy evaluation
using AIW. The flattening parameter VI is chosen based on IWCV before per­
forming policy evaluation. This method is called sample reusepolicy iteration
(SRPI) [66].
7.6.9 Robot Control Experiments
Here, the performance of the SRPI method is evaluated in a robot control task
of an upward swinging inverted pendulum.
We consider the task of an upward swinging invertedpendulum illustrated
in figure 7.6a, consisting of a rod hinged at the top of a cart. The goal of the
task is to swing the rod up by moving the cart. We have three actions: applying
positive force +50 [kg · m/s2] to the cart to move right, negative force -50 to
move left, and zero force just to coast. That is, the action space A is discrete
and described by
A= {50, -50, 0} [kg · m/s2].
Note that the force itself is not strong enough, so the cart needs to be moved
back and forth several times to swing the rod up. The state space S is contin­
uous and consists of the angle cp [rad] (E [0, 2JT]) and the angular velocity cp
[rad/s] (E [-JT, JT])-thus, a state s is described by a two-dimensional vector
Figure 7.6b shows the parameter setting used in the simulation. The angle cp
and angular velocity cp are updated as follows:
(a) Uustration
Figure 7.6
Parameter Value
Mass of the cart, W 8 [kg]
Mass of the rod, w 2 [kg]
Length of the rod, d 0.5[m]
Simulation time step, 6.t 0.1 [s]
(b) Parameter setting
Illustration of the inverted pendulum task and parameters used in the simulation.
7.6. Sample Reuse in Reinforcement Learning 177
where a =
l/(W + w) and at is the action in A chosen at time t. The reward
function R(s, a , s') is defined as
R(s, a , s') =
cos(CPs')'
where CPs' denotes the angle cp of state s'.
Forty-eight Gaussian kernels with standard deviation a =
rr are used as basis
functions, and kernel centers are distributed over the following grid points:
to, 2/3rr, 4/3rr, 2rr } x {-3rr, -rr, rr, 3rr }.
That is, the basis functions
are set to
for i =
1 , 2, 3 and j =
1 , 2, . . . , 16, where
The initial policy rrI (a Is) is chosen randomly, and the initial state probability
density PI(S) is set to be uniform. The agent collects data samples D"l (M =
10
and N =
100) at each policy iteration following the current policy rr/' The
discounted factor is set to y =
0.95, and the policy is improved by the Gibbs
policy (equation 7.3) with r =
l.
Figure 7.7a describes the performance of learned policies measured by the
discounted sum of rewards as functions of the total number of episodes. The
graph shows that SRPI nicely improves the performance throughout the entire
policy iteration process. On the other hand, the performance when the flatten­
ing parameter is fixed to v =
0 or v =
1 is not properly improved after the
middle iterations. The average flattening parameter value as a function of the
178
-6
-7
VI -8
"E
'"
3: -9
�
"
- 1 0QJ
....
C
:::l
0 -1 1u
VI
'6
- - 1 2
0
E
:::l - 1 3VI
- 1 4
- 1 5
1 0
0.9
0.8
Qj 0.7
ti
E 0.6
�
'"
a. 0.5
01
c
.� 0.4
:::
� 0.3
0.2
0.1
Figure 7.7
7 Applications of Covariate Shift Adaptation
v=O
v=l
V='VjWCV
, . - . - . • . • . - . - . - . �
.....,:'
. ........
,
.#
....
............,. .. . .. .. ....,. ... .....,,
. .....
...
.
.,.
.. .. . ..
'.. . . -
..' . . .. . . . . ..,. .
.
' ,, .
� �
,
20 30 40 50 60 70 80
Total number of episodes
(a) Performance of policy.
20 30 40 50 60 70 80
Total number of episodes
(b) Average flattening parameter.
'.
,
,
90
90
The results of sample reuse policy iteration in upward swing of an inverted pendulum. The agent
collects training sample V�l (M = 10 and N = 100) at every iteration, and policy evaluation is per·
formed using all collected samples {V�l , V�2 , . . . , V�l }. (a) The performance of policies leamed
with v = 0, v = 1, and SRPI. The performance is measured by the average sum of discounted
rewards computed from test samples over 20 trials. The total number of episodes is the num·
ber of training episodes (M x l) collected by the agent in policy iteration. (b) Average flattening
parameter used by SRPI over 20 trials.
7.6. Sample Reuse in Reinforcement Learning 179
total number of episodes is depicted in figure 7.7b, showing that the parameter
value tends to increase quickly in the beginning and then is kept at medium
values. These results indicate that the flattening parameter is well adjusted to
reuse the previously collected samples effectively for policy evaluation, and
thus SRPI can outperform the other methods.
III
LEARNING CAUSING COVARIATE SHIFT
8 Active Learning
Active learning [107, 33,55]-alsoreferred to as experimental design in statis­
tics [93, 48,127]-is the problem of determining the locations of training input
points so that the generalization error is minimized (see figure 8.1). Active
learning is particularly useful when the cost of sampling output value y is very
expensive. In such cases, we want to find the best input points to observe out­
put values within a fixed number of budgets (which corresponds to the number
ntr of training samples).
Since training input points are generated following a user-defined distri­
bution, covariate shift naturally occurs in the active learning scenario. Thus,
covariate shift adaptation techniques also play essential roles in the active
learning scenario. In this chapter, we consider two types of active learning
scenarios: population-based active learning, where one is allowed to sample
output values at any input location (sections 8.2 and 8.3) and pool-basedactive
learning, where input locations that are to have output values are chosen from
a pool of (unlabeled) input samples (sections 8.4 and 8.5).
8.1 Preliminaries
In this section, we first summarize the common setup of this chapter. Then we
describe a general strategy of active learning.
8.1.1 Setup
We focus on a linear regression setup throughout this chapter:
• The squared loss (see section 1.3.2) is used, and training output noise is
assumed to be i.i.d. with mean zero and variance 0'2. Then the generalization
1 84 8 Active Learning
Target function
Learned 
+••
Good training inputs Poor training inputs
Figure 8.1
Active learning . In both cases, the target function is learned from two (noiseless) training samples .
The only difference is the training input location .
error is expressed as follows (see section 3.2.2):
where IExte denotesthe expectation over x
le drawn from Ple(X) and lEy'e denotes
the expectation over ytedrawn from p(ylx=xte).
• The test input density Ple(X) is assumed to be known. This assumption is
essential in the population-based active learning scenario since the training
input points {x:,}7!] are designed so that prediction for the given Ple(X) is
improved (see sections 8.2 and 8.3 for details). In the pool-based active learn­
ing scenario, Ple(X) is unknown, but a pool of (unlabeled) input samples drawn
i.i.d. from Ple(x) is assumed to be given (see sections 8.4 and 8.5 for details).
• A linear-in-parameter model (see section 1.3.5.1) is used:
b
I(x; 0) =L8e({Je(x), (8.1)
e=]
where 0 is the parameter to be learned and {({Je(x)}� =] are fixed linearly
independent functions.
• A linear method is used for learning the parameter 0, that is, the learned
parameter0is given by
0=Ly
tr,
8.1 . Preliminaries
where
Ir. (tf tf tr )T
y .= Y], Y2'..., Yntr •
185
L is a b x nlr learning matrix that is independent of the training output noise;
this independence assumption is essentially used as
where IE
{YY}7�1
denotes the expectations over {y:r}7!], each of which is drawn
from p(Ylx=x:r).
8.1.2 Decomposition of Generalization Error
Let (J* be the optimal parameter under the generalization error:
(J* :=argmin[Gen].
8
Let 15r(x) be the residual function
15r(x) :=l(x; (J*) - f(x).
We normalizer(x) as
so that 8 governs the "magnitude" of the residual andr(x) expresses the "direc­
tion" of the residual. Note thatr(x) is orthogonal tofix; (J) for any (J under
Ple(X) (see figure 3.2 in section 3.2.2):
IE [r(xle)l(xle; (J)] =0 for any (J.
xle
Then the generalization error expected over the training output values {y:'}7!]
can be decomposed as
=IE IE [(fixle;9) -fixle; (J*) + 15r(xle)n+ (J2
{yY}7!!"1
xte
1 86 8 Active Learning
where
= IE IE [ ( l(xte;9) - l(xte; 9*)n+ t,zIE [r2(xte)]
{Yr}�!!1
xte xte
+ 2 IE IE [ ( l(xte;9) - l(xte; 9*)) (8r(xte»)] + a2
{Yr}7�1
xte
(8.2)
Applying the standard bias-variance decomposition technique to Gen",
we have
= IE IE [(fixte;9) - IE [ l(xte;9)]
{yy}?:!} xte {yY}?:!}
+ IE [ l(xte;9)] -fixte; 9*»)2]{yYl7,!!"1
= IE IE [(l(xte;9) _ IE [fixte;9)])2
]{yYl7,!!"1
xte {yYl?,!!"1
+ 2IE [( IE [ l(xte;9)] - IE [fixte;9)])xle {YYl?,!!"1 {yYl?,!!"1
X
( trlB1.tr
[fixte;9)] - l(xte; 9*»)]{Yi li=1
+IE [(IE [fixle;9)] - l(xle; 9*»)2
]xte tYY}?:!1
= Var + Bias2,
where Bias2 and Var denote the squared bias and the variance defined as
follows:
8.1 . Preliminaries 187
(8.3)
(8.4)
Under the linear learning setup (equation 8.1), Gen", Bias2 and Var are
expressed in matrix/vector form as follows (see figure 8.2):
Gen"= lE [(u (9- 8*),9- 8*)],
{Y),}7�1
where U is the b x b matrix with the (l, l')-th element:
()
1E [8] e······························ ()*
{ytr}ntr Bias2't .,,=1
Figure 8.2
Var
Bias-variance decomposition .
(8.5)
1 88 8 Active Learning
In some literature, Bias2 + 82 may be called the bias term:
Bias2 + 82= Bias2 + 82JE [r2(xle)]
xle
In our notation, we further decompose the above bias (Bias2 + 82) into the
reducible part (Bias2) and the irreducible part (82), and refer to them as bias
and model error, respectively.
8.1.3 Basic Strategy of Active Learning
The goal of active learning is to determine training input location {xn7�
so that the generalization error is minimized. Below, we focus on Gen", the
reducible part of Gen defined by equation 8.2, which is expressed as
Gen"= Gen -82 -a2.
Note that the way the generalization error is decomposed here is different from
that in model selection (cf. Gen' defined by equation 3.2).
Gen" is unknown, so it needs to be estimated from samples. However,
in active learning, we do not have training output values {Yn7� since the
training input points {xn7� have not yet been determined. Thus, the general­
ization error should be estimated without observing the training output values
{y:r}7�. Therefore, generalization error estimation in active learning is gener­
ally more difficult than that in model selection, where both {xl'}7! and {Yn7�
are available.
Since we cannot use {y!'}7! for estimating the generalization error, it is gen­
erally not possible to estimate the bias (equation 8.3), which depends on the
true target function I(x). A general strategy for coping with this problem is
to find a setup where the bias is guaranteed to be zero (or small enough to be
neglected), and design the location of training input points {xl'}7� so that the
variance (equation 8.4) is minimized.
Following this basic strategy, we describe several active learning methods
below.
8.2 Population-Based Active Learning Methods
Population-based active learning indicates the situation where we know the
distribution of test input points, and we are allowed to locate training input
8.2. Population-Based Active Learning Methods 189
points at any desired positions. The goal of population-based active learning
is to find the optimal training input density Ptr(x) (from which we generate
training input points {X:,}7!1) so that the generalization error is minimized.
8.2.1 Classical Method of Active Learning for Correct Models
A traditional variance-only active learning method assumes the following
conditions (see, e.g., [48, 33, 55]):
• The model at hand is correctly specified (see section 1.3.6), that is, the model
error 8 is zero. In this case, we have
lex; 9
*)= I(x).
• The design matrix xt" which is the ntr x b matrix with the (i, l)-th element
has rank b. Then the inverse of XtrTXtr exists.
(8.6)
• Ordinary leastsquares (OLS, whichis an ERM method withthe squared loss;
see chapter 2) is used for parameter learning:
The solution90LS is given by
where
tr ._
( tr tr tf )T
Y .- Yl'Y2'. . . , Yntr .
(8.7)
Under the above setup, Bias2, defined in equation 8.3, is guaranteed to be
zero, since
190
= (XlrTXlr)-IXlrTXlr(J*
= (J*.
8 Active Learning
Then the training input points {x:,}7!1 are optimized so that Var (equation 8.4)
is minimized:
(8.8)
8.2.2 Limitations of Classical Approach and Countermeasures
In the traditional active learning method explained above (see also [48,33,55]),
the minimizer of the variance really minimizes the generalization error since
the bias is exactly zero under the correct model assumption. Thus, the mini­
mizer is truly the optimal solution. However, this approach has two drawbacks.
The first drawback is that minimizing the variance with respect to {x:,}7!1
may not be computationally tractable due to simultaneous optimization of ntr
input points. This problem can be eased by optimizing the training input den­
sity Plr( X) and drawing the training input points {x:,}7!1 from the determined
density.
The second, and more critical, problem is that correctness of the model may
not be guaranteed in practice; then minimizing the variance does not necessar­
ily result in minimizing the generalization error. In other words, the traditional
active learning method based on OLS can have larger bias. This problem can
be mitigated by the use of importance-weighted least squares (lWLS; see
section 2.2.1):
The solution9IWLS is given by
where
LIWLS := (XlrT WXlr)-I
XlrT W,
and Wis the diagonal matrix with the i-th diagonal element
W
.
'
- Pte(X:')
1,1 .-
Plr( X:')'
(8.9)
8.2. Population-Based Active Learning Methods 191
IWLS is shown to be asymptotically unbiased even when the model is
misspecified; indeed, we have
=(XlfTWXlfrlXlfTW(Xlf9*+8rlf)
=9*+(XlfTWXlf)-I
XlfTWrlf,
where rlfis the nlf-dimensional vector defined by
If ( (If) (If) (If))Tr = r X
I ' r X
2
,. • .
,r X
ntr •
1
As shown in section 3.2.3.2, the second term is of order Op(n�I):
where 0p denotes the asymptoticorder in probability. Thus, we have
Below, we introduceusefulactive learningmethods which canovercomethe
limitations of the traditional OLS-based approach.
8.2.3 Input-Independent Variance-Only Method
For the IWLS estimator OIWLS, it has been proved [89] that the generalization
error expected over training input points {x:,}7!, and training output values
{yd7!, (i.e., input-independent analysis; see section 3.2.1) is asymptotically
expressed as
(8.10)
where H is the b x bmatrix defined by
(8.11)
192
S and T are the b x b matrices with the (C, £')-th elements
T.
[(,e
) (
,e
)
P,e(x,e) ]e,e':= IE CPe X CPe' X--,e- ,
x,e P,r(X )
8 Active Learning
(8.12)
where IExte denotes the expectation over X,e drawn from P,e(x), Note that
...Ltr(U-1S) corresponds to the bias term and �tr(U-1T) corresponds to the� �
variance term. S is not accessible because unknown r(x) is included, while T
is accessible due to the assumption that the test input density P,e(x) is known
(see section 8.1.1).
Equation 8.10 suggests that tr(U-1H) may be used as an active learning
criterion. However, H includes the inaccessible quantities r(x) and a2, so
tr(U-1H) cannot be calculated directly. To cope with this problem, Wiens
[199] proposedI to ignore S (the bias term) and determine the training input
density Ptr(x) so that the variance term is minimized.
p�= argmin[tr(U-1T)]. (8.13)
PIr
As shown in [153], the above active learning method can be justified for
approximately correct models, that is, for 8 = 0(1):
A notable feature of the above active learning method is that the optimal
training input density p� can be obtained analytically as
1
p�(x)exP,e(x)(�I[U-lle.e,cpe(x)CPe'(x»)');, (8.14)
which may be confirmed from the following equation [87]:
tr(U-lT) ex (1 + f (p�(x)- ptr( X»2
d X
).
Ptr(x)
1 . In the original paper, the range of application was limited to the cases where the input domain is
bounded and Pte(x)is uniform over the domain . However, it may be easily extended to an arbitrary
strictly positive Pte(x). For this reason, we deal with the extended version here .
8.2. Population-Based Active Learning Methods 193
Figure 8.3
Illustration of the weights in the optimal training input density p:(x).
Equation 8.14 implies that in the optimal training input density p�(x), Pte(X)
is weighted according to the V-I-norm of (CfJI(x), CfJ2(X), . . . , CfJb(X»T. Thus,
the input points far from the origin tend to have higher weights in p�(x).
Note that the matrix V defined by equation 8.5 corresponds to the second­
order moments of the basis functions {CfJe(x)}� around the origin, that is, it has
ellipsoidal shapes (see figure 8.3).
8.2.4 Input-Dependent Variance-Only Method
As explained in section 3.2.1, estimating the generalization error without tak­
ing the expectation over the training input points {xn7!1 is advantageous since
the estimate is more data-dependent.
Following this idea, an input-dependent variance-only active learning
method called ALICE (Active Learning using Importance-weighted least­
squares learning based on Conditional Expectation of the generalization
error) has been proposed [153]. ALICE minimizes the input-dependent
variance (equation 8.4):
(8.15)
This criterion can also be justified for approximately correct models, that is,
for 8 = 0(1),
Input-dependent variance and input-independent variance are similar;
indeed, they are actually asymptotically equivalent.
194 8 Active Learning
However, input-dependent variance is a more accurate estimator of the
generalization error than input-independent variance. More precisely, the
1
following inequality holds for approximately correct models with 0 = 0p(n�4)
[153]:
if terms of op(n�3) are ignored. This difference is shown to cause a signif­
icant difference in practical performance (see section 8.3 for experimental
performance).
As shown above, the input-dependent approach is more accurate than
the input-independent approach. However, its downside is that no analytic
solution is known for the ALICE criterion (equation 8.15). Thus, an exhaus­
tive search strategy is needed to obtain a better training input density. A
practical compromise for this problem would be to use the analytic opti­
mal solution p�(x) of the input-independent variance-only active learning
criterion (equation 8.13), and to search for a better solution around the
vicinity of p�(x). Then we only need to perform a simple one-dimensional
search: find the best training input density with respect to a scalar parame­
ter v:
The parameter v controls the "flatness" of the training input density; v = 0
corresponds to the test input density Pte(x) (i.e., passive learning-the train­
ing and test distributions are equivalent), v = 1 corresponds to the original
p�( x), and v > 1 expresses a "sharper" density than p�( x). Searching inten­
sively for a good value of v intensively around v = 1 would be useful in
practice.
A pseudo code of the ALICE algorithm is summarized in figure 8.4.
8.2. Population-Based Active Learning Methods
Input: A test input density Pte(X) and basis functions {cpe(X)}�=l
Output: Learned parameter if
Compute the b x b matrix U with UP,(I = f cpe(x)cppr(x)Pte(x)dx;
For several different values of v (possibly around v = 1)
Let Pv(x) = Pte(x) (L�,er=dU-l]p,(lCPe(x)cp(l(x))�;
Draw X�r = {X��i}�l following the density proportional to Pv(x);
Compute the ntr x b matrix X�r with [Xv]i,P =
CPP(X�i);
Compute the ntr x ntr diagonal matrix Wv with [W"']i i = pv
(
(
"'ti»;, Ptr re .
Compute Lv = (X�rTWvX�r)-lX�rT W"';
v,.
Compute ALICE(v) = tr(UL",LJ);
End
Compute v = argmin",[ALICE(v)];
Gather training output values y
tr =
(ytr, y�r, ' , , , y�,)T at Xir;
Compute if = Lf)y
tr;
Figure 8.4
Pseudo code of the ALICE algorithm .
8.2.5 Input-Independent Bias-and-Variance Approach
195
The use of input-independent/dependent variance-only approaches is justified
only for approximately correct models (i.e., the model error 0 vanishes asymp­
totically). Although this condition turns out to be less problematic in practice
(see section 8.3 for numerical results), it cannot be satisfied theoretically
since the model error 0 is a constant that does not depend on the number of
samples.
To cope with this problem, a two-stage active learning method has been
developed [89] which is theoretically justifiable for totally misspecified mod­
els within the input-independentframework (see section 8.2.3). The key idea is
to use the samples gathered in the first stage for estimating the generalization
error (i.e., both bias and variance), and the training input distribution is opti­
mized based on the estimated generalization error in the second stage. Here,
we explain the details of this algorithm.
In the first stage, nt, (:s ntr) training input points {x-:,}7!! are created inde­
pendently following the test input density Pte(x), and corresponding training
output values {)i/'}7!! are observed.
Then we determine the training input density based on the above samples.
More specifically, we prepare candidates for training input densities. Then, we
estimate the generalization error for each candidate Pt,(x), and the one that
196 8 Active Learning
minimizes the estimated generalization error is chosen for the second stage.
The generalization error for a candidate density Ptf(X) is estimated as fol­
lows. Let if and Q be the I1tf-dimensional diagonal matrices with i-th diagonal
elements
respectively, where xtf
is the design matrix for {xn7�1 (i.e., the I1tr x b matrix
with the (i, l)-th element)
and
-tr (""tf �tr �tr)T
Y =YI 'Y2 '. . . , Yntr .
Then an approximation ii of the unknown matrix H defined by equation 8.10
is given as
(8.16)
Although V-I is accessible in the current setting, it was also replaced by a
--I
consistent estimator V [89], where
Based on the above approximations, the training input density Ptf(X) is
determined as follows:
--1 -
min[tr(V H)]. (8.17)
Ptr
After determining the training input density Ptf(X), the remaining (ntf-I1tf)
training input points {xn7�itr
n! 1 are created independently following the chosen
Ptr(x), and corresponding training output values {Yn7�itr
n! 1 are gathered. This
is the second stage of sampling.
Finally, the learned parameter 0- is obtained, using {(x:f, y/f)}7�1 and
{( tf tf)}ntr-ntr
Xi'Yi i�ntr+I' as
8.2. Population-Based Active Learning Methods
[nIT nIT-nIT ( tr)
]..-.. . ..-.."" "" 2 Pte X· ..-. 2
8= argmm L (J(x:r; 8)-y/r) + L --;-r (J(x:r; 8)-yJr) .
9 i=l i=ntr+l Ptr(Xi)
197
(8.18)
The above active learning method has a strong theoretical property, that is,
for ntr= o(ntr), limntr-+oo ntr= 00, and8= 0(1),
This means that the use of this two-stage active learning method can be justi­
fied for totally misspecified models (i.e., 8= 0(1». Furthermore, this method
can be applied to more general scenarios beyond regression [89], such as
classification with logistic regression (see section 2.3.2).
On the other hand, although the two-stage active learning method has good
theoretical properties, this method seems not to perform very well in practice
(see section 8.3 for numerical examples), which may be due to the following
reasons.
• Since ntr training input points should be gathered following Pte(X) in the first
stage, users are allowed to optimize the location of only (ntr-ntr) remaining
training input points. This is particularly critical when the total number ntr is
not very large, which is be a usual case in active learning.
• The performance depends on the choice of the number ntro and it is not
1
straightforward to appropriately determine this number. Using ntr= O(nt;) is
recommended in [89], but the exact choice of ntr seems still seems open.
• The estimated generalization error corresponds to the case where ntr points
are chosen from Ptr(x) and IWLS is used; but in reality, ntr points are taken
from Pte(x) and (ntr-ntr) points are taken from Ptr(x), and a combination
of OLS and IWLS is used for parameter learning. Thus, this difference can
degrade the performance. It is possible to resolve this problem by not using
{(x:r, y;tr)}7!1 gathered in the first stage for estimating the parameter. However,
this may yield further degradation of the performance because only (ntr-ntr)
training examples are used for learning.
• Estimation of bias and noise variance is implicitly included (see equations
8.11 and 8.16). Practically, estimating the bias and noise variance from a small
number of training samples is highly erroneous, and thus the performance of
active learning can be degraded. On the other hand, the variance-only methods
can avoid this difficulty by ignoring the bias; then the noise variance included
in variance becomes just a proportional constant and can also justifiably be
ignored.
198 8 Active Learning
Currently, bias-and-variance approaches in the input-dependent framework
seem to be an open research issue.
8.3 Numerical Examples of Population-Based Active Learning Methods
In this section, we illustrate how the population-based active learning methods
described in section 8.2 behave under a controlled setting.
8.3.1 Setup
Let the input dimension be d =1 and the learning target function be
f(x) = l-x + x2 + or(x),
where
Z3-3z x-0.2
rex) = ,J6 with z =----oA' (8.19)
Note that the above rex) is the Hermite polynomial, which ensures the
orthonormality of rex) to the second-order polynomial model under a Gaus­
sian test input distribution (see below for details). Let the number of training
examples to gather be ntr =100, and we add i.i.d. Gaussian noise with mean
zero and standard deviation 0.3 to output values:
where N(E; fL, a2) denotes the Gaussian density with mean fL and variance a2
with respect to a random variable E. Let the test input density Pte(x) be the
Gaussian density with mean 0.2 and standard deviation 0.4:
Pte(x) =N(x; 0.2, (0.4)2). (8.20)
Pte(x) is assumed to be known in this illustrative simulation. See the bottom
graph of figure 8.5 for the profile of Pte(x).
Let us use the following second-order polynomial model:
Note that for this model and the test input density Pte(x) defined by equation
8.20, the residual function r(x) in equation 8.19 is orthogonal to the model and
normalized to 1 (see section 3.2.2).
8.3. Numerical Examples of Population-Based Active Learning Methods
5
Learning target function f(x)
-- 0=0
- - - 0=0.005
I I I I I I 0=0.05
199
O �------�------�------�------�------�------�----�
-1.5 -1 -0.5 o 0.5
x
Input density functions
1.5
c=0.8
0.5
-1 -0.5 o 0.5
Figure 8.5
Learning target function (top) and input density functions (bottom).
1.5
111111 P
te
(x)
--
Ptr
(x)
- - - pt(x)
1.5
Let us consider the following three setups for the experiments.
0 = 0, 0.005, 0.05,
2
2
(8.21)
which roughly correspond to correctly specijied, approximately correct, and
misspecijiedcases, respectively. See the top graph of figure 8.5 for the profiles
of f(x) with different O.
Training input densities are chosen from the Gaussian densities with mean
0.2 and standard deviation OAc:
Ptr(x)= N(x; 0.2, (OAC)2),
where
c= 0.8, 0.9, 1.0, . . ., 2.5.
See the bottom graph of figure 8.5 for the profiles of Ptr(x) with different c.
In this experiment, we compare the performance of the following
methods:
• ALICE c is determined by ALICE equation 8.15. IWLS (equation 8.9) is
used for parameter learning.
200 8 Active Learning
• Input-independent variance-only method (W) cis determined by equation
8.13. IWLS (equation 8.9) is used for parameter learning. We denote this by
W since it uses importance-weighted least squares.
• Input-independent variance-only method* (W*) The closed-form solution
p�(x) given by equation 8.14 is used as the training input density. The pro­
file of p�(x) under the current setting is illustrated in the bottom graph of
figure 8.5, showing that p�(x) is similar to the Gaussian density with c =1.3.
IWLS (equation 8.9) is used for parameter learning.
• Input-independent bias-and-variance method (OW) First, ntr training
input points are created following the test input density Pte(x), and correspond­
ing training output values are observed. Based on the ntr training examples, cis
determined by equation 8.17. Then ntr-ntr remaining training input points are
created following the determined input density. The combination of OLS and
IWLS (see (equation 8.18) is used for estimating the parameters (the abbrevia­
tion OW comes from this). We set ntr=25, which we experimentally confirmed
to be a reasonable choice in this illustrative simulation.
• Traditional method (0) c is determined by equation 8.8. OLS (equation
8.7) is used for parameter learning (the abbreviation 0 comes from this).
• Passive method (P) Following the test input density Pte(x), training input
points {x:'}7!! are created. OLS equation 8.7 is used for parameter learning.
For W*, we generate random samples following p�(x) by rejection sampling
(see, e.g., [95]). We repeat the simulation 1000 times for each 8 in equation
8.21 by changing the random seed.
8.3.2 Accuracy of Generalization Error Estimation
First, we evaluate the accuracy of each active learning criterion as an estima­
tor of the generalization error Gen" (see equation 8.2). Note that ALICE and
W are estimators of the generalization error by IWLS (which we denote by
Gen�). OW is also derived as an estimator of Gen�, but the final solution
is computed by the combination of OLS and IWLS given by equation 8.18.
Therefore, OW should be regarded as an estimator of of the generalization
error by this learning method (which we denote by Gen�w)' 0 is an estimator
of the generalization error by OLS, which we denote by Gen�.
In figure 8.6, the means and standard deviations of true generalization errors
Gen" and their estimates over 1000 runs are depicted as functions of c by the
solid curves. The generalization error estimated by each method is denoted by
Gen in the figure. Here the upper and lower error bars are calculated separately
since the distribution is not symmetric. The dashed curves show the means of
8.3. Numerical Examples of Population-Based Active Learning Methods 201
8=0
("correctly specified")
8=0.005
("approximately correct")
Gen ALICE
,.co< •
8=0.05
("misspecified")
,...� Gen ALICE
0.008 �
:: ::
0.002 ...., --------- 0.002 .... ._. ----
::�",,, �.......---
00.8 1.2 1.6
�I : :> �.I. : .�: �I-..·: ..�:�� � 1.6 � I� 1.6 U 1� 1.6
�I�IIIIIIIIIIuTu �.I�IIIIIIIIIII"� �IUmIIIIIIIIIIIiiU 12 1.6 2 U U 12 1.6 2 � U 12 1.6 2 U
"'IUw "'I� ,..
�008 Gen'b 0.008 Gen
'o o.aoallil" I �II
:� III1I1I1IIII:� IIIPIIIIJ-W :�l1it-H110.8 1.2 1.6 2 :u 0.8 1.2 1.6 2 2.. ':-:-, .
. ----:":-u ----:":-.. ----:------:0-
Gen'o
,
.. �no 0.006
:: L
o.o02·�___
:: L
O.OO2'
-
�
Figure 8.6
.
Ge'no
:
: �•
•
:
. .
::
.
...
.." " " "_.
----
The means and (asymmetric) standard deviations of true generalization error Gen" and its esti­
mates over 1000 runs as functions of c. The generalization error estimated by each method is
denoted by Gen. GenALlCE, Genw,and Geno are multiplied by (72 = (0.3)2 so that comparison with
Gen� and Gen� is clear. The dashed curves show the means of the generalization error which
corresponding active learning criteria are trying to estimate .
202 8 Active Learning
the generalization error which corresponding active learning criteria are trying
to estimate. Note that the criterion values of ALICE, W, and 0 are multiplied
by a2 =(0.3)2 so that comparison with Gen� and Gen� is clear.
These graphs show that when 8 =0 (correctly specified), ALICE and W
give accurate estimates of Gen�. Note that the criterion value of W does not
depend on the training input points {Xn7�1' Thus, it does not fluctuate over
1000 runs. OW is slightly biased in the negative direction for small c. We con­
jecture that this is caused by the small sample effect. However, the profile of
OW still roughly approximates that of Gen�w' 0 gives accurate predictions
of Gen�. When 8 =0.005 (approximately correct), ALICE, W, and OW work
similarly to the case with 8 =0, that is, ALICE and W are accurate and OW
is negatively biased. On the other hand, 0 behaves differently: it tends to be
biased in the negative direction for large c. Finally, when 8 =0.05 (misspeci­
jied), ALICE and W still give accurate estimates, although they have a slightly
negative bias for small c. OW still roughly approximates Gen�w' and 0 gives
a totally different profile from Gen�.
These results show that as estimators of the generalizationerror, ALICE and
W are accurate and robust against the misspecification of models. OW is also
reasonably accurate, although it tends to be rather inaccurate for small c. 0 is
accurate in the correctly specified case, but it becomes totally inaccurate once
the model correct assumption is violated.
Note that by definition, ALICE, W, and 0 do not depend on the learning
target function. Therefore, in the simulation, they give the same values for
all 8 (ALICE and 0 depend on the realization of {Xn7�1' so they may have
small fluctuations). On the other hand, the true generalization error of course
depends on the learning target function even if the model error 8 is subtracted,
since the training output values depend on it. Note that the bias depends on
8, but the variance does not. The simulation results show that the profile of
Gen� changes heavily as the degree of model misspecification increases. This
would be caused by the increase of the bias since OLS is not unbiased even
asymptotically. On the other hand, 0 stays the same even when 8 increases. As
a result, 0 becomes a very poor generalization error estimator for a large 8. In
contrast, the profile of Gen� appears to be very stable against the change in 8,
which is in good agreement with the theoretical fact that IWLS is asymptoti­
cally unbiased. Thanks to this property, ALICE and W are more accurate than
o for misspecified models.
8.3.3 Obtained Generalization Error
In table 8.1, the mean and standard deviation of the generalization error
obtained by each method are described. In each row of the table, the best
8.3. Numerical Examples of Population-Based Active Learning Methods 203
Table 8.1
The mean and standard deviation of the generalization error obtained by population-based active
learning methods
ALICE W W' ow o p
8 = 0 2.08 ± 1 .95 2.40 ± 2.15 2.32 ± 2.02 3.09 ± 3.03 ° 1 .3 1 ± 1 .70 3. 1 1 ± 2.78
8 = 0.005 °2. 10 ± 1 .96 2.43 ± 2.15 2.35 ± 2.02 3 . 1 3 ± 3.00 2.53 ± 2.23 3.14 ± 2.78
8 = 0.05 °2. 1 1 ± 2.12 2.39 ± 2.26 2.34 ± 2.14 3.45 ± 3.58 121 ± 67.4 3.5 1 ± 3.43
For better comparison, the model error 82 is subtracted from the generalization error,and all values
are multiplied by 103 . In each row of the table, the best method and comparable ones by the t-test
at the significance level 5 percent are indicated with o. The value of 0 for 8 = 0.05 is extremely
large, but this is not a typo .
method and comparable ones by the t-test (e.g., [78]) at the significance levelS
percent are indicated with o. For better comparison, the model error 82 is sub­
tracted from the generalization error and all values in the table are multiplied
by 103• The value of 0 for 8=0.05 is extremely large, but this is not a typo.
When 8=0, 0 works significantly better than the other methods. Actually, in
this case, training input densities that approximately minimize Gen�, Gen�w'
and Gen� were successfully found by ALICE, W, OW, and O. This implies that
the difference in the error is caused not by the quality of active learning crite­
ria, but by the difference between IWLS and OLS: IWLS generally has larger
variance than OLS [145]. Thus, OLS is more accurate than IWLS since both
of them are unbiased when 8=O. Although ALICE, W, W*, and OW are out­
performed by 0, they still work better than P. Note that ALICE is significantly
better than W, W*, OW, and P by the t-test.
When 8 =0.005, ALICE gives significantly smaller errors than other meth­
ods. All the methods except 0 work similarly to the case with 8 =0, while 0
tends to perform poorly. This result is surprising since the learning target func­
tions with 8 =0 and 8 =0.005 are visually almost the same, as illustrated in
the top graph of figure 8.5. Therefore, it intuitively seems that the result when
8=0.005 is not much different from the result when 8=O. However, this slight
difference appears to make 0 unreliable.
When 8 = 0.05, ALICE again works significantly better than the other
methods. W and W* still work reasonably well. The performance of OW is
slightly degraded, although it is still better than P. 0 gives extremely large
errors.
The above results are summarized as follows. For all three cases (8 =
0, 0.005, 0.05), ALICE, W, W*, and OW work reasonably well and consis­
tently outperform P. Among them, ALICE tends to be better than W, W*,
and OW for all three cases. 0 works excellently when the model is correctly
204 8 Active Learning
specified, but tends to perform very poorly once the model correct assumption
is violated.
8.4 Pool-Based Active Learning Methods
Population-based active learning methods are applicable only when the test
input density Pte( x) is known (see section 8.1.1). On the other hand, pool­
based active learning considers the situation where the test input distribution
is unknown, but samples from that distribution are given. Let us denote the
pooled test input samples by {x�e}j�!.
The goal of pool-based active learning is, from the pool of test input points
{x�e}j�!, to choose the best input-point subset {x:,}7!! (with ntr« nte) for gath­
ering output values {yn7!! that minimizes the generalization error. If we have
infinitely many test input samples, the pool-based problem is reduced to the
population-based problem. Due to the finite sample size, the pool-based active
learning problem is generally harder to solve than the population-based prob­
lem. Note that the sample selection process of pool-based active learning (i.e.,
choosing a subset of test input points as training input points) resembles that
of the sample selection bias model described in chapter 6.
8.4.1 Classical Active Learning Method for Correct Models and Its Limitations
The traditional active learning method explained in section 8.2.1 can be
extended to the pool-based scenario in a straightforward way. Let us consider
the same setup as section 8.2.1 here, except that Pte(x) is unknown but its
i.i.d. samples {x�e}j�! are given.
The active learning criterion (equation 8.8) contains only the design matrix
Xtr and the matrix U. Xtr is still accessible even in the pool-based scenario
(see equation 8.6), but U is not accessible since it includes the expectation
over the unknown test input distribution (see equation 8.5). To cope with this
problem, we simply replace U with an empirical estimate fj, that is, fj is the
b x b matrix with the (£, £')-th element
(8.22)
Then we immediately have a pool-based active learning criterion:
(8.23)
8.4. Pool-Based Active Learning Methods 205
Unfortunately, this method has the same drawbacks as the population-based
method:
• It requires an unrealistic assumption that the model at hand is correctly
specified.
• ntr input points need to be simultaneously optimized that may not be
computationally tractable.
In order to overcome these drawbacks, we again employ importance­
weighted least squares (lWLS; see section 2.2.1). However, in pool-based
active learning, we cannot create training input points {X:,}7!1 at arbitrary
locations-we are allowed to choose only the training input points from
{X�e};::l. In order to meet this requirement, we restrict the search space of the
training input density Ptr(x) properly. More specifically, we consider a resam­
piing weightfunction S(x�e), which is a discrete probability function over the
pooled points {X�e};::l. This means that the training input probability Ptr(x�e) is
defined over the test input probability Pte(x�e) as
(8.24)
where S(x�e) > 0 for j =1, 2, . . . , nte, and
nte
Ls(x�e) =1.
j=l
8.4.2 Input-Independent Variance-Only Method
The population-based input-independent variance-only active learning crite­
rion (see section 8.2.3) can be extended to pool-based scenarios as follows.
The active learning criterion (equation 8.13) contains only the matrices U
and T. The matrix U may be replaced by its empirical estimator fj (see
equation 8.22), but approximating T is not straightforward since its empiri­
cal approximation T still contains the importance values at training points (see
equation 8.12):
where
w(xte) =
Pte(x�e)
.
] Ptr(xje)
206 8 Active Learning
Thus, i still cannot be directly computed since Pte(x�e) is inaccessible in
the pool-based scenarios. A possible solution to this problem is to use an
importance estimation method described in chapter 4, but this produces some
estimation error.
Here, we can utilize the fact that the training input probability Ptr(x�e) is
defined over the test input probability Pte(x�e) using the resampling weight
function �(x�e) (see equation 8.24). Indeed, equation 8.24 immediately shows
that the importance weight can be expressed as follows [87]:
te 1
w(x) ex
�( xje)
. (8.25)
Note that the proportional constant is not neededin activelearningsince we are
not interested in estimating the generalization error itself, but only in finding
the minimizer of the generalizationerror. Thus, the right-hand side of equation
8.25 is sufficient for active learning purposes, which does not produce any
estimation error.
----1---,
�* =argmin[tr(U T)],
I;
where i' is the b x b matrix with the (t:, t:')-th element
1 nte ( te) ( te)�
, .__ "" ({Je Xj ({Jet Xj
Te,e'.-n � >-(xte)
•
te j=l ., ]
(8.26)
Note also that equation 8.25 is sufficient for computing the IWLS solution
(see section 2.2.1):
(8.27)
A notable feature of equation 8.26 is that the optimal resampling weight
function �*(x) can be obtained in a closedform [163]:
1
�*(x�e)ex Ctl[irl
]e,e'({Je(x�e)({Je,(x�e»)2
8.4.3 Input-Dependent Variance-Only Method
The population-based input-dependent variance-only active learning method
(see section 8.2.4)can be extended to the pool-based scenarios in a similar way.
8.4. Pool-Based Active Learning Methods 207
We can obtain a pool-based version of the active learning criterion ALICE,
given by equation 8.15, as follows. The metric matrix U is replaced with its
empiricalapproximator fj, and the importance value Pte(X:')/Ptr(x:') (included
in L1WLS) is computed by equation 8.25. Then we have
�
T
min[tr(ULIWLSLIWLS)]'
,
(8.28)
The criterion (equation 8.28) is called PALICE (Pool-based ALICE). Thanks
to the input-dependence, PALICE would be more accurate than the input­
independent counterpart (equation 8.26) as a generalization error estimator
(see section 8.2.4).
However, no analytic solution is known for the PALICE criterion, and there­
fore an exhaustive search strategy is needed to obtain a better solution. Here,
we can use the same idea as the population-basedcase, that is, the analytic opti­
mal solution �* of the input-independentvariance-only active learning criterion
(equation 8.26) is used as a baseline, and one searches for a better solution
around the vicinity of �*. In practice, we may adopt a simple one-dimensional
search strategy: find the best resampling weight function with respect to a
scalar parameter v:
(8.29)
The parameter v controls the "shape" of the training input distribution: when
v =0, the resampling weight is uniform over all test input samples. Thus, the
above choice includes passive learning (the training and test distributions are
equivalent) as a special case. In practice, solution search may be intensively
carried out around v =1.
A pseudo code of the PALICE algorithm is summarized in figure 8.7.
8.4.4 Input-Independent Bias-and-Variance Approach
Finally, we show how the population-based input-independent bias-and­
variance active learning method (see section 8.2.5) can be extended to the
pool-based scenarios. 3
As explained in section 8.2.3, if O(n�2 )-terms are ignored, the generaliza­
tion error in input-independent analysis is expressed as follows:
208 8 Active Learning
Input: A pool of test input points {x}e}j�1 and basis functions {'Pe(Xn�=1
Output: Learned parameter if
Compute the b x b matrix U with Ue,€, =
n�e L,j�1'Pc(x}e)'Pe(x}e);
For several different values of v (possibly around v = 1)
Compute {(v(X}enj�1 with (v(x) = (L,�'€'=1[U-1]e,€,'Pe(x)'Pe'(x)r;
Choose xtr =
{xtr}n" from {xte}n'ev • <=1 J J=1
with probability proportional to {(v(xjenj�1;
Compute the ntr x b matrix xtr with [Xvli,e = 'Pe(x}r);
Compute the ntr X ntr diagonal matrix Wv with [WV]i,i = ((v(x}r»)-
1
;
Compute Lv = (xtrT
Wvxtr)-1xtrT
Wv;
Compute PALICE(v) = tr(ULvLJ);
End
Compute v =
argminv [PALICE(v)];
Gather training output values ytr = (ytr,y�r,... , y�,)T at Xbr;
Compute if = Lfjytr;
Figure 8.7
Pseudo code of the PALICE algorithm .
where H is defined by equation 8.11. It has been shown [89] that the optimal
training input density which minimizes the above asymptotic generalization
error is proportional to
1
Pte(x)(�/U-lle.e'<f1e(x)<f1e'(x)(82r2(x)+ a2»)');
,
where 8r(x) is the residual function which cannot be approximated by the
model at hand (see section 8.1.2 and also figure 3.2 in section 3.2.2). Then the
optimal resampling weight function can be obtained as
1
(�l[U-1]U<f1e(X)<f1e'(x)(82r2(x)+ a2»)"
However, since (82r2(x)+ a2) is inaccessible, the above closed-form solu­
tion cannot be used directly used for active learning. Tocope with this problem,
a regression method can be used in a two-stage sampling framework [87]. In
the first stage, ntr (:::; ntr) training input points {X-:,}7!1 are uniformly chosen
from the pool {xje}j�p and corresponding training output values C5i;tr}7!1 are
observed. It can be shown that a consistent estimator of the value of the optimal
resampling weight function at x;' (i= 1, 2, . . . , ntr) is given by
8.5. Numerical Examples of Pool-Based Active Learning Methods 209
Based on the input-output samples {(i':f, gi)}7'::-1' the optimal resampling
weight function can be learned by a regression method. Let us denote the
learned resampling weight function by f(x). Since the value of f(x) is
available at any input location x,
(8.30)
can be computed and used for resampling from the pooledinputpoints. Then,
in the second stage, the remaining (ntf-Iltf) training input points {xn7'::-�:!1
are chosen following the learned resampling weight function, and correspond­
ing training output values {yJfr'::-itr
ii!! are gathered. Finally, the parameter 8 is
I d · {(-tf -tf)}iitr d {( tf tf)}ntr-iitrearne , usmg Xi' Yi i=! an Xi' Yi i=iitr+i' as
�8 - . " (f�(-tf. 8) _ -tf)2 + " Xi' -Yi
[iitr ntr-iitr (f�( tf. 8) tf)2]-argmm � Xi' Yi � � tf •
9 i=1 i=iitr+1 l;(Xi)
(8.31)
However, this method suffers from the limitations caused by the two-stage
approach pointed out in section 8.2.5. Furthermore, obtaining a good approxi­
mation f(x) by regression is generally difficult, so this method may not be so
reliable in practice.
8.5 Numerical Examples of Pool-Based Active Learning Methods
In this section, we illustrate the behavior of the pool-based active learning
methods described in section 8.4.
Let the input dimension be d= 1, and let the learning target function be
f(x)= I-x + x2 + 8r(x),
where
Z3 -3z x -0.2
r(x)= ,J6 with z= �. (8.32)
The reason for the choice of r(x) will be explained later. Let us consider the
following three cases.
8= 0, 0.03, 0.06.
210
4
2
Learning target function f(x)
-- 0=0
- - · 0=0.03
I I I I I I 0=0.06
o L-�-------L-------L------�------�------�
0.8
0.6
-1
-1
Figure 8.8
-0.5
-0.5
o 0.5
x
Input density functions
o 0.5
x
Learning target function (top) and test input density functions (bottom).
1.5
1.5
8 Active Learning
See the top graph of figure 8.8 for the profiles of I(x) with different 8.
Let the number of training samples to gather be ntr=100. Add i.i.d. Gaussian
noise with mean zero and standard deviation a =0.3 to output values. Let the
test input density Pte(x) be the Gaussian density with mean 0.2 and standard
deviation 0.4; however, Pte(x) is treated as unknown here. See the bottom graph
of figure 8.8 for the profile of Pte(x). Let us draw nte =1000 test input points
independently from the test input distribution.
A polynomial model of order 2 is used for learning:
(8.33)
Note that for these basis functions, the residual function r(x) in equation 8.32
is orthogonal to the model and normalized to I (see section 3.2.2).
In this experiment, we compare the performance of the following sampling
strategies:
• PALICE Training input points are chosen following equation 8.29 for
v E {O, 0.2, 0.4 . . ., 2} U {0.8, 0.82, 0.84, . . ., I.2}. (8.34)
8.5. Numerical Examples of Pool-Based Active Learning Methods 211
Then the best value of v is chosen from the above candidates based on
equation 8.28. IWLS (equation 8.27) is used for parameter learning.
• Pool-based input-independent variance-only method (PW) Training
input points are chosen following equation 8.26 (or equivalently equation 8.29
with v =1). IWLS (equation 8.27) is used for parameter learning.
• Pool-based traditional method (PO) Training input points are chosen fol­
lowing equation 8.29 for equation 8.34, and the best value of v is chosen based
on equation 8.23. OLS (equation 8.7) is used for parameter learning.
• Pool-based input-independent bias-and-variance method (POW) Ini­
tially,ntr training input-output samples are gathered based on the test input
distribution, and they are used for learning the resampling bias function. The
resampling bias function is learned by kernel ridge regression with Gaussian
kernels, where the Gaussian width and ridge parameter are optimized based
on fivefold cross-validation with exhaustive grid search. Then the remaining
ntr -ntr training input points are chosen based on equation 8.30. The combina­
tion of OLS and IWLS (see equation 8.31) is used for parameter learning. We
setntr =25.
• Passive method (P) Training input points are drawn uniformly from the
pool of test input samples (or equivalently equation 8.29 with v =0). OLS is
used for parameter learning.
For references, the profile of p�(x) (=Pte(X)s*(X)) is also depicted in the bot­
tom graph of figure 8.8, which is the optimal training input density by PW (see
section 8.4.2).
In table 8.2, the generalization error obtained by each method is described.
The numbers in the table are means and standard deviations over 100 trials.
For better comparison, the model error 82 is subtracted from the obtained
error and all values are multiplied by 103• In each row of the table, the best
method and comparable ones by the Wilcoxon signed-rank test (e.g., [78]) at
the significance level 5% percent are indicated with o.
When 8=0, PO works the best and isfollowedby PALICE. These two meth­
ods have no statistically significant difference and are significantly better than
the other methods. When 8 is increased from 0 to 0.03, the performances of
PALICE and PW are almost unchanged, while the performance of PO is con­
siderably degraded. Consequently, PALICE gives the best performance among
all. When 8 is further increased to 0.06, the performance of PALICE and PW
is still almost unchanged. On the other hand, PO performs very poorly and
is outperformed even by the baseline Passive method. POW does not seem to
work well for all three cases.
212 8 Active Learning
Table 8.2
The mean and standard deviation of the generalization error obtained by pool-based active learning
methods
0 = 0
0 = 0.03
0 = 0.06
Average
PALICE
°2.03 ± 1 .8 1
°2. 17 ± 2.04
°2.42 ± 2.65
°2.21 ± 2.19
PW
2.59 ± 1 .83
2.81 ± 2.01
3.19 ± 2.59
2.86 ± 2 . 1 8
PO
° 1 .82 ± 1 .69
2.62 ± 2.05
4.85 ± 3.37
3 . 1 0 ± 2.78
POW
6.43 ± 6.61
6.66 ± 6.54
7.65 ± 7.21
6.91 ± 6.79
P
3 . 1 0 ± 3.09
3.40 ± 3.55
4.12 ± 4.71
3.54 ± 3.85
For better comparison, the model error 02 is subtracted from the error, and all values are multiplied
by 103 • In each row of the table, the best method and comparable ones by the Wilcoxon signed-rank
test at the significance level 5 percent are indicated with o.
Overall, PALICE and PW are shown to be highly robust against model
misspecification, while PO is very sensitive to the violation of the correct
model assumption. PALICE significantly outperforms PW, because ALICE is
a more accurate estimator of the single-trial generalization error than W (see
section 8.2.4).
8.6 Summary and Discussion
Active learning is an important issue particularly when the cost of sampling
output value is very expensive. However, estimating the generalization error
before observing samples is hard. A standard strategy for generalization error
estimation in active learning scenarios is to assume that the bias is small
enough to be neglected, and focus on the variance (section 8.1.3).
We first addressed the population-based active learning problems, where the
test input distribution is known. We began our discussion by pointing out that
the traditional active learning method based on ordinary least squares is not
practical since it requires the model to be correctly specified (section 8.2.1).
To overcome this problem, active learning methods based on importance­
weighted least squares have been developed (sections 8.2.3, 8.2.4, and 8.2.5);
among them the input-dependent variance-only method called ALICE was
shown to perform excellently in experiments (section 8.3).
We then turned our focus to the pool-based active learning scenarios where
the test input distribution is unknown but a pool of test input samples is
given. We showed that all the population-based methods explained above can
be extended to the pool-based scenarios (sections 8.4.2, 8.4.3, and 8.4.4).
The simulations showed that the method called PALICE works very well
(section 8.5).
8.6. Summary and Discussion
Table 8.3
Active learning methods
Correct
Approximately correct
Completely misspecified
Population-based
Input
independent
section 8.2.3
section 8.2.5
Input
dependent
section 8.2. 1
section 8.2.4
Pool-based
Input
independent
section 8.4.2
section 8.4.4
21 3
Input
dependent
section 8.4. 1
section 8.4.3
In ALICE or PALICE, we need to prepare reasonable candidatesfortraining
input distributions. We introduced practical heuristics of searching around the
analytic solution obtained by other methods (sections 8.2.4 and 8.4.3), which
were shown to be reasonable through experiments. However, there may still
be room for further improvement, and it is important to explore alternative
strategies for preparing better candidates.
The active learning methods covered in this chapter are summarized in
table 8.3.
We have focused on regression scenarios in this chapter. A natural desire is
to extend the same idea to classification scenarios. The conceptual issues we
have addressed in this chapter-the usefulness of the input-dependent anal­
ysis of the generalization error and the practical importance of dealing with
approximately correct models-will still be valid in classification scenarios.
Developing active learning methods in classification scenarios based on these
conceptual ideas will be promising.
The ALICE and PALICE criteria are random variables which depend not
only on training input distributions, but also on realizations of training input
points. This is why the minimizer of ALICE or PALICE cannot be obtained
analytically. On the other hand, this fact implies that ALICE and PALICE allow
one to evaluate the goodness not only of training input distributions but also
of realizations of training input points. It will be interesting to investigate this
issue systematically.
The ALICE and PALICE methods have been shown to be robust against
the existence of bias. However, if the input dimensionality is very high, the
variance tends to dominate the bias due to small sample size, and therefore
advantages of these methods tend to be lost. More critically, regression from
data samples is highly unreliable in such high-dimensional problems due to
extremely large variance. To address this issue, it will be important to first
reduce the dimensionality of the data [154, 178, 156, 175], which is another
214 8 Active Learning
challenge in active learning research. Active learning for classification in high­
dimensional problems is discussed in, for instance, [115, 139].
We have focused on linear models. However, the importance-weighting
technique used to compensate for the bias caused by model misspecification
is valid for any empirical error-based methods (see chapter 2). Thus, another
important direction to be pursued is extending the current active learning ideas
to more complex models, such as support vector machines [193] and neural
networks [20].
9
Active Learning with Model Selection
In chapters 3 and 8, we addressed the problems of model selectionl and active
learning. When discussing model selection strategies, we assumed that the
training input points have been fixed. On the other hand, when discussing
active learning strategies, we assumed that the model had been fixed.
Although the problems of active learning and model selection share the
common goal of minimizing the generalization error, they have been studied
as two independent problems so far. If active learning and model selection
are performed at the same time, the generalization performance will be fur­
ther improved. We call the problem of simultaneously optimizing the training
input distribution and model active learning with model selection. This is the
problem we address in this chapter.
Below, we focus on the model selection criterion IWSIC (equation 3.11) and
the population-based active learning criterion ALICE (equation 8.15). How­
ever, the fundamental idea explained in this chapter is applicable to any model
selection and active learning criteria.
9.1 Direct Approach and the Active Learning/Model Selection Dilemma
A naive and direct solution to the problem of active learning with model selec­
tion is to simultaneously optimize the training input distribution and model.
However, this direct approach may not be possible simply by combining exist­
ing active learning methods and model selection methods in a batch manner,
due to the active learning/model selection dilemma [166]: When selecting
the training input density Ptr(x) with existing active learning methods, the
model must have been fixed [48, 107, 33, 55, 199, 89, 153]. On the other hand,
1. "Model selection" refers to the selection of various tunable factors M including basis functions,
the regularization parameter, and the flattening parameter.
216 9 Active Learning with Model Selection
when choosing the model with existing model selection methods, the training
input points {x:,}7!! (or the training input density Ptr(x» must have been
fixed and the corresponding training output values {yJrJ7!! must have been
gathered [2, 135, 142, 35, 145, 162]. For example, the active learning crite­
rion (equation 8.15) cannot be computed without fixing the model M, and the
model selection criterion (equation 3.11) cannot be computed without fixing
the training input density Ptr(x),
If training input points that are optimal for all model candidates exist, it
is possible to perform active learning and model selection at the same time
without regard to the active learning/model selection dilemma: Choose the
training input points {x:,}7!! for some model M by an active learning method
(e.g., equation 8.15), gather corresponding output values {yJr}7!!, and choose
a model by a selection method (e.g., equation 3.11). It has been shown that
such common optimal training input points exist for a class of correctly speci­
fied trigonometric polynomial regression models [166]. However, the common
optimal training input points may not exist in general, and thus the range of
application of this approach is limited.
9.2 Sequential Approach
A standard approach to coping with the above active learning/model selection
dilemma for arbitrary models is the sequential approach [106]. That is, a model
is iteratively chosen by a model selection method, and the next input point (or
a small batch) is optimized for the chosen model by an active learning method
(see figure 9.1a).
In the sequential approach, the chosen model varies through the sequential
learning process (see the dashed line in figure 9.1b). We refer to this phe­
nomenon as the model drift. The model driftphenomenon could be a weakness
of the sequential approach since the location of optimal training input points
depends on the target model in active learning; thus a good training input point
for one model could be poor for another model. Depending on the transition of
the chosen models, the sequential approach can work very well. For exam­
ple, when the transition of the model is the solid line in figure 9.1b, most
of the training input points are chosen for the finally selected model Mntr and
the sequential approach will have an excellent performance. However, when
the transition of the model is the dotted line in figure 9.1b, the performance
becomes poor since most of the training input points are chosen for other mod­
els. Note that we cannot control the transition of the model as desired since we
do not know a priori which model will be chosen in the end. For this reason,
the sequential approach will be unreliable in practice.
9.2. Sequential Approach
Choose the next training input point Xi+l
Gather output value Yi+1 at Xi+!
(a) Diagram
Figure 9.1
Sequential approach.
No
(j)
03
"0
o
E
'0
�
'0
£
()
Q)
- -,
I I
r--..I
I Very good
..................��?�......:
217
� -+--+--r----------�--.
1 2 n
The number of training samples
(b) Transition of chosen models
Another issue that needs to be taken into account in the sequential approach
is that the training input points are not i.i.d. in general--the choice of the
(i + l)-th training input point X:�l depends on the previously gathered sam­
ples {(x:�, yn};'=l' Since standard active learning and model selection methods
require the i.i.d. assumption for establishing their statistical properties such as
consistency or unbiasedness, they may not be directly employed in the sequen­
tial approach [9]. The active learning criterion ALICE (equation 8.15) and the
model selection criterion IWSIC (equation 3.11) also suffer from the violation
of the i.i.d. condition, and they may lose their consistency and unbiasedness.
However, this problem can be settled by slightly modifying the criteria, which
is an advantage of ALICE and IWSIC: Suppose we draw u input points from
p�)(x) in each iteration (let ntf =UV, where v is the number of iterations). If
U tends to infinity, simply redefining the diagonal matrix W as follows makes
ALICE and IWSIC still consistent and asymptotically unbiased:
w
=
Pte(xn
k,k
(i)( If) '
Ptf Xk
where k=(i-l)u + j, i =1, 2, . . . , v, and j =1, 2, . . . , u.
(9.1)
218 9 Active Learning with Model Selection
9.3 Batch Approach
An alternative approach to active learning with model selection is to choose
all the training input points for an initially chosen model Mo. We refer to
this approach as the batch approach (see figure 9.2a). Due to its nature, this
approach does not suffer from the model drift (cf. figure 9.1b); the batch
approach can be optimal in terms of active learning if an initially chosen model
Mo agrees with the finally chosen model Mntr (see the solid line in figure 9.2b).
In order to choose the initial model Mo, we may need a generalization error
estimator that can be computed before observing training samples-for exam­
ple, the generalization error estimator (equation 8.15). However, this does not
work well since equation 8.15 only evaluates the variance of the estimator
(see equation 8.4); thus, using equation 8.15 for choosing the initial model Mo
always results in selecting the simplest model from among the candidates. Note
that this problem is not specific to the generalization error estimator (equa­
tion 8.15), but is common to most generalization error estimators since it is
generally not possible to estimate the bias of the estimator (see equation 8.3)
before observing training samples. Therefore, in practice, one may have to
choose the initial model Mo randomly. If one has some prior preference of
models, p(M), the initial model may be drawn according to it; otherwise, one
has to choose the initial model Mo randomly from the uniform distribution.
Due to the randomness of the initial model choice, the performance of
the batch approach may be unreliable in practice (see the dashed line in
figure 9.2b).
Figure 9.2
Batch approach.
(a) Diagram
If)
Qi
'0
o
E
'0
�
'0
.s=
u
Q)
Optimal
Poor
..............................
� -f-+--+-----�-�
1 2 n
The number of training samples
(b) Transition of chosen models
9.4. Ensemble Active Learning 219
9.4 Ensemble Active Learning
As pointed out above, the sequential and batch approaches have potential lim­
itations. In this section, we describe a method of active learning with model
selection that can cope with the above limitations [167].
The weakness of the batch approach lies in the fact that the training input
points chosen by an active learning method are overfitted to the initially chosen
model-the training input points optimized for the initial model could be poor
if a different model is chosen later.
We may reduce the risk of overfitting by not optimizing the training input
distribution specifically for a single model, but by optimizing it for all model
candidates (see figure 9.3). This allows all the models to contribute to the opti­
mization of the training input distribution, and thus we can hedge the risk of
overfitting to a single (possibly inferior) model. Since this idea can be viewed
as applying a popular idea of ensemble learning to the problem of active
learning, this approach is called ensemble active learning (EAL).
The idea of ensemble active learning can be realized by determining the
training input density Ptr(x) so that the expected generalization error over all
model candidates is minimized:
where ALICEM denotes the value of the active learning criterion ALICE for a
model M (see equation 8.15), and p(M) is the prior preference of the model M.
Choose all training input points {Xd?=l
for ensemble of all models
Gather all output values {Y;}�l at {Xi}�l
(a) Diagram
Figure 9.3
The ensemble approach.
� _Jiiliiliiiiiiiiilt�-+
Q3
"0
o
E
'0
8
i 1 2 n
The number of training samples
(b) Transition of chosen models
220 9 Active Learning with Model Selection
If no prior information on the goodness/preference of the models is available,
the uniform prior may be used. In the next section, we experimentally show
that this ensemble approach significantly outperforms the sequential and batch
approaches.
9.5 Numerical Examples
Here, we illustrate how the sequential (section 9.2), batch (section 9.3), and
ensemble (section 9.4) methods behave using a one-dimensional data set.
9.5.1 Setting
Let the input dimension be d= I, and the target function I(x) be the following
third-order polynomial (see the top graph of figure 9.4):
I(x) = I-x +x2 + O.05r(x),
where
Z3-3z
rex) = J6 with
x-O.2
z= ---oA'
Le arningtarget function f(x)
2
OL-__�____-L____�____�____L-__�____�
-1.5 -1 -0.5
1.5
° 0.5
x
Input density functions
c=.0.8
, ..
/ '
,
,
1.5 2
- pte
(x)
- - - Ptr(x)
.•.•. Ptr(x)
0.5 "'" I Ptr
(x)
, """'11111111111111,'" ,,,
ll'IIII.!.,";' , (=2.5 """''''�''"",,
" ,1111'
"
..
.". , _ '
.".
,,' ... '''',
.. , .
' '
, 1 111 ""
,
O�__����____L-__�____����__�
-1.5 -1 -0.5 ° 0.5 1.5 2
x
Figure 9.4
Target function, training input densities Ptt(x), and test input density P,e(x),
9.5. Numerical Examples 221
Let the test input density Pte(x) be the Gaussian density with mean 0.2
and standard deviation 0.4, which is assumed to be known in this illustrative
simulation. We choose the training input density Ptr(x) from a set of Gaussian
densities with mean 0.2 and standard deviation O.4c, where
c =0.8, 0.9, 1.0, . . ., 2.5.
These density functions are illustrated in the bottom graph of figure 9.4. We
add i.i.d. Gaussian noise with mean zero and standard deviation 0.3 to the
training output values.
Let us use the second-order polynomial model:
Note that the target function f(x), which is the third-order polynomial, is not
realizable by the second-order model.
The parameters are learned by adaptive importance-weighted least squares
(AIWLS; see section 2.2.1):
where y is the flattening parameter (0 :'S y :'S 1) for controlling the bias­
variance trade-off. Here, we focus on choosing the flattening parameter y by
model selection; y is selected from
y =0, 0.5, 1.
9.5.2 Analysis of Batch Approach
First, we investigate the dependency between the goodness of the training input
density (i.e., c) and the model (i.e., y). For each y and each c, we draw train­
ing input points {xnl�� and gather output values {y:,}f��. Then we learn the
parameter () of the model by AIWLS and compute the generalization error
Gen. The mean Gen over 500 trials as a function of c for each y is depicted in
figure 9.5a. This graph underlines that the best training input density c could
strongly depend on the model y, implying that a training input density that is
good for one model could be poor for others. For example, when the training
input density is optimized for the model y =0, c =1.1 would be an excellent
choice. However, c =1.1 is not so suitable for models y =0.5, 1. This figure
illustrates a possible weakness of the batch method: When an initially cho­
sen model is significantly different from the finally chosen model, the training
222
7.5
� 7
e
0;
§ 6.5
:�
� 6.,
<::
.,
I:J
5.5
1.5 2.5
c (choice ofPtr(x))
(a) The mean generalization error over
500 trials as a function of training input
density cfor each r(when ntr = 100)
Figure 9.5
9 Active Learning with Model Selection
<::
.,
0.8
:s 0.6
-5
'0
�
&i
0.4 •
::>
CT
�
�L---�50�--1- 0- 0--- 1� 5-0---2�0 -0--�2 50
ntr(number oftraining samples)
(b) Frequency of chosen rover 500 trials
as a function of the nmnber of training
samples
Simulation results of active learning with model selection.
input points optimized for the initial model could be less useful for the final
model, and the performance is degraded.
9.5.3 Analysis of Sequential Approach
Next, we investigate the behavior of the sequential approach. In our implemen­
tation, ten training input points are chosen at each iteration. Figure 9.5b depicts
the transition of the frequency of chosen y in the sequential learning process
over 500 trials. It shows that the choice of models varies over the learning
process; a smaller y (which has smaller variance, and thus low complexity)
is favored in the beginning, but a larger y (which has larger variance, and
thus higher complexity) tends to be chosen as the number of training samples
increases. Figure 9.5 illustrates a possible weakness of the sequential method:
The target model drifts during the sequential learning process (from small y
to large y), and the training input points designed in an early stage (for y= 0)
could be poor for the finally chosen model (y= 1).
9.5.4 Comparison of Obtained Generalization Error
Finally, we investigate the generalization performance of each method when
the number of training samples to be gathered is
ntr = 50, 100, 150, 200, 250.
9.6. Summary and Discussion 223
Table 9.1
Means and standard deviations of generalization error when the flattening parameter y is chosen
from {O, 0.5, I} by model selection
nIT Passive Sequential Batch Ensemble
50 10.63±8.33 7.98±4.57 8.04±4.39 °7.59±4.27
100 5.90±3.42 5.66±2.75 5.73±3.01 °5. 15±2.49
150 4.80±2.38 4.40±1.74 4.61±1.85 °4. 13±1.56
200 4.21±1.66 3.97±1.54 4.26±1.63 °3.73±1.25
250 3.79±1.31 3.46±1.00 3.88±1.41 °3.35±0.95
All values in the table are multiplied by 103• The best method in terms of the mean generalization
error and comparable methods according to the Wilcoxon signed-rank test at the significance level
5% are marked by '0'.
Table 9.1 describes the means and standard deviations of the generalization
error obtained by the sequential, batch, and ensemble methods; as a baseline,
we included the result of passive learning, that is, the training input points
{XJr}7�1 are drawn from the test input density Pte(x) (or equivalently c = 1).
The table shows that all three methods of active learning with model selec­
tion tend to outperform passive learning. However, the improvement of the
sequential method is not so significant, as a result the model drift phenomenon
(see figure 9.5). The batch method also does not provide significant improve­
ment, due to overfitting to the randomly chosen initial model (see figure 9.5a).
On the other hand, the EAL method does not suffer from these problems and
works significantly better than the other methods. The best method in terms
of the mean generalization error and comparable methods by the Wilcoxon
signed-rank test at the significance levelS percent [78] are marked by 0 in the
table.
9.6 Summary and Discussion
Historically, the problems of active learning and model selection have been
studied as two independent problems, although they share a common goal of
minimizing the generalization error. We suggested that by simultaneously per­
forming active learning and model selection-which is called active learning
with model selection-a better generalization capability can be achieved.
We pointed out that the sequential approach, which would be a common
approach to active learning with model selection, can perform poorly due to the
model drift phenomenon (section 9.2). A batch approach does not suffer from
the model drift problem, but it is hard to choose the initial model appropriately.
For this reason, the batch approach is not reliable in practice (section 9.3). To
224 9 Active Learning with Model Selection
overcome the limitations of the sequential and batch approaches, we intro­
duced an approach called ensemble active learning (EAL), which performs
active learning not only for a single model, but also for an ensemble of models
(section 9.4). The EAL method was shown to compare favorably with other
approaches through simulations (section 9.5).
Although we focused on regression problems in this chapter, EAL is appli­
cable to any supervised learning scenario, given that a suitable batch active
learning method is available. This implies that, in principle, it is possible to
extend the EAL method to classification problems. However, to the best of our
knowledge, there is no reliable batch active learning method in classification
tasks. Therefore, developing a method of active learning with model selection
for classification is still a challenging open problem which needs to be further
investigated.
10
Applications of Active Learning
In this chapter, we describe real-world applications of active learning tech­
niques: sampling policy design in reinforcement learning (section 10. 1) and
wafer alignment in semiconductor exposure apparatus (section 10.2).
10.1 Design of Efficient Exploration Strategies in Reinforcement Learning
As shown in section 7.6, reinforcement learning [174] is a useful framework
to let a robot agent learn optimal behavior in an unknown environment.
The accuracy of estimated value functions depends on the training sam­
ples collected following sampling policy ;r(als). In this section, we apply the
population-based active learning method described in section 8.2.4 to design­
ing good sampling policies [4]. The contents of this section are based on the
framework of sample reusepolicy iteration described in section 7.6.
10.1.1 Efficient Exploration with Active Learning
Let us consider a situation where collecting state-action trajectory samples is
easy and cheap, but gathering immediate reward samples is hard and expensive.
For example, let us consider a robot-arm control task of hitting a ball with a
bat and driving the ball as far away as possible (see section 10. 1.). Let us adopt
the carry of the ball as the immediate reward. In this setting, obtaining state­
action trajectory samples of the robot arm is easy and relatively cheap since we
just need to control the robot arm and record its state-action trajectories over
time. On the other hand, explicitly computing the carry of the ball from the
state-action samples is hard due to friction and elasticity of links, air resistance,
unpredictable disturbances such a current of air, and so on. Thus, in practice,
we may need to put the robot in open space, let it really hit the ball, and mea­
sure the carry of the ball manually. Thus, gathering immediate reward samples
is much more expensive than gathering the state-action trajectory samples.
226 10 Applications of Active Learning
The goal of active learning in the current setup is to determine the sampling
policy so that the expected generalization error is minimized. The general­
ization error is not accessible in practice since the expected reward function
R(s,a) and the transition probability PT(s'ls,a) are unknown. Thus, for per­
forming active learning, the generalization error needs to be estimated from
samples. A difficulty of estimating the generalization error in the context of
active learning is that its estimation needs to be carried out only from state­
action trajectory samples without using immediate reward samples, because
gathering immediate reward samples is hard and expensive. Below, we explain
how the generalization error for active learning is estimated without reward
samples.
10.1.2 Reinforcement Learning Revisited
Here, we briefly revisit essential ingredients of reinforcement learning. See
section 7.6 for more details.
For state s,action a,reward r, and discount factor y, the state-action value
function Q" (s,a) E lR for policy ;r is defined as the expected discounted sum
of rewards the agent will receive when taking action a in state s and following
policy ;r thereafter. That is,
where lE",PT denotes the expectation over {sn,an}�l following policy ;r(anlsn)
and transition probability PT(sn+llsn, an).
We approximate the state-action value function Q" (s,a) using the following
linear model:
where
lb(s,a)=(<ih (s,a),</>2(S,a),. . . ,</>B(s,a))T
are the fixed basis functions, B is the number of basis functions, and
are model parameters.
10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 227
We want to learn (J so that the true value function Qrr(s,a) is well approx­
imated. As explained in section 7.6, this can be achieved by regressing the
immediate reward function R(s,a), using the following transformed basis
function:
t(s,a; V):=lfJ(s,a)-
IV
Y
I
L �, [lfJ(s',a')] ,
(s,a) , rr(a Is)
S E'D(s,a)
where V is a set of training samples, V(s,a) is a set of four-quadruple elements
containing state s and action a in the training data V, and IErr(a'ls') denotes the
conditional expectation of a' over 7l'(a'ls'),given s'.
The generalization error of a parameter (J is measured by the following
squared Bellman residual G:
where IEPI>'f,PT denotes the expectation over {sn' an}�=l following the initial state
probability density PI(Sl), the policy 7l'(anISn), and the transition probability
density PT(Sn+tlsn,an).
Here, we consider the off-policy setup (see section 7.6.5), where the cur­
rent policy 7l'(als) is used for evaluating the generalization error, and a
sampling policy n(als) is used for collecting data samples. We use the per­
decision importance-weighting (PIW) method for parameter learning (see
section 7.6.6.2):
where
is the importance weight.
228 10 Applications of Active Learning
The goal of active learning here is to find the best sampling policy 1T(aIs)
that minimizes the generalization error,
10.1.3 Decomposition of Generalization Error
The information we are allowed to use for estimating the generalization error
is a set of roll-out samples without immediate rewards:
Let us define the deviation of immediate rewards from the mean as
Note that E� n can be regarded as additive noise in the context of least-squares
function fitting, By definition, Eii has mean zero and its variance generallym,n
depends on s�,n and a�,n (i.e., heteroscedastic noise [21]). However, since esti-
mating the variance of Eii without using reward samples is not generallym,n
possible, we ignore the dependence of the variance on Sii and aii , Let usm,n m,n
denote the input-independent common variance by a2,
Now we would like to estimate the generalization error
from Vii, where 7i is a PIW estimator given by
7i=Lr,
L=(XTWX)-IXTW,
iirN(m-l)+n=rm,n'
W is the diagonal matrix with the (N(m - 1) +n)-th diagonal element given by
W - iiN(m-l)+n,N(m-l)+n - wm,n'
10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 229
--- -if
Note that in the above calculation of (J, V (a data set without immediate
rewards) is used instead of Vii (a data set with immediate rewards) since Vii is
not available in the active learning setup.
The expectation of the above generalization error over "noise" can be
decomposed as follows:
m; [G(i)] = Bias2 +Variance +82,
."
where lE.ii denotes the expectation over "noise" {E!,n}�;,,�,n=!' Bias2, Variance,
and 82 are the bias term, the variance term, and the model error term
(respectively) defined by
[ N
{ }2]• 2 1 * T� -iiBIas := lE - L OJ!; [8] - (J ) t(Sn,an; V ) ,
p(.1r,PT N ."n=!
(J* is the optimal parameter in the model, defined by equation 7.5 in section 7.6.
Note that the variance term can be expressed in a compact form as
��T
Variance = a2tr(ULL ),
where U is the B x B matrix with the (b, b')-th element
(10.1)
10.1.4 Estimating Generalization Error for Active Learning
The model error is constant, and thus can be safely ignored in generalization
error estimation, since we are interested in finding a minimizer of the gener­
alization error with respect to li. For this reason, we focus on the bias term
and the variance term. However, the bias term includes the unknown optimal
parameter (J*, and thus it may not be possible to estimate the bias term without
using reward samples; similarly, it may not be possible to estimate the "noise"
variance a2 included in the variance term without using reward samples.
230 10 Applications of Active Learning
As explained in section 8.2.4, the bias term is small enough to be neglected
when the model is approximately correct, that is, (J*Tt(S, a) approximately
agrees with the true function R(s,a). Then we have
� [G<'O)] - 82 - Bias2 ex tr(ULLT), (10.2)
E"
which does not require immediate reward samples for its computation. Since
lEP],1T,PT included in U is not accessible (see equation 10. 1), we replace U by
its consistent estimator fj:
Consequently, we have the following generalization error estimator:
--.....-..........TJ:=tr(ULL ),
which can be computed only from Vii, and thus can be employed in the active
learning scenarios.
10.1.5 Designing Sampling Policies
Based on the generalization error estimator derived above, we give an algo­
rithm for designing a good sampling policy which fully makes use of the
roll-out samples without immediate rewards.
1. Prepare K candidates of sampling policy: {1i}f=I'
2. Collect episodic sa"-!ples without immediate rewards for each sampling
policy candidate: {15"k}f=l.
3. Estimate U using all samples fl5"k}f=l :
4. Estimate the generalization error for each k:
Jk:=tr(ULiikLiikT),
Liik:=(jiikTWiikjiik)_ljiikTWiik,
10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 231
WiTk is the diagonal matrix with the (N(m - 1) + n)-th diagonal element
given by
5. (If possible) repeat 2 to 4 several times and calculate the average for each
k: {J;Jf=I'
6. Determine the sampling policy: nAL := argmink J{.
7. Collect training samples with immediate rewards following nAL: ViTAL.
8. Learn the value function by LSPI using ViTAL.
10.1.6 Active Learning in Policy Iteration
As shown above, the unknown generalization error can be accurately estimated
without using immediate reward samples in one-step policy evaluation. Here,
we extend the idea to the full policy iteration setup.
Sample reuse policy iteration (SRPI) [66], described in section 7.6, is a
framework of off-policy reinforcement learning [174, 125] which allows one
to reuse previously collected samples effectively. Let us denote the evaluation
policy at the l-th iteration by Jrl and the maximum number of iterations by L.
In the policy iteration framework, new data samples V"l are collected fol­
lowing the new policy Jrl for the next policy evaluation step. In ordinary
policy iteration methods, only the new samples V"l are used for policy evalua­
tion. Thus, the previously collected data samples {V"l, V"z, ..., V"l-l} are not
utilized:
E:{D"I}�" I E:{D"Z}�" I E:{D"3} I
Jrl -+ Q 1 -+ Jrz -+ Q Z -+ Jr3 -+ . . . -+ JrL+l,
where E : {V} indicates policy evaluation using the data sample V, and I
denotes policy improvement. On the other hand, in SRPI, all previously
collected data samples are reused for policy evaluation as
E:{D"I}�" I E:{D"I,D"Z}�" I E:{D"I,D"Z,D"3} I
Jrl -+ Q 1 -+ Jr2 -+ Q 2 -+ Jr3 -+ . . . -+ JrL+l,
where appropriate importance weights are applied to each set of previously
collected samples in the policy evaluation step.
Here, we apply the active learning technique to the SRPI framework. More
specifically, we optimize the sampling policy at each iteration. Then the
232 10 Applications of Active Learning
iteration process becomes
Thus, we do not gather samples following the current policy lrl' but following
the sampling policy Til optimized on the basis of the active learning method.
We call this framework activepolicy iteration (API).
10.1.7 Robot Control Experiments
Here, we evaluate the performance of the API method using a ball-batting robot
(see figure 10. 1), which consists of two links and two joints. The goal of the
ball-batting task is to control the robot arm so that it drives the ball as far as
possible.
The state space S is continuous and consists of angles <PI[rad] (E [0, lr/4])
and <P2[rad] (E [-lr/4, lr/4]), and angular velocities <PI[rad/s] and <P2[rad/s].
Thus, a state s (E S) is described by a four-dimensional vector:
The action space A is discrete and contains two elements:
where the i-th element (i = 1, 2) of each vector corresponds to the torque
[N· m] added to joint i.
We use the open dynamics engine (ODE), which is available at http://
ode.org/, for physical calculations including the update of the angles and
I
Figure 10.1
A ball-batting robot.
joint 1
,
O
ball
I
� Pln
..
0.1 [ml
(Object Settings)
link 1: O.65[mJ (length), l1.S[kgJ (mass)
link 2: O.35[mJ (length), 6.2[kgJ (mass)
ball: 0.1[mJ (radius). 0.1[kgJ (mass)
pin: O.3[mJ (height), 7.5[kgJ (mass)
10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 233
angular velocities, and collision detection between the robot arm, ball, and
pin. The simulation time step is set to 7.5 [ms] and the next state is observed
after 10 time steps. The action chosen in the current state is kept taken for 10
time steps. To make the experiments realistic, we add noise to actions: If action
(fl,h)T is taken, the actual torques applied to thejoints are JI+BIand 12 +B2,where BIand B2are drawn independently from the Gaussian distribution with
mean ° and variance 3.
The immediate reward is defined as the carry of the ball. This reward is given
only when the robot arm hits the ball for the first time at state s' after taking
action a at current state s. For value function approximation, we use the 110
basis functions defined as
{ . (IIS-CiIl2)/(a =a(]»)exp
2r
2
¢2(i-I)+j=
.
for i =1,. . . ,54 and j =1,2,
/(a =a(]») for i =55 and j =1,2,
where r is set to 3n/2 and the Gaussian centers Ci(i =1,. . .,54) are located
on the regular grid
to, n/4} x {-n,0,n} x {-n/4,0,n/4} x {-n,0,n}.
/(c) denotes the indicator function:
{I if the condition Cis true,
/(c) =
° otherwise.
We set L =7 and N=10. As for the number of episodes M, we compare
the "decreasing M" strategy (M is decreased as 10, 10,7,7,7,4,and 4 from
iteration 1 to iteration7) and the "fixed M" strategy (M is fixed to7 through­
out iterations). The initial state is always set to s =(n/4,0,0,0)T. The initial
evaluation policy nl is set to the E-greedy policy, defined as
nl(als):=0. 15 pu(a) +0.85/(a=argmax Qo(s,a')),
a'
110
Qo(s,a):=I>b(S,a),
b=1
where pu(a) denotes the uniform distribution over actions. Policies are
updated using the E-greedy rule with E =0. 15/I in the l-th iteration. We pre­
pare the following four sampling policy candidates in the sampling policy
234 10 Applications of Active Learning
selection step of the l-th iteration:
{J'f'l.15/1 J'f'l.15/l+0.15 J'f'l.15/l+0.5 J'f'l.15/l+0.85}I '1 ' I '1 '
where Jrl denotes the policy obtained by greedy update using (2"1-1, and Jf; is
the "E-greedy" version of the base policy Jr/. That is, the intended action can
be successfully chosen with probability 1 - E/2, and the other action is chosen
with probability E/2.
The discount factor y is set to 1, and the performance of learned policy
lrL+! is measured by the discounted sum of immediate rewards for test samples
{r��/l}�O';'�n=! (20 episodes with 10 steps collected following lrL+l):
M N
Performance = LLr��n
+!
•
m=1 n=1
The experiment is repeated 500 times with different random seeds, and the
average performance of each learning method is evaluated. The results, are
depicted in figure 10.2, showing that the API method outperforms the passive
learning strategy; for the "decreasing M" strategy, the performance difference
is statistically significant by the t-test at the significance level 1 percent for the
error values at the seventh iteration.
The above experimental evaluation showed that the sampling policy design
method, API, is useful for improving the performance of reinforcement learn­
ing. Moreover, the "decreasing M" strategy was shown to be a useful heuristic
to further enhance the performance of API.
10.2 Wafer Alignment in Semiconductor Exposure Apparatus
In this section, we describe an application of pool-based active learning meth­
ods to a wafer alignment problem in semiconductor exposure apparatus [163].
A profile of the exposure apparatus is illustrated in figure 10.3.
Recent semiconductors have the layered circuit structure, which is built
by exposing circuit patterns multiple times. In this process, it is extremely
important to align the wafers at the same position with very high accu­
racy. To this end, the location of markers is measured to adjust the shift
and rotation of wafers. However, measuring the location of markers is time­
consuming, and therefore there is a strong need to reduce the number of
markers to be measured in order to speed up the semiconductor production
process.
10.2. Wafer Alignment in Semiconductor Exposure Apparatus
70
65
60
QJ
u
c 55to
E
.g 50QJ
a.
QJ
Cl
� 45
QJ
>
<t:
40
35
30
2 3
Figure 10.2
4
Iteration
_ 0- - - - - .()
- --
--- AL (decreasing M)
- • - PL (decreasing M)
-0- AL (fixed M)
•
0 • PL (fixed M)
5 6 7
235
The mean performance over 500 runs in the ball-batting experiment. The dotted lines denote the
performance of passive learning (PL), and the solid lines denote the performance of the active
learning (AL) method. The error bars are omitted for clear visibility. For the "decreasing M
OO
strategy, the performance of active learning after the seventh iteration is significantly better than
that of PL according to the t-test at the significance level I percent for the error values at the
seventh iteration.
Reticle
stage
��,
��g;�wafer
�
stage
Silicon Wafer
Figure 10.3
Semiconductor exposure apparatus.
236
Observed Marker
Marker
Figure 10.4
10 Applications of Active Learning
Silicon wafer with markers. Observed markers based on the conventional heuristic are also shown.
Figure 10.4 illustrates a wafer on which markers are printed uniformly. The
goal is to choose the most "informative" markers to be measured for better
alignment of the wafer. A conventional choice is to measure markers far from
the center in a symmetric way, which will provide robust estimation of the
rotation angle (see figure 10.4). However, this naive approach is not necessar­
ily the best since misalignment is caused not only by affine transformation but
also by several other nonlinear factors, such as a warp, a biased characteristic of
measurement apparatus, and different temperature conditions. In practice, it is
not easy to model such nonlinear factors accurately. For this reason, the linear
affine model or the second-order model is often used in wafer alignment. How­
ever, this causes model misspecification, and therefore active learning methods
for approximately correct models explained in chapter 8 would be useful in this
application.
Let us consider the functions whose input x = (u,v)T is the location on
the wafer and whose output is the horizontal discrepancy /:).U or the vertical
discrepancy /:).v. These functions are learned using the following second-order
model:
For 220 wafer samples, experiments are carried out as follows. For each wafer,
ntr = 20 points are chosen from nte = 38 markers, and the horizontal and the
vertical discrepancies are observed. Then the above model is trained and its
prediction performance is tested using all 38 markers in the 220 wafers. This
10.2. Wafer Alignment in Semiconductor Exposure Apparatus 237
Table 10.1
The mean squared test error for the wafer alignment problem (means and standarddeviations over
220 wafers)
Order
2
PALICE
°2.27±1.08
°1.93±0.89
PW
2.29±1.08
2.09±0.98
PO
2.37±1.15
1.96±0.91
Passive
2.32±1.11
2.32±1.15
Conv.
2.36±1.15
2.13±1.08
PALICE, PW, and PO denote the active learning methods described in sections 8.4.3, 8.4.2, and
8.4.1, respectively. "Passive" indicates the passive sampling strategy where training input points
are randomly chosen from all markers. "Conv." indicates the conventional heuristic of choosing
the outer markers.
process is repeated for all 220 wafers. Since the choice of the sampling location
by active learning methods is stochastic, the above experiment is repeated 100
times with different random seeds.
The mean and standard deviation of the squared test error over 220 wafers
are summarized in table 10. 1. This shows that the PALICE method (see
section 8.4.3) works significantly better than the other sampling strategies,
and it provides about a 10 percent reduction in the squared error from the
conventional heuristic of choosing the outer markers.
IV
CONCLUSIONS
11
Conclusions and Future Prospects
In this book,we provided a comprehensive overview of theory,algorithms,and
applications of machine learning under covariate shift.
11.1 Conclusions
Part II of the book covered topics on learning under covariate shift. In
chapters 2 and 3, importance sampling techniques were shown to form the
theoretical basis of covariate shift adaptation in function learning and model
selection. In practice, importance weights needed in importance sampling are
unknown. Thus, estimating the importance weights is a key component in
covariate shift adaptation,which was covered in chapter 4. In chapter 5,a novel
idea for estimating the importance weights in high-dimensional problems was
explained. Chapter 6 was devoted to the review of a Nobel Prize-winning work
on sample selection bias,and its relation to covariate shift adaptation was dis­
cussed. In chapter 7, applications of covariate shift adaptation techniques to
various real-world problems were shown.
In part II, we considered the occurrence of covariate shift and how to cope
with it. On the other hand,in part III,we considered the situation where covari­
ate shift is intentionally caused by users in order to improve generalization
ability. In chapter 8, the problem of active learning, where the training input
distribution is designed by users,was addressed. Since active learning naturally
induces covariate shift, its adaptation was shown to be essential for design­
ing better active learning algorithms. In chapter 9, a challenging problem of
active learning with model selection where active learning and model selec­
tion are performed at the same time, was addressed. The ensemble approach
was shown to be a promising method for this chicken-or-egg problem. In
chapter 10, applications of active learning techniques to real-world problems
were shown.
242 11 Conclusions and Future Prospects
11.2 Future Prospects
In the context of covariate shift adaptation, the importance weights played
an essential role in systematically adjusting the difference of distributions in
the training and test phases. In chapter 4, we showed various methods for
estimating the importance weights.
Beyond covariate shift adaptation, it has been shown recently that the
ratio of probability densities can be used for solving machine learning tasks
[157,170]. This novel machine learning framework includes multitask learn­
ing [16], privacy-preserving data mining [46], outlier detection [79,146,80],
change detection in time series [91],two-sample test [169],conditional density
estimation [172], and probabilistic classification [155]. Furthermore, mutual
information-which plays a central role in information theory [34]-can be
estimated via density ratio estimation [178,179]. Since mutual information
is a measure of statistical independence between random variables, density
ratio estimation also can be used for variable selection [177], dimensionality
reduction [175],independence test [168],clustering [94],independent compo­
nent analysis [176],and causal inference [203]. Thus,density ratio estimation
is a promising versatile tool for machine learning which needs to be further
investigated.
Appendix: List of Symbols and Abbreviations
X
d
Y
a
Ptr(x)
p(ylx)
I(x)
x�e
yj
e
nte
Pte(x)
w(x)
loss(x, y, 5)
Gen
Gen', Gen"
Geii
Bias
2
Var
IE{Xr}7�1
IE{YY}7�1
IExte
lEie
M
Input domain
Input dimensionality
Output domain
i-th training input
i-th training output
Training output vector
i-th training output noise
Training output noise vector
Standard deviation of output noise
Number of training samples
Training input density
Conditional density of output y, given input x
Conditional mean of output y, given input x
j-th test input
j-th test output
Number of test samples
Test input density
Importance weight
Loss when output y at input x is estimated by y
Generalization error
Generalization error without irrelevant constant
Generalization error estimator
(Squared) bias term in generalization error
Variance term in generalization error
Expectation over {X:,}7!1 drawn i.i.d. from Ptr(x)
Expectation over {y!r}7!, each drawn from p(yIx =x:')
Expectation over xte drawn from Pte(x)
Expectation over ie drawn from p(ylx =xte)
Model
244
rex; 0)
8e
qJe(x)
o
b
K(x,x')
'9
0*
y
A
R(O)
Xlf
W
L
U
KIf
N(x; /L, a2)
N(x; It, 1:)
rex)
8
p�(x)
�(x)
�*(x)
v
Op
T
(.,.)
11·11
AIC
AlWERM
AIWLS
ALICE
API
BASIC
Function model
£-th parameter
£-th basis function
Parameter vector
Number of parameters
Kernel function
Learned parameter
Optimal parameter
Appendix: List of Symbols and Abbreviations
Flattening parameter for importance weights
Regularization parameter
Regularization function
Training design matrix
Importance-weight matrix
Learning matrix
Metric matrix
Training kernel matrix
Gaussian density with mean /L and variance a2
Multidimensional Gaussian density with mean It and covari­
ance matrix 1:
Residual function
Model error
"Optimal" training input density
Resampling weight function
"Optimal" resampling weight function
Flattening parameter of training input density
Asymptotic order ("big 0")
Asymptotic order in probability ("big 0")
Asymptotic order ("small 0")
Asymptotic order in probability ("small 0")
Transpose of matrix or vector
Inner product
Norm
Akaike information criterion
Adaptive lWERM
Adaptive IWLS
Active learning using IWLS learning based on conditional
expectation of generalization error
Active policy iteration
Bootstrap approximated SIC
Appendix: List of Symbols and Abbreviations
CRF
CSP
CV
D3
EAL
ERM
FDA
GMM
IWERM
IWLS
KDE
KKT
KL
KLIEP
KMM
LASIC
LFDA
LOOCV
LR
LS
LSIF
MDP
MFCC
MLE
Conditional random field
Common spatial pattern
Cross-validation
Direct density-ratio estimation with dimensionality reduction
Ensemble active learning
Empirical risk minimization
Fisher discriminant analysis
Gaussian mixture model
Importance-weighted ERM
Importance-weighted LS
Kernel density estimation
Karush-Kuhn-Tucker
Kullback-Leibler
KL importance estimation procedure
Kernel mean matching
Linearly approximated SIC
Local FDA
Leave-one-out CV
Logistic regression
Least squares
Least-squares importance fitting
Markov decision problem
Mel-frequency cepstrum coefficient
Maximum likelihood estimation
MSE Mean squared error
NLP Natural language processing
OLS Ordinary LS
PALICE Pool-based ALICE
QP Quadratic program
RIWERM Regularized IWERM
RL
SIC
SRPI
SVM
TD
uLSIF
Reinforcement learning
Subspace information criterion
Sample reuse policy iteration
Support vector machine
Temporal difference
Unconstrained LSIF
245
Bibliography
I. H. Akaike. Statistical predictor identification. Annals of the Institute of Statistical Mathematics,
22(1):203-217, 1970.
2. H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic
Control, AC-19(6):716-723, 1974.
3. H. Akaike. Likelihood and the Bayes procedure. In J. M. Bernardo, M. H. DeGroot, D. V.
Lindley, and A. F. M. Smith, editors, Bayesian Statistics, pages 141-166. University of Valencia
Press, Vlencia, Spain, 1980.
4. T. Akiyama, H. Hachiya, and M. Sugiyama. Efficient exploration through active learning
for value function approximation in reinforcement learning. Neural Networks, 23(5):639-648,
2010.
5. A. Albert. Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York and
London, 1972.
6. N. Altman and C. Leger. On the optimality of prediction-based selection criteria and the
convergence rates of estimators. Journal of the Royal Statistical Society, series B, 59(1):205-216,
1997.
7. S. Amari. Theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers,
EC-16(3):299-307, 1967.
8. F. Babiloni, F. Cincotti, L. Lazzarini, J. del R. Millan, J. Mourifio, M. Varsta, J. Heikkonen,
L. Bianchi, and M. G. Marciani. Linear classification of low-resolution EEG patterns produced
by imagined hand movements. IEEE Transactions on Rehabilitation Engineering, 8(2):186-188,
June 2000.
9. F. R. Bach. Active learning for misspecified generalized linear models. In B. Schiilkopf, J. Platt,
and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages
65-72. MIT Press, Cambridge, MA, 2007.
10. P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press,
Cambridge, MA, 1998.
II. L. Bao and S. S. Intille. Activity recognition from user-annotated acceleration data. In Pro­
ceedings of the 2nd IEEE International Conference on Pervasive Computing, pages 1-17.
Springer, New York, 2004.
12. R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press,
Princeton, NJ, 1961.
13. P. D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Nashua,
NH, 1996.
14. M. J. Best. An algorithm for the solution of the parametric quadratic programming problem.
CORR Report 82-24, Faculty of Mathematics, University of Waterloo, Ontaris, Canada, 1982.
15. N. B. Bharatula, M. Stager, P. Lukowicz, and G. Triister. Empirical study of design choices
in multi-sensor context recognition. In Proceedings of the 2nd International Form on Applied
Wearable Computing, pages 79-93. 2005.
248 Bibliography
16. S. Bickel, 1. Bogojeska, T. Lengauer, and T. Scheffer. Multi-task learning for HIV therapy
screening. In A. McCallum and S. Roweis, editors, Proceedings of the 25th Annual International
Conference on Machine Learning, pages 56-63. Madison, WI, Omnipress, 2008.
17. S. Bickel, M. Briickner, and T. Scheffer. Discriminative leaming for differing training and
test distributions. In Z. Ghahramani, editor, Proceedings of the 24th International Conference on
Machine Learning, pages 81-88. 2007.
18. S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In
B. Sch6ikopf, 1. Platt, and T. Hoffman, editors, Advances in Neural Information Processing
Systems, volume 19, pages 161-168. MIT Press, Cambridge, MA, 2007.
19. H. J. Bierens. Maximum likelihood estimation of Heckman's sample selection model. Unpub­
lished manuscript, Pennsylvania State University, 2002. Available at http://econ.la.psu.edur
hbierens/EasyRegTours/HECKMAN_Tourfiles/Heckman.PDF.
20. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford,
1995.
21. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006.
22. B. Blankertz, G. Dornhege, M. Krauledat, and K.-R. Miiller. The Berlin brain-computer inter­
face: EEG-based communication without subject training. IEEE Transactions on Neural Systems
and Rehabilitation Engineering, 14(2):147-152, 2006.
23. B. Blankertz, G. Dornhege, M. Krauledat, K.-R. Miiller, and G. Curio. The Berlin brain­
computer interface: Report from the feedback sessions. Technical Report 1, Fraunhofer FIRST,
2005.
24. J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspon­
dence leaming. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, pages 120-128. 2006.
25. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. SchOlkopf, and A. J. Smola.
Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics,
22(14):e49-e57, 2006.
26. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi­
fiers. In D. Haussler, editor, Proceedings of the Fifth Annual ACM Workshop on Computational
Learning Theory, pages 144-152. ACM Press, New York, 1992.
27. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
Cambridge, 2004.
28. L. Breiman. Arcing classifiers. Annals of Statistics, 26(3):801-849, 1998.
29. W. Campbell. Generalized linear discriminant sequence kernels for speaker recognition. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing,
volume 1, pages 161-164. 2002.
30. O. Chapelle, B. SchOlkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press,
Cambridge, MA, 2006.
31. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM
Journal on Scientific Computing, 20(1):33-61, 1998.
32. K. F. Cheng and C. K. Chu. Semiparametric density estimation under a two-sample density
ratio model. Bernoulli, 10(4):583-604, 2004.
33. D. A. Cohn, Z. Ghahramani, and M.1. Jordan. Active leaming with statistical models. Journal
of Artificial Intelligence Research, 4:129-145, 1996.
34. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, New
York, 1991.
35. P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimating the cor­
rect degree of smoothing by the method of generalized cross-validation. Numerische Mathematik,
31:377-403, 1979.
Bibliography 249
36. H. Daume III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual
Meeting of the Association for Computational Linguistics, pages 256-263. 2007.
37. A. Demiriz, K P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column
generation. Machine Learning, 46(1/3):225-254, 2002.
38. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society, series B, 39(1):1-38, 1977.
39. G. Dornhege, B. Blankertz, G. Curio, and K-R. Miiller. Boosting bit rates in noninvasive EEG
single-trial classifications by feature combination and multiclass paradigms. IEEE Transactions on
Biomedical Engineering, 51(6):993-1002, June 2004.
40. G. Dornhege, 1. del R. Millan, T. Hinterberger, D. McFarland, and K-R. Miiller, editors.
Toward Brain-Computer Interfacing. MIT Press, Cambridge, MA, 2007.
41. N. R. Draper and H. Smith. Applied Regression Analysis, third edition. Wiley-Interscience,
New York, 1998.
42. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, second edition. Wiley­
Interscience, New York, 2000.
43. B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):1-26,
1979.
44. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of
Statistics, 32(2):407-499, 2004.
45. B. Efron and R. 1. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, New
York, 1994.
46. C. Elkan. Privacy-preserving data mining via importance weighting. In Proceedings of the
ECMUPKDD Workshop on Privacy and Security Issues in Data Mining and Machine Learning,
pages 15-21, Springer, 2010.
47. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines.
Advances in Computational Mathematics, 13(1):1-50, 2000.
48. V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972.
49. FG-NET Consortium. The FG-NET Aging Database. http://www.fgnet.rsunit.coml.
50. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics,
7(2):179-188, 1936.
51. G. S. Fishman. Monte Carlo: Concepts, Algorithms, and Applications. Springer-Verlag,
Berlin, 1996.
52. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of
the 13th International Conference on Machine Learning, pages 148-156. Morgan Kaufmann, San
Francisco, 1996.
53. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of
boosting. Annals of Statistics, 28(2):337-407, 2000.
54. Y. Fu, Y. Xu, and T. S. Huang. Estimating human age by manifold analysis of face pictures and
regression on aging features. In Proceedings of the IEEE International Conference on Multimedia
and Expo, pages 1383-1386. 2007.
55. K Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on
Neural Networks, 11(1):17-26, 2000.
56. K Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learn­
ing with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5(1):73-99,
January 2004.
57. K Fukunaga. Introduction to Statistical Pattern Recognition, second edition. Academic Press,
Boston, 1990.
58. S. Furui. Cepstral analysis technique for automatic speaker verification. IEEE Transactions
on Acoustics, Speech, and Signal Processing, 29(2):254-272, 1981.
250 Bibliography
59. S. Furui. Comparison of speaker recognition methods using statistical features and dynamic
features. IEEE Transactions on Acoustics. Speech and Signal Processing. 29(3):342-350. 1981.
60. C. Gao. F. Kong. and J. Tan. HealthAware: Tackling obesity with Health Aware smart phone
systems. In Proceedings of the IEEE International Conference on Robotics and Biomimetics. pages
1549-1554. IEEE Press. Piscataway. NJ. 2009.
61. X. Geng. Z. Zhou. Y. Zhang. G. Li. and H. Dai. Learning from facial aging patterns for auto­
matic age estimation. In Proceedings of the 14th ACM International Conference on Multimedia.
pages 307-316. 2006.
62. A. Globerson and S. Roweis. Metric leaming by collapsing classes. In Y. Weiss. B. Schiilkopf.
and J. Platt. editors. Advances in Neural Information Processing Systems. volume 18. pages
451-458. MIT Press. Cambridge. MA, 2006.
63. J. Goldberger. S. Roweis. G. Hinton. and R. Salakhutdinov. Neighbourhood components anal­
ysis. In L. K. Saul. Y. Weiss. and L. Bottou. editors. Advances in Neural Information Processing
Systems. volume 17. pages 513-520. MIT Press. Cambridge. MA. 2005.
64. G. H. Golub and C. F. van Loan. Matrix Computations. third edition. Johns Hopkins
University Press. Baltimore. 1996.
65. G. Guo. G. Mu. Y. Fu. C. Dyer. and T. Huang. A study on automatic age estimation using a
large database. In Proceedings of the IEEE International Conference on Computer Vision. pages
1986-1991. 2009.
66. H. Hachiya. T. Akiyama. M. Sugiyama. and J. Peters. Adaptive importance sampling
for value function approximation in off-policy reinforcement leaming. Neural Networks,
22(10):1399-1410, 2009.
67. H. Hachiya, J. Peters, and M. Sugiyama. Efficient sample reuse in EM-based policy search. In
W. Buntine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor, editors, Machine Learning and
Knowledge Discovery in Databases. Volume 5781 of Lecture Notes in Artificial Intelligence,
pages 469-484. Springer, Berlin, 2009.
68. H. Hachiya, M. Sugiyama, and N. Ueda. Importance-weighted least-squares probabilis­
tic classifier for covariate shift adaptation with application to human activity recognition.
Neurocomputing, 2011. In press.
69. W. Hardie, M. Miiller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric
Models. Springer, Berlin, 2004.
70. T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regnlarization path for the support
vector machine. Journal of Machine Learning Research, 5:1391-1415, 2004.
71. T. Hastie and R. Tibshirani. Generalized Additive Models, volume 43 of Monographs on
Statistics and Applied Probability. Chapman & Hall/CRC, London, 1990.
72. T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(6):607-616, 1996.
73. T. Hastie and R. Tibshirani. Discriminant analysis by Gaussian mixtures. Journal of the Royal
Statistical Society, Series B, 58(1):155-176, 1996.
74. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining.
Inference, and Prediction. Springer, New York, 2001.
75. Y. Hattori, S. Inoue, T. Masaki, G. Hirakawa, and O. Sudo. Gathering large scale human activ­
ity information using mobile sensor devices. In Proceedings of the Second International Workshop
on Network Traffic Control. Analysis and Applications, pages 708-713. 2010.
76. J. J. Heckman. The common structure of statistical models of truncation, sample selection
and limited dependent variables and a simple estimator for such models. Annals of Economic and
Social Measurement, 5(4):120-137, 1976.
77. J. J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153-161,
1979.
78. R. E. Henkel. Tests of Significance. Sage Publications, Beverly Hills, CA, 1976.
Bibliography 251
79. S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Inlier-based outlier detection
via direct density ratio estimation. In F. Giannotti, D. Gunopulos, F. Turini, C. Zaniolo, N. Ramakr­
ishnan, and X. Wu, editors, Proceedings of the IEEE International Conference on Data Mining,
pages 223-232. 2008.
80. S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Statistical outlier detection
using direct density ratio estimation. Knowledge and Information Systems, 26(2):309-336, 2011.
81. G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural
Computation, 14(8):1771-1800, 2002.
82. J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. SchOlkopf. Correcting sample
selection bias by unlabeled data. In B. SchOlkopf, J. Platt, and T. Hoffman, editors, Advances in
Neural Information Processing Systems, volume 19, pages 601-608. MIT Press, Cambridge, MA,
2007.
83. P.1. Huber. Robust Statistics. Wiley-Interscience, New York, 1981.
84. A. Hyvfuinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, New York,
2001.
85. M. Ishiguro, Y. Sakamoto, and G. Kitagawa. Bootstrapping log likelihood and EIC, an
extension of AIC. Annals of the Institute of Statistical Mathematics, 49(3):411-434, 1997.
86. J. Jiang and C. Zhai. Instance weighting for domain adaptation in NLP. In Proceedings
of the 45th Annual Meeting of the Association for Computational Linguistics, pages 264-271.
Association for Computational Linguistics, 2007.
87. T. Kanamori. Pool-based active leaming with optimal sampling distribution and its informa­
tion geometrical interpretation. Neurocomputing, 71(1-3):353-362, 2007.
88. T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance
estimation. Journal of Machine Learning Research, 10:1391-1445, July 2009.
89. T. Kanamori and H. Shimodaira. Active leaming algorithm using the maximum weighted
log-likelihood estimator. Journal of Statistical Planning and Inference, 116(1):149-162, 2003.
90. T. Kanamori, T. Suzuki, and M. Sugiyama. Condition number analysis of kernel-based density
ratio estimation. Technical report TR09-0006, Department of Computer Science, Tokyo Institute
of Technology, 2009. Available at: http://arxiv.org.abs/0912.2800.
91. Y. Kawahara and M. Sugiyama. Change-point detection in time-series data by direct density­
ratio estimation. In H. Park, S. Parthasarathy, H. Liu, and Z. Obradovic, editors, Proceedings of
the SIAM International Conference on Data Mining, pages 389-400. SIAM, 2009.
92. M. Kawanabe, M. Sugiyama, G. Blanchard, and K.-R. Miiller. A new algorithm of non­
Gaussian component analysis with radial kernel functions. Annals of the Institute of Statistical
Mathematics, 59(1):57-75, 2007.
93. 1. Kiefer. Optimum experimental designs. Journal of the Royal Statistical Society, Series B,
21:272-304, 1959.
94. M. Kimura and M. Sugiyama. Dependence-maximization clustering with least-squares mutual
information. Journal of Advanced Computational Intelligence and Intelligent Informatics, 15(7):
800-805, 2011.
95. D. E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming,
third edition. Addison-Wesley Professional, Boston, 1998.
96. S. Konishi and G. Kitagawa. Generalised information criteria in model selection. Biometrika,
83(4):875-890, 1996.
97. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical
Statistics, 22(1):79-86, 1951.
98. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In Proceedings of the 18th International Conference on
Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, 2001.
252 Bibliography
99. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning
Research, 4:1107-1149, December 2003.
100. P. Leijdekkers and V. Gay. A self-test to detect a heart attack using a mobile phone and
wearable sensors. In Proceedings of the 21stIEEE International Symposium on Computer-Based
Medical Systems, pages 93-98. IEEE Computer Society, Washington, DC, 2008.
101. S. Lemm, B. Blankertz, G. Curio, and K.-R. Muller. Spatio-spectral filters for improving
the classification of single trial EEG. IEEE Transactions on Biomedical Engineering, 52(9):1541-
1548, September 2005.
102. Y. Li, H. Kambara, Y. Koike, and M. Sugiyama. Application of covariate shift adapta­
tion techniques in brain computer interfaces. IEEE Transactions on Biomedical Engineering,
57(6):1318-1324, 2010.
103. c.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic
regression. Journal of Machine Learning Research, 9:627-650, April 2008.
104. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York,
1987.
105. A. Luntz and V. Brailovsky. On estimation of characters obtained in statistical procedure of
recognition. Technicheskaya Kibernetica, 3, 1969. (In Russian).
106. D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415-447, 1992.
107. D. J. C. MacKay. Information-based objective functions for active data selection. Neural
Computation, 4(4):590-604, 1992.
108. C. L. Mallows. Some comments on Cpo Technometrics, 15(4):661-675, 1973.
109. O. L. Mangasarian and D. R. Musicant. Robust linear and support vector regression.IEEE
Transactions on Pattern Analysis and Machine Intelligence, 22(9):950-955, 2000.
110. 1. Mariethoz and S. Bengio. A kernel trick for sequences applied to text-independent speaker
verification systems. Pattern Recognition, 40(8):2315-2324, 2007.
111. T. Matsui and K. Aikawa. Robust model for speaker verification against session-dependent
utterance variation. IEICE Transactions on Information and Systems, E86-D(4):712-718, 2003.
112. T. Matsui and S. Furui. Concatenated phoneme models for text-variable speaker recognition.
In Proceedings of the IEEE International Conference on Audio Speech and Signal Processing,
pages 391-394. 1993.
113. T. Matsui and K. Tanabe. Comparative study of speaker identification methods: dPLRM,
SVM, and GMM.IEICE Transactions on Information and Systems, E89-D(3):1066-1073, 2006.
114. P. McCullagh and 1. A. Neider. Generalized Linear Models, second edition. Chapman &
Hall/CRC, London, 1989.
115. P. Melville and R. J. Mooney. Diverse ensembles for active learning. In Proceedings of the
21st International Conference on Machine Learning, pages 584-591. ACM Press, New York,
2004.
116. J. del R. Millan. On the need for on-line learning in brain-computer interfaces. In Pro­
ceedings of the International Joint Conference on Neural Networks, volume 4, pages 2877-2882.
2004.
117. T. P. Minka. A comparison of numerical optimizers for logistic regression. Technical report,
Microsoft Research. 2007.
118. N. Murata, S. Yoshizawa, and S. Amari. Network information criterion-Determining the
number of hidden units for an artificial neural network model. IEEE Transactions on Neural
Networks, 5(6):865-872, 1994.
119. W. K. Newey, J. L. Powell, and J. R. Walker. Semiparametric estimation of selection mod­
els: Some empirical results. American Economic Review Papers and Proceedings, 80(2):324-328,
1990.
120. X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and
the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory,
56(11):5847-5861, 2010.
Bibliography 253
121. K. Pelckmans, 1. A. K. Suykens, and B. de Moor. Additive regularization trade-off: Fusion
of training and validation levels in kernel methods. Machine Learning, 62(3):217-252, 2006.
122. G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and
desynchronization: Basic principles. Clinical Neurophysiology, 110(11):1842-1857, November
1999.
123. P. J. Phillips, P. 1. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques,
J. Min, and W. J. Worek. Overview of the face recognition grand challenge. In Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 947-954.
2005.
124. D. Precup, R. S. Sutton, and S. Dasgupta. Off-policy temporal-difference learning with func­
tion approximation. In Proceedings of the 18th International Conference on Machine Learning,
pages 417-424. 2001.
125. D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In
Proceedings of the 17th International Conference on Machine Learning, pages 759-766. Morgan
Kaufmann, San Franciscs, 2000.
126. P. A. Puhani. The heckman correction for sample selection and its critique: A short survey.
Journal of Economic Surveys, 14(1):53-68, 2000.
127. F. Pukelsheim. Optimal Design of Experiments. Wiley, New York, 1993.
128. J. Qin. Inferences for case-control and semiparametric two-sample density ratio models.
Biometrika, 85(3):619-630, 1998.
129. L. Rabiner and B-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood
Cliffs, NJ, 1993.
130. H. Rarnoser, J. Miiller-Gerking, and G. Pfurtscheller. Optimal spatial filtering of single
trial EEG during imagined hand movement. IEEE Transactions on Rehabilitation Engineering,
8(4):441-446, 2000.
131. C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, New York, 1965.
132. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted Gaussian
mixture models. Digital Signal Processing, 10(1-3):19-41, 2000.
133. D. A. Reynolds and R. C. Rose. Robust text-independent speaker verification using Gaussian
mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72-83, 1995.
134. K. Ricanek and T. Tesafaye. MORPH: A longitudinal image database of normal adult age­
progression. In Proceedings of the IEEE 7th International Conference on Automatic Face and
Gesture Recognition, pages 341-345. 2006.
135. J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465-471, 1978.
136. J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, Series B,
49(3):223-239, 1987.
137. R. T. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions.
Journal of Banking & Finance, 26(7):1443-1471, 2002.
138. P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. Wiley, New
York, 1987.
139. A. I. Schein and L. H. Ungar. Active learning for logistic regression: An evaluation. Machine
Learning, 68(3):235-265, 2007.
140. M. Schmidt. minFunc, 2005. http://people.cs.ubc.carschmidtml Software/minFunc.html.
141. B. Schiilkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2001.
142. G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461-464,
1978.
143. P. Shenoy, M. Krauledat, B. Blankertz, R. P. N. Rao, and K.-R. Miiller. Towards adaptive
classification for BCI. Journal of Neural Engineering, 3(1):R13-R23, 2006.
254 Bibliography
144. R. Shibata. Statistical aspects of model selection. In J. C. Willems, editor, From Data to
Model, pages 215-240. Springer-Verlag, New York, 1989.
145. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log­
likelihood function. Journal of Statistical Planning and Inference, 90(2):227-244, 2000.
146. A. Smola, L. Song, and C. H. Teo. Relative novelty detection. In D. van Dyk and M. Welling,
editors, Proceedings of the i2th international Conference on Artificial intelligence and Statistics,
volume 5 of JMLR Workshop and Conference Proceedings, pages 536-543. 2009.
147. L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo. Supervised feature selec­
tion via dependence estimation. In Proceedings of the 24th International Conference on Machine
Learning, pages 823-830. ACM Press, New York, 2007.
148. C. M. Stein. Estimation of the mean of a multivariate normal distribution. Annals of
Statistics, 9(6):1135-1151, 1981.
149. I. Steinwart. On the influence of the kernel on the consistency of support vector machines.
Journal of Machine Learning Research, 2:67-93, 2001.
150. M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the
Royal Statistical Society, Series B, 36(2):111-147, 1974.
151. M. Stone. Asymptotics for and against cross-validation. Biometrika, 64(1):29-35, 1977.
152. E. P. Stuntebeck, J. S. Davis 11, G. D. Abowd, and M. Blount. Healthsense: Classification
of health-related sensor data through user-assisted machine leaming. In Proceedings of the 9th
Workshop on Mobile Computing Systems and Applications, pages 1-5. ACM, New York, 2008.
153. M. Sugiyama. Active leaming in approximately linear regression based on conditional
expectation of generalization error. Journal of Machine Learning Research, 7:141-166, January
2006.
154. M. Sugiyama. Dimensionality reduction of multimodal labeled data by local Fisher discrim­
inant analysis. Journal of Machine Learning Research, 8:1027-1061, May 2007.
155. M. Sugiyama. Superfast-trainable multi-class probabilistic classifier by least-squares poste­
rior fitting. IEiCE Transactions on Information and Systems, E93-D(10):2690-2701, 2010.
156. M. Sugiyama, T. Ide, S. Nakajima, and J. Sese. Semi-supervised local Fisher discriminant
analysis for dimensionality reduction. Machine Learning, 78(1-2):35-61, 2010.
157. M. Sugiyama, T. Kanamori, T. Suzuki, S. Hido, J. Sese, I. Takeuchi, and L. Wang. A
density-ratio framework for statistical data processing. iPSJ Transactions on Computer Vision
and Applications, 1:183-208, 2009.
158. M. Sugiyama, M. Kawanabe, and P. L. Chui. Dimensionality reduction for density ratio
estimation in high-dimensional spaces. Neural Networks, 23(1):44-59, 2010.
159. M. Sugiyama, M. Kawanabe, and K.-R. Miiller. Trading variance reduction with unbi­
asedness: The regularized subspace information criterion for robust model selection in kemel
regression. Neural Computation, 16(5):1077-1104, 2004.
160. M. Sugiyama, M. Krauledat, and K.-R. Miiller. Covariate shift adaptation by importance
weighted cross validation. Journal of Machine Learning Research, 8:985-1005, May 2007.
161. M. Sugiyama and K.-R. Miiller. The subspace information criterion for infinite dimensional
hypothesis spaces. Journal of Machine Learning Research, 3:323-359, November 2002.
162. M. Sugiyama and K.-R. Miiller. Input-dependent estimation of generalization error under
covariate shift. Statistics & Decisions, 23(4):249-279, 2005.
163. M. Sugiyama and S. Nakajima. Pool-based active learning in approximate linear regression.
Machine Learning, 75(3):249-274, 2009.
164. M. Sugiyama, S. Nakajima, H. Kashima, P. von Biinau, and M. Kawanabe. Direct importance
estimation with model selection and its application to covariate shift adaptation. In 1. C. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems
volume 20, pages 1433-1440. Cambridge, MA, MIT Press, 2008.
Bibliography 255
165. M. Sugiyama and H. Ogawa. Subspace information criterion for model selection. Neural
Computation, 13(8):1863-1889, 2001.
166. M. Sugiyama and H. Ogawa. Active leaming with model selection-Simultaneous optimiza­
tion of sample points and models for trigonometric polynomial models. IEICE Transactions on
Information and Systems, E86-D(12):2753-2763, 2003.
167. M. Sugiyama and N. Rubens. A batch ensemble approach to active leaming with model
selection. Neural Networks, 21(9):1278-1286, 2008.
168. M. Sugiyama and T. Suzuki. Least-squares independence test. IEICE Transactions on
Information and Systems, E94-D(6):1333-1336, 2011.
169. M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura. Least-squares two-sample
test. Neural Networks, 24(7):735-751, 2011.
170. M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning.
Cambridge, UK: Cambridge University Press, 2012.
171. M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Biinau, and M. Kawanabe.
Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical
Mathematics, 60(4):699-746, 2008.
172. M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Least­
squares conditional density estimation. IEICE Transactions on Information and Systems, E93-
D(3):583-594, 2010.
173. M. Sugiyama, M. Yamada, P. von Biinau, T. Suzuki, T. Kanamori, and M. Kawanabe. Direct
density-ratio estimation with dimensionality reduction via least-squares hetero-distributional
subspace search. Neural Networks, 24(2):183-198, 2011.
174. R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. Cambridge, MA,
MIT Press, 1998.
175. T. Suzuki and M. Sugiyama. Sufficient dimension reduction via squared-loss mutual
information estimation. In Y. W. Teh and M. Tiggerington, editors, Proceedings of the 13th Inter­
national Conference on Artificial Intelligence and Statistics, volume 9 of JMLR Workshop and
Conference Proceedings, pages 804-811. 2010.
176. T. Suzuki and M. Sugiyama. Least-squares independent component analysis. Neural
Computation, 23(1):284-301, 2011.
177. T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese. Mutual information estimation reveals
global associations between stimuli and biological processes. BMC Bioinformatics, 1O(1):S52,
2009.
178. T. Suzuki, M. Sugiyama, J. Sese, and T. Kanamori. Approximating mutual information by
maximum likelihood density ratio estimation. In Y. Saeys, H. Liu, I. Inza, L. Wehenkel, and Y. Van
de Peer, editors, Proceedings of the ECML-PKDD2008 Workshop on New Challenges for Feature
Selection in Data Mining and Knowledge Discovery, volume 4 ofJMLR Workshop and Conference
Proceedings, pages 5-20. 2008.
179. T. Suzuki, M. Sugiyama, and T. Tanaka. Mutual information approximation via maximum
likelihood estimation of density ratio. In Proceedings of the IEEE International Symposium on
Information Theory, pages 463-467. 2009.
180. A. Takeda, J. Gotoh, and M. Sugiyama. Support vector regression as conditional value­
at-risk minimization with application to financial time-series analysis. In S. Kaski, D. J. Miller,
E. Oja, and A. Honkela, editors, IEEE International Workshop on Machine Learning for Signal
Processing, pages 118-123. 2010.
181. A. Takeda and M. Sugiyama. On generalization performance and non-convex optimization
of extended v-support vector machine. New Generation Computing, 27(3):259-279, 2009.
182. K. Takeuchi. Distribution of information statistics and validity criteria of models. Mathemat­
ical Sciences, 153:12-18, 1976. (In Japanese).
183. R. Tibshirani. Regression shrinkage and subset selection with the Lasso. Journal of the Royal
Statistical Society, Series B, 58(1):267-288, 1996.
256 Bibliography
184. F. H. C. Tivive and A. Bouzerdoum. A gender recognition system using shunting inhibitory
convolutional neural networks. In Proceedings of the IEEE International Joint Conference on
Neural Networks, volume 10, pages 5336-5341. 2006.
185. Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama. Direct density ratio estimation
for large-scale covariate shift adaptation. In M. J. Zaki, K. Wang, C. Apte, and H. Park, editors,
Proceedings of the 8th SIAM International Conference on Data Mining, pages 443-454. SIAM,
2008.
186. Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama. Direct density ratio estimation
for large-scale covariate shift adaptation. Journal of Information Processing, 17:138-155, 2009.
187. Y. Tsuboi, H. Kashima, S. Mori, H. Oda, and Y. Matsumoto. Training conditional ran­
dom fields using incomplete annotations. In Proceedings of the 22nd International Conference
on Computational Linguistics, pages 897-904. 2008.
188. K. Ueki, M. Sugiyama, and Y. Ihara. Perceived age estimation under lighting condition
change by covariate shift adaptation. In Proceedings of the 20th International Conference on
Pattern Recognition, pages 3400-3403. 2010.
189. K. Ueki, M. Sugiyama, and Y. Ihara. Semi-supervised estimation of perceived age from
face images. In Proceedings of the International Conference on Computer Vision Theory and
Applications, pages 319-324. 2010.
190. L. G. Valiant. A theory of the leamable. Communications of the Association for Computing
Machinery, 27:1134-1142, 1984.
191. S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.
192. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With
Applications to Statistics. Springer, New York, 1996.
193. V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
194. C. Vidaurre, A. Schlogl, R. Cabeza, and G. Pfurtscheller. About adaptive classifiers for brain
computer interfaces. Biomedizinische Technik, 49(1):85-86, 2004.
195. G. Wahba. Spline Models for Observational Data. Philadelphia, SIAM, 1990.
196. S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University
Press, Cambridge, UK, 2009.
197. H. White. Maximum likelihood estimation of misspecified models. Econometrica,
50(1):1-25, 1982.
198. G. Wichern, M. Yamada, H. Thornburg, M. Sugiyama, and A. Spanias. Automatic audio
tagging using covariate shift adaptation. In Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pages 253-256. 2010.
199. D. P. Wiens. Robust weights and designs for biased regression models: Least squares and
generalized M-estimation. Journal of Statistical Planning and Inference, 83(2):395-412, 2000.
200. P. M. Williams. Bayesian regularization and prnning using a Laplace prior. Neural
Computation, 7(1):117-143, 1995.
201. J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan. Brain­
computer interfaces for communication and control. Clinical Neurophysiology, 113(6):767-791,
2002.
202. M. Yamada and M. Sugiyama. Direct importance estimation with Gaussian mixture models.
IEICE Transactions on Information and Systems, E92-D(10):2159-2162, 2009.
203. M. Yamada and M. Sugiyama. Dependence minimizing regression with model selection
for non-linear causal inference under non-Gaussian noise. In Proceedings of the 24th AAAI
Conference on Artificial Intelligence, pages 643-648. AAAI Press, 2010.
204. M. Yamada, M. Sugiyama, and T. Matsui. Semi-supervised speaker identification under
covariate shift. Signal Processing, 90(8):2353-2361, 2010.
205. M. Yamada, M. Sugiyama, G. Wichern, and J. Simm. Direct importance estimation with a
mixture of probabilistic principal component analyzers. IEICE Transactions on Information and
Systems, E93-D(10):2846-2849, 2010.
Bibliography 257
206. M. Yamada, M. Sugiyama, G. Wichern, and J. Simm. Improving the accuracy of
least-squares probabilistic classifiers. IEICE Transactions on Information and Systems, E94-
D(6):I337-1340, 2011.
207. K. Yamagishi, N. Ito, H. Kosuga, N. Yasuda, K. Isogami, and N. Kozuno. A simplified
measurement of farm worker's load using an accelerometer. Journal of the Japanese Society of
Agricultural Technology Management, 9(2):127-132, 2002. (In Japanese.)
208. L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In L. K. Saul, Y. Weiss,
and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 17, pages
1601-1608. MIT Press, Cambridge, MA, 2005.
Index
Active learning, 183
ensemble, 219
with model selection, 215
pool-based, 204,234
population-based, 188,225
Active learning/model selection dilemma, 215
Active policy iteration, 231
Affine transform, 236
Age prediction, 152
Akaike information criterion
importance-weighted, 47
Ball batting, 232
Basis
polynomial, 10,198,210,221,236
trigonometric polynomial, 10,216
Bayes decision rule, 143
Bellman equation, 166
Bellman residual, 168
Bi-orthogonality, 104
Bias-variance decomposition, 186,228
Bias-variance trade-off, 23
Boosting
importance-weighted, 40
Cepstral mean normalization, 142
Chebyshev approximation, 35
Classification, 8,35,41
Common spatial patterns, 138
Conditional random field, 150
Conditional value-at-risk, 34,40
Confidence, 38,160
Consistency, 21
Covariate shift, 3,9
Cross-validation, 74,77,80,84
importance-weighted, 64,144,156,169,174
leave-one-out, 65,88,113
Curse of dimensionality, II, 74
Design matrix, 27
Dimensionality reduction, 103
Dirac delta function, 166
Direct density-ratio estimation with
dimensionality reduction, 103
Distribution
conditional, 8
test input, 9, 73
training input, 7,73
Domain adaptation, 149
Dual basis, 104
EM algorithm, 146
Empirical risk minimization, 21
importance-weighted, 23
Error function, 129
Event-related desynchronization, 138
Expected shortfall, 35
Experimental design, 183
Extrapolation, S
Feasible region, 25
Feature selection, 25
Fisher discriminant analysis, 108,137
importance-weighted, 36,43,138
local, 109
Flattening parameter, 23,47, 140
F-measure, 151
Generalization error, 9,53,183
estimation, 47
input-dependent analysis, 51,54,193,206
input-independent analysis, 51,64,191,195,
205,207
single-trial, 51
Generalized eigenvalue problem, 37
Gradient method, 28,32
conjugate, 39
stochastic, 29,33
Gram-Schmidt orthonormalization, 112
Hazard ratio, 131
Heteroscedastic noise, 228
260
Huber regression importance-weighted,31
Human activity recognition,157
Idempotence,105
Importance estimation
kernel density estimation,73
kernel mean matching,75
Kullback-Leibler importance estimation
procedure,78,144,156
least-squares importance fitting,83
logistic regression,76
unconstrained least-squares importance
fitting,87
Importance sampling,22
Importance weight,9,22
adaptive,23,169
Akaike information criterion,47
boosting,40
classification,35
cross-validation,64,144,156,169,174
empirical risk minimization,23
estimation,73,103
Fisher discriminant analysis,138
least squares,26,41,66,190
logistic regression,38,144
regression,25
regularized,23,IS3
subspace information criterion,54
support vector machine,39
Inverse Mills ratio,131
Inverted pendulum,176
Jacobian,104
Kernel
Gaussian,12,29,73,75,84,144,153
model,12,29,143,153
sequence,143
Kernel density estimation,73
Kernel mean matching,75
k-means clustering,146
Kullback-Leibler divergence,47,78
Kullback-Leibler importance estimation
procedure,78,144,150,156
Learning matrix,27
Least squares,38,41,55
general,133
importance-weighted,26,41,66,205
Least-squares importance fitting,83
Linear discriminant analysis,see Fisher
discriminant analysis
Linear learning,49,59,184
Linear program,30
Logistic regression,76
importance weight,38,144
Index
Loss
0/1,8,35,64
absolute,30
classification,38
deadzone-linear,33
exponential,40
hinge,39
Huber,31
logistic,39
regression,25
squared,8,26,38,41,49,53,55,83,183
Machine learning,3
Mapping
hetero-distributional,106
homo-distributional,106
Markov decision problem,165
Maximum likelihood,38
Mel-frequency cepstrum coefficient,142
Model
additive,II
approximately correct,14,53,192,193,195,
229
correctly specified,13,189
Gaussian mixture,80,142
kernel,12,29,143,153
linear-in-input,10,25,37,39,41,43
Iinear-in-parameter,10,27,32,49,54,78,
80,83,184,198,221
log-linear,77,80,150
logistic,38,143,150
misspecified,14,53,190
multiplicative,II
nonparametric,13,73
pararnetric,13
probabilistic principal-component-analyzer
mixture,80,123
Model drift, 216
Model error,54,57,186,188,229
Model overfitting,219
Model selection,23,47,174
Natural language processing,149
Newton's method,39
Oblique projection,104
Off-policy reinforcement learning,169
Outliers,30,31
Parametric optimization,85
Policy iteration,166
Probit model,129
Quadratic program,32,75,84
Regression,8,25,40,66
Regularization pararneter,24
Index
Regularization path tracking,85
Regularizer
absolute,24
squared,24
Reinforcement learning,4,165,225
Rejection sampling,200
Resampling weight function,205
Robustness parameter,31
Sample
test,9,73
training,7,73
Sample selection bias,125
Sample reuse policy iteration,175,225
Scatter matrix
between-class,36,108
local between-class,110
local within-class,110
within-class,36,108
Semisupervised learning,4,142
Sherman-Woodbury-Morrison formula,88
Speaker identification,142
Sphering,105
Subspace
hetero-distributional,104
homo-distributional,104
Subspace information criterion
importance-weighted,54
Supervised learning,3,7
Support vector machine,75,142
importance-weighted, 39
Support vector regression,33
Test
input distribution, 9,73
samples,9,73
Training
input distribution, 7,73
samples,7,73
Unbiasedness,66
asymptotic,50,60,66
Unconstrained least-squares importance
fitting,87,113
Universal reproducing kernel Hilbert space,75
Unsupervised learning,3
Value function,166
Value-at-risk,34
Variable selection,25
Wafer alignment,234
261
Adaptive Computation and Machine Learning
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns,
Associate Editors
Bioinformatics: The Machine Learning Approach, Pierre Baldi and Si/lren Bmnak
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey
Learning in Graphical Models, Michael I. Jordan
Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and
Richard Scheines
Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth
Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and
Si/lren Bmnak
Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and
Beyond, Bernhard SchOlkopf and Alexander J. Smola
Introduction to Machine Learning, Ethem Alpaydin
Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher
K.I. Williams
Semi-Supervised Learning, Olivier Chapelle, Bernhard SchOlkopf, and Alexander Zien,
Eds.
The Minimum Description Length Principle, Peter D. Grunwald
Introduction to Statistical Relational Learning, Lise Getoor and Ben Taskar, Eds.
Probabilistic Graphical Models: Principles and Techniques, Daphne Koller and Nir
Friedman
Introduction to Machine Learning, second edition, Ethem Alpaydin
Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift
Adaptation, Masashi Sugiyama and Motoaki Kawanabe
Boosting: Foundations and Algorithms, Robert E. Schapire and Yoav Freund

More Related Content

What's hot (17)

PDF
Feature selection and microarray data
Gianluca Bontempi
 
DOC
local_learning.doc - Word Format
butest
 
PPTX
Feature Selection
Lippo Group Digital
 
PDF
Mathematical Modelling and Computer Simulation Assist in Designing Non-tradit...
IJECEIAES
 
PDF
Ijcatr04071005
Editor IJCATR
 
PDF
Fabric Textile Defect Detection, By Selection A Suitable Subset Of Wavelet Co...
CSCJournals
 
PDF
Introduction to Machine Learning
Mian Asbat Ahmad
 
PDF
Measurements of Errors - Physics - An introduction by Arun Umrao
ssuserd6b1fd
 
PDF
Higgs Boson Machine Learning Challenge - Kaggle
Sajith Edirisinghe
 
PDF
Medical Image segmentation using Image Mining concepts
Editor IJMTER
 
PDF
FUZZY IMAGE SEGMENTATION USING VALIDITY INDEXES CORRELATION
ijcsit
 
PDF
Haoying1999
Alieska Waye
 
PDF
EDGE DETECTION IN SEGMENTED IMAGES THROUGH MEAN SHIFT ITERATIVE GRADIENT USIN...
ijscmcj
 
PDF
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
Canh Le
 
PDF
Linear Discriminant Analysis for Human Face Recognition
IRJET Journal
 
PDF
2014-mo444-final-project
Paulo Faria
 
Feature selection and microarray data
Gianluca Bontempi
 
local_learning.doc - Word Format
butest
 
Feature Selection
Lippo Group Digital
 
Mathematical Modelling and Computer Simulation Assist in Designing Non-tradit...
IJECEIAES
 
Ijcatr04071005
Editor IJCATR
 
Fabric Textile Defect Detection, By Selection A Suitable Subset Of Wavelet Co...
CSCJournals
 
Introduction to Machine Learning
Mian Asbat Ahmad
 
Measurements of Errors - Physics - An introduction by Arun Umrao
ssuserd6b1fd
 
Higgs Boson Machine Learning Challenge - Kaggle
Sajith Edirisinghe
 
Medical Image segmentation using Image Mining concepts
Editor IJMTER
 
FUZZY IMAGE SEGMENTATION USING VALIDITY INDEXES CORRELATION
ijcsit
 
Haoying1999
Alieska Waye
 
EDGE DETECTION IN SEGMENTED IMAGES THROUGH MEAN SHIFT ITERATIVE GRADIENT USIN...
ijscmcj
 
NOVEL NUMERICAL PROCEDURES FOR LIMIT ANALYSIS OF STRUCTURES: MESH-FREE METHODS
Canh Le
 
Linear Discriminant Analysis for Human Face Recognition
IRJET Journal
 
2014-mo444-final-project
Paulo Faria
 

Similar to Machine learning-in-non-stationary-environments-introduction-to-covariate-shift-adaptation (20)

PDF
Active Learning Literature Survey
butest
 
PDF
Fundamentals Of Machine Learning For Predictive Data Analytics Algorithms Wor...
allerparede
 
PDF
MLBOOK.pdf
Anil Sagar
 
DOCX
ML Project(by-Ethem-Alpaydin)-Introduction-to-Machine-Learni-24.docx
audeleypearl
 
PPT
Machine Learning ICS 273A
butest
 
DOC
Course Syllabus
butest
 
PDF
A Few Useful Things to Know about Machine Learning
nep_test_account
 
PPTX
Machine learning ppt.
ASHOK KUMAR
 
PDF
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
PPT
ai4.ppt
ssuser448ad3
 
PDF
CS229_MachineLearning_notes.pdfkkkkkkkkkk
lenhan070903
 
PDF
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 
PPT
ai4.ppt
akshatsharma823122
 
PDF
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
butest
 
PDF
Information Theory And Statistical Learning 2009th Edition Frank Emmertstreib
hajziadome
 
PPTX
Statistical foundations of ml
Vipul Kalamkar
 
PPTX
PREDICT 422 - Module 1.pptx
VikramKumar790542
 
PDF
Data Mining the City - A (practical) introduction to Machine Learning
Danil Nagy
 
PPT
Chapter01.ppt
butest
 
Active Learning Literature Survey
butest
 
Fundamentals Of Machine Learning For Predictive Data Analytics Algorithms Wor...
allerparede
 
MLBOOK.pdf
Anil Sagar
 
ML Project(by-Ethem-Alpaydin)-Introduction-to-Machine-Learni-24.docx
audeleypearl
 
Machine Learning ICS 273A
butest
 
Course Syllabus
butest
 
A Few Useful Things to Know about Machine Learning
nep_test_account
 
Machine learning ppt.
ASHOK KUMAR
 
Online machine learning in Streaming Applications
Stavros Kontopoulos
 
ai4.ppt
ssuser448ad3
 
CS229_MachineLearning_notes.pdfkkkkkkkkkk
lenhan070903
 
machine learning notes by Andrew Ng and Tengyu Ma
Vijayabaskar Uthirapathy
 
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
butest
 
Information Theory And Statistical Learning 2009th Edition Frank Emmertstreib
hajziadome
 
Statistical foundations of ml
Vipul Kalamkar
 
PREDICT 422 - Module 1.pptx
VikramKumar790542
 
Data Mining the City - A (practical) introduction to Machine Learning
Danil Nagy
 
Chapter01.ppt
butest
 
Ad

Recently uploaded (20)

PPTX
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
PPT
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
PPTX
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
PDF
PRIZ Academy - Process functional modelling
PRIZ Guru
 
PPT
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
PDF
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
PDF
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
PDF
June 2025 Top 10 Sites -Electrical and Electronics Engineering: An Internatio...
elelijjournal653
 
PDF
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
PDF
01-introduction to the ProcessDesign.pdf
StiveBrack
 
PPTX
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
PPTX
Functions in Python Programming Language
BeulahS2
 
PDF
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
PDF
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PDF
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
PDF
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
PDF
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
 
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
PRIZ Academy - Process functional modelling
PRIZ Guru
 
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
June 2025 Top 10 Sites -Electrical and Electronics Engineering: An Internatio...
elelijjournal653
 
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
01-introduction to the ProcessDesign.pdf
StiveBrack
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
Functions in Python Programming Language
BeulahS2
 
13th International Conference of Security, Privacy and Trust Management (SPTM...
ijcisjournal
 
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
Precooling and Refrigerated storage.pptx
ThongamSunita
 
Ad

Machine learning-in-non-stationary-environments-introduction-to-covariate-shift-adaptation

  • 1. Machine Learning in Non-Stationary Environments
  • 2. Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors A complete list of the books published in this series can be found at the back of the book.
  • 3. MACHINE LEARNING IN NON-STATIONARY ENVIRONMENTS Introduction to Covariate Shift Adaptation Masashi Sugiyama and Motoaki Kawanabe The MIT Press Cambridge, Massachusetts London, England
  • 4. © 2012 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. For information about special quantity discounts, please email [email protected] This book was set in Syntex and TimesRoman by Newgen. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Sugiyama, Masashi, 1974- Machine learning in non-stationary environments : introduction to covariate shift adaptation / Masashi Sugiyarna and Motoaki Kawanabe. p. cm. - (Adaptive computation and machine learning series) Includes bibliographical references and index. ISBN 978-0-262-01709-1 (hardcover: alk. paper) I. Machine learning. I. Kawanabe, Motoaki. II. Title. Q325.5.S845 2012 006.3'I-dc23 10 9 8 7 6 5 4 3 2 1 2011032824
  • 5. Contents Foreword xi Preface xiii INTRODUCTION 1 Introduction and Problem Formulation 3 1.1 Machine Learning under Covariate Shift 3 1.2 Quick Tour of Covariate Shift Adaptation 5 1.3 Problem Formulation 7 1.4 1.3.1 Function Learning from Examples 7 1.3.2 Loss Functions 8 1.3.3 Generalization Error 9 1.3.4 Covariate Shift 9 1.3.5 Models for Function Learning 10 1.3.6 Specification of Models 13 Structure of This Book 14 1.4.1 Part II: Learning under Covariate Shift 1.4.2 Part III: Learning Causing Covariate Shift II LEARNING UNDER COVARIATE SHIFT 2 Function Approximation 21 14 17 2.1 Importance-Weighting Techniques for Covariate Shift Adaptation 22 2.1.1 Importance-Weighted ERM 22 2.1.2 Adaptive IWERM 23 2.1.3 Regularized IWERM 23 2.2 Examples of Importance-Weighted Regression Methods 25 2.2.1 Squared Loss: Least-Squares Regression 26 2.2.2 Absolute Loss: Least-Absolute Regression 30 2.2.3 Huber Loss: Huber Regression 31 2.2.4 Deadzone-Linear Loss: Support Vector Regression 33 2.3 Examples of Importance-Weighted Classification Methods 35
  • 6. vi 2.3.1 Squared Loss: Fisher Discriminant Analysis 36 2.3.2 Logistic Loss: Logistic Regression Classifier 38 2.3.3 Hinge Loss: Support Vector Machine 39 2.3.4 Exponential Loss: Boosting 40 2.4 Numerical Examples 40 2.4.1 Regression 40 2.4.2 Classification 41 2.5 Summary and Discussion 45 3 Model Selection 47 3.1 Importance-Weighted Akaike Information Criterion 47 3.2 Importance-Weighted Subspace Information Criterion 50 3.2.1 Input Dependence vs. Input Independence in Generalization Error Analysis 51 3.2.2 Approximately Correct Models 53 3.2.3 Input-Dependent Analysis of Generalization Error 54 3.3 Importance-Weighted Cross-Validation 64 3.4 Numerical Examples 66 3.4.1 Regression 66 3.4.2 Classification 69 3.5 Summary and Discussion 70 4 Importance Estimation 73 4.1 Kernel Density Estimation 73 4.2 Kernel Mean Matching 75 4.3 Logistic Regression 76 4.4 Kullback-Leibler Importance Estimation Procedure 78 4.4.1 Algorithm 78 4.4.2 Model Selection by Cross-Validation 81 4.4.3 Basis Function Design 82 4.5 Least-Squares Importance Fitting 83 4.5.1 Algorithm 83 4.5.2 Basis Function Design and Model Selection 84 4.5.3 Regularization Path Tracking 85 4.6 Unconstrained Least-Squares Importance Fitting 87 4.6.1 Algorithm 87 Contents 4.6.2 Analytic Computation of Leave-One-Out Cross-Validation 88 4.7 Numerical Examples 88 4.7.1 Setting 90 4.7.2 Importance Estimation by KLiEP 90 4.7.3 Covariate Shift Adaptation by IWLS and IWCV 92 4.8 Experimental Comparison 94 4.9 Summary 101 5 Direct Density-Ratio Estimation with Dimensionality Reduction 103 5.1 Density Difference in Hetero-Distributional Subspace 103 5.2 Characterization of Hetero-Distributional Subspace 104
  • 7. Contents 5.3 Identifying Hetero-Distributional Subspace 106 5.3.1 Basic Idea 106 5.3.2 Fisher Discriminant Analysis 108 5.3.3 Local Fisher Discriminant Analysis 109 5.4 Using LFDA for Finding Hetero-Distributional Subspace 112 5.5 Density-Ratio Estimation in the Hetero-Distributional Subspace 113 5.6 Numerical Examples 113 5.6.1 Illustrative Example 113 5.6.2 Performance Comparison Using Artificial Data Sets 117 5.7 Summary 121 6 Relation to Sample Selection Bias 125 6.1 Heckman's Sample Selection Model 125 6.2 Distributional Change and Sample Selection Bias 129 6.3 The Two-Step Algorithm 131 6.4 Relation to Covariate Shift Approach 134 7 Applications of Covariate Shift Adaptation 137 7.1 Brain-Computer Interface 137 7.1.1 Background 137 7.1.2 Experimental Setup 138 7.1.3 Experimental Results 140 7.2 Speaker Identification 142 7.2.1 Background 142 7.2.2 Formulation 142 7.2.3 Experimental Results 144 7.3 Natural Language Processing 149 7.3.1 Formulation 149 7.3.2 Experimental Results 151 7.4 Perceived Age Prediction from Face Images 152 7.4.1 Background 152 7.4.2 Formulation 153 7.4.3 Incorporating Characteristics of Human Age Perception 153 7.4.4 Experimental Results 155 7.5 Human Activity Recognition from Accelerometric Data 157 7.5.1 Background 157 7.5.2 Importance-Weighted Least-Squares Probabilistic Classifier 157 7.5.3 Experimental Results 160 7.6 Sample Reuse in Reinforcement Learning 165 7.6.1 Markov Decision Problems 165 7.6.2 Policy Iteration 166 7.6.3 Value Function Approximation 167 7.6.4 Sample Reuse by Covariate Shift Adaptation 168 7.6.5 On-Policy vs. Off-Policy 169 7.6.6 Importance Weighting in Value Function Approximation 170 7.6.7 Automatic Selection of the Flattening Parameter 174 vii
  • 8. viii 7.6.8 Sample Reuse Policy Iteration 7.6.9 Robot Control Experiments 175 176 Contents III LEARNING CAUSING COVARIATE SHIFT 8 Active Learning 183 8.1 Preliminaries 183 8.1.1 Setup 183 8.1.2 Decomposition of Generalization Error 185 8.1.3 Basic Strategy of Active Learning 188 8.2 Population-Based Active Learning Methods 188 8.2.1 Classical Method of Active Learning for Correct Models 189 8.2.2 Limitations of Classical Approach and Countermeasures 190 8.2.3 Input-Independent Variance-Only Method 191 8.2.4 Input-Dependent Variance-Only Method 193 8.2.5 Input-Independent Bias-and-Variance Approach 195 8.3 Numerical Examples of Population-Based Active Learning Methods 198 8.3.1 Setup 198 8.3.2 Accuracy of Generalization Error Estimation 200 8.3.3 Obtained Generalization Error 202 8.4 Pool-Based Active Learning Methods 204 8.4.1 Classical Active Learning Method for Correct Models and Its Limitations 204 8.4.2 Input-Independent Variance-Only Method 205 8.4.3 Input-Dependent Variance-Only Method 206 8.4.4 Input-Independent Bias-and-Variance Approach 207 8.5 Numerical Examples of Pool-Based Active Learning Methods 209 8.6 Summary and Discussion 212 9 Active Learning with Model Selection 215 9.1 Direct Approach and the Active Learning/Model Selection Dilemma 215 9.2 Sequential Approach 216 9.3 Batch Approach 218 9.4 Ensemble Active Learning 219 9.5 Numerical Examples 220 9.5.1 Setting 220 9.5.2 Analysis of Batch Approach 221 9.5.3 Analysis of Sequential Approach 222 9.5.4 Comparison of Obtained Generalization Error 222 9.6 Summary and Discussion 223 10 Applications of Active Learning 225 10.1 Design of Efficient Exploration Strategies in Reinforcement Learning 225 10.1.1 Efficient Exploration with Active Learning 225 10.1.2 Reinforcement Learning Revisited 226 10.1.3 Decomposition of Generalization Error 228
  • 9. Contents 10.1.4 Estimating Generalization Error for Active Learning 229 10.1.5 Designing Sampling Policies 230 10.1.6 Active Learning in Policy Iteration 231 10.1.7 Robot Control Experiments 232 10.2 Wafer Alignment in Semiconductor Exposure Apparatus 234 IV CONCLUSIONS 11 Conclusions and Future Prospects 241 11.1 Conclusions 241 11.2 Future Prospects 242 Appendix: List of Symbols and Abbreviations 243 Bibliography 247 Index 259 ix
  • 10. Foreword Modern machine learning faces a number of grand challenges. The ever grow­ ing World Wide Web, high throughput methods in genomics, and modern imaging methods in brain science, to name just a few, pose ever larger prob­ lems where learning methods need to scale, to increase their efficiency, and algorithms need to become able to deal with million-dimensional inputs at terabytes of data. At the same time it becomes more and more important to efficiently and robustly model highly complex problems that are structured (e.g., a grammar underlies the data) and exhibit nonlinear behavior. In addi­ tion, data from the real world are typically non-stationary, so there is a need to compensate for the non-stationary aspects of the data in order to map the prob­ lem back to stationarity. Finally, when explaining that while machine learning and modern statistics generate a vast number of algorithms that tackle the above challenges, it becomes increasingly important for the practitioner not only to predict and generalize well on unseen data but to also to explain the nonlinear predictive learning machine, that is, to harvest the prediction capa­ bility for making inferences about the world that will contribute to a better understanding of the sciences. The present book contributes to one aspect of the above-mentioned grand challenges: namely, the world of non-stationary data is addressed. Classically, learning always assumes that the underlying probability distribution of the data from which inference is made stays the same. In other words, it is understood that there is no change in distribution between the sample from which we learn to and the novel (unseen) out-of-sample data. In many practical settings this assumption is incorrect, and thus standard prediction will likely be suboptimal. The present book very successfully assembles the state-of-the-art research results on learning in non-stationary environments-with a focus on the covariate shift model-and has embedded this body of work into the general literature from machine learning (semisupervised learning, online learning,
  • 11. xii Foreword transductive learning, domain adaptation) and statistics (sample selection bias). It will be an excellent starting point for future research in machine learning, statistics, and engineering that strives for truly autonomous learning machines that are able to learn under non-stationarity. Klaus-Robert Muller Machine Learning Laboratory, Computer Science Department Technische UniversWit Berlin, Germany
  • 12. Preface In the twenty-first century, theory and practical algorithms of machine learning have been studied extensively, and there has been a rapid growth of comput­ ing power and the spread of the Internet. These machine learning methods are usually based on the presupposition that the data generation mechanism does not change over time. However, modern real-world applications of machine learning such as image recognition, natural language processing, speech recog­ nition, robot control, bioinformatics, computational chemistry, and brain signal analysis often violate this important presumption, raising a challenge in the machine learning and statistics communities. To cope with this non-stationarity problem, various approaches have been investigated in machine learning and related research fields. They are called by various names, such as covariate shift adaptation, sample selection bias, semisupervised learning, transfer learning, and domain adaptation. In this book, we consistently use the term covariate shift adaptation, and cover issues including theory and algorithms of function approximation, model selection, active learning, and real-world applications. We were motivated to write the present book when we held a seminar on machine learning in a non-stationary environment at the Mathematis­ ches Forschungsinstitut Oberwolfach (MFO) in Germany in 2008, together with Prof. Dr. Klaus-Robert Miiller and Mr. Paul von Biinau of the Tech­ nische UniversWit Berlin. We thank them for their constant support and their encouragement to finishing this book. Most of this book is based on the journal and conference papers we have published since 2005. We acknowledge all the collaborators for their fruit­ ful discussions: Takayuki Akiyama, Hirotaka Hachiya, Shohei Hido, Tsuyoshi Ide, Yasuyuki Ihara, Takafumi Kanamori, Hisashi Kashima, Matthias Kraule­ dat, Shin-ichi Nakajima, Hitemitsu Ogawa, Jun Sese, Taiji Suzuki, Ichiro Takeuchi, Yuta Tsuboi, Kazuya Ueki, and Makoto Yamada. Finally, we thank
  • 13. xiv Preface the Ministry of Education, Culture, Sports, Science and Technology in Japan, the Alexander von Humboldt Foundation in Germany, the Okawa Founda­ tion, the Microsoft Institute for Japanese Academic Research Collaboration's Collaborative Research Project, the IBM Faculty Award, the Mathematisches Forschungsinstitut Oberwolfach Research-in-Pairs Program, the Asian Office of Aerospace Research and Development, the Support Center for Advanced Telecommunications Technology Research Foundation, the Japan Society for the Promotion of Science, the Funding Program for World-Leading Innova­ tive R&D on Science and Technology, the Federal Ministry of Economics and Technology, Germany, and the 1ST program of the European Community under the PASCAL2 Network of Excellence for their financial support. Picture taken at Mathematisches Forschungsinstitut Oberwolfach (MFO) in October 2008. From right to left, Masashi Sugiyama, Prof. Dr. Klaus-Robert Miiller, Motoaki Kawanabe, and Mr. Paul von Biinau. Photo courtesy of Archives of the Mathematisches Forschungsinstitut Oberwolfach.
  • 15. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate Shift Machine learning is an interdisciplinary field of science and engineering study­ ing that studies mathematical foundations and practical applications of systems that learn. Depending on the type of learning, paradigms of machine learning can be categorized into three types: • Supervised learning The goal of supervised learning is to infer an underly­ ing input-output relation based on input-output samples. Once the underlying relation can be successfully learned, output values for unseen input points can be predicted. Thus, the learning machine can generalize to unexperienced situ­ ations. Studies of supervised learning are aimed at letting the learning machine acquire the best generalization performance from a small number of training samples. The supervised learning problem can be formulated as a function approximation problem from samples. • Unsupervised learning In contrast to supervised learning, output values are not provided as training samples in unsupervised learning. The general goal of unsupervised learning is to extract valuable information hidden behind data. However, its specific goal depends heavily on the situation, and the unsupervised learning problem is sometimes not mathematically well-defined. Data clustering aimed at grouping similar data is a typical example. In data clustering, how to measure the similarity between data samples needs to be predetermined, but there is no objective criterion that can quantitatively evaluate the validity of the affinity measure; often it is merely subjectively determined.
  • 16. 4 1 Introduction and Problem Formulation • Reinforcement learning The goal of reinforcement learning is to acquire a policy function (a mapping from a state to an action) of a computer agent. The policy function is an input-output relation, so the goal of reinforcement learning is the same as that of supervised learning. However, unlike supervised learning, the output data cannot be observed directly. Therefore, the policy function needs to be learned without supervisors. However, in contrast to unsu­ pervised learning, rewards are provided as training samples for an agent's action. Based on the reward information, reinforcement learning tries to learn the policy function in such a way that the sum of rewards the agent will receive in the future is maximized. The purpose of this book is to provide a comprehensive overview of theory, algorithms, and applications of supervised learning under the situation called covariate shift. When developing methods of supervised learning, it is commonly assumed that samples used as a training set and data points used for testing the generalization performancel follow the same probability distribution (e.g., [ 195, 20, 19 3, 42, 74, 141]). However, this common assumption is not fulfilled in recent real-world applications of machine learning such as robot control, brain signal analysis, and bioinformatics. Thus, there is a strong need for theo­ ries and algorithms of supervised learning under such a changing environment. However, if there is no connection between training data and test data, noth­ ing about test data can be learned from training samples. This means that a reasonable assumption is necessary for relating training samples to test data. Covariate shift is one of the assumptions in supervised learning. The sit­ uation where the training input points and test input points follow different probability distributions, but the conditional distributions of output values given input points are unchanged, is called the covariate shift [ 145]. This means that the target function we want to learn is unchanged between the training phase and the test phase, but the distributions of input points are different for training and test data. A situation of supervised learning where input-only samples are available in addition to input-output samples is called semisupervised learning [ 30]. The covariate shift adaptation techniques covered in this book fall into the cate­ gory of semisupervised learning since input-only samples drawn from the test distribution are utilized for improving the generalization performance under covariate shift. 1. Such test points are not available during the training phase; they are given in the future after training has been completed.
  • 17. 1.2. Quick Tour of Covariate Shift Adaptation 1.2 Quick Tour of Covariate Shift Adaptation 5 Before going into the technical detail, in this section we briefly describe the core idea of covariate shift adaptation, using an illustative example. To this end, let us consider a regression problem of learning a function f(x) from its samples {(x:r,yt)}7�1' Once a good approximation fuction l(x) is obtained, we can predict the output value yte at an unseen test input point xte by means of ?exte). Let us consider a covariate shift situation where the training and test input points follow different probability distributions, but the learning target func­ tion f(x) is common to both training and test samples. In the toy regression example illustrated in figure l.la, training samples are located in the left-hand side of the graph and test samples are distributed in the right-hand side. Thus, this is an extrapolation problem where the test samples are located outside the -0.5 0.5 1.5 2.5 (a) Training and test data 1.5 (c) Function learned by ordinary least squares Figure 1.1 0.5 -0.5 I , I , : I : I : I (b) Input data densities 1.5 (d) Function learned by importance­ weighted least squares A regression example with covariate shift. (a) The learning target function I(x) (the solid line), training samples (0), and test samples (x). (b) Probability density functions of training and test input points and their ratio. (c) Learned function lex) (the dashed line) obtained by ordinary least squares. (d) Learned function lex) (the dashed-dotted line) obtained by importance-weighted least squares. Note that the test samples are not used for function learning.
  • 18. 6 1 Introduction and Problem Formulation training region. Note that the test samples are not given to us in the training phase; they are plotted in the graph only for illustration purposes. The prob­ ability densities of the training and test input points, Plf(X) and Pie(X), are plotted in figure 1. 1b. Let us consider straight-line function fitting by the method of least squares: where This ordinary least squares gives a function that goes through the training sam­ ples well, as illustrated in figure LIe. However, the function learned by least squares is not useful for predicting the output values of the test samples located in the right-hand side of the graph. Intuitively, training samples that are far from the test region (say, training samples with x < 1 in figure 1. 1a) are less informative for predicting the output values of the test samples located in the right-hand side of the graph. This gives the idea that ignoring such less informative training samples and learning only from the training samples that are close to the test region (say, training samples with x> 1.2 in figure 1. 1a) is more promising. The key idea of covariate shift adaptation is to (softly) choose informative training samples in a systematic way, by considering the importance of each training sample in the prediction of test output values. More specifically, we use the ratio of training and test input densities (see figure l.lb), Ple(xn - ( If)'Plf Xj as a weight for the i-th training sample in the least-squares fitting: Then we can obtain a function that extrapolates the test samples well (see figure 1.ld). Note that the test samples are not used for obtaining this function. In this example, the training samples located in the left-hand side of the graph (say, x < 1.2) have almost zero importance (see figure LIb). Thus, these sam­ ples are essentially ignored in the above importance-weighted least-squares
  • 19. 1.3. Problem Formulation 7 method, and informative samples in the middle of the graph are automatically selected by importance weighting. As illustrated above, importance weights play an essential role in covariate shift adaptation. Below, the problem of covariate shift adaptation is formulated more formally. 1.3 Problem Formulation In this section, we formulate the supervised learning problem, which includes regression and classification. We pay particular attention to covariate shift and model misspecijication; these two issues play the central roles in the following chapters. 1.3.1 Function Learning from Examples Let us consider the supervised learning problem of estimating an unknown input-output dependency from training samples. Let be the training samples, where the training input point is an independent and identically distributed (i.i.d.) sample following a probability distribution Ptf(x) with density Ptf(X): { tf}ntr i!.::!. Po ( )Xi i=! tf X • The training output value y:' EYe JR, i = 1, 2, . . . , ntf follows a conditional probability distribution P(y Ix) with conditional density p(ylx). P(ylx) may be regarded as the superposition of the true output I(x) and noise E: y = I(x) +E.
  • 20. 8 1 Introduction and Problem Formulation We assume that noise E has mean 0 and variance a2• Then the function I(x) coincides with the conditional mean of y given x. The above formulation is summarized in figure 1.2. 1.3.2 Loss Functions Let loss(x, y,J) be the loss function which measures the discrepancy between the true output value y at an input point x and its estimatey. In the regression scenarios where Y is continuous, the squared loss is often used. loss(x, y,J)=(y - y)2. On the other hand, in the binary classification scenarios where Y={+1, -I}, the following Oil-loss is a typical choice since it corresponds to the misclassi­ jication rate. I 0 if sgn(J)=y, loss(x, y,J)= 1 otherwise, where sgn(J) denotes the sign ofy: sgn(J) :={+ � -1 ify>O, ify=O, ify <0. Although the above loss functions are independent ofx, the loss can generally depend onx [141]. f(x) !(x) .' Figure 1.2 Framework of supervised learning.
  • 21. 1.3. Problem Formulation 9 1.3.3 Generalization Error Let us consider a test sample (xte, ie), which is not given in the training phase but will be given in the test phase. xte EX is a test input point following a test distribution Pte(x) with density Pte(x), and ie EY is a test output value following the conditional distribution P(yIx = xte) with conditional density p(yIx = xte). Note that the conditional distribution is common to both train­ ing and test samples. The test error expected over all test samples (or the generalization error) is expressed as where IExte denotes the expectation over xte drawn from Pte(x) and lEy'e denotes the expectation over ie drawn from P(ylx = xte). The goal of supervised learning is to determine the value of the parameter () so that the generaliza­ tion error is minimized, that is, output values for unseen test input points can be accurately estimated in terms of the expected loss. 1.3.4 Covariate Shift In standard supervised learning theories (e.g., [195,20,193,42,74,141,21]), the test input distribution Pte(x) is assumed to agree with the training input distri­ bution Ptr(x). However, in this book we consider the situation under covariate shift [145], that is, the test input distribution Pte(x) and the training input distribution Ptr(x) are generally different: Under covariate shift, most of the standard machine learning techniques do not work properly due to the differing distributions. The main goal of this book is to provide machine learning methods that can mitigate the influence of covariate shift. In the following chapters, we assume that the ratio of test to training input densities is bounded, that is, Pte(x) -- < 00 for all x EX. Ptr(x) This means that the support of the test input distribution must be contained in that of the training input distribution. The above ratio is called the importance [51], and it plays a central role in covariate shift adaptation.
  • 22. 10 1 Introduction and Problem Formulation 1.3.5 Models for Function Learning Let us employ a parameterized function I(x;(J)for estimating the output value y, where Here, T denotes the transpose of a vector or a matrix, and e denotes the domain of parameter(J. 1.3.5.1 Linear-in-Input Model The simplest choice of parametric model would be the linear-in-input model: ( 1. 1) where This model has linearity in both input variable x and parameter (J, and the number b of parameters is d + 1, where d is the dimensionality of x. The linear-in-input model can represent only a linear input-output relation, so its expressibility is limited. However, since the effect of each input variable X(k) can be specified directly by the parameter Bb it would have high interpretabil­ ity. For this reason, this simple model is still often used in many practical data analysis tasks such as natural language processing, bioinformatics, and computational chemistry. 1.3.5.2 Linear-in-Parameter Model A slight extension of the linear-in-input model is the linear-in-parameter model: b I(x;(J) = LBeCfJe(x), ( 1.2) e=1 where {CfJe(x)}�=1 are fixed, linearly independent functions. This model is linear in parameter(J,and we often refer to it as the linear model. Popular choices of basis functions include polynomials and trigonometric polynomials. When the input dimensionality is d = 1, the polynomial basis functions are given by
  • 23. 1.3. Problem Formulation 11 where b= t + 1. The trigonometric polynomial basis functions are given by {CPe(X)}�=1= {I, sinx, cosx, sin2x, cos2x, . . . , sincx, coscx}, where b=2c + 1. For multidimensional cases, basis functions are often built by combining one-dimensional basis functions. Popular choices include the additive model and the multiplicative model. The additive model is given by d � "" (k)f(x;(J)= ��ek,eCPe(X ). k=1 e=1 Thus, a one-dimensional model for each dimension is combined with the others in an additive manner (figure 1. 3a). The number of parameters in the additive model is b = cd. The multiplicative model is given by d I(x;(J)= L eelh....ednCPek(X(k)). el,e2,...,ed=1 k=1 Thus, a one-dimensional model for each dimension is combined with the oth­ ers in a multiplicative manner (figure 1. 3b). The number of parameters in the multiplicative model is b=cd • 4 2 o -2 2 o 0 (a) Additive model 2 2 o 0 (b) Multiplicative model Figure 1.3 Examples of an additive model lex) = (X(1)2 - X(2) and of a multiplicative model lex) = _X(l)X(2)+X(1)(X(2)2.
  • 24. 12 1 Introduction and Problem Formulation In general, the multiplicative model can represent more complex functions than the additive model (see figure 1. 3). However, the multiplicative model contains exponentially many parameters with respect to the input dimension­ ality d-such a phenomenon is often referred to as the curse of dimensionality [ 12]. Thus, the multiplicative model is not tractable in high-dimensional prob­ lems. On the other hand, the number of parameters in the additive model increases only linearly with respect to the input dimensionality d, which is more preferable in high-dimensional cases [71]. 1.3.5.3 Kernel Model The number of parameters in the linear-in-parameter model is related to the input dimensionality d. Another means for determining the number of parameters is to relate the number of parameters to the number of training samples, ntr• The kernel model follows this idea, and is defined by "" l(x;(J)= L)eK(x,x�), £=1 where K(', .) is a kernel function. The Gaussian kernel would be a typical choice (see figure 1.4): ( IIx _XI1l2 )K(x,x')=exp - 2h2 ' where h (>0) controls the width of the Gaussian function. ( 1. 3) In the kernel model, the number b of parameters is set to ntn which is inde­ pendent of the input dimensionality d. For this reason, the kernel model is often preferred in high-dimensional problems. The kernel model is still lin­ ear in parameters, so it is a kind of linear-in-parameter model; indeed, letting 0.8 0.6 0.4 0.2 Figure 1.4 -5 0 5 10 Gaussian functions (equation 1.3) centered at the origin with width h.
  • 25. 1.3. Problem Formulation 13 b = ntr and ((Je(x) = K(x, x�) in the linear-in-parameter model (equation 1.2) yields the kernel model. Thus, many learning algorithms explained in this book could be applied to both models in the same way. However, when we discuss convergence properties of the learned func­ tion lex; 0) when the number of training samples is increased to infinity, the kernel model should be treated differently from the linear-in-parameter model because the number of parameters increases as the number of train­ ing samples grows. In such a case, standard asymptotic analysis tools such as the Cramer-Rao paradigm are not applicable. For this reason, statisticians categorize the linear-in-parameter model and the kernel model in different classes: the linear-in-parameter model is categorized as a parametric model, whereas the kernel model is categorized as a nonparametric model. Analysis of the asymptotic behavior of nonparametric models is generally more difficult than that of parametric models, and highly sophisticated mathematical tools are needed (see, e.g., [ 19 1, 192,69]). A practical compromise would be to use a fixed number of kernel functions, that is, for fixed b, b lex; 0) = I:eeK(X,Ce), e=1 where, for example, {cel�=1 are template points for example chosen randomly from the domain or from the training input points {Xn7�1 without replacement. 1.3.6 SpeCification of Models A model lcx; 0) is said to be correctly specified if there exists a parameter 0* such that lex; 0*) = I(x). Otherwise, the model is said to be misspecified. In practice, the model used for learning would be misspecified to a greater or lesser extent since we do not generally have strong enough prior knowledge to correctly specify the model. Thus, it is important to consider misspecified models when developing machine learning algorithms. On the other hand, it is meaningless to discuss properties of learning algo­ rithms if the model is totally misspecified-for example, approximating highly nonlinearly fluctuated functions by a straight line does not provide meaning­ ful prediction (figure 1.5). Thus, we effectively consider the situation where the model at hand is not correctly specified but is approximately correct.
  • 26. 14 1 Introduction and Problem Formulation x Figure 1.5 Approximating a highly nonlinear function f(x) by the linear-in-input model l(x), which is totally misspecified. This approximate correctness plays an important role when designing model selection algorithms (chapter 3) and active learning algorithms (chapter 8). 1.4 Structure of This Book This book covers issues related to the covariate shift problems, from funda­ mental learning algorithms to state-of-the-art applications. Figure 1.6 summarizes the structure of chapters. 1.4.1 Part II: Learning under Covariate Shift In part II, topics on learning under covariate shift are covered. In chapter 2, function learning methods under covariate shift are introduced. Ordinary empirical risk minimization learning is not consistent under covari­ ate shift for misspecified models, and this inconsistency issue can be resolved by considering importance-weighted lossfunctions. Here, various importance­ weighted empirical risk minimization methods are introduced, including least squares and Huber's method for regression, and Fisher discriminant analysis, logistic regression, support vector machines, and boosting for classification. Their adaptive and regularized variants are also introduced. The numerical behavior of these importance-weighted learning methods is illustrated through experiments. In chapter 3, the problem of model selection is addressed. Success of machine learning techniques depends heavily on the choice of hyperparam­ eters such as basisfunctions, the kernel bandwidth, the regularization param­ eter, and the importance-flattening parameter. Thus, model selection is one of the most fundamental and crucial topics in machine learning. Standard model selection schemes such as the Akaike information criterion, cross-validation, and the subspace information criterion have their own theoretical justification
  • 27. 1.4. Structure of This Book Machine Learning in Non·Stationary Environments: Introduction to Covariate Shift Adaptation Part II Learning Under Covariate Shift I Chapter 2 lFunction Approximation Part I Introduction Part III I Chapter 3 IModel Selection Learning Causing Covariate Shift I Chapter 4 IImportance Estimation Chapter 5 Direct Density-Ratio Estimation with Dimensionality Reduction I Chapter 6 IRelation to Sample Selection Bias Chapter 7 Applications of Covariate Shift Adaptation Conclusions Chapter 11 Conclusions and Future Prospects Figure 1.6 Structure of this book. Chapter 8 Active Learning 15
  • 28. 16 1 Introduction and Problem Formulation in terms of the unbiasedness as generalization error estimators. However, such theoretical guarantees are no longer valid under covariate shift. In this chap­ ter, various their modified variants using importance-weighting techniques are introduced, and the modified methods are shown to be properly unbiased even under covariate shift. The usefulness of these modified model selection criteria is illustrated through numerical experiments. In chapter 4, the problem of importance estimation is addressed. As shown in the preceding chapters, importance-weighting techniques play essential roles in covariate shift adaptation. However, the importance values are usually unknown a priori, so they must be estimated from data samples. In this chapter, importance estimation methods are introduced, including importance estima­ tion via kernel density estimation, the kernel mean matching method, a logistic regression approach, the Kullback-Leibler importance estimation procedure, and the least-squares importance fitting methods. The latter methods allow one to estimate the importance weights without performing through density estimation. Since density estimation is known to be difficult, the direct impor­ tance estimation approaches would be more accurate and preferable in practice. The numerical behavior of direct importance estimation methods is illustrated through experiments. Characteristics of importance estimation methods are also discussed. In chapter 5, a dimensionality reduction scheme for density-ratio estima­ tion, called direct density-ratio estimation with dimensionality reduction (D pronounced as "D-cube"), is introduced. The basic idea of D3 is to find a low-dimensional subspace in which training and test densities are significantly different, and estimate the density ratio only in this subspace. A supervised dimensionality reduction technique called local Fisher discriminant analysis (LFDA) is employed for identifying such a subspace. The usefulness of the D3 approach is illustrated through numerical experiments. In chapter 6, the covariate shift approach is compared with related formula­ tions called sample selection bias. Studies of correcting sample selection bias were initiated by Heckman [77,76], who received the Nobel Prize in economics for this achievement in 2000. We give a comprehensive review of Heckman's correction model, and discuss its relation to covariate shift adaptation. In chapter 7, state-of-the-art applications of covariate shift adaptation tech­ niques to various real-world problems are described. This chapter includes non-stationarity adaptation in brain-computer interfaces, speaker identifica­ tion through change in voice quality, domain adaptation in natural language processing, age prediction from face images under changing illumination con­ ditions, user adaptation in human activity recognition, and efficient sample reuse in autonomous robot control.
  • 29. 1.4. Structure of This Book 17 1.4.2 Part III: Learning Causing Covariate Shift In part III, we discuss the situation where covariate shift is intentionally caused by users in order to improve generalization ability. In chapter 8, the problem of active learning is addressed. The goal of active learning is to find the most "informative" training input points so that learning can be successfully achieved from only a small number of training samples. Active learning is particularly useful when the cost of data sampling is expen­ sive. In the active learning scenario, covariate shift-mismatch of training and test input distributions-occurs naturally occurs since the training input dis­ tribution is designed by users, while the test input distribution is determined by the environment. Thus, covariate shift is inevitable in active learning. In this chapter, active learning methods for regression are introduced in light of covariate shift. Their mutual relation and numerical examples are also shown. Furthermore, these active learning methods are extended to the pool-based scenarios, where a set of input-only samples is provided in advance and users want to specify good input-only samples to gather output values. In chapter 9, the problem of active learning with model selection is addressed. As explained in the previous chapters, model selection and active learning are two important challenges for successful learning. A natural desire is to perform model selection and active learning at the same time, that is, we want to choose the best model and the best training input points. How­ ever, this is actually a chicken-and-egg problem since training input samples should have been fixed for performing model selection and models should have been fixed for performing active learning. In this chapter, several compro­ mise approaches, such as the sequential approach, the batch approach, and the ensemble approach, are discussed. Then, through numerical examples, limita­ tions of the sequential and batch approaches are pointed out, and the usefulness of the ensemble active learning approach is demonstrated. In chapter 10, applications of active learning techniques to real-world prob­ lems are shown. This chapter includes efficient exploration for autonomous robot control and efficient sensor design in semiconductor wafer alignment.
  • 31. 2 Function Approximation In this chapter, we introduce learning methods that can cope with covariate shift. We employ a parameterized function icx; (J) for approximating a target function f(x) from training samples {(x:r,y:')}7!l (see section 1.3.5). A standard method to learn the parameter (J would be empirical risk minimization (ERM) (e.g.,[193,141]): [1 % ]OERM := argmin ;- Lloss(x:r,y:',l(x:r; (J» , 8 tr ;=1 where loss(x,y,y) is a loss function (see section 1.3.2). If Ptr(x) = Pte(x), OERM is known to be consistentl [145]. Under covariate shift where Ptr(x) f= Pte(x), however,the situation differs: ERM still gives a consistent estimator if the model is correctly specified, but it is no longer consistent if the model is misspecified [145]: plim [OERM] f= (J*, ntr--+OO where "plim" denotes convergence in probability, and (J* is the optimal parameter in the model: (J* := argmin[Gen]. 8 1. For correctly specified models, an estimator is said to be consistent if it converges to the true parameter in probability. For misspecified models, we use the term "consistency" for conver­ gence to the optimal parameter in the model (i.e., the optimal approximation to the learning target function in the model under the generalization error Gen).
  • 32. 22 2 Function Approximation Gen is the generalization error defined as where lEx'e denotes the expectation over x,e drawn from the test input dis­ tribution P,e(x), and lEy'e denotes the expectation over y'e drawn from the conditional distribution P(yIx= x,e). This chapter is devoted to introducing various techniques of covariate shift adaptation in function learning. 2.1 Importance-Weighting Techniques for Covariate Shift Adaptation In this section,we show how the inconsistency of ERM can be overcome. 2.1.1 Importance-Weighted ERM The failure of the ERM method comes from the fact that the training input distribution is different from the test input distribution. Importance sampling (e.g.,[51]) is a standard technique to compensate for the difference of distribu­ tions. The following identity shows the essential idea of importance sampling. For a function g, where lExtrandlEx'e denote the expectation over xdrawn from Ptr(x)and P,e(x), respectively. The quantity is called the importance. The above identity shows that the expectation of a function g over x,ecan be computed by the importance-weighted expectation of the function over x'r.Thus,the difference of distributions can be systemat­ ically adjusted by importance weighting. This is the key equation that plays a central role of in covariate shift adaptation throughout the book.
  • 33. 2.1. Importance-Weighting Techniques for Covariate Shift Adaptation Under covariate shift,importance-weighted ERM (IWERM), [1 ntr (tr) ] --- ._ . Pte Xi tr tr --- tr.fJIWERM .-argmm - L--tr-I oss(xi' Yi' !(Xi, fJ)) , 9 ntri=! Ptr(Xi) 23 is shown to be consistent even for misspecified models [145],that is,it satisfies plim [OIWERMJ =fJ*. ntr-+(X) 2.1.2 Adaptive IWERM As shown above, IWERM gives a consistent estimator. However, it also can produce an unstable estimator, and therefore IWERM may not be the best pos­ sible method for finite samples [145]-in practice, a slightly stabilized variant of IWERM would be preferable, that is, one achieved by slightly "flatten­ ing" the importance weight in IWERM. We call this variant adaptive IWERM (AIWERM): � [1 ntr (Pte(X tr) )Y � ]fJy :=argmin - L -( ;r) 10ss(x: r , y; r , !(x: r ;fJ)) , 9 ntri=! Ptr Xi (2.1) where y (O:s y:s 1) is called theflattening parameter. The flattening parameter controls stability and consistency of the estima­ tor; y =0 corresponds to ordinary ERM (the uniform weight, which yields a stable but inconsistent estimator), and y =1 corresponds to IWERM (the importance weight, which yields a consistent but unstable estimator). An inter­ mediate value of y would provide the optimal control of the trade-off between stability and consistency (which is also known as the bias-variance trade-off). A good choice of y would roughly depend on the number ntr of training samples. When ntr is large,bias usually dominates variance,and thus a smaller bias estimator obtained by large y is preferable. On the other hand, when ntr is small, variance generally dominates bias, and hence a smaller variance estima­ tor obtained by small y is appropriate. However, a good choice of y may also depend on many unknown factors, such as the learning target function and the noise level. Thus, the flattening parameter y should be determined carefully by a reliable model selection method (see chapter 3). 2.1.3 Regularized IWERM Instead of flattening the importance weight, we may add a regularizer to the empirical risk term. We call this regularized IWERM:
  • 34. 24 2 Function Approximation [1 nrr ( tf) ] --... . Pte Xi tr tr --- tr9;.:=argmm - L --t-f loss(xi, Yi '!(Xi;9)) +AR(9) , 8 ntf i=! Ptf(Xi) (2.2) where R(9) is a regularization function, and A (2: 0) is the regularization parameter that controls the strength of regularization. For some C(A) (2: 0), the solution of equation 2.2 also can be obtained by solving the following constrained optimization problem: � [1 nrr Pte(xtf) � ]9 =argmin - '" --'- loss(xtf ytf !(xtf.9));. � ( tf) I' I' I' , 8 ntf i=! Ptf Xi subject to R(9):s C(A). A typical choice of the regularization function R(9) is the squared £2-norm (see figure 2.1): b R(9) =Lei. e=! This is differentiable and convex,so it is often convenient to use it in devising computationally efficient optimization algorithms. The feasible region (where the constraint R(9):s C(A) is satisfied) is illustrated in figure 2.2a. Another useful choice is the £!-norm (see figure 2.1): b R(9) =Lleel· e=! -2 Figure 2.1 o 6 -- Squared regularizer - - - Absolute regularizer 2 Regularization functions: squared regularizer (82) and absolute regularizer (181).
  • 35. 2.2. Examples of Importance-Weighted Regression Methods 25 (a) Squared regularization (b) Absolute regularization Figure 2.2 Feasible regions by regularization. This is not differentiable,but it is still convex. It is known that the absolute reg­ ularizer induces a sparse solution,that is,the parameters {eel�=1tend to become zero [200,183,31]. When the solution 0 is sparse, output values l(x;0) may be computed efficiently,which is a useful property if the number of parameters is very large. Furthermore, when the linear-in-input model (see section 1.3.5) (2.3) is used, making a solution sparse corresponds to choosing a subset of input variables {X(k)}f=1which are responsible for predicting output values y. This is highly useful when each input variable has some meaning (type of experiment, etc.),and we want to interpret the "reason" for the prediction. Such a technique can be applied in bioinformatics, natural language processing, and computa­ tional chemistry. The feasible region of the absolute regularizer is illustrated in figure 2.2b. The reason why the solution becomes sparse by absolute regularization may be intuitively understood from figure 2.3, where the squared loss is adopted. The feasible region of the absolute regularizer has "corners" on axes, and thus the solution tends to be on one of the corners, which is sparse. On the other hand, such sparseness is not available when the squared regularizer is used. 2.2 Examples of Importance-Weighted Regression Methods The above importance-weighting idea is very general, and can be applied to various learning algorithms. In this section, we provide examples of regression methods including least squares and robust regression. Classification methods will be covered in section 2.3.
  • 36. 26 /��;:::[�:���::;'.'.'.)� ....... ,' ,' """ . •••. ••.•••.•.....•. .. .. .. . . .. . . ... . .. .. . . . . . . . . ... . . . . . . . .•.•. . .• . • ...•....••.• ..• . •.. •......... •. .. .... ......................... • ................. (a) Squared regularization Figure 2.3 Sparseness brought by absolute regularization. 2 Function Approximation ...��.•. . . . . . ::::::::::::::......... .. -::.:.::::::::::::::.::::::••...••.. ••.. •.• . • • . . )( (.. (::::::::::::�..�:��.:.:.:.:L.·'.�) ...... .. .......... ::....... .... ... ... "" ···��:;:;.:�i��§. ... ··················· (b) Absolute regularization -- Squared loss -2 Figure 2.4 o y-y Loss functions for regression. - - - Absolute loss Huber loss • • • • • ,Deadzone-linear loss 2 2.2.1 Squared Loss: Least-Squares Regression Least squares (LS) is one of the most fundamental regression techniques in statistics and machine learning. The adaptive importance-weighting method for the squared loss, called adaptive importance-weighted LS (AIWLS), is given as follows (see figure 2.4): (2.4)
  • 37. 2.2. Examples of Importance-Weighted Regression Methods 27 where O :s y :s 1. Let us employ the linear-in-parameter model (see section 1.3.5.2) for learning: b rex; (J)=LeeCfJe(x),e=1 (2.5) where {CfJe(xm=1 are fixed, linearly independent functions. Then the above minimizerOy is given analytically,as follows. Let Xlfbe the design matrix, that is, Xlfis the nlfx b matrix with the (i, £)-th element Then we have and thus equation 2.4 is expressed in a matrix form as � (Xlf(J-if)TW�(rf(J-if)nlf where W�is the diagonal matrix with the i-th diagonal element w - _le_l_(P (Xlf))Y [ y].,1 - ( If) ,Plf Xi and If ( tr tr tr)TY =YI' Y2' ... , Yntf • Taking its derivative with respect to (Jand equating it to zero yields XlfTW�Xlf(J=X'fW�if. Let Ly be the learning matrix given by L =(XtrTWtrXtr)-IXtrTWtry y y'
  • 38. 28 2 Function Approximation where we assume that the inverse of XtrT w�xtrexists. ThenOy is given by (2.6) A MATLAB® implementation of adaptive importance-weighted LS is available from http://sugiyama-www.cs.titech.ac.jprsugi/software/IWLS/. The above analytic solution is easy to implement and useful for theoretical analysis of the solution.However,when the number of parameters is very large, computing the solution by means of equation 2.6 may not be tractable since the matrix XtrTw�rrthat we need to invert is very high-dimensiona1.2 Another way to obtain the solution is gradient descent-the parameter (Jis updated so that the squared-error term is reduced, and this procedure is repeated until convergence (see figure 2.5): 8e+- 8e- 8 t(Pte( (;::»)y (t8e'CfJe'(x:r)-y;r)CfJe(x:r) for all e, i=! Ptr I e'=! where 8 (>0) is a learning-rate parameter, and the rest of the second term corresponds to the gradient of the objective function (equation 2.4). If 8 is large, the solution goes down the slope very fast (see figure 2.5a); however, it can overshoot the bottom of the objective function and fluctuate around the (a) When c is large Figure 2.5 Schematic illustration of gradient descent. 2. In practice, we may solve the following linear equation, XttTW�xtt9y =XttTW�ytt, (b) When c is small for computing the solution. This would be slightly more computationally efficient than computing the solution by means of equation 2.6. However, solving this equation would still be intractable when the number of parameters is very large.
  • 39. 2.2. Examples of Importance-Weighted Regression Methods 29 bottom. On the other hand, if B is small, the speed of going down the slope is slow, but the solution stably converges to the bottom (see figure 2.5b). A suitable scheme would be to start from large B to quickly go down the slope, and then gradually decrease the value of B so that it properly converges to the bottom of the objective function. However,determining appropriate scheduling of B is highly problem-dependent, and it is easy to appropriately choose B in practice. Note that in general, the gradient method is guaranteed only to be able to find one of the local optima, whereas in the case of LS for linear-in-parameter models (equation 2.5), we can always find the globally optimal solution thanks to the convexity of the objective function (see, e.g., [27]). When the number ntf of training samples is very large, computing the gradi­ ent as above is rather time-consuming. In such a case, the following stochastic gradient method [7] is computationally more efficient-for a randomly cho­ sen sample index i E {I,2,. . .,ntf} in each iteration, repeat the following single-sample update until convergence: ee +-ee- B (Pte(Xn)Y (�ee'CfJe(xtf) -if)CfJe(xtf) for all £. P(xtf) � I I I tr 1 £'=1 Convergence of the stochastic gradient method is guaranteed in the probabilis­ tic sense. The AIWLS method can easily be extended to kernel models (see section 1.3.5.3) by letting b = ntf and CfJe(x) = K(x,xD, where K(·,·) is a kernel function: ntr l(x;()=L:eeK(x,xD· e=1 The Gaussian kernel is a popular choice: / (IIX-X/1I2)K(x,x) = exp 2h2 ' where h >0 controls the width of the Gaussian function. In this case,the design matrix xtf becomes the kernel Gram matrix Kt" that is, Ktf is the ntf x ntf matrix with the (i, £)-th element
  • 40. 30 2 Function Approximation Then the learned parameter Oy can be obtained by means of equation 2.6 with the learning matrix Ly given as using the fact that the kernel matrix Ktris symmetric: The (stochastic) gradient descent method is similarly available by replacing band CfJe(x) with ntrand K(x,xD, respectively, which is still guaranteed to converge to the globally optimal solution. 2.2.2 Absolute Loss: Least-Absolute Regression The LS method often suffers from excessive sensitivity to outliers (i.e., irreg­ ular values) and less reliability. Here, we introduce an alternative approach to LS based on the least-absolute (LA) method, which we refer to as adap­ tive importance-weighted least-absolute regression (AIWLAR)-instead of the squared loss, the absolute loss is used (see figure 2.4): (2.7) The LS regression method actually estimates the conditional mean of output y given input x. This may be intuitively understood from the fact that min­ imization under the squared loss amounts to obtaining the mean of samples {zd7=1: If one of the values in the set {Zi}7=1 is extremely large or small due to, for instance, some measurement error, the mean will be strongly affected by that outlier sample. Thus, all the values {Zi}7=1are responsible for the mean, and therefore even a single outlier observation can significantly damage the learned function. On the other hand, the LA regression method is actually estimates the con­ ditional median of output y, given input x. Indeed, minimization under the
  • 41. 2.2. Examples of Importance-Weighted Regression Methods 31 absolute loss amounts to obtaining the median: where Zl .::::Z2.::::• . • .::::Z2n+l' The median is not influenced by the magnitude of the values {Z;}i;fn, but only by their order. Thus, as long as the order is kept unchanged, the median is not affected by outliers-in fact, the median is known to be the most robust estimator in the light of breakdown-pointanalysis [83,138]. The minimization problem (equation 2.7) looks cumbersome due to the absolute value operator, which is non-differentiable. However, the following mathematical trick mitigates this issue [27]: Ixl= min b subject to - b.:::: x.:::: b. b Then the minimization problem (equation 2.7) is reduced to the following optimization problem: {min 9.{bi}7�1 subject to nrr ( ( If))Y""' PIe Xi b . � ( If) ";=1 Plf Xi -b < f � (Xlf.9) - ylf< b 'Vi.1_ I' 1 - n If the linear-in-parameter model (equation 2.5) is used for learning, the above optimization problem is reduced to a linear program [27] which can be solved efficiently by using a standard optimization software. The number of constraints is nlf in the above linear program. When nlf is large, we may employ sophisticated optimization techniques such as column generation by considering increasing sets of active constraints [37] for effi­ ciently solving the linear programming problem. Alternatively,an approximate solution can be obtained by gradient descent or (quasi)-Newton methods if the absolute loss is approximated by a smooth loss (see section 2.2.3). 2.2.3 Huber Loss: Huber Regression The LA regression is useful for suppressing the influence of outliers. However, when the training output noise is Gaussian, the LA method is not statistically efficient, that is, it tends to have a large variance when there are no outliers. A popular alternative is the Huber loss [83], which bridges the LS and LA methods. The adaptive importance-weighting method for the Huber loss, called
  • 42. 32 2 Function Approximation adaptive importance-weighted Huber regression (AIWHR), is follows: where T (�O) is the robustness parameter and p< is the Huber loss, defined as follows (see figure 2.4): if Iyl:s T, iflyl >T. Thus, the squared loss is applied to "good" samples with small fitting error,and the absolute loss is applied to "bad" samples with large fitting error. Note that the Huber loss is a convex function, and therefore the unique global solution exists. The Huber loss function is rather intricate, but for the linear-in-parameter model (equation 2.5), the solution can be obtained by solving the following convex quadratic programming problem [109]: b subject to - Vi :s LBe({Je(x:r)-y;' - Ui :s Vi for all i. e=1 Another way to obtain the solution is gradient descent (notice that the Huber loss is once differentiable): where £ (>0) is a learning rate parameter, and /::;.p< is the derivative of p< given by "p,(y ) = { y -T if Iyl:s T, if y >T, if y < -T.
  • 43. 2.2. Examples of Importance-Weighted Regression Methods 33 Its stochastic version is where the sample index i E {I,2, ...,nlf} is randomly chosen in each iteration, and the above gradient descent process is repeated until convergence. 2.2.4 Deadzone-Linear Loss: Support Vector Regression Another variant of the absolute loss is the deadzone-linear loss (see figure 2.4): � [1 nlf (Ple(Xlf))Y � ]8 =argmin - '"'" --'- If(Xlf;8) - ifI 'Y n � p (Xlf) , , • 8 tr i=1 tr 1 where I . I.is the deadzone-linear loss defined by Ixl.:=10 Ixl -E if Ixl:::: E, iflxl >E. That is, if the magnitude of the residual If(x:f;8) -yJfI is less than E, no error is assessed. This loss is also called the E-insensitive loss and is used in support vector regression [193]. We refer to this method as adaptive importance-weighted support vector regression (AIWSVR). When E =0, the deadzone-linear loss is reduced to the absolute loss (see section 2.2.2). Thus the deadzone-linear loss and the absolute loss are related to one another. However, the effect of the deadzone-linear loss is quite dif­ ferent from that of the absolute loss when E > O-the influence of "good" samples (with small residual) is deemphasized in the deadzone-linear loss, while the absolute loss tends to suppress the influence of "bad" samples (with large residual) compared with the squared loss. The solution Oy can be obtained by solving the following optimization problem [27]: min 9,{bil7,!!"1 subject to t(Ple(X:?)Y bi;=1 Plf(X;) -bi- E:::: f(x:f;8) -y;':::: bi +E, bi2: 0, 'Vi.
  • 44. 34 2 Function Approximation If the linear-in-parameter model (equation 2.5) is used for learning, the above optimization problem is reduced to a linear program [27] which can be solved efficiently by using a standard optimization software. The support vector regression was shown to be equivalent to minimizing the conditional value-at-risk (CVaR) of the absolute residuals [180]. The CVaR corresponds to the mean of the error for a set of "bad" samples (see figure 2.6), and is a popular risk measure in finance [137]. More specifically,let us consider the cumulative distribution of the absolute residuals 11(x:r;9) - y:rlover all training samples {(x:r,yn}7!]: where Problr denotes the probability over training samples {(x:r,yn}7!]. For fJ E [0, 1), let (){fJ(9) be the tOOth fJ-percentile of the distribution of absolute residuals: ()(fJ(9) := argmin(){ subject to <I>«(){19) 2: fJ· Thus, only the fraction (1 - fJ) of the absolute residuals lj(xY;9) - y:'1 exceeds the threshold (){fJ(9). ()(fJ(9) is referred to as the value-at-risk (VaR). Let us consider the fJ-tail distribution of the absolute residuals: <l>fJ«(){19) = {�«(){19) -fJ I -fJ o if (){ < (){fJ(9), if (){ 2: (){fJ(9). Probability 1- (3 � l1li • � o a(3 ¢(3 Absolute residual Figure 2.6 The conditional value-at-risk (CVaR).
  • 45. 2.3. Examples of Importance-Weighted Classification Methods 35 Let <PtJ(fJ) be the mean of the,B-tail distribution of the absolute residuals: where lE<I>p denotes the expectation over the distribution <l>p. !f>p(fJ) is called the CVaR. By definition, the CVaR of the absolute residuals is reduced to the mean absolute residuals if,B=0,and it converges to the worst absolute residual as,B tends to 1. Thus, the CVaR smoothly bridges the LA approach and the Chebyshev approximation (a.k.a. minimax) method. CVaR is also referred to as the expected shortfall. 2.3 Examples of Importance-Weighted Classification Methods In this section, we provide examples of importance-weighted classification methods including Fisher discriminant analysis [50, 57], logistic regression [74],supportvector machines [26,193,141],and boosting [52,28,53]. For sim­ plicity, we focus on the binary classification case where Y ={+1, -I}. Let n� and n� be the numbers of training samples in classes +1 and -1,respectively. In the classification setup, the following Oil-loss is typically used as the error metric since it corresponds to the misclassijication rate. { ° if sgn(y) =sgn(y ), loss(x , y,y) = 1 otherwise, where y is the true output value at an input pointx ,yis an estimate of y, and sgn(y) denotes the sign ofy: sgn(y) := {+ � -1 ify>O, ify=O, ify< 0. This means that, in classification scenarios, the sign ofyis important, and the magnitude ofydoes not affect the misclassification error. The above O/I-loss can be equivalently expressed as { ° if sgn(yy) =1, loss(x , y,y) = 1 otherwise.
  • 46. 36 5 , 4 , , , , V> 3 , , V> ,0 --' , ,2 , 0 -2 Figure 2.7 , . ' . , . " .. '(f .. .. .. 0 /y 2 2 Function Approximation -- Squared loss - - - Logistic loss ,-, -, Hinge loss ••••• , Exponential loss -- 0/1-1055 Loss functions for classification.y are the true output value at an input point x and yis an estimate ofy. For this reason, the loss function in classification is often expressed in terms of yy, which is called the margin. The profile of the 0/1-10ss is illustrated as a function of yy in figure 2.7. Minimizing the 0/1-10ss is the ultimate goal of classification. However,since the 0/1-10ss is not a convex function, optimization under it is hard. To cope with this problem, alternative convex loss functions have been proposed for classification scenarios (figure 2.7). In this section, we review classification methods with such convex loss functions. 2.3.1 Squared Loss: Fisher Discriminant Analysis Fisher discriminant analysis (FDA) is one of the classical classification meth­ ods [50]. In FDA, the input samples are first projected onto a one-dimensional subspace (i.e.,a line),and then the projected samples are linearly separated into two classes by thresholding; for multiclass extension,see, for example, [57]. Let p., p. + , and p.- be the means of {x:,}7!], {x:rly!r =+1}7!], and {x:rly!' = -1}7!], respectively: 1 ntr p.:=_ '""' xtr, n � Itf ;=1 1 ,, +.__ '""' xtr� ,- + � i' ntr i:yY=+l
  • 47. 2.3. Examples of Importance-Weighted Classification Methods 37 where Li:y)'=+ldenotes the summation over index i such that yt = +1. Let Sb and SW be the between-class scatter matrix and the within-class scatter matrix, respectively,defined as Sb := n�(p,+_p,)(p,+-p,)T+ n�(p,-_p,)(p,--p,)T, Sw:= L (x�_p,+)(x:r_p,+)T+ L (x:r_p,-)(x:r_p,-)T. The FDA projection direction if>FDA E IRd is defined as That is, FDA seeks a projection direction if> with large between-class scatter and small within-class scatter after projection. The above ratio is called the Rayleigh quotient. A strong advantage of the Rayleigh quotient formulation is that globally optimal solutions can be com­ puted analytically even though the objective function is not convex. Indeed, the FDA projection direction if>FDA is given by where if>max is the generalized eigenvector associated with the largest general­ ized eigenvalue of the following generalized eigenvalue problem: Thanks to the analytic solution,the FDA projection direction can be computed efficiently. Finally, the projected samples are classified by thresholding. The FDA solution can be also obtained in the LS regression framework where Y = R Suppose the training output values {yJr}7!1 are Ir { l/n� Yi ex -1/n� if x:rbelongs to class + 1, if x:'belongs to class -1. We use the linear-in-input model (equation 2.3) for learning, and the classifi­ cation result yeof a test sample xleis obtained by the sign of the output of the
  • 48. 38 2 Function Approximation learned function. ye=sgn (r(xte;0)). In this setting,if the parameter0is learned by the LS method,this classification method is essentially equivalent to FDA [42,21]. Under covariate shift, the adaptive importance-weighting idea can be employed in FDA, which we call adaptive importance-weighted FDA (AIWFDA): � [1 ntr (Pte(Xtr))Y � 2]Oy =argmin - L --;-r (I(x:r;0)-yn .9 ntr ;=1 Ptr(X;) The solution is given analytically in exactly the same way as the regression case (see section 2.2.1). As explained in the beginning of this section, the margin y!rfcx:r;0)plays an important role in classification. If y'r=±1, the squared error used above can be expressed in terms of the margin as (r(x:r;0)_y:')2=(y:')2(fc X;;:0)_1)2 =(1 - y!rr(x:r;0))2, where we used the facts that (y:')2=1 and y!r=1/ y!r.This expression of the squared loss is illustrated in figure 2.7,showing that the above squared loss is a convex upper bound of the Oil-loss. Therefore,minimizing the error under the squared loss corresponds to minimizing the upper bound of the Oil-loss error, although the bound is rather loose. 2.3.2 Logistic Loss: Logistic Regression Classifier Logistic regression (LR)-which sounds like a regression method-is a clas­ sifier that gives a confidence value (the class-posterior probability) for the classification results [74]. The LR classifier employs a parametric model of the following form for expressing the class-posterior probability p(ylx): � 1 p(ylx)= :xc 1 + exp (-yI(x;0)) The parameter0is usually learned by maximum likelihood estimation (MLE). Since the negative log-likelihood can be regarded as the empirical error, the
  • 49. 2.3. Examples of Importance-Weighted Classification Methods 39 adaptive importance-weighting idea can be employed in LR, which we call adaptive importance-weighted LR (AIWLR): 8y =argmin [t(Pte(x!?)Y log (1 + exp (-y;rf(x:r; 8»))]. e i=! Ptr(xi) The profile of the above loss function, which is called the logistic loss, is illustrated in figure 2.7 as a function of margin y [(x). Since the above objective function is convex when a linear-in-parameter model (2.5) is used, the global optimal solution can be obtained by standard nonlinear optimization methods such as the gradient descent method, the con­ jugate gradient method, Newton's method, and quasi-Newton methods [117]. The gradient descent update rule is given by for all e. A C-language implementation of importance-weighted kernel logistic regression is available from the Web page http://sugiyama-www.cs.titech.ac .jpryamada/iwklr.html. 2.3.3 Hinge Loss: Support Vector Machine The support vector machine (SVM) [26, 193, 141] is a popular classifica­ tion technique that finds a separating hyperplane with maximum margin. Although the original SVM was derived within the framework of the Vapnik­ Chervonenkis theory and the maximum margin principle, the SVM learning criterion can be equivalently expressed in the following form3 [47]: This implies that SVM is actually similar to FDA and LR; only the loss func­ tion is different. The profile of the above loss function, which is called the hinge loss, is illustrated in figure 2.7. As shown in the graph, the hinge loss is a convex upper bound of the Oil-loss and is sharper than the squared loss. 3. For simplicity, we have omitted the regularization term.
  • 50. 40 2 Function Approximation Adaptive importance weighting can be applied to SVM, which we call adaptive importance-weighted SVM (AIWSVM): � [nt, (Pte(xtr))Y � ](Jy=argmin L --;,- max (O, I- y:'l(x:'; (J)) . 9 ;=1 Pt,(x;) The support vector classifier was shown to minimize the conditional value­ at-risk (CVaR) of the margin [181] (see section 2.2.4 for the definition of CVaR). 2.3.4 Exponential Loss: Boosting Boosting is an iterative learning algorithm that produces a convex combination of base classifiers [52]. Although boosting has its origin in the framework of probably approximately correct (PAC) learning [190], it can be regarded as a stagewise optimization of the exponential loss [28,53]: The profile of the exponential loss is illustrated in figure 2.7. The exponential loss is a convex upper bound of the Oil-loss that is looser than the hinge loss. The adaptive importance-weighting idea can be applied to boosting, a process we call adaptive importance-weighted boosting (AIWB): � [ntr (Pte(Xtr))Y � ](J =argmin '"' --'- exp (_ytrI(xtr;(J)) . Y � P (xt') "(J i=l tf 1 2.4 Numerical Examples In this section we illustrate how ERM, IWERM, and AIWERM behave, using toy regression and classification problems. 2.4.1 Regression We assume that the conditional distribution P(ylx) has mean I(x)and vari­ ance a2, that is, output values contain independent additive noise. Let the learning target function I(x) be the sincfunction: {I if x =O, I(x) =sinc(x ) := sin(JTx) otherwise. JTX
  • 51. 2.4. Numerical Examples Let the training and test input densities be Ptr(x)= N(x;1, (1/2)2), Pte(x)= N(x;2, (1/4)2), 41 where N(x;fL, a2) denotes the Gaussian density with mean fL and variance a2• The profiles of the densities are plotted in figure 2.8a. Since the training input points are distributed in the left-hand side of the input domain and the test input points are distributed in the right-hand side, we are considering a (weak) extrapolation problem. We create the training output value {yJr}7!! as where {En7!! are i.i.d. noise drawn from We let the number of training samples be ntr= 150, and we use a linear-in-input model (see section 1.3.5.1) for function learning: If ordinary least squares (OLS, which is an ERM method with the squared loss) is used for fitting the straight-line model, we have a good approxima­ tion of the left-hand side of the sinc function (see figure 2.8b). However, this is not an appropriate function for estimating the test output values (x in the figure). Thus, OLS results in a large test error. Figure 2.8d depicts the learned function obtained by importance-weighted LS (lWLS). IWLS gives a better function for estimating the test output values than OLS, although it is rather unstable. Figure 2.8c depicts a learned function obtained by AIWLS with y = 0.5 (see section 2.2.1), which yields better estimation of the test output values than IWLS (AIWLS with y = 1) and OLS (AIWLS with y =0). 2.4.2 Classification Through the above regression examples, we found that importance weight­ ing tends to improve the prediction performance in regression scenarios. Here, we apply the importance-weighting technique to a toy classification problem.
  • 52. 42 15 05 #'. i/'if . ; '" '. O�"��----�------NU��----���---'�"�-- -05 o 05 15 2 25 3 o h-----,.-----,- f(x) - - -'f(x) -05 0 Training X Test x (a) Input data densities -0.5 0 05 15 2 25 3 (b) Function learned by OLS (AIWLS with r = 0) 3 x (c) Function learned by AIWLS with r = 0.5) (d) Function learned by IWLS (AIWLS with r = 1) Figure 2.8 2 Function Approximation An illustrative regression example with covariate shift. (a) The probability density functions of the training and test input points and their ratio. (b}-(d) The learning target function f(x) (the solid line), training samples (0), a learned function f(x) (the dashed line), and test samples (x). Note that the test samples are not used for function learning.
  • 53. 2.4. Numerical Examples 43 Let us consider a binary classification problem in the two-dimensional input space (d=2). We define the class-posterior probabilities, given input x,by I +tanh (x(!) +mineO, X(2»)) p(y=+llx)= 2 ' where x=(x(!),X(2») Tand p(y=-llx)=1 - p(y=+llx). The optimal decision boundary,that is, a set of all xsuch that p(y=+llx)=p(y=-llx), is illustrated in figure 2.9a. Let the training and test input densities be Ptr(x)=�N(x; [�2l[� �D+�N(x; [�l[� �D, Pte(x)=�N(x; [�ll[� �D+�N(x; [�ll[� �D, (2.8) (2.9) (2.10) where N(x;It, 1:) is the multivariate Gaussian density with mean It and covariance matrix 1:. This setup implies that we are considering a (weak) extrapolation problem. Contours of the training and test input densities are illustrated in figure 2.9(a). Let the number of training samples be ntr=500. We create training input points {x:'}7!! following Ptr(x) and training output labels {y!,}7!! following p(yIx=x:').Similarly, let the number of test samples be nte=500. We cre­ ate ntetest input points {xie};::!following Pte(x)and test output labels {y�e};::! following p(yIx=xie). We use the linear-in-input model for function learning: l(x;(J)=e!x(l) +e2x(2) +e3, and determine the parameter (J by AIWFDA (see section 2.3.1). Figure 2.9b depicts an example of realizations of training and test samples, as well as decision boundaries obtained by AIWFDA with y=0, 0.5, 1. In this particular realization, y = 0.5 or 1 works better than y = O.
  • 54. 44 2 Function Approximation 8 7 Negative Positive 6 5 4 3 2 0 -1 -2 -3 -4 -2 0 2 4 6 (a) Optimal decision boundary (the thick solid line) and contours of training and test input densities (thin solid lines). 8 6 4 x 2 o -2 -8 -6 y=0.5 • x x + -4 -2 • • o o 0 Train-pos o x Train-neg o Test-pos xc§) + Test-neg o ff9.nO o Oo 'U:!tTOoo 080 P °0 o�o % 0 0 0 8 �DD Di:f:J lbx�[J]q:jD ft rf! ffiJD 0 � -l='D D 0 � 2 4 6 (b) Optimal decision boundary (solid line) and learned boundaries (dashed lines). o and x denote the positive and negative training samples, while D and + denote the positive and negative test samples. Note that the test samples are not given in the training phase; they are plotted in the figure for illustration purposes. Figure 2.9 An illustrative classification example with covariate shift.
  • 55. 2.5. Summary and Discussion 45 2.5 Summary and Discussion Most of the standard machine learning methods assume that the data generat­ ing mechanism does not change over time (e.g., [195, 193, 42, 74, 141, 21]). However, this fundamental prerequisite is often violated in many practical problems, such as off-policy reinforcement learning [174, 66, 67, 4], spam fil­ tering [18], speech recognition [204], audio tagging [198], natural language processing [186], bioinformatics [10, 25], face recognition [188], and brain­ computer interfacing [201,160, 102]. When training and test distributions are different, ordinary estimators are biased and therefore good generalization performance may not be obtained. If the training and test distributions have nothing in common,we may not be able to learn anything about the test distribution from training samples. Thus, we need a reasonable assumption that links the training and test distributions. In this chapter, we focused on a specific type of distribution change called covariate shift [145], where the input distribution changes but the conditional distribution of outputs given inputs does not change-extrapolation would be a typical example of covariate shift. We have seen that the use of importance weights contributes to reduc­ ing the bias caused by covariate shift and allows us to obtain consistent estimators even under covariate shift (section 2.1.1). However, a naive use of the importance weights does not necessarily produce reliable solutions since importance-weighted estimators tend to have large variance. This is because training samples which are not "typical" in the test distribution are downweighted, and thus the effective number of training samples becomes smaller. This is the price we have to pay for bias reduction. In order to mit­ igate this problem, we introduced stabilization techniques in sections 2.1.2 and 2.1.3: flattening the importance-weights and regularizing the solution. The importance-weighting techniques have a wide range of applicability, and any learning methods can be adjusted in a systematic manner as long as they are based on the empirical error (or the log-likelihood). We have shown exam­ ples of such learning methods in regression (section 2.2) and in classification (section 2.3). Numerical results illustrating the behavior of these methods were presented in section 2.4. The introduction of stabilizers such as the flattening parameter and the regularization parameter raises another important issue to be addressed; how to optimally determine these trade-off parameters. This is a model selection problem and will be discussed in detail in chapter 3.
  • 56. 3 Model Selection As shown in chapter 2, adaptive importance-weighted learning methods are promisingin the covariate shift scenarios, giventhat theflattening parameter y is chosen appropriately. Although y = 0.5 worked well for both the regression and the classification scenarios in the numerical examples in section 2.4, y = 0.5 is not always the best choice; a good value of y may dependon the learning target function, the models, the noise level in the training samples, and so on. Therefore, model selection needs to be appropriately carried out for enhancing the generalization capability under covariate shift. The goal of model selection is to determine the model (e.g., basis functions, the flattening parameter y, and the regularization parameter A) so that the gen­ eralization error is minimized [1, 108, 2, 182, 142, 135, 35, 3, 136, 144, 195, 45, 118, 96, 85, 193, 165, 161, 159].The true generalization error is not accessible since it contains the unknown learning target function. Thus, some generaliza­ tion error estimators need to be used instead. However, standard generalization error estimators such as cross-validation are heavily biased under covariate shift, and thus are no longer reliable. In this chapter, we describe generaliza­ tion error estimators that possess proper unbiasedness even under covariate shift. 3.1 Importance-Weighted Akaike Information Criterion In density estimation problems, the Akaike information criterion (AIC) [2] is an asymptotic unbiased estimator of the Kullback-Leibler divergence [97] from the true density to an estimated density, up to a constant term. Ale can be employed in supervised learning if the supervised learning problem is formulated as the problem of estimating the conditional density of out­ put values, given input points. However, the asymptotic unbiasedness of Ale is no longer true under covariate shift; a variant of Ale which we refer
  • 57. 48 3 Model Selection to as importance-weighted Ale (lWAIC) is instead asymptotically unbiased [145]: 1 � Pte(xt') � � 1 ��- 1 IWAIC:=- -� --;,-loss(x:', yJ', f (X:';8» + -tr(FG ), nt, i=1 Pt,(xi) nt, (3.1) where0 is a learned parameter vector, and P and G are the matrices with the (£, £')-th element � ._ 1 � ( Pte(xn ) 2alOSS(x:',y:',1(x:';0»Fee' . - - � --- , ntr i=1 Ptr(x:') aee aloss(xt',yt',lext,;0» x 1 1 1 aee, , 1 ntt (tr) '121 (tr tr f � ( tr. 1I» Geer:= __ L: Pte Xi U OSS Xi' Yi' Xi' U , , nt,. Pt,(x:,) aeeaeer1=1 When Pt,(x)=Pte(X), IWAIC is reduced to --..., .-..,where F and G are the matrices with the (£, £,)-th elements � 1 ntt aloss(xt' yt' f�(xtr.0» aloss(xt' yt' f�(xtr.0» F' '- _ '"" l' I' 1 ' l' l' 1 ' e,e' ,- ntr � aee aee, ' 1=1 � 1 ntt a210ss(xt' yt' f�(xtr.0» G' * - __ '"" l' l' l' eer'- � ., ntr. a�a�,1=1 This is called the Takeuchi information criterion (TIC) [182]. Furthermore, when the modell(x; 8) is correctly specified, P' agrees with G'. Then TIC is reduced to 1 L:ntt � � dim(8) AIC:=- - loss(xtr,ytr,f(xtr;8» + --, n " 1 ntr ;=1 tr
  • 58. 3.1. Importance-Weighted Akaike Information Criterion 49 where dim(O) denotes thedimensionalityof the parameter vector O. This is the original Akaike information criterion [2]. Thus, IWAIC would be regarded as a natural extension of AIC. Note that in the derivation of IWAIC (and also of TIC and AIC), proper regularity condi­ tions are assumed (see, e.g., [197]). This excludes the use of nonsmooth loss functions such as the 0/1-10ss or nonidentifiable models such as multilayer perceptrons and Gaussian mixture models [196]. Let us consider the following Gaussian linear regression scenario: • The conditional density p(ylx) is Gaussian with mean I(x) and variance a2: p(ylx) = N(y; I(x), a2), where N(y; /-L, a2) denotes the Gaussian density with mean /-L and variance a2• • The linear-in-parameter model is used (see section 1.3.5.2) is used: b lex; 0) = Lee({Je(X), e= where b is the number of parameters and {({Je( x)}�= are fixed, linearly independent functions. • The parameter is learned by a linear learning method, that is, the learned parameter8 is given by 8= Ly tr , where L is a b x ntr learning matrix that is independent of the training output noise contained iny t " and tr. ( tf tf tr )T y . = y, Y2' ..., Yntr . • The generalization error is defined as that is, the squared loss is used.
  • 59. 50 3 Model Selection To make the following discussion simple, we subtract constant terms from the generalization error: Gen' :=Gen - C' _a2, (3. 2) where C' is defined as C':=lE [P(xte)] • xte Note that C' is independent of the learned function. Under this setup, the generalizationerror estimator based on IWAle is given as follows [145]: GenIWAIC=(ULytr, Lytr) - 2(ULytr, Ld') + 2tr(ULQL�), where (., .) denotes the inner product, � 1 T U:=-Xtr WXtr, nt, and Q is the diagonal matrix with the i-th diagonal element IWAle is asymptotically unbiased even under covariate shift; more pre­ cisely, IWAle satisfies (3.3) where lE{xYl7'!!"1 denotes the expectations over {X:,}7!I drawn i.i.d. from Ptr(x), and lE{YYl?'!!"1 denotes the expectations over {yJr}7!p each drawn from p(ylx= x:'). 3.2 Importance-Weighted Subspace Information Criterion IWAle has nice theoretical properties such as asymptotic lack of bias (equa­ tion 3.3). However, there are two issues for possible improvement-input independence and model specification.
  • 60. 3.2. Importance-Weighted Subspace Information Criterion 3.2.1 Input Dependence vs. Input Independence in Generalization Error Analysis 51 The first issue for improving IWAIC is the way lack of bias is evaluated. In IWAIC, unbiasedness is evaluated in terms of the expectations over both train­ ing input points and training output values (see equation 3.3). However, in practice we are given only a single training set {(x:r, y:r)}7!], and ideally we want to predict the single-trial generalization error, that is, the generalization error for a single realization of the training set at hand. From this viewpoint, we do not want to average out the random variables; we want to plug the real­ ization of the random variables into the generalization error and evaluate the realized value of the generalization error. However, we may not be able to avoid taking the expectation over the train­ ing output values {y!,}7!] (i.e., the expectation over the training output noise) since the training output noise is inaccessible. In contrast, the location of the training input points {x:,}7!] is accessible. Therefore, it would be advanta­ geousto predictthe generalization error without takingthe expectation overthe training input points, that is, to predict the conditional expectation of the gen­ eralization error, given training input points. Below, we refer to estimating the conditional expectation of the generalization error as input-dependent analysis of the generalization error. On the other hand, estimating the full expectation of the generalization error is referred to as input-independent analysis of the generalization error. In order to illustrate a possible advantage of the input-dependent approach, let us consider a simple model selection scenario where we have only one training sample (x, y) (see figure 3.1). The solid curves in figure 3.la depict GM] (y Ix), the generalization error for a model M] as a function of the (noisy) training output value y, given a training input point x. The three solid curves correspond to the cases where the realization of the training input pointx is x ' , x " , andX III , respectively.The value of the generalizationerror for the model M] in the input-independent approach is depicted by the dash-dotted line, where the expectation is taken over both the training input point x and the train­ ing output value y (this corresponds to the mean of the values of the three solid curves). The values of the generalization error in the input-dependent approach are depicted by the dotted lines, where the expectation is taken over only the training output value y, conditioned on x = x ' , x " , and X III , respec­ tively (this corresponds to the mean of the values of each solid curve). The graph in figure 3.lb depicts the generalization errors for a model M2 in the same manner.
  • 61. 52 3 Model Selection GM)(ylx = X "I ) GM)(ylx = X iII ) ••••• � y X.���/•••••••••••••••• ••••• , --t-------------.. y (a) Generalization error for model M( GM2(ylx = x') lE GM2(ylx = x') ulx=x' lE GM,(x,y) x,y GM2(ylx = x") --+---------------�� y (b) Generalization error for model M2 Figure 3.1 Schematic illustrations of the input-dependent and input-independent approaches to generalization error estimation.
  • 62. 3.2. Importance-Weighted Subspace Information Criterion 53 In the input-independent framework, the model MJ is judged to be better than M2, regardless of the realization of the training input point, because the dash-dotted line in figure 3.1a is lower than that in figure 3.1b. However, M2is actually better than MJifx " is realized asx. In the input-dependentframework, the goodnessof the modelis adaptively evaluated, dependingon the realization of the training input point x. This illustrates a possibility that input-dependent analysis of the generalization error allows one to choose a better model than input-independent analysis. 3.2.2 Approximately Correct Models The second issue for improving IWAIC is model misspecification. IWAIC has the same asymptotic accuracy for all models (see equation 3.3). However, in practice it may not be so difficult to distinguish good models from com­ pletely useless (i.e., heavily misspecified) models since the magnitude of the generalization error is significantly different. This means that we are essen­ tially interested in choosing a very good model from a set of reasonably good models. In this scenario, if a generalization error estimator is more accurate for better models (in other words, approximately correct models), the model selection performance will be further improved. In order to formalize the concept of "approximately correct models," let us focus on the following setup: • The squared-loss (see section 1.3.2) is used: loss(x, y,)!)=(y_ y)2. • Noise in training and test output values is assumed to be i.i.d. with mean zero and variance a2• Then the generalization error is expressed as follows (see section 1.3.3): Gen:=IEIE [(i(xte;0) _ yte)2]xteyte =IEIE [(i(xte;0) -f(xte)+ f(xte)_ yte)2]xteyte =IE [(i(xte;0) -f(xte))2]+IE [(f(xte)_ yte)2]xte yte + 2IE [i(xte;0) -f(xte)]IE [i(xte)_ yte]xte yte =IE [(i(xte;0) _ f(xte))2]+ a2,xte
  • 63. 54 3 Model Selection whereIEx'e denotesthe expectationover x'e drawnfrom Pte(x), andlEy,e denotes the expectation over yte drawn from p(yl x = xte). • A linear-in-parameter model (see section 1.3.5.1) is used: b lex; 8) =L8eCfJe(x), e=! where b is the number of parameters and {CfJe(x)}�=! are fixed, linearly independent functions. Let 8* be the optimal parameter under the above-defined generalization error: 8*:=argminGen. 9 Then the learning target function f(x) can be decomposed as f(x) =lex; 8*) + <5r(x), (3.4) where <5r(x) is the residual.rex) is orthogonalto the basis functions {CfJe(x)}��! under Pte(x): The functionrex) governs the nature of the model error, and <5 is the possible magnitude of this error. In order to separate these two factors, we impose the following normalization condition onrex): The abovedecompositionis illustrated in figure 3.2. An "approximately cor­ rect model" refers to the model with small <5 (but <5 =F 0). Rather informally, a good model should have a small model error <5. 3.2.3 Input-Dependent Analysis of Generalization Error Based on input-dependent analysis (section 3.2.1) and approximately cor­ rect models (section 3.2.2), a generalization error estimator called the sub­ space information criterion (SIC) [165, 161] has been developed for linear regression, and has been extended to be able to cope with covariate shift [162]. Here we review the derivation of this criterion, which we refer to
  • 64. 3.2. Importance-Weighted Subspace Information Criterion 55 8r(x) f(x) / �3irX;8j/ span({<pe(x)}�=l) Figure 3.2 Orthogonal decomposition of f(x). An approximately correct model is a model with small model error o. as importance-weighted SIC (lWSIC). More specifically, we first derive the basic form of IWSIC, "pre-IWSIC," in equation 3.8. Then versions of IWSIC for various learning methods are derived, including linear learning (equa­ tion 3.11, and affine learning (equation 3.11», smooth nonlinear learning (equation 3.17), and general nonlinear learning (equation 3.18). 3.2.3.1 Preliminaries Since l(x;0) and r (x) are orthogonal to one another (see figure 3.2), Gen' defined by equation 3.2 can be written as Gen' =JE [j(xte;0)2] - 2JE [l(xte;O)f(xte)]xte xte =JE [j(xte;0)2] - 2JE [l(xte;0) (j(xte; 8*) + r( xte»)] xte xte =(VO,O) - 2(VO, 8*), where V is the b x b matrix with the (£, £,)-th element Ve,e' :=JE [CPe(Xte)cper(xte)]. x'e (3.5) A basic concept of IWSIC is to replace the unknown 8* in equa­ tion 3.5 with the importance-weighted least-squares (lWLS) estimator01 (see section 2.2.1): where tr. (tf tf tf )T Y .= YI' Y2' ..., Yntr • (3.6)
  • 65. 56 X lf is the n lf x b design matrix with the (i, £)-th element and WI is the diagonal matrix with the i-th diagonal element W . . ' = P le(X:') [ il",. ( If)' P lf Xj 3 Model Selection However, simply replacing ()* with81 causes a bias since the sampleylf is also used for obtaining the target parameter8. The bias is expressed as where lE{yn7!:1 denotes the expectations over {y;r}7!p each of which is drawn from p(yI x = x:'). Below, we study the behavior of this bias. 3.2.3.2 Approximate Bias Correction and pre-IWSIC The output values CfcX:f; ()*)}7!1 can be expressed as Let r lf be the ntr-dimensional vector defined by r lf -(r(xlf) r(xlf) r(xlf »T - l ' 2 ' ···' ntr ' where rex) is the residual function (see figure 3.2). Then the training output vectorylf can be expressed as where 8 is the model error(see figure 3.2), and
  • 66. 3.2. Importance-Weighted Subspace Information Criterion 57 is the vector of training output noise. Due to the i.i.d. noise assumption, flf satisfies where Ontr denotes the nlf-dimensional vector with all zeros and Intr is the nlf- dimensional identity matrix. For a parameter vector 0 learned from ylf, the bias caused by replacing (J* with01 can be expressed as (3.7) where IE(Y),}7�1 denotes the expectations over (y:f}7!1' each of which is drawn from p(ylx = x:'). In the above derivation, we used the fact that LIXlf = Ib, which follows from equation 3.6. Based on the above expression, we define "pre-IWSIC" as follows. (3.8) that is, IE(Y),}7�1 [8(UO, Llrtf)] in equation 3.7 was ignored. Below, we show that pre-IWSIC is asymptotically unbiased even under covariate shift. More precisely, it satisfies (3.9) where 0p denotes the asymptotic order in probability with respect to the distribution of { x:,}7!1' [Proof of equation 3.9:] Llrlf is expressed as
  • 67. 58 Then the law of large numbers [131] asserts that = f Pte(x) CPe(x)cpdX)Ptf(X)d x Ptf(X) = fPte(X)CPe(X)CPe'(x)d x <00, 3 Model Selection where "plim" denotes convergence in probability. This implies that On the other hand, when ntf is large, the central limit theorem [131] asserts that XtfTw tf _ � Pte Xi (tf) (tf) [ 1 ] 1 ntf (tr)- Ir - - �-- tf-CPe Xi r Xi ntf e ntf i=1 Ptf(Xi) where we use the fact that CPe(x) and r(x) are orthogonal to one another under Pte(x) (see figure 3.2). Then we have I Llrtr = Op (n�2). This implies that when8is convergent, and therefore we have equation 3.9. Note that IWSIC's asymptotic lack of bias shown above is in terms of the expectation only over the training output values {y:f}7�1; training input points { xn7�1 are fixed(i.e., input-dependentanalysis; cf.input-independentanalysis,
  • 68. 3.2. Importance-Weighted Subspace Information Criterion 59 explained in section 3.1). The asymptotic order of the bias of pre-IWSIC is proportional to the model error 8, implying that pre-IWSIC is more accurate for "better" models; if the model is correct (i.e., 8 = 0), pre-IWSIC is exactly unbiased with finite samples. Below, for types of parameter learning including linear, affine, smooth non­ linear, and general nonlinear, we show how to approximate the third term of pre-IWSIC, from the training samples {(x:', y:')}7!]. In the above equation, IE{YFI7!::, essen­ tially means the expectation over training output noise {E:'}7!]. Below, we stick to using lE{yFI7!::, for expressing the expectation over {E:'}7!]. 3.2.3.3 Linear Learning First, let us consider linear learning, that is, the learned parameter-0 is given by (3.10) where L is a b x nt, learning matrix which is independentof the training output noise. This independence assumption is essentially used as Linear learning includes adaptive importance-weighted least squares (see section 2.2.1) and importance-weighted least squares with the squared reg­ ularizer (see section 2.1.3). For linear learning (equation 3.10), we have
  • 69. 60 3 Model Selection where a2 is the unknown noise variance. We approximate the noise variance a2 by where Lo is the learning matrix of the ordinary least-squares estimator: Then IWSIC for linear learning is given by (3.11) When the model is correctly specified (i.e., (, = 0), (J2 is an unbiased estima­ tor of a2 with finite samples. However, for misspecified models (i.e., (, f= 0), it is biased even asymptotically: On the other hand, tr(ULL�) = Op (n;;l) yields implying that even for misspecified models, IWSIC still satisfies (3.12) In section 8.2.4, weintroduceanotherinput-dependentestimatorof the gen­ eralization error under the active learning setup-it is interesting to note that the constant terms ignored above (see the definition of Gen' in equation 3.2) and the constant terms ignored in activelearning(equation 8.2) are different. The appearance of IWSIC above rather resembles IWAIC under linear regression scenarios:
  • 70. 3.2. Importance-Weighted Subspace Information Criterion where � 1 TU:=-Xtr WXtr, ntr and Q is the diagonal matrix with the i-th diagonal element 61 Asymptotic unbiasedness of IWAIC in the input-dependent framework is given by that is, the model error 8 does not affect the speed of convergence.This differ­ ence is significant when we are comparing the performance of "good" models (with small 8). 3.2.3.4 Affine Learning Next, we consider a simple nonlinear learning method called affine learning. That is, for a b x ntr matrix L and a b­ dimensional vector c, both of which are independent of the noise €tr, the learned parameter vector0 is given by O=Lir+c. (3.13) This includes additive regularization learning [121]: where 1:=()I.J, A2, ..., Antr)T is a tuning parameter vector. The learned parameter vector 0 is given by equation 3.13 with L=(XtrTXtr+Ib)-IXtrT, C = (XtrTXtr+Ib)-lXtrT1,
  • 71. 62 where Ib denotes the b-dimensionalidentity matrix. For affine learning (equation 3.13), it holds that IE [(U8, L1E tr)]= IE [(U(Lytr+c), L1E tr)] {YY}7�1 {YY}7�1 =a2tr(ULL�). 3 Model Selection Thus, we can still use linear IWSIC(equation 3.11) for affine learning without sacrificing its unbiasedness. 3.2.3.5 Smooth Nonlinear Learning Let us consider a smooth nonlinear learning method. That is, using an almost differentiable [148] operator L, 0- is given by 0-=L(ytr). This includes Huber's robust regression (see section 2.2.3): where r (2: 0) is a tuning parameter and Pr(y ):= l�y2 iflyl:::::r, rlyl - � r2 if Iyl > r . (3.14) Note that Pr(y ) is twice almost differentiable, which yields a once almost differentiable operator L. Let V be the nt, x nt, matrix with the (i, i')-th element (3.15) where 'Vi is the partial derivative operator with respect to the i-th element of input i' and [L�ULl,(i') denotes the i'-th element of the output of the vector-valued function [L�UL](i'). Suppose that the noiseE t' is Gaussian. Then, for smooth nonlinear learning (equation 3.14), we have (3.16)
  • 72. 3.2. Importance-Weighted Subspace Information Criterion 63 This result follows from Stein's identity [148]: for an ntf-dimensional i.i.d. Gaussian vector Etf with mean Ontr and covariance matrix a2Intr' and for any almost-differentiable function hO: �ntf ---+ R it holds that where E:� is the i'-th element of Etf. If we let h(E) =[LiUL](if), h(E) is almost differentiable since L(if) is almost differentiable. Then an elementwise application of Stein's identity to a vector-valued function h(E) establishes equation 3.16. Based on this, we define importance-weighted linearly approximated SIC (IWLASIC) as --- ..-...-. ..-. 2GenlWLASIC =(U8, 8) - 2(U8, Ldf) + 2a tr(V), (3.17) which still maintainsthe same lack of bias as the linear case underthe Gaussian noise assumption.It is easy to confirmthat equation 3.17is reducedto the orig­ inal form(equation 3.11) whenOis obtainedby linear learning(equation 3.10). Thus, IWLASIC may be regarded as a natural extension of the original IWSIC. Note that IWLASIC (equation 3.17) is defined for the true noise variance a2; if a2 in IWLASIC is replaced by an estimator0'2, unbiasedness of IWLA­ SIC can no longer be guaranteed. On the other hand, IWSIC for linear and affine learning(see equation 3.11) is defined with an unbiased noise variance estimator0'2, but it is still guaranteed to be unbiased. 3.2.3.6 General Nonlinear Learning Finally, let us consider general nonlin­ ear learning methods. That is, for a general nonlinear operator L, the learned parameter vector 8 is given by o=L(if). This includes importance-weighted least squares with the absolute regularizer (see section 2.1.3).
  • 73. 64 3 Model Selection 1. Obtain the learned parameter vector B, using the training samples {ex�, yyn7!, as usual. 2. Estimate the noise by {E;tr I E;tr = yf - lex:';B)}7!" 3. Create bootstrap noise samples {Eitr}7!, by sampling with replacement from {E;tr}�!l· 4. Obtain the learned parameter vector 0 using the bootstrap samples {ex�, y;tr) I y;tr = lexn + E;tr}7!" 5. Calculate (VO, L,Etr). 6. Repeat steps 3 to 5 a number of times and output the mean of (VB, L,Et'). Figure 3.3 Bootstrap procedure in IWBASIC. For general nonlinear learning, we estimate the third term IE{YYI7�J [(V8, LJ€tr)] in pre-IWSIC (equation 3.8), using the bootstrap method [43, 45]: where lE{ytrln� denotes the expectation over the bootstrap replication, and � I i-I --... () and E t, correspond to the learned parameter vector () and the training out- put noise vector€tr estimated from the bootstrap samples, respectively. More specifically, we compute iE{yYI7�J [(Vii, LIE t')] by bootstrapping residuals as described in figure 3.3. Based on this bootstrap procedure, importance-weighted bootstrap­ approximated SIC (lWBASIC) is defined as (3.18) 3.3 Importance-Weighted Cross-Validation IWAIC and IWSIC do not accept the 0/1-10ss (see section 1.3.2). Thus, they cannot be employed for estimating the misclassification rate in classification scenarios. In this section, we describe a more general model selection method that can be applied to an arbitrary loss function including the 0/1-10ss. Below, we consider the generalization error in the following general form:
  • 74. 3.3. Importance-Weighted Cross-Validation 65 One of the popular techniques for estimating the generalization error for arbitrary loss functionsis cross-validation(CV) [150,195].CV has beenshown to give an almost unbiased estimate of the generalization error with finite samples [105, 141]. However, such unbiased estimation is no longer possi­ ble under covariate shift. To cope with this problem, a variant of CV called importance-weighted CV (IWCV) has been proposed [160].Let us randomly divide the training set Z = {(x:', y:')}7�1 into k disjoint nonempty subsets {Z;}�=1 of (approximately) the same size. Let Tzi (X ) be a function learned from {Z;I};'¥; (i.e., without Z;). Then the k-fold IWCV (kIWCV) estimate of the generalization error Gen is given by k � 1 " 1 " P,e(x) �GenkIWCV = - � - � --loss(x, y, fz.(x)), k ;=1 IZ;I (X,Y)EZi p,,(x) I where IZ;I is the number ofsamples in the subset Z; (see figure 3.4). When k = n,C. kIWCV is called IW leave-one-out CV (IWLOOCV): 1 ntr (tr) G� " Pte x; I ( " " f,�( I'))enIWLOOCV = - � --,,- oss x;, y;, ; x; , ntr ;=1 Ptr(X;) where l(x) is a function learned from all samples without (x:', y:'). It has been proved that IWLOOCV gives an almost unbiased estimate of the general­ ization error even under covariate shift [160]. More precisely, IWLOOCV for nt, training samples gives an unbiased estimate of the generalization error for Subset I [eee] .. v Estimation 1 Subset Subset Subset i-I i i + I [eee] [eee] [eee] . . . . -.....":...... Validation .. Subset k [eee] j v Estimation iZi(x) .................. ... ..... ..................... • £(x, y, iZi(x)) Figure 3.4 Cross-validation.
  • 75. 66 3 Model Selection ntr - 1 training samples: where IE(Xn7!1 denotes the expectation over {X:,}7!1 drawn i.i.d. from Ptr(x), lE(yn7!1 denotes the expectations over {YI'}7!p each of which is drawn from p(ylx = xn, and Genntr-! denotes the generalization error for ntr - 1 train­ ing samples. A similar proof is possible for kIWCV, but the bias is slightly larger [74]. Almost unbiasedness of IWCV holds for any loss function, any model, and any parameter learning method; even nonidentifiable models [196] or nonpara­ metric learning methods(e.g., [141]) are allowed. Thus, IWCV is very flexible and useful in practical model selection under covariate shift. Unbiasedness of IWCV shown aboveis within the input-independentframe­ work (see section 3.2.1); in the input-dependent framework, unbiasedness of IWCV is described as This is the same asymptoticorder as inIWAIC and IWSIC. However, since it is independentof the model error 8, IWSIC would be more accurate in regression scenarios with good models. 3.4 Numerical Examples Here we illustrate how IWAIC, IWSIC, and IWCV behave. 3.4.1 Regression Let us continue the one-dimensional regression simulation of section 2.4.1. As shown in figure 2.8 in section 2.4.1, adaptive importance-weighted least squares (AIWLS) with flattening parameter y = 0.5 appears to work well for that particular realization.However, the best value of y depends on the realiza­ tion of samples. In order to investigate this issue systematically, let us repeat the simulation 1000 times with different random seeds. That is, in each run, {(x:'' EI')};!! are randomly drawn and the scores of tenfold IWCV, IWSIC, IWAIC, and tenfold CV are calculated for y = 0, 0.1, 0.2, ..., 1. The means and standard deviations of the generalization error Gen and its estimate by each method are depicted as functions of y in figure 3.5.
  • 76. 3.4. Numerical Examples 0.5 0.4 0.3 0.2 0.1 0.5 0.4 0.3 0.2 0.1 0.5 0.4 0.3 0.2 0.1 0.5 0.4 0.3 0.2 0.1 1.4 1.2 0.8 0.6 0.4 0.2 o o o o o Figure 3.5 0.2 0.2 /'0.. GenlWSIC 0.2 /'0.. GenlWAIC 0.2 /'0.. GenCV 0.2 67 0.4 0.6 0.8 y 0.4 0.6 0.8 y 0.4 0.6 0.8 y 0.4 0.6 0.8 y 0.4 0.6 0.8 y Generalization error and its estimates as functions of the flattening parameter y in adaptive importance-weighted least squares (AIWL S) for the regression examples in figure 2.8. Dashed curves in the bottom four graphs depict the true generalization error for clear comparison. Note that the vertical scale of CV is different from others since it takes a wider range of values. Also note that IWAIC and IW SIC are estimators of the generalization error up to some constants (see equations 3.3 and 3.12). For clear comparison, we included those ignored constants in the plots, which does not essentially change the result.
  • 77. 68 3 Model Selection Note that IWAIC and IWSIC are estimators of the generalization error up to some constants (see equations 3.3 and 3.12). For clear comparison, we included those ignored constants in the plots, which does not essentially change the result. The graphs show that IWCV, IWSIC, and IWAIC give reasonably good unbiased estimates of the generalization error, while CV is heavily biased. The variance of IWCV is slightly larger than those of IWSIC and IWAIC, which would be the price we have to pay in compensation for gen­ erality (as discussed in section 3.3, IWCV is a much more general method than IWSIC and IWAIC). Fortunately, as shown below, this rather large variance appears not to affect the model selection performance so much. Next we investigate the model selection performance: the flattening param­ eter y (see section 2.1.2) is chosen from {O, 0.1, 0.2, ..., I} so that the score of each method is minimized. The mean and standard deviation of the gener­ alization error Gen of the learned function obtained by each method over 1000 runs are described in the top row of table 3.1. This shows that IWCV, IWSIC, and IWAIC give significantly smaller generalization errors than CV under the t-test [78] at the significance level of 5 percent.IWCV, IWSIC, and IWAIC are comparable to each other. For reference, the generalization error when the flat­ tening parameter y is chosen optimally(i.e., for each trial, y is chosen so that the true generalization error is minimized) is described as OPT in the table. The result shows that the generalization error values of IWCV, IWSIC, and IWAIC are rather close to the optimal value. The bottom row of table 3.1 describes the results when the polynomial model of order 2 is used for learning. This shows that IWCV and IWSIC still work well, andoutperformIWAIC andCV.When the second-orderpolynomial model is used, the target function is rather realizable in the test region (see figure 2.8). Therefore, IWSIC tends to be more accurate (see section 3.2.3). Table 3.1 The mean and standard deviation of the generalization error Gen obtained by each method for the toy regression data set IW SIC IWAIC CV OPT (150,1) °0.077 ± 0.020 °0.077 ± 0.023 °0.076 ± 0.019 0.356 ± 0.086 0.069 ± 0.011 (100,2) °0.104 ± 0.113 °0.101 ± 0.103 0.113 ± 0.135 0.110 ± 0.135 0.072 ± 0.014 The best method and comparable ones by the t-test at the 5 percent significance level 5% are indicated by o. For reference, the generalization error obtained with the optimal y (i.e., the min­ imum generalization error) is described as OPT. ntr is the number of training samples, while k is the order of the polynomial regression model. (ntr, k) = (150,1) and (100,2) roughly cor­ respond to "misspecijied and large sample size;' and "approximately correct and small sample size;' respectively.
  • 78. 3.4. Numerical Examples 69 The good performance of IWCV is maintained thanks to the almost total lack of bias unbiasedness. OrdinaryCV is not extremely poor dueto the the fact that it is almost realizable, but it is still inferior to IWCV and IWSIC. IWAIC tends to work rather poorly since ntr = 100 is relatively small compared with the high complexity of the second-order model, and hence the asymptotic results are not valid (see sections 3.1 and 3.2.3). The above simulation results illustrate that IWCV performs quite well in regression under covariate shift; its performance is comparable to that of IWSIC, which is a generalization error estimator specialized for linear regression with linear parameter learning. 3.4.2 Classification Let us continue the toy classification simulation in section 2.4.2. IWSIC and IWAIC cannot be applied to classification problemsbecause they do not accept the O/I-loss. For this reason, we compare only IWCV and ordinary CV here. In figure 2.9b in section 2.4.2, adaptive importance-weighted Fisher dis­ criminant analysis (AIWFDA) with a middle/large flattening parameter y appears to work well for that particular realization. Here, we investigate the choice of the flattening parameter value by IWCV and CV more extensively. Figure 3.6 depicts the means and standard deviations of the generalization error Gen (which corresponds to the misclassification rate) and its estimate by each method over 1000 runs, as functions of the flattening parameter y in AIWFDA. The graphs clearly show that IWCV gives much better estimates of the generalization error thanCV does. Next we investigate the model selection performance: the flattening param­ eter y is chosen from {0,0.1, 0.2, ..., I} so that the score of each model selection criterionis minimized.The mean andstandard deviationof the gener­ alization error Gen of the learned function obtained by each method over 1000 runs are described in table 3.2. The table shows that IWCV gives significantly smaller test errors than CV does. Table 3.2 The mean and standard deviation of the generalization error Gen (i.e., the misclassification Rate) obtained by each method for the toy classification data set IWCV CV OPT °0.108 ± 0.027 0.131 ± 0.029 0.091 ± 0.009 The best method and comparable ones by the t-test at the 5 percent significance level are indi­ cated by o. For reference, the generalization error obtained with the optimal y (i.e., the minimum generalization error) is described as OPT.
  • 79. 70 0.2 0.3 r Gen 0.1 �:t--!�--+I-:iE----i:I-+� -Ir---i�-+I--1�O L-L-____�_______L______�______�______L_ o 0.2 0.4 0.6 0.8 y "'t �,� :;l�-!-++1-·I'1--1+·1o � M M M 1 0.3 ./'... GenCV 0.2 0.1 ...... -.. -- Y O L-L-____�_______L______�______�______L_ o 0.2 0.4 0.6 0.8 y Figure 3.6 3 Model Selection The generalization error Gen (i.e., the misclassification rate) and its estimates as functions of the tuning parameter y in adaptive importance-weighted Fisher discriminant analysis (AIWFDA) for the toy classification examples in figure 2.9. Dashed curves in the bottom two graphs depict the true generalization error in the top graph for clear comparison. This simulation result illustrates that IWCV also is useful in classification under covariate shift. 3.5 Summary and Discussion In this chapter, we have addressed the model selection problem under the covariate shift paradigm--training input points and test input points are drawn from different distributions (i.e., Ptrain (x) 'I Ptest (x)), but the functional rela­ tion remains unchanged (i.e., Ptrain (yIx) = Ptest (y Ix)). Under covariate shift, standard model selection schemes such as the Akaike information criterion (AIC), the subspace information criterion (SIC), and cross-validation (CV) are heavily biased and do not work as desired. On the other hand, their importance-weighted counterparts, IWAIC (section 3.1), IWSIC (section 3.2), and IWCV (section 3.3) have been shown to possess proper unbiasedness.
  • 80. 3.5. Summary and Discussion 71 Through simulations (section 3.4), the importance-weighted model selection criteria were shown to be useful for improving the generalization performance under covariate shift. Although the importance-weighted model selection criteria were shown to work well, they tend to have larger variance, and therefore the model selec­ tion performance can be unstable. Investigating the effect of large variance on model selection performance will be an important direction to pursue, such as following the line of [151] and [6]. In the experiments in section 3.4, we assumed that the importance weights are known. However, this may not be the case in practice.We will discuss how the importance weights can be accurately estimated from data in chapter 4.
  • 81. 4 Importance Estimation In chapters 2 and 3, we have seen that the importance weight w(X) = Pte(X) Ptr(X) can be used for asymptotically canceling the bias caused by covariate shift. However, the importance weight is unknown in practice, and needs to be estimated from data. In this chapter, we give a comprehensive overview of importance estimation methods. The setup in this chapter is that in addition to the i.i.d. training input samples { tr}ntr i-!::!,. p. ( )Xi i=l tr X , we are given i.i.d. test input samples { te}nte i.�. p. ( )Xj j=l te X • Although this setup is similar to semisupervised learning [30], our attention is directed to covariate shift adaptation. The goal of importance estimation is to estimate the importance function w(x ) (or the importance values at the training input points, (w(x:')}7�1) from {Xn7�1 and (X�e}��l' 4.1 Kernel Density Estimation Kernel density estimation (KDE) is a nonparametric technique to estimate a probability density function p(x) from its i.i.d. samples {xd7=1' For the
  • 82. 74 Gaussian kernel / ( IIX _X/1I2 )K,,(x,x )=exp - 20'2 ' KDE is expressed as 4 Importance Estimation (4.1) The performance of KDE depends on the choice of the kernel width o'. It can be optimized by cross-validation (CV) as follows [69]. First, divide the samples {xd7=1 into k disjoint subsets {Xr}�=l of (approximately) the same size. Then obtain a density estimate PXr (x) from {A;};,.r (i.e.,without Xr),and compute its log-likelihood for Xr: 1 '"" � IXrl � logpx'(x), XEXr where IXrl denotes the number of elements in the set Xr• Repeat this pro­ cedure for r =1,2,. . . , k, and choose the value of a such that the aver­ age of the above holdout log-likelihood over all r is maximized. Note that the average holdout log-likelihood is an almost unbiased estimate of the Kullback-Leibler divergence from p(x) to p(x), up to some irrelevant constant. KDE can be used for importance estimation by first obtaining density esti­ mators Ptr(X) and Pte (x) separately from { X:'}7!1 and {xje}j�l' respectively,and then estimating the importance by � ( ) _ Pte (x) wx -� . Ptr(x) However,a potential limitation of this naive approach is that KDE suffers from the curse ofdimensionality [193,69],that is,the number of samples needed to maintain the same approximation quality grows exponentially as the dimen­ sion of the input space increases. This is critical when the number of available samples is limited. Therefore,the KDE-based approach may not be reliable in high-dimensional problems. In the following sections, we consider directly estimating the importance w(x) without going through density estimation of Ptr(x) and Pte (x). An
  • 83. 4.2. Kernel Mean Matching 75 intuitive advantage of this direct estimation approach is that knowing the den­ sities Ptr(x) and Pte(x) implies knowing the importance w(x), but not vice versa-the importance w(x) cannot be uniquely decomposed into Ptr(x) and Pte(x). Thus, estimating the importance w(x) could be substantially simpler than estimating the densities Ptr(x) and Pte(x). The Russian mathematician Vladimir Vapnik-who developed one of the most successful classification algorithms, the support vector machine­ advocated the following principle [193]: One should not solve more difficult intermediate problems when solving a target problem. The support vector machine follows this principle by directly learning the decision boundary that is sufficient for pattern recognition instead of solving a more gen­ eral, and thus more difficult, problem of estimating the data generation probability. The idea of direct importance estimation would also follow Vapnik's prin­ ciple since one can avoid solving a substantially more difficult problem of estimating the densities Ptr(x) and Pte(x). 4.2 Kernel Mean Matching Kernel mean matching (KMM) allows one to directly obtain an esti­ mate of the importance values without going through density estima­ tion [82]. The basic idea of KMM is to find w(x) such that the mean discrepancy between nonlinearly transformed samples drawn from Ptr(x) and Pte(x) is minimized in a universal reproducing kernel Hilbert space (RKHS) [149]. The Gaussian kernel (equation 4.1) is an example of ker­ nels that induce a universal RKHS, and it has been shown that the solu­ tion of the following optimization problem agrees with the true importance values: (4.2) subject to lE [ w(xtr)] = 1 and w(x) � 0, xtr where II . 1111 denotes the norm in the Gaussian RKHS, K,,(x ,x') is the Gaus­ sian kernel (equation 4.1), and lExtr and lExte denote the expectation overxtr andxte drawn from Ptr(x) and Pte(x), respectively. Note that for eachfixedx, K"(x , .) is a function belonging to the RKHS ?t.
  • 84. 76 4 Importance Estimation An empirical version of the above problem is reduced to the following quadratic program (QP): 1 I nrr Isubject to ntr t;:Wi - ntr :::; E and 0:::; WI,W2,···,wnrr :::; B, where nte ntr '"' K ( tr te)Kj=- � (J Xj , Xj • nte j= B (� O) and E (� O) are tuning parameters that control the regularization effects. The solution { Wj }7! is an estimate of the importance at {xn7!. Since KMM does not involve density estimation, it is expected to work well even in high-dimensional cases. However, its performance depends on the choice of the tuning parameters B, E, and cr, and they cannot be simply optimized by, for instance, CV, since estimates of the importance are available only at {xn7!. As shown in [90], an inductive variant of KMM (i.e., the entire importance function is estimated) exists. This allows one to optimize B and E by CV over the objective function (equation 4.2). However, the Gaussian ker­ nel width cr may not be appropriately determined by CV since changing the value of cr means changing the RKHSs; the objective function (equation 4.2) is defined using the RKHS norm, and thus the objective values for different norms are not comparable. A popular heuristic to choose cr is to use the median distance between sam­ ples as the Gaussian width cr [141, 147]. However, there seems to be no strong justification for this heuristic. For the choice of E, a theoretical result given in [82] can be used as guidance. However, it is still hard to determine the best value of E in practice. 4.3 Logistic Regression Another approach to directly estimating importance is to use a probabilistic classifier. Let us assign a selector variable17=-1 to samples drawn from Ptr(x) and17=1 to samples drawn from Pte(x), That is, the two densities are written as Ptr( X)=P( XI17=-1), Pte(x)=P( XI17=1).
  • 85. 4.3. Logistic Regression 77 Note that rJ is regarded as a random variable. An application of Bayes's theorem shows that the importance can be expressed in terms of rJ, as follows [128, 32, 17]: p(rJ=-I) p(rJ=llx) w(x)= . p(rJ=l) p(rJ=-llx) The ratio p(rJ=-I)/p(rJ=1) may be easily estimated by the ratio of the numbers of samples: p(rJ=-I) ntr p(rJ=1) � n,e· The conditional probability p(rJlx) can be approximated by discriminating {Xn7�1 and (X7}��1' using a logistic regression (LR) classifier, where rJ plays the role of a class variable. Below we briefly explain the LR method. The LR classifier employs a parametric model of the following form for expressing the conditional probability p(rJlx): � 1 p(rJlx) = ( m ) ,1 + exp -rJLe=l�e(Pe(x) where m is the number of basis functions and {4>e(x)}�1 are fixed basis func­ tions. The parameter � is learned so that the negative regularized log-likelihood is minimized: f= arg;nin [�log (1 + exp (��e4>e(x:r»)) + �log (1 + exp (- ��e4>e(x�e»)) + A�T�]. Since the above objective function is convex, the global optimal solution can be obtained by standard nonlinear optimization methods such as the gradi­ ent method and the (quasi-)Newton method [74, 117]. Then the importance estimator is given by ( m )___ ntrw(x)= n ' e exp .f;�e4>e(x) . This model is often called the log-linear model. (4.3)
  • 86. 78 4 Importance Estimation An advantage of the LR method is that model selection (i.e., the choice of the basis functions {4>e(x)}�=! as well as the regularization parameter Je) is possible by standard CV since the learning problem involved above is a standard supervised classification problem. When multiclass LR classifiers are used, importance values among mul­ tiple densities can be estimated simultaneously [16]. However, training LR classifiers is rather time-consuming. A computationally efficient alternative to LR, called the least-squares probabilistic classifier (LSPC), has been proposed [155]. LSPC would be useful in large-scale density-ratio estimation. 4.4 Kullback-Leibler Importance Estimation Procedure The Kullback-Leibler importance estimation procedure (KLIEP) [164] directly gives an estimate of the importance function without going through density estimation by matching the two distributions in terms of the Kullback­ Leibler divergence [97]. 4.4.1 Algorithm Let us model the importance weight w(x) by means of the following linear-in­ parameter model (see section 1.3.5.2): w(x) = LaeqJe(X), (4.4) e=! where t is the number of parameters, a = (aJ ,a2 ," " at)T are parameters to be learned from data samples, T denotes the transpose, and {qJe(x)}�=! are basis functions such that qJe(X) 2: 0 for all x E V and e = 1,2,. . . , t. Note that t and {qJe(x)}�=! can depend on the samples {xn7!! and {x�e}j::p so kernel models are also allowed. Later, we explain how the basis functions {qJe(x)}�=! are designed in practice. An estimate of the density Pte(x) is given by using the model w(x) as
  • 87. 4.4. Kullback-Leibler Importance Estimation Procedure 79 In KLIEP, the parameters « are determined so that the Kullback-Leibler divergence from Pte(x) to Pte(x) is minimized: where IExte denotes the expectation overxte drawn from Pte(x). The first term is a constant, so it can safely be ignored. We define the negative of the second term by KL': (4.5) Since Pte(x)(=w(x)Ptr(x» is a probability density function, it should satisfy 1= r Pte(x)dx= r w(x)Ptr(x)dx=lE [ w(xtr)]. 1'0 1'0 xtr (4.6) Consequently, the KLIEP optimization problem is given by replacing the expectations in equations 4.5 and 4.6 with empirical averages as This is a convex optimization problem, and the global solution-which tends to be sparse [27]-can be obtained, for example, simply by performing gradient ascent and feasibility satisfaction iteratively. A pseudo code is summarized in figure 4.1. Properties of KLIEP-type algorithms are theoretically investigated in [128, 32, 171, 120]. In particular, the following facts are known regarding the convergence properties: • When a fixed set of basis functions (i.e., a parametric model) is used for importance estimation, KLIEP converges to the optimal parameter in the model with convergence rate Op(n-!) under n=ntr=nte> where Op denotes the
  • 88. 80 I t · - { ()}t {tr}ntr d {te}ntenpu . m - 'Pg x g=l, xi i=l' an Xj j=l Output: w(x) 4 Importance Estimation Aj,g+-- 'Pg(xje) for j = 1,2,...,nte and C = 1,2, . . . ,t; bg+-- n�r L:�,� 'Pg(x}r) for C = 1,2,. . . ,t; Initialize a (> Ot) and E (0 < E « 1); Repeat until convergence end a+-- a + EAT(lnte'/Aa); % Gradient ascent a+-- a + (1 - bTa)b/(bTb); % Constraint satisfaction a+-- max(Ot, a); % Constraint satisfaction a+-- a/(bTa); % Constraint satisfaction w(x)+-- L:�=1 ag<pg(x); Figure 4.1 Pseudo code of KLIEP. 0, denotes the t-dimensional vector with all zeros, and In,e denotes the n",­ dimensional vector with all ones. .I indicates the elementwise division, and T denotes the transpose. Inequalities and the "max" operation for vectors are applied elementwise. asymptotic order in probability. This is the optimal convergence rate in the parametric setup. Furthermore, KLIEP has asymptotic normality around the optimal solution. • When a nonparametric model (e.g., kernel basis functions centered at test samples; see section 4.4.3) is used for importance estimation, KLIEP converges to the optimal solution with a convergence rate slightly slower than Op(n-!). This is the optimal convergence rate in the minimax sense. Note that the importance model of KLIEP is the linear-in-parameter model (equation 4.4), while that of LR is the log-linear model (equation 4.3). A variant of KLIEP for log-linear models has been studied in [185, 120], which is computationally more efficient when the number of test samples is large. The KLIEP idea also can be applied to Gaussian mixture models [202] and probabilistic principal-component-analyzer mixture models [205]. 4.4.2 Model Selection by Cross-Validation The performance of KLIEP depends on the choice of basis functions {CPe(xm=l' Here we explain how they can be appropriately chosen from data samples. Since KLIEP is based on the maximization of KL' (see equation 4.5), it would be natural to select the model such that KL' is maximized. The expectation over Pte(x) involved in KL' can be numerically approximated by cross-validation (CV) as follows. First, divide the test samples {xje}j�l into
  • 89. 4.4. Kullback-Leibler Importance Estimation Procedure 81 k disjoint subsets {X,te}�=l of (approximately) the same size. Then obtain an importanceestimate W xJe(x)from {XJ"}hir (i.e.,without x,te),and approximate KL' using Xrlr as --, 1 L � KLr:= -- log wxle (X). IX lel r r XEX�e This procedure is repeated for r = 1,2,. . ., k, and the average IT is used as an estimate of KL': � 1 k � KL':= k LKL'r. (4.7) r=1 For model selection, we compute IT for all model candidates (the basis functions {CPe(x)}�=1 in the current setting),and choose the one that minimizes IT. A pseudo code of the CV procedure is summarized in figure 4.2. One of the potential limitations of CV in general is that it is not reliable for small samples since data splitting by CV further reduces the sample size. On the other hand, in our CV procedure the data splitting is performed only over the test input samples {X�e}]::l' not over the training samples. Therefore, even when the number of training samples is small,our CV procedure does not suffer from the small sample problem as long as a large number of test input samples are available. Input: M = {mlm = {'Pe(x)}�=d, {x�r}��rl' and {X�e}j�l Output: w(x) Split {X�e}j�l into k disjoint subsets {Xj}j=l; for each model m E M end for each split r = 1,2,..., k end Wx;e(X) +--- KLIEP(m, {x�r}��rl' {Xr}j#r); _i 1 '" � KLr(m) +--- Ix;el DaJEx;e log Wx,"e(x); -i 1 k _i KL (m) +--- Ti Lr=l KLr(m); _i in +--- argmaxmEM KL (m); w(x) +--- KLIEP(in {xtr}ntr {xte}nte)., t t=l' J J =1 ' Figure 4.2 Pseudo code of CV-based model selection for KLIEP.
  • 90. 82 4 Importance Estimation 4.4.3 Basis Function Design A good model may be chosen by the above CV procedure, given that a set of promising model candidates is prepared. As model candidates we use a Gaussian kernel model centered at the test input points {x�e}�::' That is, nte w(x)=La(K,,(x ,x�e), (= whereK"(x ,x') is the Gaussian kernel with width a: ( IIx -X'1I2 )K,,(x ,x') :=exp - 2a2 • Our reason for choosing the test input points {x�e}j:: as the Gaussian centers, not the training input points {x:,}7!, is as follows. By definition, the importance w(x) tends to take large values if the training input density Ptr(x) is small and the test input density Pte(x) is large; conversely, w(x) tends to be small (i.e., close to zero) if Ptr(x) is large and Pte(x) is small. When a nonnegative function is approximated by a Gaussian kernel model, many kernels may be needed in the region where the output of the target function is large; on the other hand, only a small number of kernels will be enough in the region where the output of the target function is close to zero (see figure 4.3). Following this heuristic, we decided to allocate many kernels at high test input density regions, which can be achieved by setting the Gaussian centers at the test input . t { te}ntepom s Xj j=' Alternatively, we may locate( ntr + nte) Gaussian kernels at both {x:,}7! and {x�e}�::' However, this seems not to further improve the performance, but slightly increases the computational cost. When nte is very large, just using all the test input points {x�e}�:: as Gaussian centers is already computationally w(x) Figure 4.3 : . : . . . . • I . • I - , . - . . - ,• I . : I . : I : ). : - .,- : I • • I _.. I I : • I : . I I . • I • • I : . . . . . : .- -. . : ",- .. . - , .... ;f.. ..... Heuristic of Gaussian center allocation. Many kernels may be needed in the region where the output of the target function is large, and only a small number of kernels will be enough in the region where the output of the target function is close to zero.
  • 91. 4.5. Least-Squares Importance Fitting 83 rather demanding.To ease this problem, a subset of {xie};::! may in practice be used as Gaussian centers for computational efficiency. That is, w(x) = I>!eK<1(X, ce), (4.8) e=! where Ce is a template point randomly chosen from {xie};::! and t (:s nte) is a prefixed number. A MATLAB® implementation of the entire KLIEP algorithm is available from http://sugiyama-www.cs.titech.ac.jprsugi/software/KLIEP/. 4.5 Least-Squares Importance Fitting KLIEP employed the Kullback-Leibler divergence for measuring the discrep­ ancy between two densities. Least-squares importancefitting (LSIF) [88] uses the squared-loss for importance function fitting. 4.5.1 Algorithm The importance w(x) is again modeled by the linear-in-parameter model (equation 4.4). The parameters {ae}�=! in the model w(x) are determined so that the following squared error Jis minimized: where the last term is a constant, and therefore can be safely ignored. Let us denote the first two terms by J': I J'(a):= J(a) - - IE [W2(xtr)] 2 xt, = � IE [ w2(xt')] - IE [w(x)] . 2 xtr xte Approximating the expectations in J' by empirical averages, we obtain � I ntr Inte J'(a):= - " W2(xt,) -- " w(xte) 2n� I n� Jtr i=! � }=!
  • 92. 84 where H is the txt matrix with the (£,£')-th element 1 ntr H� . '"' ( tr) ( tr)e,e' . = - � ({Je X i ({Jer X i 'ntr i= and Ii is the t-dimensional vector with the £-th element 4 Importance Estimation (4.9) (4.10) Taking into account the nonnegativity of the importance function w(x), the optimization problem is formulated as follows. m}n [�aT Ha _ liT a +A.1�a] subject to a 2: 0" (4.11) where 1, and 0, are the t-dimensional vectors with all ones and zeros, respec­ tively. The vector inequality a 2: 0, is applied in the elementwise manner, that is, Cle 2: 0 fod = 1,2, . . .,t. In equation 4.11, a penalty term A.1�a is included for regularization pur­ poses, where A. (2: 0) is a regularization parameter. Equation 4.11 is a convex quadratic programming problem, and therefore the unique global optimal solution can be computed efficiently by a standard optimization package. 4.5.2 Basis Function Design and Model Selection Basis functions may be designed in the same way as KLIEP,that is, Gaus­ sian basis functions centered at (a subset of) the test input points {x�e}�� (see section 4.4). Model selection of the Gaussian width a and the regularization parameter A. is possible by CV: First, {xn7� and {x�e}�� are divided into k disjoint subsets
  • 93. 4.5. Least-Squares Importance Fitting 85 {:t;!f}7=! and {X;e}�=i' respectively. Then an importance estimate wxy,x:e (x) is obtained using {:t;!fh"" and {x;e}j,e, (i.e., without X,!f and x:e), and the cost J' is approximated using the holdout samples X,!f and x:e as This procedure is repeated for r = 1,2,. . ., k, and the average J is used as an estimate of J': For LSIF, an information criterion has also been derived [88], which is an asymptotic unbiased estimator of the error criterion J'. 4.5.3 Regularization Path Tracking The LSIF solution « is shown to be piecewise linear with respect to the reg­ ularization parameter A (see figure 4.4). Therefore, the regularization path (i.e., solutions for all A) can be computed efficiently based on the parametric optimization technique [14,44, 70]. A basic idea of regularization path tracking is to check the violation of the Karush-Kuhn-Tucker (KKT) conditions [27]-which are necessary and suffi­ cient conditions for optimality of convex programs-when the regularization parameter A is changed. A pseudo code of the regularization path tracking algorithm for LSIF is described in figure 4.5. -0 a(A3) �� A � !0 � a(A2) a(Ao) =Ot Figure 4.4 Regularization path tracking of LSIF. The solution a(A) is shown to be piecewise-linear in the parameter space as a function of A. Starting from A =00, the trajectory of the solution is traced as A is decreased to zero. When A � AO for some AO � 0, the solution stays at the origin 0,. When A gets smaller than AO, the solution departs from the origin. As A is further decreased, for some Al such that 0::: Al ::: AO, the solution goes straight toa(AI) with a constant "speed." Then the solution path changes direction and, for some A2 such that °::: A2::: AJ, the solution is headed straight for a(A2) with a constant speed as A is further decreased. This process is repeated until A reaches zero.
  • 94. 86 4 Importance Estimation Input: Ii and Ii % see equations 4.9 and 4.10 for the definition Output: entire regularization path &(A) for A ::::: 0 T +-- 0; k+-- a�gmaxi{hi I i = 1,2,...,t}; A7 +-- hk; A+-- {1,2,...,t}{k}; &(A7) +-- Dt; % vector with all zeros While A7 > 0 E +-- 0IAlxt; % matrix with all zeros For i = 1,2,..., IAI 'Bi,ji +-- 1; % A= {h,i2,··· ,jlAI I jl < j2 < ... < jlAI } end � H -E(- � T )G+-- � . u +-- 8�IE( �I )�xIAI ' DIAl v+-- 8-1 ( 1�);DIAl If vS Dt+IAI % final interval A7+1 +-- 0; &(A7+1) +-- (Ul, U2,..., ut)T; else % an intermediate interval end k +-- argmaxdUdVi I Vi > 0, i = 1,2,...,t + IAI}; A7+1+-- max{O, uk/vd; &(A7+1) +-- (Ul, U2,..., ut)T -A7+1(VI, v2,..., Vt)T ; Iflskst A+-- Au {k}; else end T +-- T + 1; Figure 4.5 Pseudo code for computing the entire regularization path of LSIF. The computation of a-I is sometimes unstable. For stabilization purposes, small positive diagonals may be added to ii.
  • 95. 4.6. Unconstrained Least-Squares Importance Fitting 87 The pseudo code shows that a quadratic programming solver is no longer needed for obtaining the LSIF solution-just computing matrix inverses is enough. This contributes highly to saving computation time. Furthermore, the regularization path algorithm is computationally very efficient when the solu­ tion is sparse, that is, most of the elements are zero since the number of change points tends to be small for sparse solutions. An R implementation of the entire LSIF algorithm is available from http://www.math.cm.is.nagoya-u.ac.jprkanamori/softwareILSIFl. 4.6 Unconstrained Least-Squares Importance Fitting LSIF combined with regularization path tracking is computationally very effi­ cient. However, it sometimes suffers from a numerical problem, and therefore is not reliable in practice.To cope with this problem, an approximation method called unconstrained LSIF (uLSIF) has been introduced [88]. 4.6.1 Algorithm The approximation idea is very simple: the non-negativity constraint in the optimization problem (equation 4.11) is dropped. This results in the following unconstrained optimization problem: min -PHP- h P+ -P P .[1 T� �T 'A T ]PEP.' 2 2 (4.12) In the above, a quadratic regularization term 'APTP/2 is included instead of the linear one 'Al�Ol since the linear penalty term does not work as a regularizer without the nonnegativity constraint. Equation 4.12 is an unconstrained convex quadratic program, and the solution can be analytically computed as where It is the t-dimensional identity matrix. Since the nonnegativity constraint P2: 0, is dropped, some of the learned parameters can be negative. To compensate for this approximation error, the solution is modified as (4.13)
  • 96. 88 4 Importance Estimation where the "max" operation for a pair of vectors is applied in the elemen­ twise manner. The error caused by ignoring the nonnegativity constraint and the above rounding-up operation is theoretically investigated in [88]. An advantage of the above unconstrained formulation is that the solution can be computed just by solving a system of linear equations. Therefore, the com­ putation is fast and stable. In addition, uLSIF has been shown to be superior in terms of condition numbers [90]. 4.6.2 Analytic Computation of Leave-One-Out Cross-Validation Another notable advantage of uLSIF is that the score of leave-one-out CV (LOOCV) can be computed analytically. Thanks to this property, the compu­ tational complexity for performing LOOCV is of the same order as computing a single solution, which is explained below. In the current setting, two sets of samples {x:,}7!1 and {xje};::1 are given that generally are of different size. To explain the idea in a simple manner, we assume that ntr < nte , andx:r andx:e (i = 1,2,. . ., ntr) are held out at the same time; {xje};::ntr+1 are always used for importance estimation. Let Wi(x) be an estimate of the importance function obtained withoutx:r andx:o. Then the LOOCV score is expressed as (4.14) Our approach to efficiently computing the LOOCV score is to use the Sherman-Woodbury-Morrison formula [64] for computing matrix inverses. For an invertible square matrix A and vectors � and 1/ such that 1/TA-I� f= -1, the Sherman-Woodbury-Morrison formula states that A pseudo code of uLSIF with LOOCV-based model selection is summarized in figure 4.6. MATLAB® and R implementations of the entire uLSIF algorithm are available from http://sugiyama-www.cs.titech.ac.jprsugi/software/uLSIF/ http://www.math.cm.is.nagoya-u.ac.jprkanamoriisoftwareILSIF/. 4.7 Numerical Examples In this section, we illustrate the behavior of the KLIEP method, and how it can be applied to covariate shift adaptation.
  • 97. 4.7. Numerical Examples Input: {X�}7:1 and {X�e}]::l Output: w(x) t �min(lOO, n'e); n �min(ntr, n'e); Randomly chooset centers {cel�=l from {X�e}]�lwithout replacement; For each candidate of Gaussian width a end � 1 ntr (IX"-ceI2+lx"-ce,"2) He,e' �- L exp - r 2 r for£,£,=1,2" " ,t; ntr i= l 2a � 1 nle ("Xte- Ce112) he�- L exp - } 2 for£=1,2,,,,,t; n,e j=l 2 a tr (IIx:'-ceIl2) , Xe' �exp - fOf!=1,2" , "n and£=1,2" " ,t;,r 2a2 'e (IIx:e-Ce112) f ' 2 d 0 2Xii �exp - or 1=1, '"''n an {.=1, '"''t;, 2 a2 For each candidate of regularization parameter J... end � � J...(n,,-l) B�H+ I,; ntr ( �T�-l tr )�-I� T �-l tr ' h B X Bo�B h1n +B X dlag �-l ; ntrl�- 1�(Xtr*B r) �-l �-l ( 1T(Xte*ii-Ixtr) )BI�B Xte+B X"diag , �_ ; nIT_IT(Xtr*B IXtr) B2�max ( O,xn, ntr-l (:,e�o- ' Bl»);ntr(n,e- 1) II(X"*B)T1112 IT(Xte*B)1 LOOCV(a, J...)� 2, _ ' 2 n; 2n n (a,):) �argminCO',A) LOOCV(a, J...); � 1 nII ("Xtr _ Ce112+IIxtr-Ce'112) He,e' �- Lexp - r �2 r for£,£'=I,2" " ,t; n" i=l 2 a � 1 "'e ("xte-ceI2) he�- L exp - } �2 for£=I,2" " ,t; n'e j=l 2 a ,..., ......... I-ii�max(O" (H+H,)- h); � �� ("x-ceI2) w(x) �L.,.. aeexp 2a2 ; e=1 Figure 4.6 89 Pseudo code of uLSIF with LOOCV. B * B' denotes the elementwise multiplication of matrices B and B' of the same size. For n-dimensional vectors b and b', diag (V) denotes the n x n diagonal matrix with i-th diagonal element b;jb;.
  • 98. 90 4 Importance Estimation 4.7.1 Setting Let us consider a one-dimensional toy regression problem of learning the following function: 11 if x=O, I(x)=sinc(x) := sin(JTx) otherwise. JTX Let the training and test input densities be Ptr(x)=N(x; 1,(1/2)2), Pte(x)=N(x; 2,(1/4)2), where N(x; f-L, (2) denotes the Gaussian density with mean f-L and variance a2• We create the training output value {y:,}7![ by where the i.i.d. noise {E!r}7! has density N(E; 0,(1/4)2). Test output values {y)e}]:: are generated in the same way. Let the number of training samples be ntr=200 and the number of test samples be nte=1000. The goal is to obtain a function l(x) such that the generalization error is minimized. This setting implies that we are considering a (weak) extrapolation problem (see figure 4.7, where only 100 test samples are plotted for clear visibility). 4.7.2 Importance Estimation by KLiEP First, we illustrate the behavior of KLIEP in importance estimation, where we use only the input points {x!'}7! and {x)e}]::. Figure 4.8 depicts the true importance and its estimates by KLIEP; the Gaussian kernel model with b = 100 is used, and three Gaussian widths a=0.02,0.2,0.8 are tested. The graphs show that the performance of KLIEP is highly dependent on the Gaussian width; the estimated importance function w(x) is fluctuates highly when a is small, and is overly smoothed when a is large. When a is chosen appropriately, KLIEP seems to work reasonably well for this example. Figure 4.9 depicts the values of the true J(see equation 4.5), and its estimate by fivefold CV (see equation 4.7); the means, the 25th percentiles, and the 75th percentiles over 100 trials are plotted as functions of the Gaussian width a. This shows that CV gives a very good estimate of KL', which results in an appropriate choice of a.
  • 99. 4.7. Numerical Examples 1.2 0.8 0.6 0.4 0.2 .. , , , , . ' , , , , , , -�.�5��--------���� -- ��- 2 - .5 �'�- (a) Training input densityPtr(x) and test input densityPte(x). Figure 4.7 Illustrative example. I �" " " " :4 40 I (a) Gaussian width (J = 0.02. 15 10 1.5 20 15 10 o (b) Target function/(x) training samples {(xy, yy)}�� I' and test samples {(xle, yj)}J� I -�.�5--��� 0� .5""�"�1.�5 --�--�2� .5--� (b) Gaussian width (J = 0.2. oL-______���___�� -0.5 0.5 1.5 2.5 (c) Gaussian width (J = 0.8. Figure 4.8 91 Results of importance estimation by KLIEP. w(x) is the true importance function, and w(x) is its estimation obtained by KLIEP.
  • 100. 92 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 0.8 0.02 Figure 4.9 0.2 0.5 a (Gaussian Width) 4 Importance Estimation 0.8 Model selection curve for KLIEP. KL' is the true score of an estimated importance (see equation 4.5), and Ki7cv is its estimate by fivefold CV (see equation 4.7). 4.7.3 Covariate Shift Adaptation by IWLS and IWCV Next, we illustrate how the estimated importance is used for covariate shift adaptation. Here we use { (xl',y!')}7!! and {x�e}j�! for learning; the test output values {yj"}j�! are used only for evaluating the generalization performance. We use the following polynomial regression model: l (x; 9):= L:8; xc, (4.15) c=o where t is the order of polynomials. The parameter vector 9 is learned by importance-weighted least squares (lWLS): IWLS is asymptotically unbiased when the true importance w(x!') is used as weights. On the other hand, ordinary LS is not asymptotically unbiased due to covariate shift, given that the modell (x; 9) is not correctly specified (see section 2.1.1). For the linear regression model (equation 4.15), the minimizer 9IWLS is given analytically by
  • 101. 4.7. Numerical Examples where Xtr ._ (Xlf)e-I i,e·- i ' W�If d' (�( If) �( If) �( tr »):= lag wXI ' Wx2 ,• • • ,W xntr ' If . ( tr tr tr )T Y .= YI' Y2' ... , Yntr • 93 diag(a, b, . . . , c) denotes the diagonal matrix with diagonal elements a, b, . . . , c. We choose the order t of polynomials based on importance-weighted CV (IWCV; see section 3.3). More specifically, we first divide the training samples {z:flz:' =(XJf , y!')}7!1 into k disjoint subsets (Zn�=I' Then we learn a function l(x) from {Zn;';f; (i.e., withoutZn by IWLS, and compute its mean test error for the remaining samples Zr � 1 " � (� )2Gen;:= Zlf � W(X ) fi(x ) - Y . I ; I tr(X,Y)EZj This procedure is repeated for i=1,2,. . ., k, and its average Gen is used as an estimate of Gen: � 1 k � Gen :=k L Genj. ;=1 (4.16) For model selection, we compute Gen for all model candidates (the order t E { I , 2,3} of polynomials in the current setting), and choose the one that minimizes Gen. We set the number of folds in IWCV to k=5. IWCV is shown to be almost unbiased when the true importanceW(XJf) is used as weights, while ordinary CV for misspecified models is highly biased due to covariate shift (see section 3.3). Figure 4.10 depicts the functions learned by IWLS with different orders of polynomials. The results show that for all cases, the learned functions reason­ ably go through the test samples (note that the test output points are not used for obtaining the learned functions). Figure 4.11a depicts the true generaliza­ tion error of IWLS and its estimate by IWCV; the means, the 25th percentiles, and the 75th percentiles over 100 runs are plotted as functions of the order of polynomials. This shows that IWCV roughly grasps the trend of the true generalization error. For comparison purposes, include the results by ordinary LS and ordinary CV in figures 4.10 and 4.11. Figure 4.10 shows that the func­ tions obtained by ordinary LS nicely go through the training samples, but not
  • 102. 94 1.5 -{l_s -{l.5 o 0.5 1.5 -«x) - - • flWl5 (x) ......"fLS (xl o Training )< Test 2.5 (a) Polynomial of order 1. 1.5 • • 4 Importance Estimation (b) Polynomial of order 2. (c) Polynomial of order 3. Figure 4.10 Learned functions obtained by IWLS and LS, which are denoted by llWLs(x) and As(x), respectively. through the test samples. Figure 4.11 shows that the scores of ordinary CV tend to be biased, implying that model selection by ordinary CV is not reliable. Finally, we compare the generalization errors obtained by IWLSILS and IWCV/CV, which are summarized in figure 4.12 as box plots. This shows that IWLS+IWCV tends to outperform other methods, illustrating the usefulness of the covariate shift adaptation method. 4.8 Experimental Comparison In this section, we compare the accuracy and computational efficiency of density-ratio estimation methods. Let the dimension of the domain be d, and Ptr(x)=N(x; (0, 0, . . . , 0)T, Id), Pte(x)=N(x; (1, 0, . . ., O)T, Id),
  • 103. 4.8. Experimental Comparison 95 0.18 0.4 --Gen 0,16 - - • GenlWOI 0.35 ......" Gencv 0.3 0.14 0.25 0,12 0.2 0.1 0.15 0.08 0.1 0.06 0.05 t (OrderofPolynomials) t (Order ofPolynomial) (a) IWLS. (b) LS. Figure 4.11 Model selection curves for IWLS/LS and IWCV/CV. Gen denotes the true generalization error of a learned function; GenIWCV and Gencv denote their estimates by fivefold IWCV and fivefold CV, respectively (see equation 4.16). 0.35 - 0.3 0.25 0.2 0.15 ..,�� - 0.05 - - - - - - 95% 75% 50% 25% 5% L-_�___��___�___�___ IWLS+IWCV IWLS+CV L5+IWCV LS+CV Figure 4.12 Box plots of generalization errors. where Id denotes the d x d identity matrix and N(x; It, I:) denotes the multi­ dimensional Gaussian density with mean It and covariance matrix I:. The task is to estimate the importance at training points: _ ( tf) _ Pte(x :' ) Wi-W X, - ---I Ptf( X:') for i = 1, 2,... , ntf.
  • 104. 96 4 Importance Estimation We compare the following methods: • KDE(CV) The Gaussian kernel (equation 4.1) is used where the kernel widths of the training and test densities are separately optimized based on fivefold CV. • KMM(med) The performance of KMM is dependent on B, E, and a. We set B=1000 and E=C.,;n;;-1)/y'rl;" following the original KMM paper [82], and the Gaussian width a is set to the median distance between samples within the training set and the test set [141, 147]. • LR(CV) The Gaussian kernel model (equation 4.8) is used. The ker­ nel width a and the regularization parameter A are chosen based on fivefold Cv. • KLIEP(CV) The Gaussian kernel model (equation 4.8) is used. The kernel width a is selected based on fivefold CV. • uLSIF(CV) The Gaussian kernel model (equation 4.8) is used. The kernel width a and the regularization parameter A are determined based on LOOCV. All the methods are implemented using the MATLAB® environment, where the CPLEX® optimizer is used for solving quadratic programs in KMM and the LIBLINEAR implementation is used for LR [103]. We set the number of test points to nte=1000, and considered the following setups for the number ntr of training samples and the input dimensionality d: (a) ntr is fixed to ntr=100, and d is changed as d=1,2,. . .,20 (b) d is fixed to d=10, and ntr is changed as ntr=50, 60, . . .,150. We ran the experiments 100 times for each d, each nt" and each method, and evaluated the quality of the importance estimates {wd7!! by the normalized mean-squared error (NMSE): 1 ntt ( � )2'" w· w NMSE :=-� ntt I � - ntt I ntr ;=1 Li'=l Wi' Li'=l Wi' For the purpose of covariate shift adaptation, the global scale of the impor­ tance values is not important. Thus, the above NMSE, evaluating only the relative magnitude among {wd7!!, would be a suitable error metric for the current experiments. NMSEs averaged over 100 trials (a) as a function of input dimensionality d and (b) as a function of the training sample size ntr are plotted in log scale in
  • 105. 4.8. Experimental Comparison 97 figure 4.13.Error bars are omitted for clear visibility-instead, the best method in terms of the mean error and comparable methods based on the t-test at the significance level 1 percent are indicated by 0; the methods with significant difference from the best methods are indicated by x . Figure 4.13a shows that the error of KDE(CV) sharply increases as the input dimensionality grows, while LR, KLIEP, and uLSIF tend to give much smaller errors than KDE. This would be an advantage of directly estimating the impor­ tance without going through density estimation. KMM tends to perform poorly, which is caused by an inappropriate choice of the Gaussian kernel width. On the other hand, model selection in LR, KLIEP, and uLSIF seems to work quite well. Figure 4.13b shows that the errors of all methods tend to decrease as the number of training samples grows. Again LR, KLIEP, and uLSIF tend to give much smaller errors than KDE and KMM. Next, we investigate the computation time. Each method has a different model selection strategy: KMM does not involve CV; KDE and KLIEP involve CV over the kernel width; and LR and uLSIF involve CV over both the kernel width and the regularization parameter. Thus, the naive comparison of the total computation time is not so meaningful. For this reason, we first investigate the computation time of each importance estimation method after the model parameters have been determined. The average CPU computation time over 100 trials is summarized in figure 4.14. Figure 4.14a shows that the computation times of KDE, KLIEP, and uLSIF are almost independent of the input dimensionality, while those of KMM and LR are rather dependent on the input dimensionality. Note that LR for d :s 3 is slow due to a convergence problem of the LIBLINEAR package. The uLSIF is one of the fastest methods. Figure 4.14b shows that the compu­ tation times of LR, KLIEP, and uLSIF are nearly independent of the number of training samples, while those of KDE and KMM sharply increase as the number of training samples increases. Both LR and uLSIF have high accuracy, and their computation times after model selection are comparable. Finally, we compare the entire computation times of LR and uLSIF including CV, which are summarized in figure 4.15. The Gaussian width (J and the regularization parameter A. are chosen over the 9 x 9 grid for both LR and uLSIF. Therefore, the comparison of the entire com­ putation time is fair.Figures 4.15a and 4.15b show that uLSIF is approximately five times faster than LR. Overall, uLSIF is shown to be comparable to the best existing method (LR) in terms of accuracy, but is computationally more efficient than LR.
  • 106. 98 Q; � � 10 "" .Q g � -;;; � 0 � Q;> 0 w V1 ::;: Z Q) '" � Q) > <X: Q; -;;; 10 -4 � '" .Q :§. � .= 0 0 Q; 10 -5 > 0 w V1 ::;: Z Q) '" � Q) > <X: 10 -6 50 f "I - -KDE(CV) • • • • KMM(med) _. _. LR(CV) """" KLlEP(CV) __'" .It • • • -- uL5IF(CV) ... .... .... .... ., .... " • . c';-A":� 4 Importance Estimation -_ ... , ... « .. ..... .. . .. . .. /rI' c. -c. .... . . 't ('I. )'/ '''';tJ 4 /' II' 6 8 10 12 14 d (Input dimension) 16 (a) When input dimensionality is changed. 18 20 " .... .... .... .... ...... .. .. ..... .. .. -::-- ....... . .. .. -:-:-:-.. .... . ::-.":'0...... .. ..""eooo....................... ................ '. 100 150 n " (Number of training samples) (b) When training sample size is changed. Figure 4.13 NMSEs averaged over 100 trials in log scale for the artificial data set. Error bars are omitted for clear visibility. Instead, the best method in terms of the mean error and comparable methods based on the t·test at the significance level I percent are indicated by 0; the methods with significant difference from the best methods are indicated by x.
  • 107. 4.8. Experimental Comparison 0.16 0.14 ]'� � :g 0.12 " � 0.1 o '" E ., " o ..,'" 0.08 5 0.06 c. E 8 - -KDE(CV) • • • • KMM(med) _. _. LR(CV) 11111111 KLlEP(CV) -- uLSIF(CV) � 0.04 � � --�-�����--�-------- u '" � � .. ·S 0 � :;;> 0 '" E .., " o - ..... .... _ .... - " 0.02 O L-�--�--�----�--�--�----L---�---L--� 0. 15 0. 1 4 6 8 10 12 14 16 18 d (Input dimension) (a) When input dimensionality is changed. .' " 1 111 11""I I I I � ""' I""IIIIIIIIIIIIIIIIIIIII1:'J'111111111111111111111111111""' 1 11111 1 1 11 1 20 .�c. 0.05 E --- 8 '" COl � � _ . ............ .... -- ----- -------- -' ....... .... ... O L-----------------�------------------� 50 100 150 n " (Number oftraining samples) (b) When training sample size is changed. Figure 4.14 Average computation time (after model selection) over 100 trials for the artificial data set. 99
  • 108. 100 18 '. � 16 � -;;; E 14 g '" 12 15 '" .g 10 c .g � 8 :0 C. E 8 6 �'" CI � � « 12 2 4 Importance Estimation 4 6 8 10 12 14 16 18 20 d (Input dimension) (a) When input dimensionality is changed. __. _ . _. _ . _ ._._ ._. - ._" _._._. - ._0 _._. _._. _._._. . U 10 '" � � -;;; '5 g �'" E .� c o .�:0 C. E 8 �'" CI � � 8 6 4 O L---------------------�------------------� 50 100 150 n tr (Number oftraining samples) (b) When training sample size is changed. Figure 4.15 Average computation time over I()() trials for the artificial data set (including model selection of the Gaussian width (J and the regularization parameter A over the 9 x 9 grid).
  • 109. 4.9. Summary 101 4.9 Summary In this chapter, we have shown methods of importance estimation that can avoid solving a substantially more difficult task of density estimation. Table 4.1 summarizes properties of the importance estimation methods. Kernel density estimation (KDE; section 4.1) is computationally very efficient since no optimization is involved, and model selection is possi­ ble by cross-validation (CV). However, KDE may suffer from the curse of dimensionality due to the difficulty of density estimation in high dimensions. Kernel mean matching (KMM; section 4.2) may potentially work better by directly estimating the importance. However, since objective model selection methods are missing for KMM, model parameters such as the Gaussian width need to be determined by hand. This is highly unreliable unless we have strong prior knowledge. Furthermore, the computation of KMM is rather demanding since a quadratic programming problem has to be solved. Logistic regression (LR; section 4.3) and the Kullback-Leibler importance estimation procedure (KLIEP; section 4.4) also do not involve density estima­ tion. However, in contrast to KMM, LR and KLIEP are equipped with CV for model selection, which is a significant advantage over KMM. Nevertheless, LR and KLIEP are computationally rather expensive since nonlinear optimization problems have to be solved. Least-squares importance fitting (LSIF; section 4.5) is qualitatively similar to LR and KLIEP, that is, it can avoid density estimation, model selection is possible, and nonlinear optimization is involved. LSIF is more advantageous than LR and KLIEP in that it is equipped with a regularization path track­ ing algorithm. Thanks to this, model selection for LSIF is computationally much more efficient than for LR and KLIEP. However, the regularization path tracking algorithm tends to be numerically unstable. Table 4.1 Importance estimation methods Out-of-sample Methods Density Estimation Model Selection Optimization Prediction KDE Necessary Available Analytic Possible KMM Not necessary Not available Convex QP Not possible LR Not necessary Available Convex non-linear Possible KLIEP Not necessary Available Convex non-linear Possible LSIF Not necessary Available Convex QP Possible uLSIF Not necessary Available Analytic Possible QP, quadratic program.
  • 110. 102 4 Importance Estimation Unconstrained LSIF (uLSIF; section 4.6) exhibits good properties of other methods (e.g., no density estimation is involved and a built-in model selection method is available). In addition to these properties, the solution of uLSIF can be computed analytically by solving a system of linear equations. Therefore, uLSIF is computationally very efficient and numerically stable. Furthermore, thanks to the availability of the closed-form solution, the LOOCV score of uLSIF can be computed analytically without repeating holdout loops, which highly contributes to reducing the computation time in the model selection phase. Consequently, uLSIF is a preferable method for importance estimation.
  • 111. 5 Direct Density-Ratio Estimation with Dimensionality Reduction As shown in chapter 5, various methods have been developed for directly estimating the density ratio without going through density estimation. How­ ever, even these methods can perform rather poorly when the dimensional­ ity of the data domain is high. In this chapter, a dimensionality reduction scheme for density-ratio estimation, called direct density-ratio estimation with dimensionality reduction (D3; pronounced as "D-cube") [158], is introduced. 5.1 Density Difference in Hetero-Distributional Subspace The basic assumption behind D3 is that the densities Ptr(x) and Pte(x) are dif­ ferent not in the entire space, but only in some subspace. This assumption can be mathematically formulated with the following linear mixing model. Let {u:'}7!! be i.i.d. samples drawn from an m-dimensional distribution with density! Ptr(u), where m is in {I , 2,. . . ,d}. We assume Ptr(u) > 0 for all u . Let {u�e}��! be i.i.d. samples drawn from another m-dimensional distribution with density Pte(u). Let {v:,}7!! and {v�e}��! be i.i.d. samples drawn from a (d - m)-dimensional distribution with density p (v). We assume p (v) > 0 for all v. Let A be a d x m matrix and B be a d x (d- m) matrix such that the column vectors of A and B span the entire space. Based on these quantities, we consider the case where the samples {xn7!! and {x�e}��! are generated as I. With abuse of notation, we use p,,(u) and Pte(u), which are different from p,,(x) and Pte(x), for simplicity.
  • 112. 104 5 Direct Density-Ratio Estimation with Dimensionality Reduction Thus, Ptr(x) and Pte(x) are expressed as Ptr(X) = c Ptr(u) p (v), Pte(X) = C Pte(U) P (V), where c is the Jacobian between the observation x and (u , v). We call R(A) and R(B) the hetero-distributional subspace and the homo-distributional sub­ space, respectively, where RO denotes the range of a matrix. Note that R(A) and R(B) are not generally orthogonal to one another (see figure 5.1). Under the above decomposability assumption with independence of U and v , the density ratio of Pte(X) and Ptr(x) can be simplified as w (X) = Pte(x) = c Pte(u) p (v) = Pte(u) = w (u). Ptr(x) c Ptr(u) p (v) Ptr(u) (5.1) This means that the density ratio does not have to be estimated in the entire d-dimensional space, but only in the heterodistributional subspace of dimen­ sion m (:::: d) . Now we want to extract the hetero-distributional components u:r and u�e from the original high-dimensional samples x:r and x�e. This allows us to estimate the density ratio only in R(A) via equation 5.1. As illustrated in figure 5.1, the oblique projection of x:r and x�e onto R(A) along R(B) gives u:r and u�e. 5.2 Characterization of Hetero-Distributional Subspace Let us denote the oblique projection matrix onto R(A) along R(B) by PR(A),R(B)' In order to characterize the oblique projection matrix PR(A),R(B), let us consider matrices U and V whose rows consist of dual bases for the col­ umn vectors of A and B, respectively. More specifically, U is an m x d matrix and V is a (d-m) x d matrix such that they are bi-orthogonal to one another: UB = Omx(d-mj. VA = O(d-m)xm, where Omxm' denotes the m x m' matrix with all zeros. Thus, R(B) and R(UT) are orthogonal to one another, and R(A) and R(VT) are orthogonal to one another, where T denotes the transpose. When R(A) and R(B) are orthogo­ nal to one another, R(UT) agrees with R(A) and R(VT) agrees with R(B) ; however, in general they are different, as illustrated in figure 5.1.
  • 113. 5.2. Characterization of Hetero-Distributional Subspace R(B) R(UT) (a) Ptr(x) ex: Ptr(u)p(v). Figure 5.1 105 R(B) , R(UT) (b) Pte(x) ex: Pte(u)p(v). A schematic picture of the hetero-distributional subspace for d = 2 and m = I. Let Aex(I,0)T and Bex(l,2)T. Then Uex(2,-I) and Vex(0, l). R.(A) and R.(B) are called the hetero­ distributional subspace and the homo-distributional subspace, respectively. If a data point x is projected onto R.(A) along R.(B), the homo-distributional component v can be eliminated and the hetero-distributional component u can be extracted. The relation between A and B and the relation between U and V can be characterized in terms of the covariance matrix 1: (of either Ptr(x) or Pte(x» as AT1:-1B = O(d-m)xm, U1:VT = Omx(d-m)' (5.2) (5.3) These orthogonality relations in terms of 1: follow from the statistical inde­ pendence between the components in R(A) and R(B)-more specifically, equation 5.2 follows from the fact that the sphering operation (transforming samples x by x +- 1:-1/2X in advance) orthogonalizes independent compo­ nents u and v [84], and equation 5.3 is its dual expression [92]. After sphering, the covariance matrix becomes identity, and consequently all the discussions become simpler. However, estimating the covariance matrix from samples can be erroneous in high-dimensional problems, and taking its inverse further mag­ nifies the estimation error. For this reason, we decided to deal directly with nonorthogonal A and B below. For normalization purposes, we further assume that VB = Id-m,
  • 114. 106 5 Direct Density-Ratio Estimation with Dimensionality Reduction where 1m denotes the m-dimensional identity matrix. Then the oblique projec­ tion matrices PR(A),R(B) and PR(B),R(A) can be expressed as PR(A),R(B)= A U, PR(B),R(A)= BV, which can be confirmed by the facts that P �(A),R(B) = PR(A),R(B) (idempo­ tence); the null space of PR(A),R(B) is R(B); and the range of PR(A),R(B) is R(A); the same is true for PR(B),R(A)' The above expressions of PR(A),R(B) and PR(B),R(A) imply that Uexpresses projected images in an m-dimensional coordinate system within R(A) , while V expresses projected images in a (d - m)-dimensional coordinate system within R(B). We call U and V the hetero-distributional mapping and the homo-distributional mapping, respectively. Now u:r, u�e, v:r, and v�e are expressed as ute = Uxte,] ] vte = Vxte.] ] Thus, if the hetero-distributional mapping U is estimated, estimation of the density ratio w (x) can be carried out in a low-dimensional hetero-distributional subspace via equation 5.1. The above framework is called the direct density-ratio estimation with dimensionality reduction (D3) [158]. For the time being, we assume that the dimension m of the hetero-distributional subspace is known; we show how m is estimated from data in section 5.5. 5.3 Identifying Hetero-Distributional Subspace by Supervised Dimensionality Reduction In this section, we explain how the hetero-distributional subspace is estimated. 5.3.1 Basic Idea In order to estimate the hetero-distributional subspace, we need a criterion that reflects the degree of distributional difference in a subspace. A key observation in this context is that the existence of distributional difference can be checked to determine whether samples from the two distributions are separated from one another. That is, if samples of one distribution can be distinguished from the samples of the other distribution, one may conclude that two distributions
  • 115. 5.3. Identifying Hetero-Distributional Subspace 107 are different; otherwise, distributions may be similar. We employ this idea for finding the hetero-distributional subspace. Let us denote the samples projected onto the hetero-distributional sub­ space by {ute I ute = Uxte}nte l ' j j j j= Then our goal is to find the matrix Usuch that {u:,}7!1 and {U�e};::1 are maxi­ mally separated from one another.For that purpose, we may use any supervised dimensionality reduction methods. Among supervised dimensionality reduction methods (e.g., [72, 73, 56, 63, 62]), we decided to use local Fisher discriminant analysis (LFDA; [154]), which is an extension of classical Fisher discriminant analysis (FDA; [50]) for multimodally distributed data. LFDA has useful properties, in practice; for instance, there is no limitation on the dimension of the reduced subspace: (FDA is limited to one-dimensional projection for two-class problems [57]); it works well even when data have multimodal structure such as separate clus­ ters; it is robust against outliers; its solution can be analytically computed using eigenvalue decomposition as stable and efficient as the original FDA; and its experimental performance has been shown to be better than other supervised learning methods. Below, we briefly review technical details of LFDA, showing how to use it for hetero-distributional subspace search.To simplify the notation, we consider a set of binary-labeled training samples and reduce the dimensionality of Xb using an m x d transformation matrix T, as Effectively, the training samples {(Xb Yk)}Z=1 correspond to the following setup: for n = ntr + nte, { }n {tf}ntr U {te}nteXk k=1 = x; ;=1 Xj j=I' if x E {xte}ntek j j=I'
  • 116. 108 5 Direct Density-Ratio Estimation with Dimensionality Reduction 5.3.2 Fisher Discriminant Analysis Since LFDA is an extension of FDA [50], we first briefly review the original FDA (see also section 2.3.1). Let n+ and n_ be the numbers of samples in class +1 and class -1, respec­ tively. Let /L, /L+, and /L- be the means of {xd�=i' {XkIYk = +1}�=i' and {xklYk= -1}�=i' respectively: Let Sb and SW be the between-class scatter matrix and the within-class scatter matrix, respectively, defined as Sb :=n+(/L+ -/L)(/L+ - /L)T + n_(/L_ - /L) (/L- - /L)T, Sw:= L (Xk -/L+)(Xk -/L+)T + L (Xk -/L_)(Xk - /L-)T. k:Yk=+1 k:Yk=-1 The FDA transformation matrix TFDA is defined as TFDA:= argmax [tr(TSbTT(TSWTT)-l)]. TElRmxd That is, FDA seeks a transformation matrix T with large between-class scatter and small within-class scatter in the embedding space IRm. In the above formu­ lation, we implicitly assume that SW is full-rank and T has rank m so that the inverse of TswTT exists. Let {tz}1=1 be the generalized eigenvectors associated with the generalized eigenvalues {1JZ}1=1 of the following generalized eigenvalue problem [57]: We assume that the generalized eigenvalues are sorted as
  • 117. 5.3. Identifying Hetero-Distributional Subspace Then a solution TFDA is analytically given as follows (e.g., [42]): 109 FDA works very well if samples in each class are Gaussian with common covariance structure. However, it tends to give undesired results if samples in a class form several separate clusters or there are outliers. Furthermore, the between-class scatter matrix Sb is known to have rank 1 in the current setup (see, e.g., [57]), implying that we can obtain only one meaningful feature tI through the FDA criterion; the remaining features {tZ}1=2 found by FDA are arbitrary in the null space of Sb. This is an essential limitation of FDA in dimensionality reduction. 5.3.3 Local Fisher Discriminant Analysis In order to overcome the weaknesses of FDA explained above, LFDA has been introduced [154]. Here, we explain the main idea of LFDA briefly. The scatter matrices Sb and SW in the original FDA can be expressed in the pairwise form as follows [154]: where /lln - lln+ ifYk=Yk' =+l, W:k' := lln - lln_ ifYk=Yk' = -l, lin ifYk=/=Yk', ifYk=Yk' = - 1, ifYk=/=Yk'· Note that (lIn - lln+) and (lin - lln- ) included in the definition of Wb are negative values.
  • 118. 110 5 Direct Density-Ratio Estimation with Dimensionality Reduction Based on the above pairwise expression, let us define the local between-class scatter matrix Sib and the local within-class scatter matrix Slw as Sib ._ 1 � Wlb ( ) ( )T .- 2 � k,k' Xk-Xk' Xk-Xk' , k,k'=1 Slw. 1 L:n Wlw ( ) ( )T.= - kk' Xk-Xk' Xk-Xk' , 2 ' k,k'=1 where {Ak'k'(l/n-1/n+) if Yk= Yk'=+l, W��k':= Ak,k'(l/n-1/n_) if Yk= Yk'=-l, 1/n if Yk #= Yk', if Yk= Yk'=+1, if Yk= Yk'=-1, if Yk #= Yk'· Ak,k' is the affinity value between Xk and Xk' (e.g., as defined based on the local scaling heuristic [208]): Tk is the local scaling factor around Xk defined by wherexkK) denotes the K-th nearest neighbor of Xk. A heuristic choice of K=7 was shown to be useful through extensive simulations [208,154]. Note that the local scaling factors are computed in a classwise manner in LFDA (see the pseudo code of LFDA in figure 5.2) . Based on the local scatter matrices Sib and Slw, the LFDA transformation matrix TLFDA is defined as TLFDA:= argmax [tr(TSlbTT(TSIWTT)-I) ]. TElRmxd
  • 119. 5.3. Identifying Hetero-Distributional Subspace Input: Two sets of samples {Xn�:l and {X;e}j�l on �d Dimensionality of embedding space m (1 :::; m :::; d) Output: m x d transformation matrix TLFDA x�r +-- 7th nearest neighbor of x�r among {X�;}��l for i = 1,2, ... , ntr; x;e +-- 7th nearest neighbor of x;e among {x;nj, t �l for j = 1,2, . . .,nte; Tt +-- Ilx!, - xn for i = 1,2, ...,ntr; Tr +-- Ilx;e - x;ell for j = 1,2, ... ,nte; Atr ( Ilx�r_x�;112 ) " . ., 12ii' +-- exp - t t lor 2,2 = , ,...,ntr;, �r�! Ate ( IIx;e - X;�112 ) " . ., _ .j,j' +-- exp - TrTr lor J, J - 1,2, ...,nte, Xtr +-- (xrlx�rl'..Ix;;..); Xte +-- (xtelxtel'.'Ixte ).1 2 nte , Gtr +-- Xtrdiag(Atrl )XtrT _ xtrAtrxtrT.ntr , Gte +-- Xtediag(Atel )XteT _ xteAtexteT.nh , Slw +-- �Gtr + �Gte. ntr nte ' n +-- ntr + nte; Sib +-- (.1 _ �)Gtr + (.1 _ �)Gte + !!k XtrXtrT +!!llXteXteT n ntr n nte n n _.lXtrl (Xtel )T _ .lXtel (Xtrl )T. n ntr nte n nte ntr ' {''It,'lj;d�l +-- generalized eigenvalues and eigenvectors of Slbnl. = 'YlSlwnl.. % 'Yl > 'Yl > ... > 'Yl0/ " 0/, ,,1 _ ,,2 _ _ " d {�1}�1 +-- orthonormal basis of {'lj;d�l; 111 % span({-¢1}�1) = span({'lj;1}�1) for m' = 1,2, .. ., m TLFDA +-- (�11�21" 'l�m)T; Figure 5.2 Pseudo code of LFDA. In denotes the n-dimensional vectors with all ones, and diag(b) denotes the diagonal matrix with diagonal elements specified by a vector b. Recalling that (lin- lln+) and (lin- lln_) included in the definition of Wlb are negative values, the definitions of Sib and Slw imply that LFDA seeks a transformation matrix T such that nearby data pairs in the same class are made close and the data pairs in different classes are made apart; far apart data pairs in the same class are not forced to be close. By the localization effect brought by the introduction of the affinity matrix, LFDA can overcome the weakness of the original FDA against clustered data and outliers. When Au' = I for all k, k' (i.e., no locality), Slw and Sib are reduced to SW and Sb. Thus, LFDA can be regarded as a natural localized
  • 120. 112 5 Direct Density-Ratio Estimation with Dimensionality Reduction variant of FDA. The between-class scatter matrix Sb in the original FDA had only rank 1, whereas its local counterpart Sib in LFDA usually has full rank with no multiplicity in eigenvalues (given n 2: d) . Therefore, LFDA can be applied to dimensionality reduction into any dimensional spaces, which is a significant advantage over the original FDA. A solution TLFDA can be computed in the same way as the original FDA. Namely, the LFDA solution is given as where {tl}f=, are the generalized eigenvectors associated with the generalized eigenvalues 111 2: 1'/2 2: ...2: IJd of the following generalized eigenvalue problem: (5.4) Since the LFDA solution can be computed in the same way as the original FDA solution, LFDA is computationally as efficient as the original FDA. A pseudo code of LFDA is summarized in figure 5.2. A MATLAB® implementation of LFDA is available from http://sugiyama­ www.cs.titech.ac.jprsugi/softwareILFDAI. 5.4 Using LFDA for Finding Hetero-Distributional Subspace Finally, we show how to obtain an estimate of the transformation matrix U needed in the density-ratio estimation procedure (see section 5.3) from the LFDA transformation matrix TLFDA. First, an orthonormal basis {tI};:! of the LFDA subspace is computed from the generalized eigenvectors {tl};:! so that the span of {tl};:! agrees with the span of {tl};:, for all m' (1 ::: m' ::: m) . This can be carried out in a straightforward way, for instance, by the Gram-Schmidt orthonormalization (see, e.g., [5]). Then an estimate fj is given as and the samples are transformed as �ur '. = �Uxtr cor' 1 2I I l' I = , , . . . , nt " �ue '. = �Uxte cor ' 1 2 J J l' } = , ,..., nte· The above expression of fj implies another useful advantage of LFDA. In density-ratio estimation, one needs the LFDA solution for each reduced
  • 121. 5.6. Numerical Examples 113 dimensionality m = 1,2,. . ., d (see section 5.5). However, we do not actually have to compute the LFDA solution for each m, but only to solve the gen­ eralized eigenvalue problem (equation 5.4) once for m = d and compute the orthonormal basis {tz}f=l; the solution for m < d can be obtained by simply taking the first m basis vectors {tz};:1 • 5.5 Density-Ratio Estimation in the Hetero-Distributional Subspace Given that the hetero-distributional subspace has been successfully identified by the above procedure, the next step is to estimate the density ratio within the subspace. Since the direct importance estimator unconstrained least-squares importancefitting (uLSIF; [88]) explained in section 4.6 was shown to be accu­ rate and computationally very efficient, it would be advantageous to combine LFDA with uLSIF. So far, we have explained how the dimensionality reduction idea can be incorporated into density-ratio estimation, when the dimension m of the hetero-distributional subspace is known in advance. Here we address how the dimension m is estimated from samples, which results in a practical procedure. For dimensionality selection, the ev score of the uLSIF algorithm can be utilized. In particular, uLSIF allows one to compute the leave-one-out ev (LOOeV) score analytically (see section 4.6.2): where Wj(u) is a density-ratio estimate obtained without 11;' and 11;e. Thus, the above Looev score is computed as a function of m, and the one that minimizes the LOOeV score is chosen. The pseudo code of the entire algorithm is summarized in figure 5.3. 5.6 Numerical Examples In this section, we illustrate how the D3 algorithm behaves. 5.6.1 Illustrative Example Let the input domain be ]R2 (i.e., d = 2), and the denominator and numerator densities be set to
  • 122. 114 5 Direct Density-Ratio Estimation with Dimensionality Reduction Input: Two sets of samples {Xn��l and {xje}j�l on �d Output: Density-ratio estimate w(x) Obtain orthonormal basis {�1}�1 using LFDA with {xtr}':'t, and {xte}nte • , ,=1 J J=1' For each reduced dimension m = 1,2, . . . , d Form projection matrix: Um = (�11�21" 'l�m)T; ProJ'ect sampleS' {utr I utr = U xtr}n" and• t,m t,m m 1. t=l {ute I ute = U xte}nte• J,m J,m m J J=I' For each candidate of Gaussian width a For each candidate of regularization parameter A Compute LOOCV score LOOCV(m, a, A) using {utr }n" and {ute }nte•t,m 1.=1 J,m )=1' end end end Choose the best model: (in, 0',:) +- argmincm,CT,A) LOOCV(m, a, A); Estimate � density ratio from {utmJ�";"l and {uj�m}j�1 using uLSIF with (0',A); Figure 5.3 Pseudo code of direct density-ratio estimation with dimensionality reduction (D3). where N(x; 11-, 1:) denotes the multivariate Gaussian density with mean II- and covariance matrix 1:. The profiles of the above densities and their ratios are illustrated in figures 5.4 and 5.7a, respectively. We sample ntr = 100 points from Ptr(x) and nte = 100 points from Pte(x); the samples are illustrated in figure 5.5. In this data set, the distributions are different only in the one­ dimensional subspace spanned by (1,0)T, that is, the true dimensionality of the hetero-distributional subspace is m = 1. The true hetero-distributional subspace is depicted by the solid line in figure 5.5. The dotted line in figure 5.5 depicts the hetero-distributional subspace estimated by LFDA with reduced dimensionality 1; when the reduced dimen­ sionality is 2, LFDA gives the entire space. This shows that for reduced dimen­ sionality 1, LFDA gives a very good estimate of the true hetero-distributional subspace. Next, we choose reduced dimensionality m as well as the Gaussian width a and the regularization parameter A. in uLSIF. Figure 5.6 depicts the LOOCV
  • 123. 5.6. Numerical Examples 0.08 0.06 0.04 0.02 Figure 5.4 (a) Profile ofPtr(x). Two-dimensional toy data set. 3 2 x -2 o 0 �o x x 0.08 0.06 0.04 0.02 o o o o (b) Profile ofPte(x), tr X xi o x te J 115 --True subspace 11111 LFDA subspace _3 L-----L-----L-----L-----L-----L---� -6 -4 -2 2 4 6 Figure 5.5 Samples and the hetero-distributional subspace of two-dimensional toy data set. The LFDA esti­ mate of the hetero-distributional subspace is spanned by (1.00, 0.01)T, which is very close to the true hetero-distributional subspace spanned by (1, 0)T. score of uLSIF, showing that is the minimizer. Finally, the density ratio is estimated by uLSIF. Figure 5.7 depicts the true density ratio, its estimate by uLSIF without dimensionality reduction, and its estimate by uLSIF with dimensionality reduction by LFDA. For uLSIF without
  • 124. 116 5 Direct Density-Ratio Estimation with Dimensionality Reduction (j o g Figure 5.6 -1 o log 0" 2 - �- m=1. 1..=10 1 """*"" m=1. 1..=10- 0 . 5 ",)(" m=1. 1..=10 ° - - - m=2. 1..=10- 1 -- m=2.A.=10- 0 . 5 ""'" m=2.A.=10 0 LOOCV score of uLSIF for two-dimensional toy data set. 4 ° X(l) (a) True density ratio. ° (b) Density-ratio estimation without dimensionality reduction. NMSE = 1.52 x 10-5 Figure 5.7 4 ° (c) Density-ratio estimation with dimensionality reduction. NMSE = 0.89 x 10-5 True and estimated density ratio functions. x·(1) and X·(2) in (c) denote the LFDA solution, that is, x·(1) = l.OOx(1) + O.Olx(2) and X·(2) = -O.Olx(1) + l.OOX(2), respectively.
  • 125. 5.6. Numerical Examples 117 dimensionality reduction, (a,'i) = (1,10-0.5) is chosen by LOOCV (see figure 5.6 with m = 2). This shows that when dimen­ sionality reduction is not performed, independence between the density ratio w(x) and the second input element X(2) (figure 5.7a) is not incorporated, and the estimated density ratio has Gaussian-tail structure along X(2) (figure 5.7b). On the other hand, when dimensionality reduction is carried out, independence between the density ratio function w(x) and the second input element X(2) can be successfully captured, and consequently a more accurate estimator is obtained (figure 5.7c). The accuracy of an estimated density ratio is measured by the normalized mean-squared error (NMSE): (5.5) By dimensionality reduction, NMSE is reduced from 1.52 x 10-5 to 0.89 x 10-5• Thus, we gain a 41.5 percent reduction in NMSE. 5.6.2 Performance Comparison Using Artificial Data Sets Here, we investigate the performance of the D3 algorithm using six artifi­ cial data sets. The input domain of the data sets is d-dimensional (d 2: 2), and the true dimensionality of the hetero-distributional subspace is m = 1 or 2. The homo-distributional component of the data sets is the (d - m)­ dimensional Gaussian distribution with mean zero and covariance identity. The hetero-distributional component of each data set is given as follows: (a) Data set 1 (shifting, m = 1): utr�N ([�l[� �D, ute�N ([�l [� �D· (b) Data set 2 (shrinking, m = 2): utf�N ([�l [� �D' te ([0] [1/4 0 ])u �N O ' ° 1/4 .
  • 126. 118 5 Direct Density-Ratio Estimation with Dimensionality Reduction (c) Data set 3 (magnifying, m = 2): u lf � N([�],[1 �4 1�4])' u te � N([�l [� �]). (d) Data set 4(rotating, m = 2): u tf � N([�l [� �]), u te � N([���],[��� ���]). (e) Data set 5 (one-dimensional splitting, m = 1): u tf � N([�l [� �]), u te � �N([�l [1 �4 1�4])+ �N([�l [1 �4 1�4])' (f) Data set 6 (two-dimensional splitting, m = 2): u tf � N([�l [� �]), u te � �N([�l [1 �4 1�4])+ �N([� 2l [1 �4 1�4]) + �N([�],[1 �4 1�4])+ �N([�2]' [1 �4 1�4])' The number of samples is set to ntf = 200and nte = 1000for all the data sets. Examples of realized samples are illustrated in figure 5.8. For each dimension­ ality d = 2,3,. . .,10,the density ratio is estimated using uLSIF with/without dimensionality reduction. This experiment is repeated 100 times for each d with different random seed. Figure 5.9 depicts choice of the dimensionality of the hetero-distributional subspace by LOOeV for each d. This shows that for data sets 1, 2, 4, and
  • 127. 5.6. Numerical Examples 119 6 x 4 x x x xx x � x �xli' xx x x X xfi x 0 fi x 0 x �x xx -1 x-2 IJ!'� x -2 x -4 XX x. -3 x x -6 x -2 0 2 4 6 -5 0 X(l) X(l) (a) Data set 1 (m = 1) (b) Data set 2 (m = 2) 8 6 6 4 4 0 0 -x 0 I!. '" 0 0 -x 0 x -2 -2 -4 -4 0 0 -6 -6 -6 -4 -2 0 4 6 -4 -2 0 4 6 X(l) X(l) (c) Data set 3 (m = 2) (d) Data set 4 (m = 2) 4 4 x x fi x 0 fi x 0 -1 -1 -2 -2 x it -3 x -3 -4 -2 0 4 -4 -2 0 X(l) X(l) (e) Data set 5 (m = 1) (d) Data set 6 (m = 2) Figure 5.8 Artificial data sets.
  • 128. 120 5 Direct Density-Ratio Estimation with Dimensionality Reduction 2 eo eo 70 70 4 4 <E 6 <E 6 7 • 8 8 9 9 " " 10 10 4 6 8 10 4 6 8 9 10 d d (a) Data set 1 (m = 1) (b) Data set 2 (m = 2) "0 4 4 so (E 6 <E 6 " JO JO 8 20 20 9 " 10 10 4 6 9 10 4 6 8 9 10 d d (c) Data set 3 (m = 2) (d) Data set 4 (m = 2) 100 100 90 80 eo 70 4 60 C so so <E <E " " 30 30 20 20 " 10 10 '---0 4 6 10 4 6 8 10 d d (e) Data set 5 (m = 1) (d) Data set 6 (m = 2) Figure 5.9 Dimension choice of the hetero-distributional subspace by LOOCV over 100 runs.
  • 129. 5.7. Summary 121 5, dimensionality choice by LOOCV works well. For data set 3, iii = 1 is always chosen although the true dimensionality is m = 2. For data set 6, dimensionality choice is rather unstable, but it still works reasonably well. Figure 5.10 depicts the value of NMSE (see equation 5.5) averaged over 100 trials. For each d, the t-test (see, e.g., [78]) at the significance level 5 percent is performed, and the best method as well as the comparable method in terms of mean NMSE are indicated by x. (In other words, the method without the symbol x is significantly worse than the other method.) This shows that mean NMSE of the baseline method (no dimensionality reduction) tends to grow rapidly as the dimensionality d increases. On the other hand, increase of mean NMSE of the D3 algorithm is much smaller than that of the baseline method. Consequently, mean NMSE of D3 is much smaller than that of the baseline method when the input dimensionality d is large. The difference of mean NMSE is statistically significant for the data sets 1, 2, 5, and 6. The above experiments show that the dimensionality reduction scheme is useful in high-dimensional density-ratio estimation. 5.7 Summary The direct importance estimators explained in chapter 4 tend to perform better than naively taking the ratio of density estimators. However, in high­ dimensional problems, there is still room for further improvement in terms of estimation accuracy. In this chapter, we have explained how dimensional­ ity reduction can be incorporated into direct importance estimation. The basic idea was to perform importance estimation only in a subspace where two den­ sities are significantly different. We called this framework direct density-ratio estimation with dimensionality reduction (D3). Finding such a subspace, which we called the hetem-distributional sub­ space, is a key challenge in this context. We explained that the hetero­ distributional subspace can be identified by finding a subspace in which train­ ing and test samples are maximally separated. We showed that the accuracy of the unconstrained least-squares importancefitting (uLSIF; [88]) algorithm can be improved by combining uLSIF with a supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA; [154]). We chose LFDA because it was shown to be superior in terms of both accuracy and computational efficiency. On the other hand, supervised dimen­ sionality reduction is one of the most active research topics, and better methods will be developed in the future. The framework of D3 allows one to use any supervised dimensionality reduction method for importance estimation. Thus, if better methods of supervised dimensionality reduction are developed in the
  • 130. 122 5 Direct Density-Ratio Estimation with Dimensionality Reduction 11 ___No dimensionality reduction 10 -D3 9 w --- --- ,,, ---� 8 --z w 7 6 � 5 2 3 4 5 6 7 8 9 10 d (a) Data set 1 X 10 -4 4.5 )C. ......1(.... ,..,.. .... )( .... �- ........-)It ..- 1(....01( 4.4 4.3 � 4.2 z 4.1 3.9'--:-2--:-3--C4--:-5--:-6--:-7--:-8--:-9--C:10� d X 10 -5 4.5 4 (c) Data set 3 ................... "' -- w 3.5 " , , - -- � 3 z 2.5 2 3 4 5 6 7 8 9 10 d (e) Data set 5 Figure 5.10 X 10 -4 1.5 1.4 1.3 1.2 w en � 1.1 0.9 " " � w 0.8 � .. 2 3 4 5 6 7 8 9 10 d X 10 -4 1.4 1.35 1.3 (b) Data set 2 ,x )C' ,, ;)It .... Je ...... ..".. .. -)It ..-I( � 1.25 z 1.2 1.15 1.1 2 3 4 5 6 7 8 9 10 d (d) Data set 4 x 10 -4 1.18 ,"'" ... 1.16 ,, ,,1.14 ,,,1.12 ,w en 1.1 ::;: z 1.08 ,, 1.06 �1.04 1.02 2 3 4 5 6 7 8 9 10 d (d) Data set 6 Mean NMSE of the estimated density-ratio functions over 100 runs. For each d, the t-test at the significance level 5 percent is performed and the best method as well as the comparable method in terms of mean NMSE, are indicated by x.
  • 131. 5.7. Summary 123 future, the new dimensionality reduction methods can be incorporated into the direct importance estimation framework. The D3 formulation explained in this chapter assumes that the components inside and outside the hetero-distributional subspace are statistically indepen­ dent. A possible generalization of this framework is to weaken this condition, for example, by following the line of [173]. We focused on a linear hetero-distributional subspace. Another possible generalization of D3 would be to consider a nonlinear hetero-distributional manifold. Using a kernelized version of LFDA [154] is one possibility. Another possibility is to use a mixture of probabilistic principal component analyzers in importance estimation [205].
  • 132. 6 Relation to Sample Selection Bias One of the most famous works on learning under changing environment is Heckman's method for coping with sample selection bias [76, 77]. Sam­ ple selection bias has been proposed and extensively studied in economet­ rics and sociology, and Heckman received the Nobel Prize for economics in 2000. Sample selection bias indicates the situation where the training data set con­ sists of nonrandomly selected (i.e., biased) samples. Data samples collected through Internet surveys typically suffer from sample selection bias-samples corresponding to those who do not have access to the Internet are completely missing. Since the number of conservative people, such as the elderly, is less than reality in the Internet surveys, rather progressive conclusions tend to be drawn. Heckman introduced a sample selection model in his seminal papers [76, 77], which characterized the selection bias based on parametric assumptions. Then he proposed a two-step procedurefor correcting sample selection bias. In this chapter, we briefly explain Heckman's model and his two-step procedure, and discuss its relation to the covariate shift approach. 6.1 Heckman's Sample Selection Model In order to deal with sample selection bias, Heckman considered a lin­ ear regression model specifying the selection process, in addition to the regression model that characterizes the target behavioral relationship for the entire population. The sign of the latent response variable of this regres­ sion model determines whether each sample is observed or not. The key assumption in this model is that error terms of the two regression mod­ els are correlated to one another, causing nonrandom sample selection (see figure 6.1).
  • 133. 126 6 Relation to Sample Selection Bias 8 0 0 0 7 0 000 0 6 5 0 .. --4 .. cJ .. .. .. .. .. .. 3 .. ...... .. .. .. .. .. 2 .. .. 0 0 0.2 0.4 0.6 0.8 X Figure 6.1 A numerical example of sample selection bias. This data set was generated according to Heck­ man's model. The circles denote observed samples, and the crosses are missing samples. Because of large positive correlation between the error terms (p = 0.9), we can see that only the samples with higher responses y are observed when the covariate x is small. Thus, the OLS estimator (solid line) computed from the observed samples is significantly biased from the true populational regressor (dotted line). Let Yt be the target quantity on which we want to make inference. Heckman assumed a linear regression model (6.1) where x is a covariate and UI is an error term. However, Yt is not always observed. Samples are selected depending on another random variable Y2*; the variable Yt is observable only if Y2* > O. Let Y be the observed response variable, that is, { y* Y= I missing value (6.2) The variable Y2* obeys another linear regression model, (6.3)
  • 134. 6.1. Heckman's Sample Selection Model 127 where z is a covariate and U2 is an error term. Note that the covariates x in equation 6.1 and z in equation 6.3 can contain common components. In Heckman's papers [76, 77], the error terms UI and U2 are assumed to be jointly bivariate-normally distributed with mean zero and covariance matrix 1:, conditioned on x and z. Since only the sign of the latent variable Y{ mat­ ters later, we can fix the variance of U2 at 1 without loss of generality. We denote the variance of UI and the covariance between UI and U2 by a 2 and p, respectively: If there is no correlation between UI and U2 (p = 0), the selection process (equation 6.3) is independentof the regression model (equation 6.1) of interest. This situation is called missing at random (see [104D. On the other hand, if the error terms are correlated (p f= 0), the selection process makes the distribution of Y different from that of Yt. For example, the expectation of the observation Y has some bias from that of Yt (i.e., xTIn Figure 6.2 illustrates the sample selection bias for different selection prob­ abilities when the correlation p is nonzero. If the mean of Y2*, zTy, is large in the positive direction, most of the samples are observed and the selection bias ",::o'-+l:_""-=�'"""'�____ yt selection bias (a) z T 'Y » 0 Figure 6.2 Selection bias Sample selection biases for different selection probabilities in Heckman's model (when the corre­ lation p is nonzero). The symbol 0 denotes the mean of yt for all samples, while the symbol x denotes the mean of yt only for y; > O. (a) If zT Y is large in the positive direction, most of the samples are observed and the selection bias is small. (b) If zT Y is large in the negative direction, only a small portion of samples can be observed and the sample selection bias becomes large.
  • 135. 128 6 Relation to Sample Selection Bias is small (see figure 6.2a). On the other hand, if ZTY is large in the negative direction, only a small portion of samples can be observed and the sample selection bias becomes large (see figure 6.2b). As illustrated in figure 6.3, the bias becomes positive in positively correlated cases (p>0) and negative in negatively correlated cases (p < 0). Indeed, the conditional expectation of the observation Y, given x, can be expressed as lE[Ylx, Z]=lE[Ytlx, z, Y2*>0] =xTP + lE[Ud U2>-ZTy], (6.4) where the second term is the selection bias from the conditional expectation lE[YtIx]=X TP of the latent variable Y2*' Heckman calculated the second term in equation 6.4 explicitly, as explained below. (a) No correlation (p = 0). Y2* Selection bias Selection bias (b) Positive correlation (p > 0). (c) Negative correlation (p < 0). Figure 6.3 Sample selection bias corresponding to no, positive, and negative correlation p. When p i= 0, the conditional mean of the observed area (Y{ � 0) differs from the unconditional average xTp. In Heckman's model, this gap (which is the sample selection bias) can be calculated explicitly.
  • 136. 6.2. Distributional Change and Sample Selection Bias 6.2 Distributional Change and Sample Selection Bias 129 As defined in equation 6.1, the conditional density of the latent variable Yt, given x, is Gaussian: * 1 (y-xT{J )p (Y1x;{J, a)=-;¢ a ' (6.5) where ¢ denotes the density function of the standard Gaussian distribution N(O, I). Thanks to the Gaussian assumption of the error terms U and U2, we can also characterize the conditional distribution of the observation Y explicitly. Let D be an indicator variable taking the value 1 if Y is observed, and 0 if not, that is, D=I(Y2* > 0), where I (.) is the indicator function. The assumption is that in the selection process (equation 6.3), we cannot observe the value of Y2*; we can observe only D (i.e., the sign of Yn. The conditional probability of D=1 can be expressed as P(D=Ilx, z;y)=P(Y2* > Olx, z;y) =<I>(ZTy), (6.6) where <I> is the cumulative distribution function of the standard Gaussian (i.e., the errorfunction). This binary-response distribution is called the probit model in statistics. By the definition of the indicator D and the selection process (equation 6.3), the conditional distribution of the observation Y given D=I, x, and Z is rewritten as Po(ylx, z)=P(Y:s YID=I, x, z) P(Y:s y, D=Ilx, z) P(D=Ilx, z) P(Yt:s y, Y2* > Olx, z) P(D=llx, z) (6.7) From equation 6.6, the denominator of equation 6.7 is expressed by the error function <1>.
  • 137. 130 6 Relation to Sample Selection Bias Figure 6.4 In order to compute the distribution function of the observable Y in equation 6.7, the shaded area of the bivariate Gaussian distribution should be integrated. In order to calculate the numerator of equation 6.7, we need to integrate the correlated Gaussian distribution over the region illustrated in figure 6.4 for each fixed y. To this end, the error term UI is decomposed into two terms: UI = a�UI+paU2, where UI is independent of U2• Although the resulting formula still contains a single integral, by taking the partial derivative with respect to y, the conditional density Po can be expressed using the density ¢ of the Gaussian and the error function <l> as shown below. a po(ylx, z;fJ, y, p, a) = ay Po(ylx, z) = 1 ¢ (y-XTfJ ) <l> (P(y-xTfJ)/a+zTy ). (6.8) a<l>(zTy) a J(I- p 2 ) Note that if the selection process (equation 6.3) is independent of ft(p = 0), this conditional density results in that of ft (see equation 6.5). The detailed derivation can be found in [19]. Figure 6.5 shows the conditional density function (equation 6.8) for differ­ ent sample selection thresholds when p = 0.5 and p = -0.5. Here, we took the standard Gaussian N(O, I) for the distribution of ft, that is, the other param­ eters are set to xTfJ = 0 and a = 1. The smaller the mean zTY of f2* is, the larger the shift is from the unconditional distribution. In other words, we suffer
  • 138. 6.3. The Two-Step Algorithm O.S 0.4 0.3 l> .� C!i 0.2 0.1 . . . . . . . . . . . . . . . . . . O�·�··�����--�--���-3 -2 -1 (a)p = 0.5. Figure 6.5 0.5 0.4 0.1 -1 , , , , , . , , , , , . , , (b)p = -0.5 , , " '. 131 Conditional density functions for p = 0.5 and p = -0.5 (xTf3 = 0 and a = I), denoted by the solid lines. The selection thresholds are set to zT Y = -3. -2. -I, 0, 1 from closest to farthest. The dashed line denotes the Gaussian density when p = O. from larger selection bias when the threshold is relatively higher, and therefore a small portion of the entire population is observed (see figure 6.2). The conditional expectation and variance of Y can be calculated from the moment generating function. As in the previous calculation, the result­ ing formula contains a single integral, but its derivative at the origin can be written explicitly with the Gaussian density ¢ and the error function <I> as follows. (6.9) The function A(t) = ¢(t)/<1>(-t) is often called the inverse Mills ratio, and is also known as the hazard ratio in survival data analysis. As plot­ ted in figure 6.6, it is a monotonically increasing function of t, and can be approximated by a linear function for a wide range of its argument. 6.3 The Two-Step Algorithm Based on the theoreticaldevelopmentin the previous section, Heckman [76,77] treated sampleselection bias as an ordinary modelspecificationerror or "omit­ ted variable" bias. More specifically, he reformulated the linear regression
  • 139. 132 6 Relation to Sample Selection Bias 3 oL---����----�--�----�--� -3 Figure 6.6 Hazard ratio. -2 -1 model (equation 6.1) as o t Y=lE[YID=1.x.z] + V =XTf3 + pa)..(-zTy) + V, 2 3 where V has mean 0 conditioned on being observable, that is, lE[V ID=1, x, Z]=0. (6.10) If the hazard ratio term were available or, equivalently, the parameter y of the selection process (equation 6.3) were known, it would become just a linear regression problem with the extended covariates [x, )..(-zTy)], and then ordi­ nary least squares (OLS) would give a consistent estimator of the regression parameters f3 and pa. However, we need to estimate the parameter y from the data as well. In [77], Heckman considers a situation where the covariate z in the selec­ tion process can be observed in any case, that is, regardless of whether Yt is observed or not. Then, it is possible to estimate the regression parameter y from the input-output pairs of (z, D) by probit analysis [114], which is the first step of Heckman's algorithm. More specifically, the regression parameter y is determined as [no n ]y:=arg�ax �log <I>(z�y) + ifI log {I- <I>(z�y)} . (6.11)
  • 140. 6.3. The Two-Step Algorithm 133 Input: observed samples {Xi,zi,yd�l and missing samples {Zi}�noH Output: parameter estimates (fj,:Y,ij,ji) 1. Fit the selection model (Le., estimate 1') by probit analysis (equation 6.11) from all input-output pairs {Zi' 1}��1 and {Zi' O}f=noH' 2. Calculate the estimates of the hazard ratio by plugging in :y for the observed samples: i = 1, ... ,no. 3. Estimate the parameters {3 and pain the modified linear regres­ sion model (equation 6.10) by the ordinary least squares (OLS) based on the extended covariates {Xi)�i}��l and the observed responses {Yi}��l' The estimators fj and {Xi are consistent estimators of {3 and pO', respectively. 4. Estimate the parameters 0=2 and Ii from the residuals as follows (See the conditional variance, equation 6.9: - � pO' P =-=-. a Figure 6.7 Pseudo code of Heckman's two-step algorithm. Step 4 is not necessary if we are interested only in prediction. Since the probit estimator y is consistent, the plug-in hazard ratio ).(-zTy) can be used as the extra covariate in the second OLS step. This is the core idea of Heckman's two-step algorithm. A pseudo code of Heckman's two-step procedure is summarized in figure 6.7. Heckman considered his estimator to be useful for providing good initial values for more efficient methods or for exploratory analysis. Nevertheless, it has become a standard way to obtain final results for the sample selection model. However, there are several criticisms of Heckman's two-step algo­ rithm [126], such as statistical inefficiency, collinearity, and the restrictive Gaussian-noise assumption.
  • 141. 134 6 Relation to Sample Selection Bias At first, Heckman's estimator is statistically less efficient. Because the mod­ ified regression problem (equation 6.11) has the heterogeneous variance as in equation 6.9, the general least squares (GLS) [41] may be used as an alter­ native to OLS, which minimizes the squared error weighted according to the inverse of the noise variance at each sample point. However, GLS is not an efficient estimator-an efficient estimator can be obtained by the maximum likelihood estimator (MLE), which maximizes the following log-likelihood function: % % 10gC({3, y, p, a) = �)ogPO(y;lXi' Zi; {3, y, p, a) + L)og<l>(z�y) ;=1 ;=1 n + L 10g{I-<I>(z�y)}, ;=no+l where we assume that we are given no input-output pairs {Xi, Zi, yd7�1 for observed samples and (n - no) input-only samples (i.e., covariates) {Zi}7=no+l for missing samples. For computing the MLE solution, gradient ascent or quasi-Newton iterations are usually employed, and Heckman's two-step esti­ mator can be used as an initial value since it is computationally very efficient and widely available in standard software toolboxes. The second criticism is that collinearity problems (two variables are highly correlated to one another) can occur rather frequently. For example, if all the covariates in Z are includedin x, the extra covariate A(-ZTy) in the second step can be collinear with the other covariates in x, because the hazard ratio A(t) is an approximately linear function over a wide range of its argument. In other words, in order to let Heckman's algorithm work in practice, we need variables in Z which are good predictors of Y; and do not appear in x. Unfortunately, it is often very difficult to find such variables in practice. Finally, Heckman's procedure relies heavily on the Gaussian assumptions for the error terms in equations 6.1 and 6.3. Instead, some authors have pro­ posed semiparametric or nonparametric procedures with milder distributional assumptions. See [119,126] for details. 6.4 Relation to Covariate Shift Approach In this section, we discuss similarities and differences between Heckman's approach and the covariateshift approach.For ease of comparison, we consider the case where the covariates z and x in the sample selection model coincide. Note that the distribution of the entire population in Heckman's formulation
  • 142. 6.4. Relation to Covariate Shift Approach 135 corresponds to the test distribution in the covariate shift formulation, and that of observed samples corresponds to the training distribution: Covariate shift Pte(x, y) {:::::::} Ptr(X, y) {:::::::} Heckman's model p(x, y) p(x, YID = 1). Although Heckman did not specify the probability distribution for the covari­ ates x, we will proceed as if they were generated from a probability distribution. One of the major differences is that in Heckman's model, the conditional distribution p(ylx, D = 1) of observed samples changes from its popula­ tional counterpart p(ylx) in addition to a change in covariate distributions. In this sense, the sample selection model deals with more general distribution changes-it is reduced to covariate shift (i.e., the conditional distribution does not change) only if the selection process is independentof the behavioral model (p = 0 and no selection bias). On the other hand, in Heckman's model the dis­ tributional change including the selection bias is computed under a very strong assumption of linear regression models with bivariate Gaussian error terms. When this assumption is not fulfilled, there is no guarantee that the selection bias can be captured reasonably. In machine learning applications, we rarely expect that statistical models at hand are correctly specified and noise is purely Gaussian. Thus, Heckman's model is too restrictive to be used in real-world machine learning applications. Another major difference between Heckman's approach and the covariate shift approach is the way the sample selection bias is reduced. As explained in chapters 2 and 3, covariate shift adaptation is carried out via importance weighting, which substantially increases the estimation variance. On the other hand, Heckman's correction compensates directly for the selection bias in order to make the estimator consistent, which does not increase the estima­ tion variance. However, the price we have to pay for the "free" bias reduction is that Heckman's procedure is not at all robust against model misspecifica­ tion. The bias caused by model misspecification can be corrected by postbias subtraction, but this is a hard task in general (e.g., bootstrap bias estimates are not accurate; see [45]). On the other hand, in covariate shift adapta­ tion, importance weighting guarantees to asymptotically minimize the bias for misspecified models.1 Since the increase of variance caused by importance 1. Note that when the model is correctly specified, no importance weighting is needed (see chapter 2).
  • 143. 136 6 Relation to Sample Selection Bias weighting can be controlled by regularization with model selection, the covari­ ate shift approach would be more useful for machine learning applications where correctly specified models are not available. Finally, in covariate shift adaptation, the importance weight w(x) = Pte(x)/Ptr(x) is estimated in an nonparametric way over the multidimensional input space (see chapter 4). If we compute the importance weight between the joint distributions2 of (x, y) for Heckman's model, we get p(x, y) w(x, y): p(x, yID=I) p(ylx) � ------���------- p(ylx, D=I)P(D=llx) = {<l>(p(y - XTP)/(J + XT'1 )}- .J(1 - p2) Due to the linear and Gaussian assumptions, this importance weight has a fixed functional form specified by the error function <l> and depends only on the one-dimensional projection xT('1 - E.P) of the input and the response variable(1 y. Especially if the selection process is independent of the behavioral model (p=0, when only the covariate density changes), the importance weight is reduced to In this respect, the covariate shift approach covers more flexible changes of distributions than Heckman's sample selection model. 2. Such a joint importance weight can be used for multitask learning [16].
  • 144. 7 Applications of Covariate Shift Adaptation In this chapter, we show applications of covariate shift adaptation techniques to real-world problems: the brain-computer interface in section 7.1, speaker iden­ tification in section 7.2, natural language processing in section 7.3, face-based age prediction in section 7.4, and human activity recognition from accelero­ metric data in section 7.5. In section 7.6, covariate shift adaptation techniques are employed for efficient sample reuse in the framework of reinforcement learning. 7.1 Brain-Computer Interface In this section, importance-weighting methods are applied to brain-computer interfaces (BCls), which have attracted a great deal of attention in biomedical engineering and machine learning [160,102]. 7.1.1 Background A BCI system allows direct communication from human to machine [201,40]. Cerebral electric activity is recorded via the electroencephalogram (EEG): electrodes attached to the scalp measure the electric signals of the brain. These signals are amplified and transmitted to the computer, which translates them into device control commands. The crucial requirement for the successful func­ tioning of BCI is that the electric activity on the scalp surface already reflects, for instances, motor intentions, such as the neural correlate of preparation for hand or foot movements. A BCI system based on motor imagery can detect the motor-related EEG changes and uses this information, for example, to perform a choice between two alternatives: the detection of the preparation to move the left hand leads to the choice of the first control command, whereas the right hand intention would lead to the second command. By this means, it is possible to operate devices which are connected to the computer (see figure 7.1).
  • 145. 138 imagine left hand movements imagine right hand movements Figure 7.1 Illustration of Bel system. 7 Applications of Covariate Shift Adaptation --- -{> --- -i> For classification of appropriately preprocessed EEG signals [130,122,101], the Fisher discriminant analysis (FDA) [50] has been shown to work well [201,39,8]. On the other hand, strong non-stationarity effects have often been observed in brain signals between training and test sessions [194, 116, 143], which could be regarded as an example of covariate shift. This indicates that employing importance-weighting methods could further improve the Bel recognition accuracy. Here, adaptive importance-weighted FDA (AIWFDA; see section 2.3.1) is employed for coping with the non-stationarity. AIWFDA is tested on 14 data sets obtained from 5 different subjects (see table 7.1 for specification), where the task is binary classification of EEG signals. 7.1.2 Experimental Setup Here, how the data samples are gathered and preprocessed in the Bel experiments is briefly described. Further details are given in [23,22]. The data in this study were recorded in a series of online Bel experiments in which the event-related desynchronization [122] was used to discriminate between various mental states. During imagined hand or foot movement, the spectral power in the frequency band between 8 Hz and 35 Hz is known to decrease in the EEG signals of the corresponding motor cortices. The data acquired from 128 EEG channels at a rate of 1000 Hz were downsampled to 100 Hz and band-pass filtered to specifically selected frequency bands. The
  • 146. 7.1. Brain-Computer Interface 139 Table 7.1 Specification of Bel data # of Training # of Unlabeled # of Test Subject Session ID Dim. of Samples Samples Samples Samples I 3 280 1 12 1 12 2 3 280 120 120 I 3 3 280 35 35 2 I 3 280 1 13 1 12 2 2 3 280 1 12 1 12 2 3 3 280 35 35 3 3 280 91 91 3 2 3 280 1 12 1 12 3 3 3 280 30 30 4 6 280 1 12 1 12 4 2 6 280 126 126 4 3 6 280 35 35 5 2 280 1 12 1 12 5 2 2 280 1 12 1 12 common spatial patterns (CSP) algorithm [130], a spatial filter that maximizes the band power in one class while minimizing it for the other class, was applied and the data were projected onto three to six channels. The features were finally extracted by calculating the log-variance over a window one second long. The experiments consisted of an initial training period and three test peri­ ods. During the training period, the letters (L), (R), or (F) were displayed on the screen to instruct the subjects to imagine left hand, right hand, or foot movements. Then a classifier was trained on the two classes with the best dis­ criminability. This classifier was then used in test periods. In each test period, a cursor could be controlled horizontally by the subjects, using the classifier output. One of the targets on the left and right sides of the computer screen was highlighted to indicate that the subjects should then try to select this target with the cursor. The experiments were carried out with five subjects. Note that the third test period for the fifth subject is missing. So there was a total of 14 test sets (see table 7.1 for specification). In the evaluation below, the misclassification rates of the classifiers are reported. Note that the classifier output was not directly translated into the position of a cursor on the monitor, but underwent some postprocessing steps
  • 147. 140 7 Applications of Covariate Shift Adaptation (averaging, scaling, and biasing). Therefore, a wrong classification did not necessarily result in the wrong target being selected. Since the time window chosen for this evaluation is at the beginning of the trial, a better classification accuracy on this data will lead to a shortened trial length, and therefore to a higher bit rate (see [23] for details). Training samples and unlabeled/test samples were gathered in different recording sessions, so the non-stationarity in brain signals may have changed the distributions. On the other hand, the unlabeled samples and test samples were gathered in the same recording session; more precisely, the unlabeled samples were gathered in the first half of the session and the test samples (with labels) were collected in the latter half. Therefore, unlabeled samples may have contained some information on the test input distribution. However, input dis­ tributions of unlabeled and test samples are not necessarily identical since the non-stationarity in brain signals can cause a small change in distributions even within the same session. Thus, this setting realistically renders the classifier update in online BCI systems. Ptr(x) and Pte(x) were estimated by maximum likelihood fitting of the multidimensional Gaussian density with full covariance matrix. Ptr(x) was estimated using training samples, and Pte(x) was estimated using unlabeled samples. 7.1.3 Experimental Results Table 7.2 describes the misclassification rates of test samples by FDA (cor­ responding to AIWFDA with flattening parameter A = 0), AIWFDA with A chosen based on tenfold IWCV or tenfold CV, and AIWFDA with opti­ mal A (i.e., for each case, A is determined so that the misclassification rate is minimized). The value of flattening parameter A was selected from {O, 0.1, 0.2, . . . , 1.0}; chosen flattening parameter values are also in the table. Table 7.2 also contains the Kullback-Leibler (KL) divergence [97] from the estimated training input distribution to the estimated test input distribution. Since we wanted to have an accurate estimate of the KL divergence, we used test samples for estimating the test input distribution when computing the KL divergence (only unlabeled samples were used when the test input distribution is estimated for AIWFDA and IWCV ). The KL values may be interpreted as the level of covariate shift. First, FDA is compared with OPT (AIWFDA with optimal A). The table shows that OPT outperforms FDA in 8 out of 14 cases. This implies that the non-stationarity in the brain signals can be modeled well by covariate shift, which motivates us to employ AIWFDA in BCL Within each subject, it can be
  • 148. 7.1. Brain-Computer Interface 141 Table 7.2 Misclassification rates for BCl data Subject Trial OPT FDA IWCV CV KL 1 * 8.7 (0.5) 9.3 (0) - 10.0 (0.9) 10.0 (0.9) 0.76 2 * 6.2 (0.3) 8.8 (0) 8.8 (0) 8.8 (0) 1. 1 1 3 4.3 (0) 4.3 (0) 4.3 (0) 4.3 (0) 0.69 2 1 40.0 (0) 40.0 (0) 040.0 (0) 41.3 (0.7) 0.97 2 2 * 38.7 (0. 1) 39.3 (0) : 38.7 (0.2) 39.3 (0) 1.05 2 3 25.5 (0) 25.5 (0) 25.5 (0) 25.5 (0) 0.43 3 * 34.4 (0.2) 36.9 (0) + 34.4 (0.2) 34.4 (0.2) 2.63 3 2 * 18.0 (0.4) 2 1 .3 (0) + 19.3 (0.6) 19.3 (0.9) 2 .88 3 3 * 15.0 (0.6) 22.5 (0) + 17.5 (0.3) 17.5 (0.4) 1.25 4 20.0 (0.2) 2 1.3 (0) 2 1.3 (0) 2 1 .3 (0) 9.23 4 2 2.4 (0) 2.4 (0) 2.4 (0) 2 .4 (0) 5.58 4 3 6.4 (0) 6.4 (0) 6.4 (0) 6.4 (0) 1.83 5 1 2 1.3 (0) 2 1.3 (0) 2 1.3 (0) 2 1 .3 (0) 0.79 5 2 * 13.3 (0.5) 15.3 (0) : 14.0 (0. 1) 15.3 (0) 2.0 1 All values are in percent. lWCV or CV refers to AlWFDA with A chosen by ten fold lWCV or ten fold CY. OPT refers to AlWFDA with optimal A. Values of chosen A are in parentheses (FDA is denoted as A = 0). * in the table indicates the case where OPT is better than FDA. + is the case where IWCV outperforms FDA, and - is the opposite case where FDA outperforms IWCY. o denotes the case where lWCV outperforms CV. KL refers to the Kullback-Leibler divergence between (estimated) training and test input distributions. observed that OPT outperforms FDA when the KL divergence is large, but they are comparable to one another when the KL divergence is small. This agrees well with the theory that AIWFDA corrects the bias caused by covariate shift, and AIWFDA is reduced to plain FDA in the absence of covariate shift. Next, IWCV (applied to AIWFDA) is compared with FDA. IWCV outperforms FDA in five cases, whereas the opposite case occurs only once. Therefore, IWCV combined with AIWFDA certainly contributes to improving the classification accuracy in BCI. The table also shows that within each subject, IWCV tends to outperform FDA when the KL divergence is large. Finally, IWCV is compared with CV (applied to AIWFDA). IWCV outperforms CV in three cases, and the opposite case does not occur. This substantiates the effectiveness of IWCV as a model selection criterion in BCI. IWCV tends to outperform CV when the KL divergence is large within each subject. The above results showed that non-stationarity in brain signals can be suc­ cessfully compensated for by IWCV combined with AIWFDA, which certainly contributes to improving the recognition accuracy of BCI.
  • 149. 142 7 Applications of Covariate Shift Adaptation 7.2 Speaker Identification In this section, we describe an application of covariate shift adaptation techniques to speaker identification [204]. 7.2.1 Background Speaker identification methods are widely used in real-world situations such as controlling access to information service systems, speaker detection in speech dialogue, and speaker indexing problems with large audio archives [133]. Recently, the speaker identification and indexing problems have attracted a great deal of attention. Popular methods of text-independent speaker identification are based on the Gaussian mixture model [132] or kernel methods such as the support vec­ tor machine [29, 110]. In these supervised learning methods, it is implicitly assumed that training and test data follow the same probability distribution. However, since speech features vary over time due to session-dependent vari­ ation, the recording environment change, and physical conditions/emotions, the training and test distributions are not necessarily the same in practice. In a paper by Furui [59], the influence of the session-dependent variation of voice quality in speaker identification problems is investigated, and the identification performance is to decrease significantly over three months-the major cause for the performance degradation is the characteristic variations of the voice source. To alleviate the influence of session-dependent variation, it is popular to use several sessions of speaker utterance samples [111, 113] or to use cepstral mean normalization [58]. However, gathering several sessions of speaker utter­ ance data and assigning the speaker ID to the collected data are expensive in both time and cost, and therefore not realistic in practice. Moreover, it is not possible to perfectly remove the session-dependent variation by cepstral mean normalization alone. A more practical and effective setup is semisupervised learning [30], where unlabeled samples are additionally given from the testing environment. In semisupervised learning, it is required that the probability distributions of training and test are related to each other in some sense; otherwise, we may not be able to learn anything about the test probability distribution from the training samples. Below, the semisupervised speaker identification problem is formulated as a covariate shift adaptation problem [204]. 7.2.2 Formulation Here, the speaker identification problem is formulated.
  • 150. 7.2. Speaker Identification 143 An utterance feature X pronounced by a speaker is expressed as a set of N mel-frequency cepstrum coefficients (MFCC) [129], which are vectors of d dimensions: For training, we are given nlr labeled utterance samples where denotes the index of the speaker who pronounces x;r . The goal of speaker identification is to predict the speaker index of a test utterance sample Xle based on the training samples. The speaker index c of a test sample Xle is predicted based on the Bayes decision rule: For approximating the class-posterior probability, the following logistic model /i(y=clX) is used: �( _ IX) - exp {l(X)}p y-c - c � , Le'=1 exp { fe'(X)} where lex) is a discriminant function corresponding to the speaker c. The following kernel model is used as the discriminant function [113]: ntr � � Ir fAX)=� ee,eK(X, Xe), c=1, 2, .,. , C, e=1 where {ee,e}�!1 are parameters corresponding to speaker c and K(X, X') is a kernel function. The sequence kernel [110] is used as the kernel function here because it allows us to handle features of different size; for two utterance sam­ ples X=[XI,X2, . • . ,XN] E IRdxN and X'=[x�,x;, .. . ,x�,] E IRdxN' (generally N :f. N'), the sequence kernel is defined as N N' K(X, X')= N �' LLK(XhX;,), i=l i'=l
  • 151. 144 7 Applications of Covariate Shift Adaptation where K (x, x') is a vectorial kernel. We use the Gaussian kernel here: I (- IIX _XI 1l2 )K(x,x ) =exp 2a2 • The parameters {ec,el��l are learned so that the following negative penalized log-likelihood is minimized: where A (2: 0) is the regularization parameter. In practical speaker identification tasks, speech features are not stationary due to time-dependent voice variation, change in the recording environment, and physical conditions/emotion. Thus, the training and test feature distribu­ tions are not generally the same. Here, we deal with such changing environ­ ment via the covariate shift model, and employ importance-weighted logistic regression (IWLR; see section 2.3.2) with the sequence kernel. The tuning parameters in kernel logistic regression are chosen by importance-weighted cross-validation (IWCV; see section 3.3), where the importance weights are learned by the Kullback-Leibler importance-estimation procedure (KLIEP; see section 4.4) with the sequence kernel. 7.2.3 Experimental Results Training and test samples were collected fromten male speakers, and two types of experiments were conducted: text-dependent and text-independent speaker identification, In text-dependent speaker identification, the training and test sentences were commonto all speakers. In text-independentspeaker identifica­ tion, the training sentences were common to all speakers, but the test sentences were different from training sentences. The NIT data set [112] was used here. Each speaker uttered several Japanese sentences for text-dependent and text-independent speaker identifi­ cation evaluation. The following three sentences were used as training and test samples in the text-dependentspeaker identification experiments (Japanese sentences written using the Hepburn system of romanization): • seno takasawa hyakunanajusseNchi hodode mega ookiku yaya futotteiru, • oogoeo dashisugite kasuregoeni natte shimau, • tashizaN hikizaNwa dekinakutemo eha kakeru.
  • 152. 7.2. Speaker Identification 145 In the text-independent speaker identification experiments, the same three sentences were used as training samples: • seno takasawa hyakunanajusseNchi hodode mega ookiku yaya futotteiru, • oogoeo dashisugite kasuregoeni natte shimau, • tashizaN hikizaNwa dekinakutemo eha kakeru. The following five sentences were used as test samples: • tobujiyuuwo eru kotowajiNruino yume datta, • hajimete ruuburubijutsukaNe haittanowa juuyoneNmaeno kotoda, • jibuNno jitsuryokuwa jibuNga ichibaN yoku shitteiru hazuda, • koremade shouneNyakyuu mamasaN bareenado chiikisupootsuo sasae shimiNni micchakushite kitanowamusuuno boraNtiadatta, • giNzakeno tamagoo yunyuushite fukasase kaichuude sodateru youshokumo hajimatteiru. The utterance samples for trammg were recorded in 1990/12, and the utterance samples for testing were recorded in 199113, 199116, and 199119, respectively. Since the recording times between training and test utterance samples are different, the voice quality variation is expected to be included. Thus, this target speaker identification problem is a challenging task. The total duration of the training sentences is about 9 seconds. The durations of the test sentences for text-dependent and text-independent speaker identifi­ cation are 9 seconds and 24 seconds, respectively. There are approximately ten vowels in the sentences for every 1.5 seconds. The input utterance is sampled at 16 kHz. A feature vector consists of 26 components: 12 mel-frequency cepstrum coefficients, the normalized log energy, and their first derivatives. Feature vectors are derived at every 10 mil­ liseconds over the 25.6-millisecond Hamming-windowedspeech segment, and cepstral mean normalization is applied over the features to remove channel effects. Each utterance is divided into 300-millisecond disjoint segments, each of which corresponds to a set of features of size 26 x 30. Thus, the training set is given as for text-independent and text-dependent speaker identification evaluation. For text-independent speaker identification, the sets of test samples for 1991/3, 199116, and 199119 are given as
  • 153. 146 7 Applications of Covariate Shift Adaptation Xlel = {Xlel}907 Xle2 = {Xle2}919 and Xle3 = {Xle3}9061 1 1=1' 1 l l=I' 1 1 1=1" For text-dependent speaker identification, the sets of test data are given as Xlel = {Xlel}407 Xle2 = {Xle2}407 and Xle3 = {Xle3}4122 1 1=1' 2 l 1=1' 2 1 1=1· In the above, we assume that samples are sorted according to time, with the index i = 1 being the first sample. The speaker identification rate was computed at every 1.5 seconds, 3.0 sec­ onds, and 4.5 seconds, and the speaker was identified based on the average posterior probability where m = 5, 10, and 15 for 1.5 seconds, 3.0 seconds, and 4.5 seconds, respectively. Here, the product-of-Gaussian model (PoG) [81], plain logistic regression (LR) with the sequence kernel, and importance-weighted logistic regression (lWLR) with the sequence kernel were tested, and the speaker identification rates were compared with the data taken in 199113, 199116, and 199119. For PoG andLR training, we used the 1990/12data set (inputsXlr and their labels). For PoG training, the means, diagonal covariance matrices, and mixing coef­ ficients were initialized by the results of k-means clustering on all training sentences for all speakers; then these parameters were estimated via the EM algorithm [38,21] for each speaker. The number of components, b, was deter­ mined by fivefold cross-validation. In the test phase of PoG, the probability for the speaker c was computed as m N Pc(X:e) = nnqc(X:"-i+l), i=1 j=1 where x:e= [x:�pX:�2' . . . ,X:�N] and qc(x) was a PoG for the speaker c. In the current setup, N = 30 and m = 5, 10, and 15 (which correspond to 1.5 seconds, 3.0 seconds, and 4.5 seconds, respectively). For IWLR training, unlabeled samples Xlel, Xle2, and Xle3 in addition to the training inputs Xlr and their labels (i.e., semisupervised) were used. We first estimated the importance weight from the training and test data set pairs (Xlr, Xlel), (Xlr, Xle2), or (Xlr, Xle3) by KLIEP with fivefold CV (see section 4.4),
  • 154. 7.2. Speaker Identification 147 and we used fivefold IWCV (see section 3.3) to decide the Gaussian kernel width a and regularization parameter A. In practice, the k-fold CV and k-fold IWCV scores can be strongly affected by the way the data samples are split into k disjoint subsets (we used k = 5). This phenomenonis due to the non-i.i.d. nature of the mel-frequency cepstrum coefficient features, which is differentfromthetheory.Toobtain reliableexper­ imental results, theCV procedure was repeated 50 times with different random data splits, and the highest score was used for model selection. Table 7.3 shows the text-independent speaker identification rates in percent for 199113, 199116, and 199119. IWLR refers to IWLR with a and A cho­ sen by fivefold IWCV, LR refers to LR with a and A chosen by fivefold CV, and PoG refers to PoG with the number of mixtures b chosen by fivefold CV. The chosen values of these hyperparameters are described in parenthe­ ses. DoDC (degree of distribution change) refers to the standard deviation of estimated importance weights {w(X:r)}7!]; the smaller the standard deviation is, the "flatter" the importance weights are. Flat importance weights imply that Table 7.3 Correct classification rates for text-independent speaker identification 1991/3 IWLR LR PoG (DoDC=0.34) (0" = 1.4. A = 10-4) (0" = 1.0, A = 10-2) (b = 16) 1.5 sec. 91.0 88.2 89.7 3.0 sec. 95.0 92.9 94.4 4.5 sec. 97.7 96. 1 94.6 1991/6 IWLR LR PoG (DoDC=0.37) (0" = 1.3. A = 10-4) (0" = 1.0, A = 10-2) (b = 16) 1.5 sec. 91.0 87.7 90.2 3.0 sec. 95.3 91. 1 94.0 4.5 sec. 97.4 93.4 96. 1 1991/9 IWLR LR PoG (DoDC=0.35) (0" = 1.2. A = 10-4) (0" = 1.0, A = 10-2) (b = 16) 1.5 sec. 94.8 91.7 92. 1 3.0 sec. 97.9 96.3 95.0 4.5 sec. 98.8 98.3 95.8 All values are in percent. IWLR refers to IWLR with (0", A) chosen by fivefold IWCV, LR refers to LR with (0", A) chosen by fivefold CV, and PoG refers to PoG with the number of components b chosen by fivefold Cv. The chosen values of these hyperparameters are described in parentheses. DoDC refers to the standard deviation of estimated importance weights {w(Xn17�1' which roughly indicates the degree of distribution change.
  • 155. 148 7 Applications of Covariate Shift Adaptation there is no significant distribution change between the training and test phases. Thus, the standard deviation of estimated importance weights may be regarded as a rough indicator of the degree of distribution change. As can be seen from the table, IWLR+IWCV outperforms PoG+CV and LR+CV for all sessions. This result implies that importance weighting is useful in coping with the influence of non-stationarity in practical speaker identifi­ cation such as utterance variation, change in the the recording environment, physical conditions/emotions. Table 7.4 summarizes the text-dependent speaker identification rates in per­ cent for 199113, 199116, and 199119, showing that IWLR+IWCV and LR+CV slightly outperform PoG and are highly comparable to each other. The result that IWLR+IWCV and LR+CV are comparable in this experiment is a reason­ able consequencesince the standard deviationof estimated importance weights is very small in all three cases-implying that there is no significant distribu­ tion change, and therefore no adaptation is necessary. This result indicates that Table 7.4 Correct classification rates for text-dependent speaker identification 199113 IWLR LR PoG (DoDC=0.05) (0- = 1.2, 'A = 10-4) (0- = 1.0. 'A = 10-2) (b = 16) 1.5 sec. 100.0 98.9 96.8 3.0 sec. 100.0 100.0 97.7 4.5 sec. 100.0 100.0 97.9 199116 IWLR LR PoG (DoDC=0.05) (0- = 1.2, 'A = 10-4) (0- = 1.0, 'A = 10-2) (b = 16) 1.5 sec. 97.5 96.2 97.8 3.0 sec. 97.5 97.2 98.1 4.5 sec. 98.9 97.4 98.3 199119 IWLR LR PoG (DoDC=0.05) (0- = 1.2, 'A = 10-4) (0- = 1.0, 'A = 10-2) (b = 16) 1.5 sec. 100.0 100.0 98.2 3.0 sec. 100.0 100.0 98.4 4.5 sec. 100.0 100.0 98.5 All values are in percent. IWLR refers to IWLR with (0-, 'A) chosen by fivefold IWCV, LR refers to LR with (0-, 'A) chosen by fivefold CV, and PoG refers to PoG with the number of components b chosen by fivefold CV. The chosen values of these hyperparameters are described in parentheses. DoDC refers to the standard deviation of estimated importance weights {w(XnJ7!,, which roughly indicates the degree of distribution change.
  • 156. 7.3. Natural Language Processing 149 the proposedmethoddoes not degradethe accuracy when there is nosignificant distribution change. Overall, the importance-weighting method tends to improve the perfor­ mance when a significant distribution change exists. It also tends to maintain the good performance of the baseline method when no distribution change exists. Thus, the importance-weighting method is a promising approach to handling session-dependent variation in practical speaker identification. 7.3 Natural Language Processing In this section, importance-weighting methods are applied to a domain adapta­ tion task in natural language processing (NLP). 7.3.1 Formulation A standard way to train an NLP system is to use data collected from the target domain in which the system is operated. In practice, though, collecting data from the target domain is often costly, so it is desirable to (re)usedata obtained from other domains. However, due to the difference in vocabulary and writing style, a naive use of the data obtainedfrom other domains results in serious per­ formance degradation. For this reason, domain adaptation is one of the most important challenges in NLP [24,86,36]. Here, domain adaptation experiments are conducted for a Japanese word segmentationtask. It is not trivial to detect word boundaries for nonsegmented languages such as Japanese or Chinese. In the word segmentation task, represents a sequence of features at character boundaries of a sentence, and is a sequence of the corresponding labels, which specify whether the current position is a word boundary. It would be reasonable to consider the domain adaptation task of word segmentation as a covariate shift adaptation problem since a word segmentation policy, p(yI X), rarely changes across domains in the same language, but the distribution of characters, p(X), tends to vary over different domains. The goal of the experiments here is to adapt a word segmentation system from a daily conversation domain to a medical domain [187]. One of the characteristics of the NLP tasks is their high dimensionality. The total number of distinct features is about d = 300, 000 in this data set. ntr = 13, 000 labeled
  • 157. 150 7 Applications of Covariate Shift Adaptation (i.e., word-segmented) sentences from the source domain, and nte =53, 834 unlabeled (i.e., unsegmented) sentences from the target domain are used for learning. In addition, 1000 labeled sentences from the target domain are used for the evaluation of the domain adaptation performance. As a word segmentation model, conditional random fields (CRFs) are used, which are generalizations of the logistic regression models (see section 2.3.2) for structured prediction [98]. The CRFs model the conditional probability p(yIX;9)of output structure y, given an inputX: p(YIX;9)=", {9 T (X ')}'L..y' exp qJ ,y where qJ(X,y) is a basis function mapping (X,y) to a b-dimensional fea­ ture vector,I and b is the dimension of the parameter vector 9. Although the conventional CRF learning algorithm minimizes the regularized negative log­ likelihood, an importance-weighted CRF (IWCRF) training algorithm is used here for covariate shift adaptation: The Kullback-Leibler importance estimation procedure (KLIEP; see section 4.4) for log-linear models is used to estimate the importance weight w(X). w(X) _ -,-- _ e _ xp _ (,,-ot _ Tt .,--- (X _ ),,- ) � - ;;; I:��l exp(otTt(x;r)) , where the basis function t(X)is set to the average value of features for CRFs in a sentence: Since KLIEP for log-linear models is computationally more efficient than KLIEP for linear models when the number of test samples is large (see section 4.4), we chose the log-linear model here. 1. Its definition is omitted here. See [ 187] for details.
  • 158. 7.3. Natural Language Processing 7.3.2 Experimental Results The performance is evaluated by the F-measure, 2xRxP F =---- R +P , where R and P denote recall and precision defined by # of correct words R = x100, # of words in test data # of correct words P = x100. # of words in system output 151 The hyperparameterofIWCRF is optimizedbased on an importance-weighted F-measure (IWF) for separate validation data in which the number of correct words is weighted according to the importance of the sentence that these words belong to: 2xIWR(D)xIWP(D) IWF(D) =-IW- R- (-D-) -+-IW-P(-D-) - for the validation set D, where IWR(D) =L(X,Y)ED w(X) L�=l[Y; =Yt] x100, T L(X,Y)ED w(X) IWP(D) =L(X,Y)ED w(X) L�=l[Y; =Yt] x100, T L(X,Y)ED w(X) and Y =(Yi,)12,"")IT)T =argmaxp(yIX; 8). Y Ten percent of the training data is used as the validation set D. The performances of CRF, IWCRF, and CRF' (a CRF trained with addi­ tional 1000 labeled manual word segmentation samples an from the target domain [187])are compared. For importance estimation, log-linearKLIEP and the logistic regression (LR) approach (see section 4.3) are tested, and fivefold cross-validation is used to find the optimal Gaussian width a in the LR model. Table 7.5 summarizes the performance, showing that IWCRF+KLIEP sig­ nificantly outperforms CRF. CRF', which used additional labeled samples in the target domain and thus was highly expensive, tends to perform better
  • 159. 152 7 Applications of Covariate Shift Adaptation Table 7.5 Word segmentation performance in the target domain F R P CRF 92.30 90.58 94.08 IWCRF+KLIEP 94.46 94.32 94.59 IWCRF+LR 93.68 94.30 93.07 CRF' 94.43 93.49 95.39 CRF' indicates the performance of a CRF' trained with additional 1000 manual word segmentation samples in the target domain. than CRF, as expected. A notable fact is that IWCRF+KLIEP, which does not require expensive labeled samples in the target domain, is on par with CRF'. Empirically, the covariate shift adaptation technique seems to improve the cov­ erage (R) in the target domain. Compared with LR, KLIEP seems to perform better in this experiment. Since it is easy to obtain a large amount of unla­ beled text data in NLP tasks, domain adaptation by importance weighting is a promising approach in NLP. 7.4 Perceived Age Prediction from Face Images In this section, covariate shift adaptation techniques are applied to perceived age prediction from face images [188]. 7.4.1 Background Recently, demographic analysis in public places such as shopping malls and stations has been attracting a great deal of attention. Such information is use­ ful for various purposes such as designing effective marketing strategies and targeting advertisements based on prospective customers' genders and ages. For this reason, a number of approaches have been explored for age estimation from face images [61, 54, 65]. Several databases are now publicly available [49,123,134]. The accuracy of age prediction systems is significantly influenced by the type of camera, the camera calibration, and lighting variations. The publicly available databases were collected mainly in semicontrolled environments such as a studio with appropriate illumination. However, in real-world envi­ ronments, lighting conditions vary considerabley: strong sunlight may be cast on the sides of faces, or there is not enough light. For this reason, training and test data tend to have different distributions. Here, covariate shift adaptation techniques are employed to alleviate changes in lighting conditions.
  • 160. 7.4. Perceived Age Prediction from Face Images 153 7.4.2 Formulation Let us consider a regression problem of estimating the age y of subject x (x corresponds to a face-feature vector). Here, age estimation is performed not on the basis of subjects' real ages, but on their perceived ages. Thus, the "true" age of a subject is defined as the average perceived age evaluated by those who observed the subject's face images (the number is rounded to the nearest integer). Suppose training samples {(x:f, y:')}7!1 are given. We use the following kernel model for age estimation (see section 1.3.5.3). ntt 1(x;9) = L:8eK(x,xe), e=1 where 9 = (81,82, • • • ,8ntt)T are parameters and K(·, ·) is a kernel function. Here the Gaussian kernel is used: / ( IIX _X/1I2)K(x,x ) =exp 2a2 ' where a (> 0) denotes the Gaussian width. Under the covariate shift formula­ tion, the parameter9may be learned by regularized importance-weighted least squares (see section 2.2.1): [1 ntt (If) ntt ]min -'""' Pie Xi (l(xlf;9) _ ylf)2 + A '""'82 ,8 n � p (Xlf) I I � ete i=l tr i C=l where A (2: 0) is the regularization parameter. 7.4.3 Incorporating Characteristics of Human Age Perception Human age perception is known to have heterogeneous characteristics. For example, it is rare to misjudge the age of a 5-year-old child as 15 years, but the age of a 35-year-old person is often misjudged as 45 years. A paper by Ueki et al. [189] quantified this phenomenon by carrying out a large-scale questionnaire survey: each of 72 volunteers was asked to give age labels y to approximately 1000 face images. Figure 7.2 depicts the relation between subjects' perceived ages and its stan­ dard deviation. The standarddeviation is approximately 2 (years)when the true age is less than 15. It increases and goes beyond 6 as the true age increases from 15 to 35. Then the standard deviation decreases to around 5 as the true age increases from 35 to 70. This graph shows that the perceived age deviation
  • 161. 154 7 Applications of Covariate Shift Adaptation c: o 6 � 5 .� � 4 '" "0 lij 3 (ij O L-__�______L-____�____-L____�____�____--J 10 20 30 40 50 60 70 Age Figure 7.2 The relation between subjects' perceived ages (horizontal axis) and its standard deviation (vertical axis). tends to be small in younger age brackets and large in older age groups. This agress well with our intuition concerning the human growth process. In order to match characteristics of age prediction systems to those of human age perception, the goodness-of-fit term in the training criterion is weighted according to the inverse variance of the perceived age: where �(y) is the standard deviation of the perceived age at y (i.e., the values in figure 7.2). The solution-0 is given analytically by -0= (KtrWKtr + AIntr)-lKtrWif, where KIf is the nlf x nlf matrix with the (i, i')-th element W is the diagonal matrix with the i-th diagonal element w _ Ple(X:') 1 i,i - - ( If) !"2( If)'Plf Xi � Yi
  • 162. 7.4. Perceived Age Prediction from Face Images Intr is the ntf-dimensional identity matrix, and tf (tr tr tf )T Y = Yl ' Y2 ' . .. , Yntr . 7.4.4 Experimental Results 155 The face images recorded under 17 different lighting conditions were used for experiments: for instance, average illuminance from above is approximately WOO lux and 500 lux from the front in the standard lighting condition; 250 lux from above and 125 lux from the front in the dark setting; and 190 lux from the subject's right and 750 lux from his left in another setting (see figure 7.3). Images were recorded as movies with the camera at a negative elevation angle of 15 degrees. The number of subjects was approximately 500 (250 of each gender). A face detector was used for localizing the eye pupils, and then the image was rescaled to 64 x 64 pixels. The number of face images in each environment was about 2500 (5 face images x 500 subjects). As preprocess­ ing, a feature extractor based on convolutional neural networks [184] was used to extract 100 dimensional features from 64 x 64 face images. Learn­ ing for male/female data was performed separately, assuming that gender classification had been correctly carried out in advance. The 250 subjects of eachgender were split into the trainingset (200subjects) and the test set (50 subjects). For the test samples {(x�e, y�e)}]::l corresponding to the environment with strong light from a side, the following age-weighted mean-square error (AWMSE) was calculated as a performance measure for a learned parameter8: 1 nte 1" (te f� te �(J )2AWMSE =- � -2-t-e Yj - (Xj; ) . nte j=l � (Yj) Figure 7.3 (7.1) Examples of face images under different lighting conditions. Left: standard lighting; middle: dark; right: strong light from the subjects right-hand side.
  • 163. 156 7 Applications of Covariate Shift Adaptation The training set and the test set were shuffled five times in such a way that each subject was selected as a test sample once. The final performance was evaluated based on the average AWMSE over the five trials. The performances of the following three methods were compared: · IW Training samples were taken from all 17 lighting conditions. The importance weights were estimated by the Kullback-Leibler importance esti­ mation procedure (KLIEP; see section 4.4), using the training samples and additional unlabeled test samples; the Gaussian width included in KLIEP was determined based on twofold likelihood cross-validation. The importance weights for training samples estimated by KLIEP were averaged over samples in each lighting condition, and the average importance weights were used in the training of regression models. This had the effect of smoothing the impor­ tance weights. The Gaussian width a and the regularization parameter A were determined based on fourfold importance-weighted cross-validation (lWCV; see section 3.3)over AWMSE; that is, the training set was furtherdivided into a training part (150 subjects) and a validation part (50 subjects). · NIW Training samples were taken from all 17 lighting conditions. No importance weight was incorporated in the training criterion (the age weights were included). The Gaussian width a and the regularization parameter A were determined based on fourfold cross-validation over AWMSE. • NIW' Only training samples taken the standard lighting condition were used. Other setups were the same as NIW. Table 7.6 summarizes the experimental results, showing that for both male and female data, IW is better than NIW, and NIW is better than NIW'. This illustrates that the covariate shift adaptation techniques are useful for alleviating the influence of changes in lighting conditions. Table 7.6 The age prediction performance measured by AWMSE Male Female IW 2.54 3.90 NIW 2.64 4.40 NIW' 2.83 6.5 1 See equation 7. 1.
  • 164. 7.5. Human Activity Recognition from Accelerometric Data 7.5 Human Activity Recognition from Accelerometric Data 157 In this section, covariate shift adaptation techniques are applied to accelero­ meter-based human activity recognition [68]. 7.5.1 Background Human activity recognition from accelerometric data (e.g., obtained by a smart phone)has been gathering a great deal of attention recently [11,15,75], since it can be used for purposes such as remote health care [60,152,100] and worker behavior monitoring [207]. To construct a good classifier for activity recog­ nition, users are required to prepare accelerometric data with activity labels for types of actions such as walking, running, and bicycle riding. However, since gathering labeled data is costly, this initial data collection phase prevents new users from using the activity recognition system. Thus, overcoming such a new user problem is an important challenge for making the human activity recognition system useful in practical ways. Since unlabeled data are relatively easy to gather, we can typically use labeled data obtained from existing users and unlabeled data obtained from a new user for developing the new user's activity classifier. Such a situation is commonly called semisupervised learning, and various learning methods that utilize unlabeled samples have been proposed [30]. However, semisupervised learning methods tend to perform poorly if unlabeled test data have a signifi­ cantly different distribution than the labeled training data. Unfortunately, this is a typical situation in human activity recognition since motion patterns (and thus distributions of motion data)depend heavily on users. In this section, we apply an importance-weighted variant of a probabilis­ tic classification algorithm called least-squares probabilistic classifier (LSPC) [155] to real-world human activity recognition. 7.5.2 Importance-Weighted Least-Squares Probabilistic Classifier Here, we describe a covariate shift adaptation method called the importance­ weighted least-squares probabilistic classifier (lWLSPC) [68]. Let us consider a problem of classifying an accelerometric samplex (E JRd) into activity class y (E {I, 2, . . . , c}), where d is the input dimensionality and c denotes the number of classes. IWLSPC estimates the class-posterior probabil­ ity p(y Ix) from training input-output samples {(x:r, yn}7![ and test input-only samples {x�e}]�[ under covariate shift. In the context of human activity recogni­ tion, the labeled training samples {(x:r, yn}7![ correspond to the data obtained from existing users, and the unlabeled test input samples {x�e}]�[ correspond to the data obtained from a new user.
  • 165. 158 7 Applications of Covariate Shift Adaptation Let us model the class-posterior probability p(yIx) by nte p(Ylx;9y) =L8y,eK(x,x�e), e= where 9y = (8y,,8y,2, . ,. ,8y,nte)T is the parameter vector and K(x,x') is a kernel function, We focus on the Gaussian kernel here: , ( IIX_X'1I2)K(x,x ) =exp - 2a2 ' where a denotes the Gaussian kernel width. We determine the parameter9yso that the following squared error ly is minimized: ly(9y) = �f(P(YIX;9y) - p(YIX)rPte(x)dx = �fp(Ylx;9y)2Pte(x)dx -fp(Ylx;9y)p(ylx)Pte(x)dx + C 1 T T= 19y Q9y-qy9y+ C , where C is a constant independent of the parameter9y, and Q is the nte x nte matrix, andqy= (qy,, . . . , qy,nte)T is the nte-dimensional vector defined as Qe,e' := fK(x,x�e)K(x,x��)Pte(x)dx, qy,e := fK(x,x�e)p(Ylx)Pte(x)dx. Here, we approximate Q and qy using the adaptive importance sampling technique (see section 2,1.2) as follows, First, using the importance weight defined as w(x) := Pte(X) , Ptr(x) we express Q andqy in terms of the training distribution as Qe,e' = fK(x,x�e)K(x,x��)Ptr(x)w(x)dx,
  • 166. 7.5. Human Activity Recognition from Accelerometric Data qy,e = fK(x,x�e)p(yIx)Ptr(x)w(x)dx = p(y) fK(x,x�e)Ptr(xly)w(x)dx, 159 where Ptr(xly) denotes the training input density for class y. Then, based on the above expressions, Q and q yare approximated using the training samples {(x:f, y:')r:l as follows2: � 1 ntr Q . '"K(tf te)K(tf te) (tf)Ve,e' , = - � Xi ,Xe Xi ,Xe' W Xi 'ntf i=l where the class-prior probability p(y) is approximated by n:i)Int" and n:i) denotes the number of training samples with label y. Also, v (O :s v :s 1) denotes the flattening parameter, which controls the bias-variance trade-off in importance sampling (see section 2,1,2). Consequently, we arrive at the followingoptimization problem: where �9;9y is a regularization term to avoid overfitting and A (�O) denotes the regularization parameter, Then, the IWLSPC solution is given analytically as where Inte denotes the nte-dimensional identity matrix. Since the class­ posterior probability is nonnegative by definition, we modify the solution as 2. When J = 1, Q may be approximated directly by using the test input samples {X7}��1 as
  • 167. 160 7 Applications of Covariate Shift Adaptation follows [206]: if Z=L�= max (0, L;:KeK(x, x�e)) > 0; otherwise, p(ylx)= lie, where c denotes the number of classes. The learned class-posterior probability p(yIx) allows us to predict the class label y of a new sample x with confidence p(Ylx) as y=argmax p(Ylx). y In practice, the importance w(x) can be estimated by the methods described in chapter 4, and the tuning parameters such as the regularization parameter A, the Gaussian kernel width a, and the flattening parameter v may be chosen based on importance-weighted cross-validation (IWCV; see section 3.3). An importance-weighted variant of kernel logistic regression (IWKLR; see section 2.3.2) also can be used for estimating the class-posterior probability under covariate shift [204]. However, training a large-scale IWKLR model is computationally challenging since it requires numerical optimization of the all­ class parameter of dimension c x nte• On the other hand, IWLSPC optimizes the classwise parameter (Jy of dimension nte separately c times in an analytic form. 7.5.3 Experimental Results Here, we apply IWLSPC to real-world human activity recognition. We use three-axis accelerometric data collected by iPodTouch.3 In the data collection procedure, subjects were asked to perform a specific task such as walking, running, or bicycle riding. The duration of each task was arbitrary, and the sampling rate was 20 Hz with small variations. An example of three­ axis accelerometric data for walking is plotted in figure 7.4. To extract features from the accelerometric data, each data stream was segmented in a sliding window manner with a width of five seconds and a sliding step of one second. Depending on subjects, the position and ori­ entation of iPodTouch was arbitrary-held in the hand or kept in a pocket or a bag. For this reason, we decided to take the lz-norm of the three­ dimensional acceleration vector at each time step, and computed the following 3. The data set is available from http://alkan.mns.kyutech.ac.jp/web/data.html.
  • 168. 7.S. Human Activity Recognition from Accelerometric Data 161 :�f�o 5 10 15 20 25 I-_:2:f .��� _ ��.. , , ,2o 5 10 15 20 252 j �o 5 10 15 20 25Time [second] Figure 7.4 Example of three-axis accelerometric data for walking. five orientation-invariant features from each window: mean, standard devia­ tion, fluctuation of amplitude, average energy, and frequency-domain entropy [11, 15]. Let us consider a situation where two new users want to use the activ­ ity recognition system. Since they do not want to label their accelerometric data, only unlabeled test samples are available from the new users. Labeled data obtained from 20 existing users are available for training the new users' classifiers. Each existing user has at most 100 labeled samples for each action. We compared the performance of the following six classification methods. • LapRLS4+IWCV A semisupervised learning method called Laplacian regularized least squares (LapRLS) [30] with the Gaussian kernel as basis functions. Hyperparameters are selected by IWCV. 4. Laplacian regularized least squares (LapRLS) is a standard semisupervised learning method which tries to impose smoothness over nonlinear data manifold [30]. Let us consider a binary classification problem where y E {+1 , -I}. LapRLS uses a kernel model for class prediction: "te k(x; 8):= Le,K(x,x�). £=1
  • 169. 162 7 Applications of Covariate Shift Adaptation • LapRLS+CV LapRLS with hyperparameters chosen by ordinary CV (i.e., no importance weighting). • IWKLR+IWCV An adaptive and regularized variant of IWKLR (see section 2.3.2) with the Gaussian kernel as basis functions. The hyperparam­ eters are selected by IWCV. A MATLAB implementation of a limited memory BFGS (Broyden-Fletcher-Goldfarb-Shanno) quasi-Newton method included in the minFunc package [140] was used for optimization. • KLR+CV IWKLR+IWCV without importance weighting. • IWLSPC+IWCV The probabilistic classification method described in section 7.5.2. The hyperparameters are chosen by IWCV. • LSPC+CV: IWLSPC+IWCV without importance weighting. The uLSIF method (see section 4.6) was used for importance estimation. For stabilization purposes, estimated importance weights were averaged in a userwise manner in IWCV. The experiments were repeated 50 times with different sample choices. Table 7.7 shows the experimental results for each new user (specified by ul and u2) in three binary classification tasks: walk vs. run, walk vs. rid­ ing a bicycle, walk vs. taking a train. The table shows that IWKLR+IWCV and IWLSPC+IWCV compare favorably with other methods in terms of classification accuracy. Table 7.8 depicts the computation time for training The parameter 8 is determined as [1 ntr 2 n ]mjn ntr �(k(X�; 8)-yt) +)..8T8 +1/ '�lL""k(x,; 8)k(x,,; 8) , where the first term is the goodness of fit, the second term is the C2-regularizer to avoid overfitting, and the third term is the Laplacian regularizer to impose smoothness over data manifold. n:= ntr+n'e, L:= D -W is an n x ngraph Laplacian matrix, W is an affinity matrix defined by W '- ( "X, _X;,1I2 )i.i' .-exp 2r2 ' (Xl,...,Xn):= (Xr,...,X�tr,X e ,...,X�e)' r is an affinity-controlling parameter, and D is the diagonal matrix given by D", := I:7'�1 W,,;,. The solution of LapRLS can be analytically computed since the optimization problem is an unconstrained quadratic program, However, covariate shift is not taken into account in LapRLS, and thus it will not perform well if training and test distributions are siguificantly different.
  • 170. 7.5. Human Activity Recognition from Accelerometric Data 163 Table 7.7 Mean misclassification rates [%] and standard deviation (in parentheses) Averaged over 50 trials for each new user (specified by u l and u2) in Human Activity Recognition LapRLS LapRLS KLR IWKLR LSPC IWLSPC Walk vs. + CV + IWCV + CV + IWCV + CV + IWCV Run (u l) 2 1.2(4.7) 1 1.4(4.0) 15.0(6.6) 9.0(0.5) 13.3(3.9) 9.0(0.4) Bicycle (u l) 9.9( 1. 1) 12.6( 1.3) 9.5(0.7) 8.7(4.6) 9.7(0.7) 8.7(5.0) Train (u l) 2.2(0.2) 2.2(0.2) 1.4(0.4) 1.2( 1.5) 1.5(0.4) 1.1( 1.5) Run (u2) 24.7(5.0) 20.6(9.7) 25.6( 1.3) 22.4(6. 1) 25.6(0.8) 21.9(5.9) Bicycle (u2) 13.0( 1.5) 14.0(2. 1) 1 1.1( 1.8) 1 1.0( 1.7) 10.9( 1.8) 10.4( 1.7) Train (u2) 3.9( 1.3) 3.7( 1. 1) 3.6(0.6) 2.9(0.7) 3.5(0.5) 3.1(0.5) A number in boldface indicates that the method is the best or comparable to the best in terms of the mean misclassification rate by the t-test at the significance level 5 percent. Table 7.8 Mean computation time (sec) and standard deviation (in parentheses) averaged over 50 trials for each new user (specified by u l and u2) in Human Activity Recognition LapRLS LapRLS KLR IWKLR LSPC IWLSPC Walk vs. + CV + IWCV + CV + IWCV + CV + IWCV Run (u l) 14. 1(0.7) 14.5(0.8) 86.8( 16.2) 78.8(23.2) 7.3(1.1) 6.6( 1.3) Bicycle (u l) 38.8(4.8) 52.8(8.1) 38.8(4.8) 52.8(8. 1) 4.2(0.8) 3.7(0.8) Train (u l) 5.5(0.6) 5.4(0.6) 19.8(7.3) 30.9(6.0) 3.9(0.8) 4.0(0.8) Run (u2) 12.6(2. 1) 12.1(2.2) 70. 1(12.9) 128.5(5 1.7) 8.2( 1.3) 7.8( 1.5) Bicycle (u2) 16.8(7.0) 27.2(5.6) 16.8(7.0) 27.2(5.6) 3.7(0.8) 3.1(0.9) Train (u2) 5.6(0.7) 5.6(0.6) 24.9( 10.8) 29.4( 10.3) 4.1(0.8) 3.9(0.8) A number in boldface indicates that the method is the best or comparable to the best in terms of the mean computation time by the t-test at the significance level 5 percent. classifiers, showing that the LSPC-based methods are computationally much more efficient than the KLR-based methods. Figure 7.5 depicts the mean misclassification rate for various coverage levels, which is the ratio of test sample size used for evaluating the misclas­ sification rate. For example, the coverage 0.8 means that 80 percent of test samples with high confidence level (obtained by an estimated class-posterior probability) are used for evaluating the misclassification rate. This renders a realistic situation where prediction with low confidence level is rejected; if the prediction is rejected, the prediction obtained in the previous time step will be inherited since an action usually continues for a certain duration. The
  • 171. 164 l & � c 0 .�-= .� � :E 1 3.3 1 1 8.8 9.7 0.8 .' .' .. .... .. . 0.85 ..... ............ . ' .' .' 0.9 Coverage 0.95 (a)Walkvs. run (ul). .. ..... // : / / .. '# # # # . . . . . . . . . . . . . . . . . . .... 6.5 �===;;..._---:'::--_-=-=__� 0.8 0.85 0.9 0.95 1 .5 .. .. .. .. .. Coverage (c)Walkvs. bicycle (ul). ........... . .. . .. ' . ' . ' . ' ....... .. .. . . .... .. .. 0.2 b::========---_�_�_ 0.8 0.85 0.9 Coverage 0.95 (e) Walk vs. train (ul). Figure 7.5 7 Applications of Covariate Shift Adaptation 25.6 . ' .' . ' . ' . ' .' .' . ' .'.' . ' . ' .' . ' .' .'.'.' . ' .' 1 7.6 E::::.:-----:-�---:-'::-----::-�--� 0.8 0.85 0.9 0.95 1 0.9 . ' .' 6.2 .. . 0.8 3.5 .' Coverage (b) Walk vs. run (u2). .' . ' 0.85 . ' .. ' .' . '.. ' 0.9 Coverage ....., . ' 0.95 (d) Walk vs. bicycle (u2). ..... .. .. . . .. . . ... . .. . ... .. . .. .. .. .. .. .. .. .. .. .. .. .. .. ..... 0.3 1:......___�__�___�___� 0.8 0.85 0.9 Coverage 0.95 (f) Walk vs. win (u2). Misclassification rate as a function of coverage for each new user (specified by u I and u2) in human activity recognition.
  • 172. 7.6. Sample Reuse in Reinforcement Learning 165 graphs show that for most of the coverage level, IWLSPC+IWCV outperforms LSPC+CV. The misclassification rates of IWLSPC+IWCV and LSPC+CV tend to decrease as the coverage level decreases, implying that the confi­ dence estimation by (lW)LSPC is reliable since erroneous prediction can be successfully rejected. 7.6 Sample Reuse in Reinforcement Learning Reinforcement learning (RL) [174] is a framework which allows a robot agent to act optimally in an unknown environment through interaction. Because of its usefulness and generality, reinforcement learning is gathering a great deal of attention in the machine learning, artificial intelligence, and robotics com­ munities. In this section, we show that covariate shift adaptation techniques can be successfully employed in the reinforcement learning scenario [66]. 7.6.1 Markov Decision Problems Let us consider a Markovdecisionproblem (MDP) specified by (S, A, PI. PT, R, y), where • S is a set of states, • A is a set of actions, • PIeS) E [0, 1] is the initial-state probability density, . PT(sfls, a) E [0, 1] is the transition probability density from state s to state Sf when action a is taken, • R(s, a, Sf) E IR is a reward for transition from s to Sf by taking action a, • y E (0, 1] is the discount factor for future rewards. Let n(als) E [0, 1] be a stochastic policy of the agent, which is the con­ ditional probability density of taking action a given state s. The state-action value function Q" (s, a) E IR for policy n is the expected discounted sum of rewards the agent will receive when taking action a in state s and following policy n thereafter. That is, where 1E",PT denotes the expectation over {sn, an}:! following n(anlsn) and PT(sn+Iisn, an).
  • 173. 166 7 Applications of Covariate Shift Adaptation The goal of reinforcement learning is to obtain the policy which maxi­ mizes the sum of future rewards. Then the optimal policy can be expressed as follows5: Jr*(als) =8(a - argmax Q*(s, a'», a' where 8(.) is the Dirac deltafunction and Q*(s, a) is the optimal state-action value function defined by Q*(s, a) :=max Q"(s, a). " Q"(s, a) can be expressed as the following recurrent form, called the Bellman equation [174]: Q"(s, a) = R(s, a) + y lE lE [Q"(s', a')] , VS E S, Va E A, (7.2) PT(s'ls,a) ,,(a'ls') where R(s, a) is the expected reward function when the agent takes action a in state s: R(s, a) := lE [R(s, a, s')] .PT(s'ls,a) lEpT(s'ls,a) denotes the conditional expectation of s' over PT(s'ls, a), given s and a . lE,,(a'ls') denotes the conditional expectation of a' over Jr(a'ls'), given s'. 7.6.2 Policy Iteration The computation of the value function Q"(s, a) is called policy evaluation. With Q"(s, a), one can find a better policy Jr'(aIs) by means of Jr'(als) = 8(a - argmax Q"(s, a'». a' This is called (greedy) policy improvement. It is known that repeating policy evaluation and policy improvement results in the optimal policy Jr*(als) [174]. This entire process is called policy iteration: E Q" I E Q"2 I E I *Jr1 ---+ 1 ---+ Jr2 ---+ ---+ Jr3 ---+ • , • ---+ Jr , 5. We assume that given state s, there is only one action maximizing the optimal value function Q*(s. a).
  • 174. 7.6. Sample Reuse in Reinforcement Learning 167 where JrJ is an initial policy. E and I indicate the policy evaluation and policy improvement steps, respectively. For technical reasons, we assume that all poli­ cies are strictly positive (i.e., all actions have nonzero probability densities). In order to guarantee this, explorative policy improvement strategies such as the Gibbs policy and the E-greedy policy are used here. In the case of the Gibbs policy, Jr'(als) = exp(QJr (s, a)jr) , fAexp(QJr (s, a')jr) da' (7.3) where r is a positive parameter which determines the randomness of the new policy Jr'. In the case of the E-greedy policy, Jr'(als) = { l - E + EjIAI EjlAI where a* = argmax QJr (s, a) a if a =a*, otherwise, and E E (0, 1] determines how stochastic the new policy Jr' is. 7.6.3 Value Function Approximation (7.4) Althoughpolicy iteration is guaranteedto producethe optimalpolicy, it is often computationally intractable since the number of state-action pairs lSI x IAI is very large; lSI or IAI can even become infinite when the state space or action space is continuous. To overcome this problem, the state-action value function QJr (s, a) is approximated using the following linear model [174, 124, 99]: where are the fixed basis functions, T denotes the transpose, B is the number of basis functions, and
  • 175. 168 7 Applications of Covariate Shift Adaptation are model parameters. Note that B is usually chosen to be much smaller than ISl x IAI· For N-step transitions, the ideal way to learn the parameters 0 is to minimize the approximation error of the state-action value function Qlr(s, a): where IEpI,lr,PT denotes the expectation over {sn' an}:=l following the initial state probability density P,(SI), the policy rr(anlsn), and the transition probability density PT(Sn+llsn, an). A fundamental problem of the above formulation is that the target function Qlr(s, a) cannot be observed directly. To cope with this problem, we attempt to minimize the square of the Bellman residual instead [99]: 0* :=argmin G, 9 ( � � )2g(s, a; O) := Qlr(s, a; O) - R(s, a) - y IE IE [Qlr(s', a'; O)] ,PT(s'ls,a) ,,(a'ls) (7.5) (7.6) where g(s, a; 0) is the approximation error for one step (s, a) derived from the Bellman equation6 (equation 7.2). 7.6.4 Sample Reuse by Covariate Shift Adaptation In policy iteration, the optimal policy is obtained by iteratively performingpol­ icy evaluation and improvement steps [174, 13]. When policies are updated, many popular policy iteration methods require the user to gather new samples following the updated policy, and the new samples are used for value func­ tion approximation. However, this approach is inefficient particularly when 6. Note that g(s, a; 9) with a reward observation r instead of the expected reward R(s. a) corresponds to the square of the temporal difference (TD) error. That is, (� � )2 gTD(s, a, r; 9) := Qn(s, a, 9) - r - y lE lE [Qn(s', a'; 9)] . PT(s'ls,a) Jr(a' ls) Although we use the Bellman residual for measuring the approximation error, it can be easily replaced with the TD error.
  • 176. 7.6. Sample Reuse in Reinforcement Learning 169 the sampling cost is high, and it will be more cost-efficient if one can reuse the data collected in the past. A situation where the sampling policy (a policy used for gathering data samples) and the current policy are different is called off-policy reinforcement learning [174]. In the off-policy setup, simply employing a standard policy iteration method such as least-squares policy iteration [99] does not lead to the optimal policy, as the sampling policy can introduce bias into value function approximation. This distribution mismatch problem can be eased by the use of importance­ weighting techniques, which cancel the bias asymptotically. However, the approximation error is not necessarily small when the bias is reduced by impor­ tance weighting; the variance of estimators also needs to be taken into account since the approximation error is the sum of squared bias and variance. Due to large variance, naive importance-weighting techniques used in reinforcement learning tend to be unstable [174, 125]. To overcome the instability problem, an adaptive importance-weighting technique is useful. As shown in section 2.1.2, the adaptive importance­ weighted estimator smoothly bridges the ordinary estimator and the importance-weighted estimator, allowing one to control the trade-off between bias and variance. Thus, given that the trade-off parameter is determined care­ fully, the optimal performance may be achieved in terms of both bias and variance. However, the optimal value of the trade-off parameter is heavily dependent on data samples and policies, and therefore using a predetermined parameter value may not always be effective in practice. For optimally choosing the value of the trade-off parameter, importance­ weighted cross-validation [160] (see section 3.3) enables one to estimate the approximation error of value functions in an almost unbiased manner even under off-policy situations. Thus, one can adaptively choose the trade-off parameter based on data samples at hand. 7.6.5 On-Policy vs. Off-Policy Suppose that a data set consisting of M episodes of N steps is available. The agent initially starts from a randomly selected state Sl following the initial state probability density PIeS) and chooses an action based on a samplingpol­ icy 3T(anISn). Then the agent makes a transition following PT(sn+llsn, an), and receives a reward rn (= R(sn' an, Sn+l»' This is repeated for N steps-thus, the training data Vi are expressed as Vi ' _ {di}M .- m m=l'
  • 177. 170 7 Applications of Covariate Shift Adaptation where each episodic sample d! consists of a set of quadruple elements as drr . {( rr rr rr rr )}N m '= sm,n, am,n, rm,n, sm,n+l n=l' Two types of policies which have different purposes are used here: the sam­ pling policy n(aIs) for collecting data samples and the current policy JT(aIs) for computing the value function Q". When n(als) is equal to JT(als) (the sit­ uation called on-policy), just replacing the expectation contained in the error G defined in equation 7.6 by a sample average gives a consistent estimator (i.e., the estimated parameter converges to the optimal value as the number of episodes M goes to infinity): ONIW :=argmin GNIW, 9 � (� 1 '" g(s, a; 8, D) := Q"(s, a; 8) - -- � r I D(s,a) I rEV(s,a) - 1-nY I L � , [Q"(s', a'; 8)]r.L/(s,a) , ,,(a Is ) s EV(s,a) where D(s,a) is a set of quadruple elements containing state s and action a in the training data D, and L EV and L 'EV denote the summation over rr (s,a) S (s,a) and s' in the set D(s,a) , respectively. Note that NIW stands for No Importance Weight. However, in reality, n(aIs) is usually differentfrom JT(aIs), since the current policy is updated in policy iteration, The situation where n(aIs) is different from JT(aIs) is called off-policy, In the off-policy setup,ONIW is no longer con­ sistent. This inconsistency can be avoided by gathering new samples, That is, when the current policy is updated, new samples are gathered following the updated policy, and the new samples are used for policy evaluation. However, when the data sampling cost is high, this is not cost-efficient-it would be more practical if one could reuse the previously gathered samples. 7.6.6 Importance Weighting in Value Function Approximation For coping with the off-policy situation, we would like to use importance­ weighting techniques. However, this is not that straightforward since our
  • 178. 7.6. Sample Reuse in Reinforcement Learning 171 training samples of state S and action a are not i.i.d. due to the sequential nature of MDPs. Below, standard importance-weighting techniques in the context of MDPs are reviewed. 7.6.6.1 Episodic Importance Weighting The method of episodic importance weighting (EIW) [174] utilizes the independence between episodes: p(d, d') = p(d)p(d') Based on the independence between episodes, the error G defined by equa­ tion 7.6 can be rewritten as where . _ p,,(d) WN ·- -- · Pii(d) p"(d) and Pii(d) are the probability densities of observing episodic data d under policy n and policy n, respectively: N p,,(d) := PI(sd nn(anISn)PT(Sn+!lsn, an), n=1 N Pii(d) := PI(sd nn(anISn)PT(Sn+!lsn, an). n=1 Note that the importance weights can be computed withoutexplicitly knowing PI and PT since they are canceled out: Thus, in the off-policy reinforcement learning scenario, importance estimation (chapter 4) is not necessary. Using the training data Vii, one can construct a consistent estimator of G as
  • 179. 172 7 Applications of Covariate Shift Adaptation 1 M N G '- - � ��g WEIW ·- MN � � m,n m,N, m=l n=l where Based on this, the parameter () is estimated by OEIW := argmin GEIW, 9 (7.7) 7.6.6.2 Per-Decision Importance Weighting A more sophisticated importance-weightingtechnique, called theper-decision importance-weighting (PIW) method, was proposed in [125]. A crucial observation in PIW is that the error at the n-th step does not depend on the samples after the n-th step, that is, the error G can be rewritten as Using the training data Vii, one can construct a consistent estimator as follows (see equation 7,7): � 1 M N � GpIW := MN LLgm,nwm,n' (7.8) m=l n=l Wm,n in equation 7.8 contains only the relevant terms up to the n-th step, while Wm,N in equation 7.7 includes all the terms upto the end of the episode. Based on this, the parameter () is estimated by OPIW := argmin GpIW' 9 7.6.6.3 Adaptive Per-Decision Importance Weighting The importance­ weighted estimator OPIW (also OEIW) is guaranteed to be consistent. However, both are not efficient in the statistical sense [145], that is, they do not have the smallest admissible variance. For this reason, OPIW can have large variance
  • 180. 7.6. Sample Reuse in Reinforcement Learning 173 in finite sample cases, and therefore learning with PIW can be unstable in practice. Below, an adaptive importance-weighting method (see section 2.1.2) is employed for enhancing stability. In order to improve the estimation accuracy, it is important to control the trade-off between consistency and efficiency (or, similarly, bias and vari­ ance) based on the training data. Here, the flattening parameter v (E [0, 1]) is introduced to control the trade-off by slightly "flattening" the importance weights [145, 160]: (7.9) where AIW stands for "Adaptive PIW." Based on this, the parameter () is estimated as follows: OAIW := argmin GAIW. o When v = 0, OAIW is reduced to the ordinary estimator ONIW. Therefore, it has large bias but has relatively small variance. On the other hand, when v = 1, OAIW is reduced to the importance-weighted estimator OPIW. Therefore, it has small bias but relatively large variance. In practice, an intermediate v would yield the best performance. The solution OAIW can be computed analytically as follows [66]: where t(s, a; V) is a B-dimensional column vector defined by � y "t(s, a; V) :=fb(s, a) - -- � lE [fb(s', a')] . IV(s,a) I , � n(a'ls') S E L-'(s,a) (7.10) This implies that the cost for computingOAIW is essentially the same as forONIW and OP1w.
  • 181. 174 7 Applications of Covariate Shift Adaptation 7.6.7 Automatic Selection of the Flattening Parameter As shown above, the performance of AIW depends on the choice of the flat­ tening parameter v. Ideally, v is set so that the approximation error G is minimized. However, the true G is inaccessible in practice. To cope with this problem, the approximation error G may be replaced by its estimator obtained using IWCV [160] (see section 3.3). Below we explain how IWCV can be applied to the selection of the flattening parameter v in the context of value function approximation. Let us divide a training data set Vrr containing M episodes into K subsets {Vnf=l of approximately the same size (we used K = 5 in experiments). For simplicity, assume that M is divisible by K . Let e:IW be the parameter learned from {V�hl# with AIW (see equation 7.9). Then, the approximation error is estimated by K � 1 '" �kG1WCY = K � GIWCY' k=l where The approximation error is estimated by the above K -fold IWCV method for all candidate models (in the current setting, a candidate model corresponds to a different value of the flattening parameter v). Then the model candidate that minimizes the estimated error is selected: Viwcy = argmin G1WCY' In general, the use of IWCV is computationally rather expensive since � �(JAIW and Gkcy need to be computed many times. For example, when per- forming fivefold IWCV for 11 candidates of the flattening parameter v E � �k .{O.O,0.1, . . . , 0.9, 1.0}, (JAIW and G1WCY need to be computed 55 times. How- ever, this would be acceptable in practice for two reasons. First, sensible model selection via IWCV allows one to obtain a much better solution with a small number of samples. Thus, in total, the computation time may not grow that much. The second reason is that cross-validation is suitable for parallel com­ puting since error estimation for different flattening parameters and different folds are independent of one another. For instance, when performing fivefold
  • 182. 7.6. Sample Reuse in Reinforcement Learning 175 IWCV for 11 candidates of the flattening parameter, one can compute GIWCV for all candidates at once in parallel, using 55 CPUs; this would be highly real­ istic in the current computing environment. If a simulated problem is solved, the storage of all sequences can be more costly than resampling. However, for real-world applications, it is essential to reuse data, and IWCV will be one of the more promising approaches. 7.6.8 Sample Reuse Policy Iteration So far, the AIW+IWCV method has been considered only in the context of policy evaluation. Here, this method is extended to the full policy iteration setup. Let us denote the policy at the l-th iteration by lrl and the maximum number of iterations by L. In general policy iteration methods, new data samples V"I are collected following the new policy lrl during the policy evaluation step. Thus, previously collected data samples {V"I , V"2 , ..., V"I-I } are not used: E:{D"I } �" [ E:{D"2 } �" [ E:{D"3 } [ lrl -+ Q I -+ lr2 ----+ Q 2 -+ lr3 ----+ . . . ----+ lrL, where E : {V} indicates policy evaluation using the data sample V. It would be more cost-efficient if one could reuse all previously collected data samples to perform policy evaluation with a growing data set as E:{D"I } �" [ E:{D"I ,D"2 } �" [ E:{D"I ,D"2 ,D"3 } [ lr1 ----+ Q I -+ lr2 ----+ Q 2 -+ lr3 ----+ . , . ----+ lrL' Reusing previously collected data samples turns this into an off-policy sce­ nario because the previous policies and the current policy are different unless the current policy has converged to the optimal one. Here, the AIW+IWCV method is applied to policy iteration. For this purpose, the definition of GAIW is extended so that multiple sampling policies {lr] , lr2, . . . , lrd are taken into account: --::t • �I(J AIW :=argmm GAIW ' 8 1 M N G�I '- 1 '"" '"" '"" � ( " I' "I' . (J {"""I, }I ) AIW ' - IMN ���g Sm,n, am,n, , v 1'=1 1'=1 m=1 n=1 (nn ( "I' I "I' »)VI n'=1 1C1 amn' Smn'X " nn ( "I' I "I' ) , n'=l 7rJ' am,n' sm,n' (7.11)
  • 183. 176 7 Applications of Covariate Shift Adaptation where G�IW is the approximation error estimated at the l-th policy evaluation using AIW. The flattening parameter VI is chosen based on IWCV before per­ forming policy evaluation. This method is called sample reusepolicy iteration (SRPI) [66]. 7.6.9 Robot Control Experiments Here, the performance of the SRPI method is evaluated in a robot control task of an upward swinging inverted pendulum. We consider the task of an upward swinging invertedpendulum illustrated in figure 7.6a, consisting of a rod hinged at the top of a cart. The goal of the task is to swing the rod up by moving the cart. We have three actions: applying positive force +50 [kg · m/s2] to the cart to move right, negative force -50 to move left, and zero force just to coast. That is, the action space A is discrete and described by A= {50, -50, 0} [kg · m/s2]. Note that the force itself is not strong enough, so the cart needs to be moved back and forth several times to swing the rod up. The state space S is contin­ uous and consists of the angle cp [rad] (E [0, 2JT]) and the angular velocity cp [rad/s] (E [-JT, JT])-thus, a state s is described by a two-dimensional vector Figure 7.6b shows the parameter setting used in the simulation. The angle cp and angular velocity cp are updated as follows: (a) Uustration Figure 7.6 Parameter Value Mass of the cart, W 8 [kg] Mass of the rod, w 2 [kg] Length of the rod, d 0.5[m] Simulation time step, 6.t 0.1 [s] (b) Parameter setting Illustration of the inverted pendulum task and parameters used in the simulation.
  • 184. 7.6. Sample Reuse in Reinforcement Learning 177 where a = l/(W + w) and at is the action in A chosen at time t. The reward function R(s, a , s') is defined as R(s, a , s') = cos(CPs')' where CPs' denotes the angle cp of state s'. Forty-eight Gaussian kernels with standard deviation a = rr are used as basis functions, and kernel centers are distributed over the following grid points: to, 2/3rr, 4/3rr, 2rr } x {-3rr, -rr, rr, 3rr }. That is, the basis functions are set to for i = 1 , 2, 3 and j = 1 , 2, . . . , 16, where The initial policy rrI (a Is) is chosen randomly, and the initial state probability density PI(S) is set to be uniform. The agent collects data samples D"l (M = 10 and N = 100) at each policy iteration following the current policy rr/' The discounted factor is set to y = 0.95, and the policy is improved by the Gibbs policy (equation 7.3) with r = l. Figure 7.7a describes the performance of learned policies measured by the discounted sum of rewards as functions of the total number of episodes. The graph shows that SRPI nicely improves the performance throughout the entire policy iteration process. On the other hand, the performance when the flatten­ ing parameter is fixed to v = 0 or v = 1 is not properly improved after the middle iterations. The average flattening parameter value as a function of the
  • 185. 178 -6 -7 VI -8 "E '" 3: -9 � " - 1 0QJ .... C :::l 0 -1 1u VI '6 - - 1 2 0 E :::l - 1 3VI - 1 4 - 1 5 1 0 0.9 0.8 Qj 0.7 ti E 0.6 � '" a. 0.5 01 c .� 0.4 ::: � 0.3 0.2 0.1 Figure 7.7 7 Applications of Covariate Shift Adaptation v=O v=l V='VjWCV , . - . - . • . • . - . - . - . � .....,:' . ........ , .# .... ............,. .. . .. .. ....,. ... .....,, . ..... ... . .,. .. .. . .. '.. . . - ..' . . .. . . . . ..,. . . ' ,, . � � , 20 30 40 50 60 70 80 Total number of episodes (a) Performance of policy. 20 30 40 50 60 70 80 Total number of episodes (b) Average flattening parameter. '. , , 90 90 The results of sample reuse policy iteration in upward swing of an inverted pendulum. The agent collects training sample V�l (M = 10 and N = 100) at every iteration, and policy evaluation is per· formed using all collected samples {V�l , V�2 , . . . , V�l }. (a) The performance of policies leamed with v = 0, v = 1, and SRPI. The performance is measured by the average sum of discounted rewards computed from test samples over 20 trials. The total number of episodes is the num· ber of training episodes (M x l) collected by the agent in policy iteration. (b) Average flattening parameter used by SRPI over 20 trials.
  • 186. 7.6. Sample Reuse in Reinforcement Learning 179 total number of episodes is depicted in figure 7.7b, showing that the parameter value tends to increase quickly in the beginning and then is kept at medium values. These results indicate that the flattening parameter is well adjusted to reuse the previously collected samples effectively for policy evaluation, and thus SRPI can outperform the other methods.
  • 188. 8 Active Learning Active learning [107, 33,55]-alsoreferred to as experimental design in statis­ tics [93, 48,127]-is the problem of determining the locations of training input points so that the generalization error is minimized (see figure 8.1). Active learning is particularly useful when the cost of sampling output value y is very expensive. In such cases, we want to find the best input points to observe out­ put values within a fixed number of budgets (which corresponds to the number ntr of training samples). Since training input points are generated following a user-defined distri­ bution, covariate shift naturally occurs in the active learning scenario. Thus, covariate shift adaptation techniques also play essential roles in the active learning scenario. In this chapter, we consider two types of active learning scenarios: population-based active learning, where one is allowed to sample output values at any input location (sections 8.2 and 8.3) and pool-basedactive learning, where input locations that are to have output values are chosen from a pool of (unlabeled) input samples (sections 8.4 and 8.5). 8.1 Preliminaries In this section, we first summarize the common setup of this chapter. Then we describe a general strategy of active learning. 8.1.1 Setup We focus on a linear regression setup throughout this chapter: • The squared loss (see section 1.3.2) is used, and training output noise is assumed to be i.i.d. with mean zero and variance 0'2. Then the generalization
  • 189. 1 84 8 Active Learning Target function Learned +•• Good training inputs Poor training inputs Figure 8.1 Active learning . In both cases, the target function is learned from two (noiseless) training samples . The only difference is the training input location . error is expressed as follows (see section 3.2.2): where IExte denotesthe expectation over x le drawn from Ple(X) and lEy'e denotes the expectation over ytedrawn from p(ylx=xte). • The test input density Ple(X) is assumed to be known. This assumption is essential in the population-based active learning scenario since the training input points {x:,}7!] are designed so that prediction for the given Ple(X) is improved (see sections 8.2 and 8.3 for details). In the pool-based active learn­ ing scenario, Ple(X) is unknown, but a pool of (unlabeled) input samples drawn i.i.d. from Ple(x) is assumed to be given (see sections 8.4 and 8.5 for details). • A linear-in-parameter model (see section 1.3.5.1) is used: b I(x; 0) =L8e({Je(x), (8.1) e=] where 0 is the parameter to be learned and {({Je(x)}� =] are fixed linearly independent functions. • A linear method is used for learning the parameter 0, that is, the learned parameter0is given by 0=Ly tr,
  • 190. 8.1 . Preliminaries where Ir. (tf tf tr )T y .= Y], Y2'..., Yntr • 185 L is a b x nlr learning matrix that is independent of the training output noise; this independence assumption is essentially used as where IE {YY}7�1 denotes the expectations over {y:r}7!], each of which is drawn from p(Ylx=x:r). 8.1.2 Decomposition of Generalization Error Let (J* be the optimal parameter under the generalization error: (J* :=argmin[Gen]. 8 Let 15r(x) be the residual function 15r(x) :=l(x; (J*) - f(x). We normalizer(x) as so that 8 governs the "magnitude" of the residual andr(x) expresses the "direc­ tion" of the residual. Note thatr(x) is orthogonal tofix; (J) for any (J under Ple(X) (see figure 3.2 in section 3.2.2): IE [r(xle)l(xle; (J)] =0 for any (J. xle Then the generalization error expected over the training output values {y:'}7!] can be decomposed as =IE IE [(fixle;9) -fixle; (J*) + 15r(xle)n+ (J2 {yY}7!!"1 xte
  • 191. 1 86 8 Active Learning where = IE IE [ ( l(xte;9) - l(xte; 9*)n+ t,zIE [r2(xte)] {Yr}�!!1 xte xte + 2 IE IE [ ( l(xte;9) - l(xte; 9*)) (8r(xte»)] + a2 {Yr}7�1 xte (8.2) Applying the standard bias-variance decomposition technique to Gen", we have = IE IE [(fixte;9) - IE [ l(xte;9)] {yy}?:!} xte {yY}?:!} + IE [ l(xte;9)] -fixte; 9*»)2]{yYl7,!!"1 = IE IE [(l(xte;9) _ IE [fixte;9)])2 ]{yYl7,!!"1 xte {yYl?,!!"1 + 2IE [( IE [ l(xte;9)] - IE [fixte;9)])xle {YYl?,!!"1 {yYl?,!!"1 X ( trlB1.tr [fixte;9)] - l(xte; 9*»)]{Yi li=1 +IE [(IE [fixle;9)] - l(xle; 9*»)2 ]xte tYY}?:!1 = Var + Bias2, where Bias2 and Var denote the squared bias and the variance defined as follows:
  • 192. 8.1 . Preliminaries 187 (8.3) (8.4) Under the linear learning setup (equation 8.1), Gen", Bias2 and Var are expressed in matrix/vector form as follows (see figure 8.2): Gen"= lE [(u (9- 8*),9- 8*)], {Y),}7�1 where U is the b x b matrix with the (l, l')-th element: () 1E [8] e······························ ()* {ytr}ntr Bias2't .,,=1 Figure 8.2 Var Bias-variance decomposition . (8.5)
  • 193. 1 88 8 Active Learning In some literature, Bias2 + 82 may be called the bias term: Bias2 + 82= Bias2 + 82JE [r2(xle)] xle In our notation, we further decompose the above bias (Bias2 + 82) into the reducible part (Bias2) and the irreducible part (82), and refer to them as bias and model error, respectively. 8.1.3 Basic Strategy of Active Learning The goal of active learning is to determine training input location {xn7� so that the generalization error is minimized. Below, we focus on Gen", the reducible part of Gen defined by equation 8.2, which is expressed as Gen"= Gen -82 -a2. Note that the way the generalization error is decomposed here is different from that in model selection (cf. Gen' defined by equation 3.2). Gen" is unknown, so it needs to be estimated from samples. However, in active learning, we do not have training output values {Yn7� since the training input points {xn7� have not yet been determined. Thus, the general­ ization error should be estimated without observing the training output values {y:r}7�. Therefore, generalization error estimation in active learning is gener­ ally more difficult than that in model selection, where both {xl'}7! and {Yn7� are available. Since we cannot use {y!'}7! for estimating the generalization error, it is gen­ erally not possible to estimate the bias (equation 8.3), which depends on the true target function I(x). A general strategy for coping with this problem is to find a setup where the bias is guaranteed to be zero (or small enough to be neglected), and design the location of training input points {xl'}7� so that the variance (equation 8.4) is minimized. Following this basic strategy, we describe several active learning methods below. 8.2 Population-Based Active Learning Methods Population-based active learning indicates the situation where we know the distribution of test input points, and we are allowed to locate training input
  • 194. 8.2. Population-Based Active Learning Methods 189 points at any desired positions. The goal of population-based active learning is to find the optimal training input density Ptr(x) (from which we generate training input points {X:,}7!1) so that the generalization error is minimized. 8.2.1 Classical Method of Active Learning for Correct Models A traditional variance-only active learning method assumes the following conditions (see, e.g., [48, 33, 55]): • The model at hand is correctly specified (see section 1.3.6), that is, the model error 8 is zero. In this case, we have lex; 9 *)= I(x). • The design matrix xt" which is the ntr x b matrix with the (i, l)-th element has rank b. Then the inverse of XtrTXtr exists. (8.6) • Ordinary leastsquares (OLS, whichis an ERM method withthe squared loss; see chapter 2) is used for parameter learning: The solution90LS is given by where tr ._ ( tr tr tf )T Y .- Yl'Y2'. . . , Yntr . (8.7) Under the above setup, Bias2, defined in equation 8.3, is guaranteed to be zero, since
  • 195. 190 = (XlrTXlr)-IXlrTXlr(J* = (J*. 8 Active Learning Then the training input points {x:,}7!1 are optimized so that Var (equation 8.4) is minimized: (8.8) 8.2.2 Limitations of Classical Approach and Countermeasures In the traditional active learning method explained above (see also [48,33,55]), the minimizer of the variance really minimizes the generalization error since the bias is exactly zero under the correct model assumption. Thus, the mini­ mizer is truly the optimal solution. However, this approach has two drawbacks. The first drawback is that minimizing the variance with respect to {x:,}7!1 may not be computationally tractable due to simultaneous optimization of ntr input points. This problem can be eased by optimizing the training input den­ sity Plr( X) and drawing the training input points {x:,}7!1 from the determined density. The second, and more critical, problem is that correctness of the model may not be guaranteed in practice; then minimizing the variance does not necessar­ ily result in minimizing the generalization error. In other words, the traditional active learning method based on OLS can have larger bias. This problem can be mitigated by the use of importance-weighted least squares (lWLS; see section 2.2.1): The solution9IWLS is given by where LIWLS := (XlrT WXlr)-I XlrT W, and Wis the diagonal matrix with the i-th diagonal element W . ' - Pte(X:') 1,1 .- Plr( X:')' (8.9)
  • 196. 8.2. Population-Based Active Learning Methods 191 IWLS is shown to be asymptotically unbiased even when the model is misspecified; indeed, we have =(XlfTWXlfrlXlfTW(Xlf9*+8rlf) =9*+(XlfTWXlf)-I XlfTWrlf, where rlfis the nlf-dimensional vector defined by If ( (If) (If) (If))Tr = r X I ' r X 2 ,. • . ,r X ntr • 1 As shown in section 3.2.3.2, the second term is of order Op(n�I): where 0p denotes the asymptoticorder in probability. Thus, we have Below, we introduceusefulactive learningmethods which canovercomethe limitations of the traditional OLS-based approach. 8.2.3 Input-Independent Variance-Only Method For the IWLS estimator OIWLS, it has been proved [89] that the generalization error expected over training input points {x:,}7!, and training output values {yd7!, (i.e., input-independent analysis; see section 3.2.1) is asymptotically expressed as (8.10) where H is the b x bmatrix defined by (8.11)
  • 197. 192 S and T are the b x b matrices with the (C, £')-th elements T. [(,e ) ( ,e ) P,e(x,e) ]e,e':= IE CPe X CPe' X--,e- , x,e P,r(X ) 8 Active Learning (8.12) where IExte denotes the expectation over X,e drawn from P,e(x), Note that ...Ltr(U-1S) corresponds to the bias term and �tr(U-1T) corresponds to the� � variance term. S is not accessible because unknown r(x) is included, while T is accessible due to the assumption that the test input density P,e(x) is known (see section 8.1.1). Equation 8.10 suggests that tr(U-1H) may be used as an active learning criterion. However, H includes the inaccessible quantities r(x) and a2, so tr(U-1H) cannot be calculated directly. To cope with this problem, Wiens [199] proposedI to ignore S (the bias term) and determine the training input density Ptr(x) so that the variance term is minimized. p�= argmin[tr(U-1T)]. (8.13) PIr As shown in [153], the above active learning method can be justified for approximately correct models, that is, for 8 = 0(1): A notable feature of the above active learning method is that the optimal training input density p� can be obtained analytically as 1 p�(x)exP,e(x)(�I[U-lle.e,cpe(x)CPe'(x»)');, (8.14) which may be confirmed from the following equation [87]: tr(U-lT) ex (1 + f (p�(x)- ptr( X»2 d X ). Ptr(x) 1 . In the original paper, the range of application was limited to the cases where the input domain is bounded and Pte(x)is uniform over the domain . However, it may be easily extended to an arbitrary strictly positive Pte(x). For this reason, we deal with the extended version here .
  • 198. 8.2. Population-Based Active Learning Methods 193 Figure 8.3 Illustration of the weights in the optimal training input density p:(x). Equation 8.14 implies that in the optimal training input density p�(x), Pte(X) is weighted according to the V-I-norm of (CfJI(x), CfJ2(X), . . . , CfJb(X»T. Thus, the input points far from the origin tend to have higher weights in p�(x). Note that the matrix V defined by equation 8.5 corresponds to the second­ order moments of the basis functions {CfJe(x)}� around the origin, that is, it has ellipsoidal shapes (see figure 8.3). 8.2.4 Input-Dependent Variance-Only Method As explained in section 3.2.1, estimating the generalization error without tak­ ing the expectation over the training input points {xn7!1 is advantageous since the estimate is more data-dependent. Following this idea, an input-dependent variance-only active learning method called ALICE (Active Learning using Importance-weighted least­ squares learning based on Conditional Expectation of the generalization error) has been proposed [153]. ALICE minimizes the input-dependent variance (equation 8.4): (8.15) This criterion can also be justified for approximately correct models, that is, for 8 = 0(1), Input-dependent variance and input-independent variance are similar; indeed, they are actually asymptotically equivalent.
  • 199. 194 8 Active Learning However, input-dependent variance is a more accurate estimator of the generalization error than input-independent variance. More precisely, the 1 following inequality holds for approximately correct models with 0 = 0p(n�4) [153]: if terms of op(n�3) are ignored. This difference is shown to cause a signif­ icant difference in practical performance (see section 8.3 for experimental performance). As shown above, the input-dependent approach is more accurate than the input-independent approach. However, its downside is that no analytic solution is known for the ALICE criterion (equation 8.15). Thus, an exhaus­ tive search strategy is needed to obtain a better training input density. A practical compromise for this problem would be to use the analytic opti­ mal solution p�(x) of the input-independent variance-only active learning criterion (equation 8.13), and to search for a better solution around the vicinity of p�(x). Then we only need to perform a simple one-dimensional search: find the best training input density with respect to a scalar parame­ ter v: The parameter v controls the "flatness" of the training input density; v = 0 corresponds to the test input density Pte(x) (i.e., passive learning-the train­ ing and test distributions are equivalent), v = 1 corresponds to the original p�( x), and v > 1 expresses a "sharper" density than p�( x). Searching inten­ sively for a good value of v intensively around v = 1 would be useful in practice. A pseudo code of the ALICE algorithm is summarized in figure 8.4.
  • 200. 8.2. Population-Based Active Learning Methods Input: A test input density Pte(X) and basis functions {cpe(X)}�=l Output: Learned parameter if Compute the b x b matrix U with UP,(I = f cpe(x)cppr(x)Pte(x)dx; For several different values of v (possibly around v = 1) Let Pv(x) = Pte(x) (L�,er=dU-l]p,(lCPe(x)cp(l(x))�; Draw X�r = {X��i}�l following the density proportional to Pv(x); Compute the ntr x b matrix X�r with [Xv]i,P = CPP(X�i); Compute the ntr x ntr diagonal matrix Wv with [W"']i i = pv ( ( "'ti»;, Ptr re . Compute Lv = (X�rTWvX�r)-lX�rT W"'; v,. Compute ALICE(v) = tr(UL",LJ); End Compute v = argmin",[ALICE(v)]; Gather training output values y tr = (ytr, y�r, ' , , , y�,)T at Xir; Compute if = Lf)y tr; Figure 8.4 Pseudo code of the ALICE algorithm . 8.2.5 Input-Independent Bias-and-Variance Approach 195 The use of input-independent/dependent variance-only approaches is justified only for approximately correct models (i.e., the model error 0 vanishes asymp­ totically). Although this condition turns out to be less problematic in practice (see section 8.3 for numerical results), it cannot be satisfied theoretically since the model error 0 is a constant that does not depend on the number of samples. To cope with this problem, a two-stage active learning method has been developed [89] which is theoretically justifiable for totally misspecified mod­ els within the input-independentframework (see section 8.2.3). The key idea is to use the samples gathered in the first stage for estimating the generalization error (i.e., both bias and variance), and the training input distribution is opti­ mized based on the estimated generalization error in the second stage. Here, we explain the details of this algorithm. In the first stage, nt, (:s ntr) training input points {x-:,}7!! are created inde­ pendently following the test input density Pte(x), and corresponding training output values {)i/'}7!! are observed. Then we determine the training input density based on the above samples. More specifically, we prepare candidates for training input densities. Then, we estimate the generalization error for each candidate Pt,(x), and the one that
  • 201. 196 8 Active Learning minimizes the estimated generalization error is chosen for the second stage. The generalization error for a candidate density Ptf(X) is estimated as fol­ lows. Let if and Q be the I1tf-dimensional diagonal matrices with i-th diagonal elements respectively, where xtf is the design matrix for {xn7�1 (i.e., the I1tr x b matrix with the (i, l)-th element) and -tr (""tf �tr �tr)T Y =YI 'Y2 '. . . , Yntr . Then an approximation ii of the unknown matrix H defined by equation 8.10 is given as (8.16) Although V-I is accessible in the current setting, it was also replaced by a --I consistent estimator V [89], where Based on the above approximations, the training input density Ptf(X) is determined as follows: --1 - min[tr(V H)]. (8.17) Ptr After determining the training input density Ptf(X), the remaining (ntf-I1tf) training input points {xn7�itr n! 1 are created independently following the chosen Ptr(x), and corresponding training output values {Yn7�itr n! 1 are gathered. This is the second stage of sampling. Finally, the learned parameter 0- is obtained, using {(x:f, y/f)}7�1 and {( tf tf)}ntr-ntr Xi'Yi i�ntr+I' as
  • 202. 8.2. Population-Based Active Learning Methods [nIT nIT-nIT ( tr) ]..-.. . ..-.."" "" 2 Pte X· ..-. 2 8= argmm L (J(x:r; 8)-y/r) + L --;-r (J(x:r; 8)-yJr) . 9 i=l i=ntr+l Ptr(Xi) 197 (8.18) The above active learning method has a strong theoretical property, that is, for ntr= o(ntr), limntr-+oo ntr= 00, and8= 0(1), This means that the use of this two-stage active learning method can be justi­ fied for totally misspecified models (i.e., 8= 0(1». Furthermore, this method can be applied to more general scenarios beyond regression [89], such as classification with logistic regression (see section 2.3.2). On the other hand, although the two-stage active learning method has good theoretical properties, this method seems not to perform very well in practice (see section 8.3 for numerical examples), which may be due to the following reasons. • Since ntr training input points should be gathered following Pte(X) in the first stage, users are allowed to optimize the location of only (ntr-ntr) remaining training input points. This is particularly critical when the total number ntr is not very large, which is be a usual case in active learning. • The performance depends on the choice of the number ntro and it is not 1 straightforward to appropriately determine this number. Using ntr= O(nt;) is recommended in [89], but the exact choice of ntr seems still seems open. • The estimated generalization error corresponds to the case where ntr points are chosen from Ptr(x) and IWLS is used; but in reality, ntr points are taken from Pte(x) and (ntr-ntr) points are taken from Ptr(x), and a combination of OLS and IWLS is used for parameter learning. Thus, this difference can degrade the performance. It is possible to resolve this problem by not using {(x:r, y;tr)}7!1 gathered in the first stage for estimating the parameter. However, this may yield further degradation of the performance because only (ntr-ntr) training examples are used for learning. • Estimation of bias and noise variance is implicitly included (see equations 8.11 and 8.16). Practically, estimating the bias and noise variance from a small number of training samples is highly erroneous, and thus the performance of active learning can be degraded. On the other hand, the variance-only methods can avoid this difficulty by ignoring the bias; then the noise variance included in variance becomes just a proportional constant and can also justifiably be ignored.
  • 203. 198 8 Active Learning Currently, bias-and-variance approaches in the input-dependent framework seem to be an open research issue. 8.3 Numerical Examples of Population-Based Active Learning Methods In this section, we illustrate how the population-based active learning methods described in section 8.2 behave under a controlled setting. 8.3.1 Setup Let the input dimension be d =1 and the learning target function be f(x) = l-x + x2 + or(x), where Z3-3z x-0.2 rex) = ,J6 with z =----oA' (8.19) Note that the above rex) is the Hermite polynomial, which ensures the orthonormality of rex) to the second-order polynomial model under a Gaus­ sian test input distribution (see below for details). Let the number of training examples to gather be ntr =100, and we add i.i.d. Gaussian noise with mean zero and standard deviation 0.3 to output values: where N(E; fL, a2) denotes the Gaussian density with mean fL and variance a2 with respect to a random variable E. Let the test input density Pte(x) be the Gaussian density with mean 0.2 and standard deviation 0.4: Pte(x) =N(x; 0.2, (0.4)2). (8.20) Pte(x) is assumed to be known in this illustrative simulation. See the bottom graph of figure 8.5 for the profile of Pte(x). Let us use the following second-order polynomial model: Note that for this model and the test input density Pte(x) defined by equation 8.20, the residual function r(x) in equation 8.19 is orthogonal to the model and normalized to 1 (see section 3.2.2).
  • 204. 8.3. Numerical Examples of Population-Based Active Learning Methods 5 Learning target function f(x) -- 0=0 - - - 0=0.005 I I I I I I 0=0.05 199 O �------�------�------�------�------�------�----� -1.5 -1 -0.5 o 0.5 x Input density functions 1.5 c=0.8 0.5 -1 -0.5 o 0.5 Figure 8.5 Learning target function (top) and input density functions (bottom). 1.5 111111 P te (x) -- Ptr (x) - - - pt(x) 1.5 Let us consider the following three setups for the experiments. 0 = 0, 0.005, 0.05, 2 2 (8.21) which roughly correspond to correctly specijied, approximately correct, and misspecijiedcases, respectively. See the top graph of figure 8.5 for the profiles of f(x) with different O. Training input densities are chosen from the Gaussian densities with mean 0.2 and standard deviation OAc: Ptr(x)= N(x; 0.2, (OAC)2), where c= 0.8, 0.9, 1.0, . . ., 2.5. See the bottom graph of figure 8.5 for the profiles of Ptr(x) with different c. In this experiment, we compare the performance of the following methods: • ALICE c is determined by ALICE equation 8.15. IWLS (equation 8.9) is used for parameter learning.
  • 205. 200 8 Active Learning • Input-independent variance-only method (W) cis determined by equation 8.13. IWLS (equation 8.9) is used for parameter learning. We denote this by W since it uses importance-weighted least squares. • Input-independent variance-only method* (W*) The closed-form solution p�(x) given by equation 8.14 is used as the training input density. The pro­ file of p�(x) under the current setting is illustrated in the bottom graph of figure 8.5, showing that p�(x) is similar to the Gaussian density with c =1.3. IWLS (equation 8.9) is used for parameter learning. • Input-independent bias-and-variance method (OW) First, ntr training input points are created following the test input density Pte(x), and correspond­ ing training output values are observed. Based on the ntr training examples, cis determined by equation 8.17. Then ntr-ntr remaining training input points are created following the determined input density. The combination of OLS and IWLS (see (equation 8.18) is used for estimating the parameters (the abbrevia­ tion OW comes from this). We set ntr=25, which we experimentally confirmed to be a reasonable choice in this illustrative simulation. • Traditional method (0) c is determined by equation 8.8. OLS (equation 8.7) is used for parameter learning (the abbreviation 0 comes from this). • Passive method (P) Following the test input density Pte(x), training input points {x:'}7!! are created. OLS equation 8.7 is used for parameter learning. For W*, we generate random samples following p�(x) by rejection sampling (see, e.g., [95]). We repeat the simulation 1000 times for each 8 in equation 8.21 by changing the random seed. 8.3.2 Accuracy of Generalization Error Estimation First, we evaluate the accuracy of each active learning criterion as an estima­ tor of the generalization error Gen" (see equation 8.2). Note that ALICE and W are estimators of the generalization error by IWLS (which we denote by Gen�). OW is also derived as an estimator of Gen�, but the final solution is computed by the combination of OLS and IWLS given by equation 8.18. Therefore, OW should be regarded as an estimator of of the generalization error by this learning method (which we denote by Gen�w)' 0 is an estimator of the generalization error by OLS, which we denote by Gen�. In figure 8.6, the means and standard deviations of true generalization errors Gen" and their estimates over 1000 runs are depicted as functions of c by the solid curves. The generalization error estimated by each method is denoted by Gen in the figure. Here the upper and lower error bars are calculated separately since the distribution is not symmetric. The dashed curves show the means of
  • 206. 8.3. Numerical Examples of Population-Based Active Learning Methods 201 8=0 ("correctly specified") 8=0.005 ("approximately correct") Gen ALICE ,.co< • 8=0.05 ("misspecified") ,...� Gen ALICE 0.008 � :: :: 0.002 ...., --------- 0.002 .... ._. ---- ::�",,, �.......--- 00.8 1.2 1.6 �I : :> �.I. : .�: �I-..·: ..�:�� � 1.6 � I� 1.6 U 1� 1.6 �I�IIIIIIIIIIuTu �.I�IIIIIIIIIII"� �IUmIIIIIIIIIIIiiU 12 1.6 2 U U 12 1.6 2 � U 12 1.6 2 U "'IUw "'I� ,.. �008 Gen'b 0.008 Gen 'o o.aoallil" I �II :� III1I1I1IIII:� IIIPIIIIJ-W :�l1it-H110.8 1.2 1.6 2 :u 0.8 1.2 1.6 2 2.. ':-:-, . . ----:":-u ----:":-.. ----:------:0- Gen'o , .. �no 0.006 :: L o.o02·�___ :: L O.OO2' - � Figure 8.6 . Ge'no : : �• • : . . :: . ... .." " " "_. ---- The means and (asymmetric) standard deviations of true generalization error Gen" and its esti­ mates over 1000 runs as functions of c. The generalization error estimated by each method is denoted by Gen. GenALlCE, Genw,and Geno are multiplied by (72 = (0.3)2 so that comparison with Gen� and Gen� is clear. The dashed curves show the means of the generalization error which corresponding active learning criteria are trying to estimate .
  • 207. 202 8 Active Learning the generalization error which corresponding active learning criteria are trying to estimate. Note that the criterion values of ALICE, W, and 0 are multiplied by a2 =(0.3)2 so that comparison with Gen� and Gen� is clear. These graphs show that when 8 =0 (correctly specified), ALICE and W give accurate estimates of Gen�. Note that the criterion value of W does not depend on the training input points {Xn7�1' Thus, it does not fluctuate over 1000 runs. OW is slightly biased in the negative direction for small c. We con­ jecture that this is caused by the small sample effect. However, the profile of OW still roughly approximates that of Gen�w' 0 gives accurate predictions of Gen�. When 8 =0.005 (approximately correct), ALICE, W, and OW work similarly to the case with 8 =0, that is, ALICE and W are accurate and OW is negatively biased. On the other hand, 0 behaves differently: it tends to be biased in the negative direction for large c. Finally, when 8 =0.05 (misspeci­ jied), ALICE and W still give accurate estimates, although they have a slightly negative bias for small c. OW still roughly approximates Gen�w' and 0 gives a totally different profile from Gen�. These results show that as estimators of the generalizationerror, ALICE and W are accurate and robust against the misspecification of models. OW is also reasonably accurate, although it tends to be rather inaccurate for small c. 0 is accurate in the correctly specified case, but it becomes totally inaccurate once the model correct assumption is violated. Note that by definition, ALICE, W, and 0 do not depend on the learning target function. Therefore, in the simulation, they give the same values for all 8 (ALICE and 0 depend on the realization of {Xn7�1' so they may have small fluctuations). On the other hand, the true generalization error of course depends on the learning target function even if the model error 8 is subtracted, since the training output values depend on it. Note that the bias depends on 8, but the variance does not. The simulation results show that the profile of Gen� changes heavily as the degree of model misspecification increases. This would be caused by the increase of the bias since OLS is not unbiased even asymptotically. On the other hand, 0 stays the same even when 8 increases. As a result, 0 becomes a very poor generalization error estimator for a large 8. In contrast, the profile of Gen� appears to be very stable against the change in 8, which is in good agreement with the theoretical fact that IWLS is asymptoti­ cally unbiased. Thanks to this property, ALICE and W are more accurate than o for misspecified models. 8.3.3 Obtained Generalization Error In table 8.1, the mean and standard deviation of the generalization error obtained by each method are described. In each row of the table, the best
  • 208. 8.3. Numerical Examples of Population-Based Active Learning Methods 203 Table 8.1 The mean and standard deviation of the generalization error obtained by population-based active learning methods ALICE W W' ow o p 8 = 0 2.08 ± 1 .95 2.40 ± 2.15 2.32 ± 2.02 3.09 ± 3.03 ° 1 .3 1 ± 1 .70 3. 1 1 ± 2.78 8 = 0.005 °2. 10 ± 1 .96 2.43 ± 2.15 2.35 ± 2.02 3 . 1 3 ± 3.00 2.53 ± 2.23 3.14 ± 2.78 8 = 0.05 °2. 1 1 ± 2.12 2.39 ± 2.26 2.34 ± 2.14 3.45 ± 3.58 121 ± 67.4 3.5 1 ± 3.43 For better comparison, the model error 82 is subtracted from the generalization error,and all values are multiplied by 103 . In each row of the table, the best method and comparable ones by the t-test at the significance level 5 percent are indicated with o. The value of 0 for 8 = 0.05 is extremely large, but this is not a typo . method and comparable ones by the t-test (e.g., [78]) at the significance levelS percent are indicated with o. For better comparison, the model error 82 is sub­ tracted from the generalization error and all values in the table are multiplied by 103• The value of 0 for 8=0.05 is extremely large, but this is not a typo. When 8=0, 0 works significantly better than the other methods. Actually, in this case, training input densities that approximately minimize Gen�, Gen�w' and Gen� were successfully found by ALICE, W, OW, and O. This implies that the difference in the error is caused not by the quality of active learning crite­ ria, but by the difference between IWLS and OLS: IWLS generally has larger variance than OLS [145]. Thus, OLS is more accurate than IWLS since both of them are unbiased when 8=O. Although ALICE, W, W*, and OW are out­ performed by 0, they still work better than P. Note that ALICE is significantly better than W, W*, OW, and P by the t-test. When 8 =0.005, ALICE gives significantly smaller errors than other meth­ ods. All the methods except 0 work similarly to the case with 8 =0, while 0 tends to perform poorly. This result is surprising since the learning target func­ tions with 8 =0 and 8 =0.005 are visually almost the same, as illustrated in the top graph of figure 8.5. Therefore, it intuitively seems that the result when 8=0.005 is not much different from the result when 8=O. However, this slight difference appears to make 0 unreliable. When 8 = 0.05, ALICE again works significantly better than the other methods. W and W* still work reasonably well. The performance of OW is slightly degraded, although it is still better than P. 0 gives extremely large errors. The above results are summarized as follows. For all three cases (8 = 0, 0.005, 0.05), ALICE, W, W*, and OW work reasonably well and consis­ tently outperform P. Among them, ALICE tends to be better than W, W*, and OW for all three cases. 0 works excellently when the model is correctly
  • 209. 204 8 Active Learning specified, but tends to perform very poorly once the model correct assumption is violated. 8.4 Pool-Based Active Learning Methods Population-based active learning methods are applicable only when the test input density Pte( x) is known (see section 8.1.1). On the other hand, pool­ based active learning considers the situation where the test input distribution is unknown, but samples from that distribution are given. Let us denote the pooled test input samples by {x�e}j�!. The goal of pool-based active learning is, from the pool of test input points {x�e}j�!, to choose the best input-point subset {x:,}7!! (with ntr« nte) for gath­ ering output values {yn7!! that minimizes the generalization error. If we have infinitely many test input samples, the pool-based problem is reduced to the population-based problem. Due to the finite sample size, the pool-based active learning problem is generally harder to solve than the population-based prob­ lem. Note that the sample selection process of pool-based active learning (i.e., choosing a subset of test input points as training input points) resembles that of the sample selection bias model described in chapter 6. 8.4.1 Classical Active Learning Method for Correct Models and Its Limitations The traditional active learning method explained in section 8.2.1 can be extended to the pool-based scenario in a straightforward way. Let us consider the same setup as section 8.2.1 here, except that Pte(x) is unknown but its i.i.d. samples {x�e}j�! are given. The active learning criterion (equation 8.8) contains only the design matrix Xtr and the matrix U. Xtr is still accessible even in the pool-based scenario (see equation 8.6), but U is not accessible since it includes the expectation over the unknown test input distribution (see equation 8.5). To cope with this problem, we simply replace U with an empirical estimate fj, that is, fj is the b x b matrix with the (£, £')-th element (8.22) Then we immediately have a pool-based active learning criterion: (8.23)
  • 210. 8.4. Pool-Based Active Learning Methods 205 Unfortunately, this method has the same drawbacks as the population-based method: • It requires an unrealistic assumption that the model at hand is correctly specified. • ntr input points need to be simultaneously optimized that may not be computationally tractable. In order to overcome these drawbacks, we again employ importance­ weighted least squares (lWLS; see section 2.2.1). However, in pool-based active learning, we cannot create training input points {X:,}7!1 at arbitrary locations-we are allowed to choose only the training input points from {X�e};::l. In order to meet this requirement, we restrict the search space of the training input density Ptr(x) properly. More specifically, we consider a resam­ piing weightfunction S(x�e), which is a discrete probability function over the pooled points {X�e};::l. This means that the training input probability Ptr(x�e) is defined over the test input probability Pte(x�e) as (8.24) where S(x�e) > 0 for j =1, 2, . . . , nte, and nte Ls(x�e) =1. j=l 8.4.2 Input-Independent Variance-Only Method The population-based input-independent variance-only active learning crite­ rion (see section 8.2.3) can be extended to pool-based scenarios as follows. The active learning criterion (equation 8.13) contains only the matrices U and T. The matrix U may be replaced by its empirical estimator fj (see equation 8.22), but approximating T is not straightforward since its empiri­ cal approximation T still contains the importance values at training points (see equation 8.12): where w(xte) = Pte(x�e) . ] Ptr(xje)
  • 211. 206 8 Active Learning Thus, i still cannot be directly computed since Pte(x�e) is inaccessible in the pool-based scenarios. A possible solution to this problem is to use an importance estimation method described in chapter 4, but this produces some estimation error. Here, we can utilize the fact that the training input probability Ptr(x�e) is defined over the test input probability Pte(x�e) using the resampling weight function �(x�e) (see equation 8.24). Indeed, equation 8.24 immediately shows that the importance weight can be expressed as follows [87]: te 1 w(x) ex �( xje) . (8.25) Note that the proportional constant is not neededin activelearningsince we are not interested in estimating the generalization error itself, but only in finding the minimizer of the generalizationerror. Thus, the right-hand side of equation 8.25 is sufficient for active learning purposes, which does not produce any estimation error. ----1---, �* =argmin[tr(U T)], I; where i' is the b x b matrix with the (t:, t:')-th element 1 nte ( te) ( te)� , .__ "" ({Je Xj ({Jet Xj Te,e'.-n � >-(xte) • te j=l ., ] (8.26) Note also that equation 8.25 is sufficient for computing the IWLS solution (see section 2.2.1): (8.27) A notable feature of equation 8.26 is that the optimal resampling weight function �*(x) can be obtained in a closedform [163]: 1 �*(x�e)ex Ctl[irl ]e,e'({Je(x�e)({Je,(x�e»)2 8.4.3 Input-Dependent Variance-Only Method The population-based input-dependent variance-only active learning method (see section 8.2.4)can be extended to the pool-based scenarios in a similar way.
  • 212. 8.4. Pool-Based Active Learning Methods 207 We can obtain a pool-based version of the active learning criterion ALICE, given by equation 8.15, as follows. The metric matrix U is replaced with its empiricalapproximator fj, and the importance value Pte(X:')/Ptr(x:') (included in L1WLS) is computed by equation 8.25. Then we have � T min[tr(ULIWLSLIWLS)]' , (8.28) The criterion (equation 8.28) is called PALICE (Pool-based ALICE). Thanks to the input-dependence, PALICE would be more accurate than the input­ independent counterpart (equation 8.26) as a generalization error estimator (see section 8.2.4). However, no analytic solution is known for the PALICE criterion, and there­ fore an exhaustive search strategy is needed to obtain a better solution. Here, we can use the same idea as the population-basedcase, that is, the analytic opti­ mal solution �* of the input-independentvariance-only active learning criterion (equation 8.26) is used as a baseline, and one searches for a better solution around the vicinity of �*. In practice, we may adopt a simple one-dimensional search strategy: find the best resampling weight function with respect to a scalar parameter v: (8.29) The parameter v controls the "shape" of the training input distribution: when v =0, the resampling weight is uniform over all test input samples. Thus, the above choice includes passive learning (the training and test distributions are equivalent) as a special case. In practice, solution search may be intensively carried out around v =1. A pseudo code of the PALICE algorithm is summarized in figure 8.7. 8.4.4 Input-Independent Bias-and-Variance Approach Finally, we show how the population-based input-independent bias-and­ variance active learning method (see section 8.2.5) can be extended to the pool-based scenarios. 3 As explained in section 8.2.3, if O(n�2 )-terms are ignored, the generaliza­ tion error in input-independent analysis is expressed as follows:
  • 213. 208 8 Active Learning Input: A pool of test input points {x}e}j�1 and basis functions {'Pe(Xn�=1 Output: Learned parameter if Compute the b x b matrix U with Ue,€, = n�e L,j�1'Pc(x}e)'Pe(x}e); For several different values of v (possibly around v = 1) Compute {(v(X}enj�1 with (v(x) = (L,�'€'=1[U-1]e,€,'Pe(x)'Pe'(x)r; Choose xtr = {xtr}n" from {xte}n'ev • <=1 J J=1 with probability proportional to {(v(xjenj�1; Compute the ntr x b matrix xtr with [Xvli,e = 'Pe(x}r); Compute the ntr X ntr diagonal matrix Wv with [WV]i,i = ((v(x}r»)- 1 ; Compute Lv = (xtrT Wvxtr)-1xtrT Wv; Compute PALICE(v) = tr(ULvLJ); End Compute v = argminv [PALICE(v)]; Gather training output values ytr = (ytr,y�r,... , y�,)T at Xbr; Compute if = Lfjytr; Figure 8.7 Pseudo code of the PALICE algorithm . where H is defined by equation 8.11. It has been shown [89] that the optimal training input density which minimizes the above asymptotic generalization error is proportional to 1 Pte(x)(�/U-lle.e'<f1e(x)<f1e'(x)(82r2(x)+ a2»)'); , where 8r(x) is the residual function which cannot be approximated by the model at hand (see section 8.1.2 and also figure 3.2 in section 3.2.2). Then the optimal resampling weight function can be obtained as 1 (�l[U-1]U<f1e(X)<f1e'(x)(82r2(x)+ a2»)" However, since (82r2(x)+ a2) is inaccessible, the above closed-form solu­ tion cannot be used directly used for active learning. Tocope with this problem, a regression method can be used in a two-stage sampling framework [87]. In the first stage, ntr (:::; ntr) training input points {X-:,}7!1 are uniformly chosen from the pool {xje}j�p and corresponding training output values C5i;tr}7!1 are observed. It can be shown that a consistent estimator of the value of the optimal resampling weight function at x;' (i= 1, 2, . . . , ntr) is given by
  • 214. 8.5. Numerical Examples of Pool-Based Active Learning Methods 209 Based on the input-output samples {(i':f, gi)}7'::-1' the optimal resampling weight function can be learned by a regression method. Let us denote the learned resampling weight function by f(x). Since the value of f(x) is available at any input location x, (8.30) can be computed and used for resampling from the pooledinputpoints. Then, in the second stage, the remaining (ntf-Iltf) training input points {xn7'::-�:!1 are chosen following the learned resampling weight function, and correspond­ ing training output values {yJfr'::-itr ii!! are gathered. Finally, the parameter 8 is I d · {(-tf -tf)}iitr d {( tf tf)}ntr-iitrearne , usmg Xi' Yi i=! an Xi' Yi i=iitr+i' as �8 - . " (f�(-tf. 8) _ -tf)2 + " Xi' -Yi [iitr ntr-iitr (f�( tf. 8) tf)2]-argmm � Xi' Yi � � tf • 9 i=1 i=iitr+1 l;(Xi) (8.31) However, this method suffers from the limitations caused by the two-stage approach pointed out in section 8.2.5. Furthermore, obtaining a good approxi­ mation f(x) by regression is generally difficult, so this method may not be so reliable in practice. 8.5 Numerical Examples of Pool-Based Active Learning Methods In this section, we illustrate the behavior of the pool-based active learning methods described in section 8.4. Let the input dimension be d= 1, and let the learning target function be f(x)= I-x + x2 + 8r(x), where Z3 -3z x -0.2 r(x)= ,J6 with z= �. (8.32) The reason for the choice of r(x) will be explained later. Let us consider the following three cases. 8= 0, 0.03, 0.06.
  • 215. 210 4 2 Learning target function f(x) -- 0=0 - - · 0=0.03 I I I I I I 0=0.06 o L-�-------L-------L------�------�------� 0.8 0.6 -1 -1 Figure 8.8 -0.5 -0.5 o 0.5 x Input density functions o 0.5 x Learning target function (top) and test input density functions (bottom). 1.5 1.5 8 Active Learning See the top graph of figure 8.8 for the profiles of I(x) with different 8. Let the number of training samples to gather be ntr=100. Add i.i.d. Gaussian noise with mean zero and standard deviation a =0.3 to output values. Let the test input density Pte(x) be the Gaussian density with mean 0.2 and standard deviation 0.4; however, Pte(x) is treated as unknown here. See the bottom graph of figure 8.8 for the profile of Pte(x). Let us draw nte =1000 test input points independently from the test input distribution. A polynomial model of order 2 is used for learning: (8.33) Note that for these basis functions, the residual function r(x) in equation 8.32 is orthogonal to the model and normalized to I (see section 3.2.2). In this experiment, we compare the performance of the following sampling strategies: • PALICE Training input points are chosen following equation 8.29 for v E {O, 0.2, 0.4 . . ., 2} U {0.8, 0.82, 0.84, . . ., I.2}. (8.34)
  • 216. 8.5. Numerical Examples of Pool-Based Active Learning Methods 211 Then the best value of v is chosen from the above candidates based on equation 8.28. IWLS (equation 8.27) is used for parameter learning. • Pool-based input-independent variance-only method (PW) Training input points are chosen following equation 8.26 (or equivalently equation 8.29 with v =1). IWLS (equation 8.27) is used for parameter learning. • Pool-based traditional method (PO) Training input points are chosen fol­ lowing equation 8.29 for equation 8.34, and the best value of v is chosen based on equation 8.23. OLS (equation 8.7) is used for parameter learning. • Pool-based input-independent bias-and-variance method (POW) Ini­ tially,ntr training input-output samples are gathered based on the test input distribution, and they are used for learning the resampling bias function. The resampling bias function is learned by kernel ridge regression with Gaussian kernels, where the Gaussian width and ridge parameter are optimized based on fivefold cross-validation with exhaustive grid search. Then the remaining ntr -ntr training input points are chosen based on equation 8.30. The combina­ tion of OLS and IWLS (see equation 8.31) is used for parameter learning. We setntr =25. • Passive method (P) Training input points are drawn uniformly from the pool of test input samples (or equivalently equation 8.29 with v =0). OLS is used for parameter learning. For references, the profile of p�(x) (=Pte(X)s*(X)) is also depicted in the bot­ tom graph of figure 8.8, which is the optimal training input density by PW (see section 8.4.2). In table 8.2, the generalization error obtained by each method is described. The numbers in the table are means and standard deviations over 100 trials. For better comparison, the model error 82 is subtracted from the obtained error and all values are multiplied by 103• In each row of the table, the best method and comparable ones by the Wilcoxon signed-rank test (e.g., [78]) at the significance level 5% percent are indicated with o. When 8=0, PO works the best and isfollowedby PALICE. These two meth­ ods have no statistically significant difference and are significantly better than the other methods. When 8 is increased from 0 to 0.03, the performances of PALICE and PW are almost unchanged, while the performance of PO is con­ siderably degraded. Consequently, PALICE gives the best performance among all. When 8 is further increased to 0.06, the performance of PALICE and PW is still almost unchanged. On the other hand, PO performs very poorly and is outperformed even by the baseline Passive method. POW does not seem to work well for all three cases.
  • 217. 212 8 Active Learning Table 8.2 The mean and standard deviation of the generalization error obtained by pool-based active learning methods 0 = 0 0 = 0.03 0 = 0.06 Average PALICE °2.03 ± 1 .8 1 °2. 17 ± 2.04 °2.42 ± 2.65 °2.21 ± 2.19 PW 2.59 ± 1 .83 2.81 ± 2.01 3.19 ± 2.59 2.86 ± 2 . 1 8 PO ° 1 .82 ± 1 .69 2.62 ± 2.05 4.85 ± 3.37 3 . 1 0 ± 2.78 POW 6.43 ± 6.61 6.66 ± 6.54 7.65 ± 7.21 6.91 ± 6.79 P 3 . 1 0 ± 3.09 3.40 ± 3.55 4.12 ± 4.71 3.54 ± 3.85 For better comparison, the model error 02 is subtracted from the error, and all values are multiplied by 103 • In each row of the table, the best method and comparable ones by the Wilcoxon signed-rank test at the significance level 5 percent are indicated with o. Overall, PALICE and PW are shown to be highly robust against model misspecification, while PO is very sensitive to the violation of the correct model assumption. PALICE significantly outperforms PW, because ALICE is a more accurate estimator of the single-trial generalization error than W (see section 8.2.4). 8.6 Summary and Discussion Active learning is an important issue particularly when the cost of sampling output value is very expensive. However, estimating the generalization error before observing samples is hard. A standard strategy for generalization error estimation in active learning scenarios is to assume that the bias is small enough to be neglected, and focus on the variance (section 8.1.3). We first addressed the population-based active learning problems, where the test input distribution is known. We began our discussion by pointing out that the traditional active learning method based on ordinary least squares is not practical since it requires the model to be correctly specified (section 8.2.1). To overcome this problem, active learning methods based on importance­ weighted least squares have been developed (sections 8.2.3, 8.2.4, and 8.2.5); among them the input-dependent variance-only method called ALICE was shown to perform excellently in experiments (section 8.3). We then turned our focus to the pool-based active learning scenarios where the test input distribution is unknown but a pool of test input samples is given. We showed that all the population-based methods explained above can be extended to the pool-based scenarios (sections 8.4.2, 8.4.3, and 8.4.4). The simulations showed that the method called PALICE works very well (section 8.5).
  • 218. 8.6. Summary and Discussion Table 8.3 Active learning methods Correct Approximately correct Completely misspecified Population-based Input independent section 8.2.3 section 8.2.5 Input dependent section 8.2. 1 section 8.2.4 Pool-based Input independent section 8.4.2 section 8.4.4 21 3 Input dependent section 8.4. 1 section 8.4.3 In ALICE or PALICE, we need to prepare reasonable candidatesfortraining input distributions. We introduced practical heuristics of searching around the analytic solution obtained by other methods (sections 8.2.4 and 8.4.3), which were shown to be reasonable through experiments. However, there may still be room for further improvement, and it is important to explore alternative strategies for preparing better candidates. The active learning methods covered in this chapter are summarized in table 8.3. We have focused on regression scenarios in this chapter. A natural desire is to extend the same idea to classification scenarios. The conceptual issues we have addressed in this chapter-the usefulness of the input-dependent anal­ ysis of the generalization error and the practical importance of dealing with approximately correct models-will still be valid in classification scenarios. Developing active learning methods in classification scenarios based on these conceptual ideas will be promising. The ALICE and PALICE criteria are random variables which depend not only on training input distributions, but also on realizations of training input points. This is why the minimizer of ALICE or PALICE cannot be obtained analytically. On the other hand, this fact implies that ALICE and PALICE allow one to evaluate the goodness not only of training input distributions but also of realizations of training input points. It will be interesting to investigate this issue systematically. The ALICE and PALICE methods have been shown to be robust against the existence of bias. However, if the input dimensionality is very high, the variance tends to dominate the bias due to small sample size, and therefore advantages of these methods tend to be lost. More critically, regression from data samples is highly unreliable in such high-dimensional problems due to extremely large variance. To address this issue, it will be important to first reduce the dimensionality of the data [154, 178, 156, 175], which is another
  • 219. 214 8 Active Learning challenge in active learning research. Active learning for classification in high­ dimensional problems is discussed in, for instance, [115, 139]. We have focused on linear models. However, the importance-weighting technique used to compensate for the bias caused by model misspecification is valid for any empirical error-based methods (see chapter 2). Thus, another important direction to be pursued is extending the current active learning ideas to more complex models, such as support vector machines [193] and neural networks [20].
  • 220. 9 Active Learning with Model Selection In chapters 3 and 8, we addressed the problems of model selectionl and active learning. When discussing model selection strategies, we assumed that the training input points have been fixed. On the other hand, when discussing active learning strategies, we assumed that the model had been fixed. Although the problems of active learning and model selection share the common goal of minimizing the generalization error, they have been studied as two independent problems so far. If active learning and model selection are performed at the same time, the generalization performance will be fur­ ther improved. We call the problem of simultaneously optimizing the training input distribution and model active learning with model selection. This is the problem we address in this chapter. Below, we focus on the model selection criterion IWSIC (equation 3.11) and the population-based active learning criterion ALICE (equation 8.15). How­ ever, the fundamental idea explained in this chapter is applicable to any model selection and active learning criteria. 9.1 Direct Approach and the Active Learning/Model Selection Dilemma A naive and direct solution to the problem of active learning with model selec­ tion is to simultaneously optimize the training input distribution and model. However, this direct approach may not be possible simply by combining exist­ ing active learning methods and model selection methods in a batch manner, due to the active learning/model selection dilemma [166]: When selecting the training input density Ptr(x) with existing active learning methods, the model must have been fixed [48, 107, 33, 55, 199, 89, 153]. On the other hand, 1. "Model selection" refers to the selection of various tunable factors M including basis functions, the regularization parameter, and the flattening parameter.
  • 221. 216 9 Active Learning with Model Selection when choosing the model with existing model selection methods, the training input points {x:,}7!! (or the training input density Ptr(x» must have been fixed and the corresponding training output values {yJrJ7!! must have been gathered [2, 135, 142, 35, 145, 162]. For example, the active learning crite­ rion (equation 8.15) cannot be computed without fixing the model M, and the model selection criterion (equation 3.11) cannot be computed without fixing the training input density Ptr(x), If training input points that are optimal for all model candidates exist, it is possible to perform active learning and model selection at the same time without regard to the active learning/model selection dilemma: Choose the training input points {x:,}7!! for some model M by an active learning method (e.g., equation 8.15), gather corresponding output values {yJr}7!!, and choose a model by a selection method (e.g., equation 3.11). It has been shown that such common optimal training input points exist for a class of correctly speci­ fied trigonometric polynomial regression models [166]. However, the common optimal training input points may not exist in general, and thus the range of application of this approach is limited. 9.2 Sequential Approach A standard approach to coping with the above active learning/model selection dilemma for arbitrary models is the sequential approach [106]. That is, a model is iteratively chosen by a model selection method, and the next input point (or a small batch) is optimized for the chosen model by an active learning method (see figure 9.1a). In the sequential approach, the chosen model varies through the sequential learning process (see the dashed line in figure 9.1b). We refer to this phe­ nomenon as the model drift. The model driftphenomenon could be a weakness of the sequential approach since the location of optimal training input points depends on the target model in active learning; thus a good training input point for one model could be poor for another model. Depending on the transition of the chosen models, the sequential approach can work very well. For exam­ ple, when the transition of the model is the solid line in figure 9.1b, most of the training input points are chosen for the finally selected model Mntr and the sequential approach will have an excellent performance. However, when the transition of the model is the dotted line in figure 9.1b, the performance becomes poor since most of the training input points are chosen for other mod­ els. Note that we cannot control the transition of the model as desired since we do not know a priori which model will be chosen in the end. For this reason, the sequential approach will be unreliable in practice.
  • 222. 9.2. Sequential Approach Choose the next training input point Xi+l Gather output value Yi+1 at Xi+! (a) Diagram Figure 9.1 Sequential approach. No (j) 03 "0 o E '0 � '0 £ () Q) - -, I I r--..I I Very good ..................��?�......: 217 � -+--+--r----------�--. 1 2 n The number of training samples (b) Transition of chosen models Another issue that needs to be taken into account in the sequential approach is that the training input points are not i.i.d. in general--the choice of the (i + l)-th training input point X:�l depends on the previously gathered sam­ ples {(x:�, yn};'=l' Since standard active learning and model selection methods require the i.i.d. assumption for establishing their statistical properties such as consistency or unbiasedness, they may not be directly employed in the sequen­ tial approach [9]. The active learning criterion ALICE (equation 8.15) and the model selection criterion IWSIC (equation 3.11) also suffer from the violation of the i.i.d. condition, and they may lose their consistency and unbiasedness. However, this problem can be settled by slightly modifying the criteria, which is an advantage of ALICE and IWSIC: Suppose we draw u input points from p�)(x) in each iteration (let ntf =UV, where v is the number of iterations). If U tends to infinity, simply redefining the diagonal matrix W as follows makes ALICE and IWSIC still consistent and asymptotically unbiased: w = Pte(xn k,k (i)( If) ' Ptf Xk where k=(i-l)u + j, i =1, 2, . . . , v, and j =1, 2, . . . , u. (9.1)
  • 223. 218 9 Active Learning with Model Selection 9.3 Batch Approach An alternative approach to active learning with model selection is to choose all the training input points for an initially chosen model Mo. We refer to this approach as the batch approach (see figure 9.2a). Due to its nature, this approach does not suffer from the model drift (cf. figure 9.1b); the batch approach can be optimal in terms of active learning if an initially chosen model Mo agrees with the finally chosen model Mntr (see the solid line in figure 9.2b). In order to choose the initial model Mo, we may need a generalization error estimator that can be computed before observing training samples-for exam­ ple, the generalization error estimator (equation 8.15). However, this does not work well since equation 8.15 only evaluates the variance of the estimator (see equation 8.4); thus, using equation 8.15 for choosing the initial model Mo always results in selecting the simplest model from among the candidates. Note that this problem is not specific to the generalization error estimator (equa­ tion 8.15), but is common to most generalization error estimators since it is generally not possible to estimate the bias of the estimator (see equation 8.3) before observing training samples. Therefore, in practice, one may have to choose the initial model Mo randomly. If one has some prior preference of models, p(M), the initial model may be drawn according to it; otherwise, one has to choose the initial model Mo randomly from the uniform distribution. Due to the randomness of the initial model choice, the performance of the batch approach may be unreliable in practice (see the dashed line in figure 9.2b). Figure 9.2 Batch approach. (a) Diagram If) Qi '0 o E '0 � '0 .s= u Q) Optimal Poor .............................. � -f-+--+-----�-� 1 2 n The number of training samples (b) Transition of chosen models
  • 224. 9.4. Ensemble Active Learning 219 9.4 Ensemble Active Learning As pointed out above, the sequential and batch approaches have potential lim­ itations. In this section, we describe a method of active learning with model selection that can cope with the above limitations [167]. The weakness of the batch approach lies in the fact that the training input points chosen by an active learning method are overfitted to the initially chosen model-the training input points optimized for the initial model could be poor if a different model is chosen later. We may reduce the risk of overfitting by not optimizing the training input distribution specifically for a single model, but by optimizing it for all model candidates (see figure 9.3). This allows all the models to contribute to the opti­ mization of the training input distribution, and thus we can hedge the risk of overfitting to a single (possibly inferior) model. Since this idea can be viewed as applying a popular idea of ensemble learning to the problem of active learning, this approach is called ensemble active learning (EAL). The idea of ensemble active learning can be realized by determining the training input density Ptr(x) so that the expected generalization error over all model candidates is minimized: where ALICEM denotes the value of the active learning criterion ALICE for a model M (see equation 8.15), and p(M) is the prior preference of the model M. Choose all training input points {Xd?=l for ensemble of all models Gather all output values {Y;}�l at {Xi}�l (a) Diagram Figure 9.3 The ensemble approach. � _Jiiliiliiiiiiiiilt�-+ Q3 "0 o E '0 8 i 1 2 n The number of training samples (b) Transition of chosen models
  • 225. 220 9 Active Learning with Model Selection If no prior information on the goodness/preference of the models is available, the uniform prior may be used. In the next section, we experimentally show that this ensemble approach significantly outperforms the sequential and batch approaches. 9.5 Numerical Examples Here, we illustrate how the sequential (section 9.2), batch (section 9.3), and ensemble (section 9.4) methods behave using a one-dimensional data set. 9.5.1 Setting Let the input dimension be d= I, and the target function I(x) be the following third-order polynomial (see the top graph of figure 9.4): I(x) = I-x +x2 + O.05r(x), where Z3-3z rex) = J6 with x-O.2 z= ---oA' Le arningtarget function f(x) 2 OL-__�____-L____�____�____L-__�____� -1.5 -1 -0.5 1.5 ° 0.5 x Input density functions c=.0.8 , .. / ' , , 1.5 2 - pte (x) - - - Ptr(x) .•.•. Ptr(x) 0.5 "'" I Ptr (x) , """'11111111111111,'" ,,, ll'IIII.!.,";' , (=2.5 """''''�''"",, " ,1111' " .. .". , _ ' .". ,,' ... '''', .. , . ' ' , 1 111 "" , O�__����____L-__�____����__� -1.5 -1 -0.5 ° 0.5 1.5 2 x Figure 9.4 Target function, training input densities Ptt(x), and test input density P,e(x),
  • 226. 9.5. Numerical Examples 221 Let the test input density Pte(x) be the Gaussian density with mean 0.2 and standard deviation 0.4, which is assumed to be known in this illustrative simulation. We choose the training input density Ptr(x) from a set of Gaussian densities with mean 0.2 and standard deviation O.4c, where c =0.8, 0.9, 1.0, . . ., 2.5. These density functions are illustrated in the bottom graph of figure 9.4. We add i.i.d. Gaussian noise with mean zero and standard deviation 0.3 to the training output values. Let us use the second-order polynomial model: Note that the target function f(x), which is the third-order polynomial, is not realizable by the second-order model. The parameters are learned by adaptive importance-weighted least squares (AIWLS; see section 2.2.1): where y is the flattening parameter (0 :'S y :'S 1) for controlling the bias­ variance trade-off. Here, we focus on choosing the flattening parameter y by model selection; y is selected from y =0, 0.5, 1. 9.5.2 Analysis of Batch Approach First, we investigate the dependency between the goodness of the training input density (i.e., c) and the model (i.e., y). For each y and each c, we draw train­ ing input points {xnl�� and gather output values {y:,}f��. Then we learn the parameter () of the model by AIWLS and compute the generalization error Gen. The mean Gen over 500 trials as a function of c for each y is depicted in figure 9.5a. This graph underlines that the best training input density c could strongly depend on the model y, implying that a training input density that is good for one model could be poor for others. For example, when the training input density is optimized for the model y =0, c =1.1 would be an excellent choice. However, c =1.1 is not so suitable for models y =0.5, 1. This figure illustrates a possible weakness of the batch method: When an initially cho­ sen model is significantly different from the finally chosen model, the training
  • 227. 222 7.5 � 7 e 0; § 6.5 :� � 6., <:: ., I:J 5.5 1.5 2.5 c (choice ofPtr(x)) (a) The mean generalization error over 500 trials as a function of training input density cfor each r(when ntr = 100) Figure 9.5 9 Active Learning with Model Selection <:: ., 0.8 :s 0.6 -5 '0 � &i 0.4 • ::> CT � �L---�50�--1- 0- 0--- 1� 5-0---2�0 -0--�2 50 ntr(number oftraining samples) (b) Frequency of chosen rover 500 trials as a function of the nmnber of training samples Simulation results of active learning with model selection. input points optimized for the initial model could be less useful for the final model, and the performance is degraded. 9.5.3 Analysis of Sequential Approach Next, we investigate the behavior of the sequential approach. In our implemen­ tation, ten training input points are chosen at each iteration. Figure 9.5b depicts the transition of the frequency of chosen y in the sequential learning process over 500 trials. It shows that the choice of models varies over the learning process; a smaller y (which has smaller variance, and thus low complexity) is favored in the beginning, but a larger y (which has larger variance, and thus higher complexity) tends to be chosen as the number of training samples increases. Figure 9.5 illustrates a possible weakness of the sequential method: The target model drifts during the sequential learning process (from small y to large y), and the training input points designed in an early stage (for y= 0) could be poor for the finally chosen model (y= 1). 9.5.4 Comparison of Obtained Generalization Error Finally, we investigate the generalization performance of each method when the number of training samples to be gathered is ntr = 50, 100, 150, 200, 250.
  • 228. 9.6. Summary and Discussion 223 Table 9.1 Means and standard deviations of generalization error when the flattening parameter y is chosen from {O, 0.5, I} by model selection nIT Passive Sequential Batch Ensemble 50 10.63±8.33 7.98±4.57 8.04±4.39 °7.59±4.27 100 5.90±3.42 5.66±2.75 5.73±3.01 °5. 15±2.49 150 4.80±2.38 4.40±1.74 4.61±1.85 °4. 13±1.56 200 4.21±1.66 3.97±1.54 4.26±1.63 °3.73±1.25 250 3.79±1.31 3.46±1.00 3.88±1.41 °3.35±0.95 All values in the table are multiplied by 103• The best method in terms of the mean generalization error and comparable methods according to the Wilcoxon signed-rank test at the significance level 5% are marked by '0'. Table 9.1 describes the means and standard deviations of the generalization error obtained by the sequential, batch, and ensemble methods; as a baseline, we included the result of passive learning, that is, the training input points {XJr}7�1 are drawn from the test input density Pte(x) (or equivalently c = 1). The table shows that all three methods of active learning with model selec­ tion tend to outperform passive learning. However, the improvement of the sequential method is not so significant, as a result the model drift phenomenon (see figure 9.5). The batch method also does not provide significant improve­ ment, due to overfitting to the randomly chosen initial model (see figure 9.5a). On the other hand, the EAL method does not suffer from these problems and works significantly better than the other methods. The best method in terms of the mean generalization error and comparable methods by the Wilcoxon signed-rank test at the significance levelS percent [78] are marked by 0 in the table. 9.6 Summary and Discussion Historically, the problems of active learning and model selection have been studied as two independent problems, although they share a common goal of minimizing the generalization error. We suggested that by simultaneously per­ forming active learning and model selection-which is called active learning with model selection-a better generalization capability can be achieved. We pointed out that the sequential approach, which would be a common approach to active learning with model selection, can perform poorly due to the model drift phenomenon (section 9.2). A batch approach does not suffer from the model drift problem, but it is hard to choose the initial model appropriately. For this reason, the batch approach is not reliable in practice (section 9.3). To
  • 229. 224 9 Active Learning with Model Selection overcome the limitations of the sequential and batch approaches, we intro­ duced an approach called ensemble active learning (EAL), which performs active learning not only for a single model, but also for an ensemble of models (section 9.4). The EAL method was shown to compare favorably with other approaches through simulations (section 9.5). Although we focused on regression problems in this chapter, EAL is appli­ cable to any supervised learning scenario, given that a suitable batch active learning method is available. This implies that, in principle, it is possible to extend the EAL method to classification problems. However, to the best of our knowledge, there is no reliable batch active learning method in classification tasks. Therefore, developing a method of active learning with model selection for classification is still a challenging open problem which needs to be further investigated.
  • 230. 10 Applications of Active Learning In this chapter, we describe real-world applications of active learning tech­ niques: sampling policy design in reinforcement learning (section 10. 1) and wafer alignment in semiconductor exposure apparatus (section 10.2). 10.1 Design of Efficient Exploration Strategies in Reinforcement Learning As shown in section 7.6, reinforcement learning [174] is a useful framework to let a robot agent learn optimal behavior in an unknown environment. The accuracy of estimated value functions depends on the training sam­ ples collected following sampling policy ;r(als). In this section, we apply the population-based active learning method described in section 8.2.4 to design­ ing good sampling policies [4]. The contents of this section are based on the framework of sample reusepolicy iteration described in section 7.6. 10.1.1 Efficient Exploration with Active Learning Let us consider a situation where collecting state-action trajectory samples is easy and cheap, but gathering immediate reward samples is hard and expensive. For example, let us consider a robot-arm control task of hitting a ball with a bat and driving the ball as far away as possible (see section 10. 1.). Let us adopt the carry of the ball as the immediate reward. In this setting, obtaining state­ action trajectory samples of the robot arm is easy and relatively cheap since we just need to control the robot arm and record its state-action trajectories over time. On the other hand, explicitly computing the carry of the ball from the state-action samples is hard due to friction and elasticity of links, air resistance, unpredictable disturbances such a current of air, and so on. Thus, in practice, we may need to put the robot in open space, let it really hit the ball, and mea­ sure the carry of the ball manually. Thus, gathering immediate reward samples is much more expensive than gathering the state-action trajectory samples.
  • 231. 226 10 Applications of Active Learning The goal of active learning in the current setup is to determine the sampling policy so that the expected generalization error is minimized. The general­ ization error is not accessible in practice since the expected reward function R(s,a) and the transition probability PT(s'ls,a) are unknown. Thus, for per­ forming active learning, the generalization error needs to be estimated from samples. A difficulty of estimating the generalization error in the context of active learning is that its estimation needs to be carried out only from state­ action trajectory samples without using immediate reward samples, because gathering immediate reward samples is hard and expensive. Below, we explain how the generalization error for active learning is estimated without reward samples. 10.1.2 Reinforcement Learning Revisited Here, we briefly revisit essential ingredients of reinforcement learning. See section 7.6 for more details. For state s,action a,reward r, and discount factor y, the state-action value function Q" (s,a) E lR for policy ;r is defined as the expected discounted sum of rewards the agent will receive when taking action a in state s and following policy ;r thereafter. That is, where lE",PT denotes the expectation over {sn,an}�l following policy ;r(anlsn) and transition probability PT(sn+llsn, an). We approximate the state-action value function Q" (s,a) using the following linear model: where lb(s,a)=(<ih (s,a),</>2(S,a),. . . ,</>B(s,a))T are the fixed basis functions, B is the number of basis functions, and are model parameters.
  • 232. 10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 227 We want to learn (J so that the true value function Qrr(s,a) is well approx­ imated. As explained in section 7.6, this can be achieved by regressing the immediate reward function R(s,a), using the following transformed basis function: t(s,a; V):=lfJ(s,a)- IV Y I L �, [lfJ(s',a')] , (s,a) , rr(a Is) S E'D(s,a) where V is a set of training samples, V(s,a) is a set of four-quadruple elements containing state s and action a in the training data V, and IErr(a'ls') denotes the conditional expectation of a' over 7l'(a'ls'),given s'. The generalization error of a parameter (J is measured by the following squared Bellman residual G: where IEPI>'f,PT denotes the expectation over {sn' an}�=l following the initial state probability density PI(Sl), the policy 7l'(anISn), and the transition probability density PT(Sn+tlsn,an). Here, we consider the off-policy setup (see section 7.6.5), where the cur­ rent policy 7l'(als) is used for evaluating the generalization error, and a sampling policy n(als) is used for collecting data samples. We use the per­ decision importance-weighting (PIW) method for parameter learning (see section 7.6.6.2): where is the importance weight.
  • 233. 228 10 Applications of Active Learning The goal of active learning here is to find the best sampling policy 1T(aIs) that minimizes the generalization error, 10.1.3 Decomposition of Generalization Error The information we are allowed to use for estimating the generalization error is a set of roll-out samples without immediate rewards: Let us define the deviation of immediate rewards from the mean as Note that E� n can be regarded as additive noise in the context of least-squares function fitting, By definition, Eii has mean zero and its variance generallym,n depends on s�,n and a�,n (i.e., heteroscedastic noise [21]). However, since esti- mating the variance of Eii without using reward samples is not generallym,n possible, we ignore the dependence of the variance on Sii and aii , Let usm,n m,n denote the input-independent common variance by a2, Now we would like to estimate the generalization error from Vii, where 7i is a PIW estimator given by 7i=Lr, L=(XTWX)-IXTW, iirN(m-l)+n=rm,n' W is the diagonal matrix with the (N(m - 1) +n)-th diagonal element given by W - iiN(m-l)+n,N(m-l)+n - wm,n'
  • 234. 10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 229 --- -if Note that in the above calculation of (J, V (a data set without immediate rewards) is used instead of Vii (a data set with immediate rewards) since Vii is not available in the active learning setup. The expectation of the above generalization error over "noise" can be decomposed as follows: m; [G(i)] = Bias2 +Variance +82, ." where lE.ii denotes the expectation over "noise" {E!,n}�;,,�,n=!' Bias2, Variance, and 82 are the bias term, the variance term, and the model error term (respectively) defined by [ N { }2]• 2 1 * T� -iiBIas := lE - L OJ!; [8] - (J ) t(Sn,an; V ) , p(.1r,PT N ."n=! (J* is the optimal parameter in the model, defined by equation 7.5 in section 7.6. Note that the variance term can be expressed in a compact form as ��T Variance = a2tr(ULL ), where U is the B x B matrix with the (b, b')-th element (10.1) 10.1.4 Estimating Generalization Error for Active Learning The model error is constant, and thus can be safely ignored in generalization error estimation, since we are interested in finding a minimizer of the gener­ alization error with respect to li. For this reason, we focus on the bias term and the variance term. However, the bias term includes the unknown optimal parameter (J*, and thus it may not be possible to estimate the bias term without using reward samples; similarly, it may not be possible to estimate the "noise" variance a2 included in the variance term without using reward samples.
  • 235. 230 10 Applications of Active Learning As explained in section 8.2.4, the bias term is small enough to be neglected when the model is approximately correct, that is, (J*Tt(S, a) approximately agrees with the true function R(s,a). Then we have � [G<'O)] - 82 - Bias2 ex tr(ULLT), (10.2) E" which does not require immediate reward samples for its computation. Since lEP],1T,PT included in U is not accessible (see equation 10. 1), we replace U by its consistent estimator fj: Consequently, we have the following generalization error estimator: --.....-..........TJ:=tr(ULL ), which can be computed only from Vii, and thus can be employed in the active learning scenarios. 10.1.5 Designing Sampling Policies Based on the generalization error estimator derived above, we give an algo­ rithm for designing a good sampling policy which fully makes use of the roll-out samples without immediate rewards. 1. Prepare K candidates of sampling policy: {1i}f=I' 2. Collect episodic sa"-!ples without immediate rewards for each sampling policy candidate: {15"k}f=l. 3. Estimate U using all samples fl5"k}f=l : 4. Estimate the generalization error for each k: Jk:=tr(ULiikLiikT), Liik:=(jiikTWiikjiik)_ljiikTWiik,
  • 236. 10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 231 WiTk is the diagonal matrix with the (N(m - 1) + n)-th diagonal element given by 5. (If possible) repeat 2 to 4 several times and calculate the average for each k: {J;Jf=I' 6. Determine the sampling policy: nAL := argmink J{. 7. Collect training samples with immediate rewards following nAL: ViTAL. 8. Learn the value function by LSPI using ViTAL. 10.1.6 Active Learning in Policy Iteration As shown above, the unknown generalization error can be accurately estimated without using immediate reward samples in one-step policy evaluation. Here, we extend the idea to the full policy iteration setup. Sample reuse policy iteration (SRPI) [66], described in section 7.6, is a framework of off-policy reinforcement learning [174, 125] which allows one to reuse previously collected samples effectively. Let us denote the evaluation policy at the l-th iteration by Jrl and the maximum number of iterations by L. In the policy iteration framework, new data samples V"l are collected fol­ lowing the new policy Jrl for the next policy evaluation step. In ordinary policy iteration methods, only the new samples V"l are used for policy evalua­ tion. Thus, the previously collected data samples {V"l, V"z, ..., V"l-l} are not utilized: E:{D"I}�" I E:{D"Z}�" I E:{D"3} I Jrl -+ Q 1 -+ Jrz -+ Q Z -+ Jr3 -+ . . . -+ JrL+l, where E : {V} indicates policy evaluation using the data sample V, and I denotes policy improvement. On the other hand, in SRPI, all previously collected data samples are reused for policy evaluation as E:{D"I}�" I E:{D"I,D"Z}�" I E:{D"I,D"Z,D"3} I Jrl -+ Q 1 -+ Jr2 -+ Q 2 -+ Jr3 -+ . . . -+ JrL+l, where appropriate importance weights are applied to each set of previously collected samples in the policy evaluation step. Here, we apply the active learning technique to the SRPI framework. More specifically, we optimize the sampling policy at each iteration. Then the
  • 237. 232 10 Applications of Active Learning iteration process becomes Thus, we do not gather samples following the current policy lrl' but following the sampling policy Til optimized on the basis of the active learning method. We call this framework activepolicy iteration (API). 10.1.7 Robot Control Experiments Here, we evaluate the performance of the API method using a ball-batting robot (see figure 10. 1), which consists of two links and two joints. The goal of the ball-batting task is to control the robot arm so that it drives the ball as far as possible. The state space S is continuous and consists of angles <PI[rad] (E [0, lr/4]) and <P2[rad] (E [-lr/4, lr/4]), and angular velocities <PI[rad/s] and <P2[rad/s]. Thus, a state s (E S) is described by a four-dimensional vector: The action space A is discrete and contains two elements: where the i-th element (i = 1, 2) of each vector corresponds to the torque [N· m] added to joint i. We use the open dynamics engine (ODE), which is available at http:// ode.org/, for physical calculations including the update of the angles and I Figure 10.1 A ball-batting robot. joint 1 , O ball I � Pln .. 0.1 [ml (Object Settings) link 1: O.65[mJ (length), l1.S[kgJ (mass) link 2: O.35[mJ (length), 6.2[kgJ (mass) ball: 0.1[mJ (radius). 0.1[kgJ (mass) pin: O.3[mJ (height), 7.5[kgJ (mass)
  • 238. 10.1. Design of Efficient Exploration Strategies in Reinforcement Learning 233 angular velocities, and collision detection between the robot arm, ball, and pin. The simulation time step is set to 7.5 [ms] and the next state is observed after 10 time steps. The action chosen in the current state is kept taken for 10 time steps. To make the experiments realistic, we add noise to actions: If action (fl,h)T is taken, the actual torques applied to thejoints are JI+BIand 12 +B2,where BIand B2are drawn independently from the Gaussian distribution with mean ° and variance 3. The immediate reward is defined as the carry of the ball. This reward is given only when the robot arm hits the ball for the first time at state s' after taking action a at current state s. For value function approximation, we use the 110 basis functions defined as { . (IIS-CiIl2)/(a =a(]»)exp 2r 2 ¢2(i-I)+j= . for i =1,. . . ,54 and j =1,2, /(a =a(]») for i =55 and j =1,2, where r is set to 3n/2 and the Gaussian centers Ci(i =1,. . .,54) are located on the regular grid to, n/4} x {-n,0,n} x {-n/4,0,n/4} x {-n,0,n}. /(c) denotes the indicator function: {I if the condition Cis true, /(c) = ° otherwise. We set L =7 and N=10. As for the number of episodes M, we compare the "decreasing M" strategy (M is decreased as 10, 10,7,7,7,4,and 4 from iteration 1 to iteration7) and the "fixed M" strategy (M is fixed to7 through­ out iterations). The initial state is always set to s =(n/4,0,0,0)T. The initial evaluation policy nl is set to the E-greedy policy, defined as nl(als):=0. 15 pu(a) +0.85/(a=argmax Qo(s,a')), a' 110 Qo(s,a):=I>b(S,a), b=1 where pu(a) denotes the uniform distribution over actions. Policies are updated using the E-greedy rule with E =0. 15/I in the l-th iteration. We pre­ pare the following four sampling policy candidates in the sampling policy
  • 239. 234 10 Applications of Active Learning selection step of the l-th iteration: {J'f'l.15/1 J'f'l.15/l+0.15 J'f'l.15/l+0.5 J'f'l.15/l+0.85}I '1 ' I '1 ' where Jrl denotes the policy obtained by greedy update using (2"1-1, and Jf; is the "E-greedy" version of the base policy Jr/. That is, the intended action can be successfully chosen with probability 1 - E/2, and the other action is chosen with probability E/2. The discount factor y is set to 1, and the performance of learned policy lrL+! is measured by the discounted sum of immediate rewards for test samples {r��/l}�O';'�n=! (20 episodes with 10 steps collected following lrL+l): M N Performance = LLr��n +! • m=1 n=1 The experiment is repeated 500 times with different random seeds, and the average performance of each learning method is evaluated. The results, are depicted in figure 10.2, showing that the API method outperforms the passive learning strategy; for the "decreasing M" strategy, the performance difference is statistically significant by the t-test at the significance level 1 percent for the error values at the seventh iteration. The above experimental evaluation showed that the sampling policy design method, API, is useful for improving the performance of reinforcement learn­ ing. Moreover, the "decreasing M" strategy was shown to be a useful heuristic to further enhance the performance of API. 10.2 Wafer Alignment in Semiconductor Exposure Apparatus In this section, we describe an application of pool-based active learning meth­ ods to a wafer alignment problem in semiconductor exposure apparatus [163]. A profile of the exposure apparatus is illustrated in figure 10.3. Recent semiconductors have the layered circuit structure, which is built by exposing circuit patterns multiple times. In this process, it is extremely important to align the wafers at the same position with very high accu­ racy. To this end, the location of markers is measured to adjust the shift and rotation of wafers. However, measuring the location of markers is time­ consuming, and therefore there is a strong need to reduce the number of markers to be measured in order to speed up the semiconductor production process.
  • 240. 10.2. Wafer Alignment in Semiconductor Exposure Apparatus 70 65 60 QJ u c 55to E .g 50QJ a. QJ Cl � 45 QJ > <t: 40 35 30 2 3 Figure 10.2 4 Iteration _ 0- - - - - .() - -- --- AL (decreasing M) - • - PL (decreasing M) -0- AL (fixed M) • 0 • PL (fixed M) 5 6 7 235 The mean performance over 500 runs in the ball-batting experiment. The dotted lines denote the performance of passive learning (PL), and the solid lines denote the performance of the active learning (AL) method. The error bars are omitted for clear visibility. For the "decreasing M OO strategy, the performance of active learning after the seventh iteration is significantly better than that of PL according to the t-test at the significance level I percent for the error values at the seventh iteration. Reticle stage ��, ��g;�wafer � stage Silicon Wafer Figure 10.3 Semiconductor exposure apparatus.
  • 241. 236 Observed Marker Marker Figure 10.4 10 Applications of Active Learning Silicon wafer with markers. Observed markers based on the conventional heuristic are also shown. Figure 10.4 illustrates a wafer on which markers are printed uniformly. The goal is to choose the most "informative" markers to be measured for better alignment of the wafer. A conventional choice is to measure markers far from the center in a symmetric way, which will provide robust estimation of the rotation angle (see figure 10.4). However, this naive approach is not necessar­ ily the best since misalignment is caused not only by affine transformation but also by several other nonlinear factors, such as a warp, a biased characteristic of measurement apparatus, and different temperature conditions. In practice, it is not easy to model such nonlinear factors accurately. For this reason, the linear affine model or the second-order model is often used in wafer alignment. How­ ever, this causes model misspecification, and therefore active learning methods for approximately correct models explained in chapter 8 would be useful in this application. Let us consider the functions whose input x = (u,v)T is the location on the wafer and whose output is the horizontal discrepancy /:).U or the vertical discrepancy /:).v. These functions are learned using the following second-order model: For 220 wafer samples, experiments are carried out as follows. For each wafer, ntr = 20 points are chosen from nte = 38 markers, and the horizontal and the vertical discrepancies are observed. Then the above model is trained and its prediction performance is tested using all 38 markers in the 220 wafers. This
  • 242. 10.2. Wafer Alignment in Semiconductor Exposure Apparatus 237 Table 10.1 The mean squared test error for the wafer alignment problem (means and standarddeviations over 220 wafers) Order 2 PALICE °2.27±1.08 °1.93±0.89 PW 2.29±1.08 2.09±0.98 PO 2.37±1.15 1.96±0.91 Passive 2.32±1.11 2.32±1.15 Conv. 2.36±1.15 2.13±1.08 PALICE, PW, and PO denote the active learning methods described in sections 8.4.3, 8.4.2, and 8.4.1, respectively. "Passive" indicates the passive sampling strategy where training input points are randomly chosen from all markers. "Conv." indicates the conventional heuristic of choosing the outer markers. process is repeated for all 220 wafers. Since the choice of the sampling location by active learning methods is stochastic, the above experiment is repeated 100 times with different random seeds. The mean and standard deviation of the squared test error over 220 wafers are summarized in table 10. 1. This shows that the PALICE method (see section 8.4.3) works significantly better than the other sampling strategies, and it provides about a 10 percent reduction in the squared error from the conventional heuristic of choosing the outer markers.
  • 244. 11 Conclusions and Future Prospects In this book,we provided a comprehensive overview of theory,algorithms,and applications of machine learning under covariate shift. 11.1 Conclusions Part II of the book covered topics on learning under covariate shift. In chapters 2 and 3, importance sampling techniques were shown to form the theoretical basis of covariate shift adaptation in function learning and model selection. In practice, importance weights needed in importance sampling are unknown. Thus, estimating the importance weights is a key component in covariate shift adaptation,which was covered in chapter 4. In chapter 5,a novel idea for estimating the importance weights in high-dimensional problems was explained. Chapter 6 was devoted to the review of a Nobel Prize-winning work on sample selection bias,and its relation to covariate shift adaptation was dis­ cussed. In chapter 7, applications of covariate shift adaptation techniques to various real-world problems were shown. In part II, we considered the occurrence of covariate shift and how to cope with it. On the other hand,in part III,we considered the situation where covari­ ate shift is intentionally caused by users in order to improve generalization ability. In chapter 8, the problem of active learning, where the training input distribution is designed by users,was addressed. Since active learning naturally induces covariate shift, its adaptation was shown to be essential for design­ ing better active learning algorithms. In chapter 9, a challenging problem of active learning with model selection where active learning and model selec­ tion are performed at the same time, was addressed. The ensemble approach was shown to be a promising method for this chicken-or-egg problem. In chapter 10, applications of active learning techniques to real-world problems were shown.
  • 245. 242 11 Conclusions and Future Prospects 11.2 Future Prospects In the context of covariate shift adaptation, the importance weights played an essential role in systematically adjusting the difference of distributions in the training and test phases. In chapter 4, we showed various methods for estimating the importance weights. Beyond covariate shift adaptation, it has been shown recently that the ratio of probability densities can be used for solving machine learning tasks [157,170]. This novel machine learning framework includes multitask learn­ ing [16], privacy-preserving data mining [46], outlier detection [79,146,80], change detection in time series [91],two-sample test [169],conditional density estimation [172], and probabilistic classification [155]. Furthermore, mutual information-which plays a central role in information theory [34]-can be estimated via density ratio estimation [178,179]. Since mutual information is a measure of statistical independence between random variables, density ratio estimation also can be used for variable selection [177], dimensionality reduction [175],independence test [168],clustering [94],independent compo­ nent analysis [176],and causal inference [203]. Thus,density ratio estimation is a promising versatile tool for machine learning which needs to be further investigated.
  • 246. Appendix: List of Symbols and Abbreviations X d Y a Ptr(x) p(ylx) I(x) x�e yj e nte Pte(x) w(x) loss(x, y, 5) Gen Gen', Gen" Geii Bias 2 Var IE{Xr}7�1 IE{YY}7�1 IExte lEie M Input domain Input dimensionality Output domain i-th training input i-th training output Training output vector i-th training output noise Training output noise vector Standard deviation of output noise Number of training samples Training input density Conditional density of output y, given input x Conditional mean of output y, given input x j-th test input j-th test output Number of test samples Test input density Importance weight Loss when output y at input x is estimated by y Generalization error Generalization error without irrelevant constant Generalization error estimator (Squared) bias term in generalization error Variance term in generalization error Expectation over {X:,}7!1 drawn i.i.d. from Ptr(x) Expectation over {y!r}7!, each drawn from p(yIx =x:') Expectation over xte drawn from Pte(x) Expectation over ie drawn from p(ylx =xte) Model
  • 247. 244 rex; 0) 8e qJe(x) o b K(x,x') '9 0* y A R(O) Xlf W L U KIf N(x; /L, a2) N(x; It, 1:) rex) 8 p�(x) �(x) �*(x) v Op T (.,.) 11·11 AIC AlWERM AIWLS ALICE API BASIC Function model £-th parameter £-th basis function Parameter vector Number of parameters Kernel function Learned parameter Optimal parameter Appendix: List of Symbols and Abbreviations Flattening parameter for importance weights Regularization parameter Regularization function Training design matrix Importance-weight matrix Learning matrix Metric matrix Training kernel matrix Gaussian density with mean /L and variance a2 Multidimensional Gaussian density with mean It and covari­ ance matrix 1: Residual function Model error "Optimal" training input density Resampling weight function "Optimal" resampling weight function Flattening parameter of training input density Asymptotic order ("big 0") Asymptotic order in probability ("big 0") Asymptotic order ("small 0") Asymptotic order in probability ("small 0") Transpose of matrix or vector Inner product Norm Akaike information criterion Adaptive lWERM Adaptive IWLS Active learning using IWLS learning based on conditional expectation of generalization error Active policy iteration Bootstrap approximated SIC
  • 248. Appendix: List of Symbols and Abbreviations CRF CSP CV D3 EAL ERM FDA GMM IWERM IWLS KDE KKT KL KLIEP KMM LASIC LFDA LOOCV LR LS LSIF MDP MFCC MLE Conditional random field Common spatial pattern Cross-validation Direct density-ratio estimation with dimensionality reduction Ensemble active learning Empirical risk minimization Fisher discriminant analysis Gaussian mixture model Importance-weighted ERM Importance-weighted LS Kernel density estimation Karush-Kuhn-Tucker Kullback-Leibler KL importance estimation procedure Kernel mean matching Linearly approximated SIC Local FDA Leave-one-out CV Logistic regression Least squares Least-squares importance fitting Markov decision problem Mel-frequency cepstrum coefficient Maximum likelihood estimation MSE Mean squared error NLP Natural language processing OLS Ordinary LS PALICE Pool-based ALICE QP Quadratic program RIWERM Regularized IWERM RL SIC SRPI SVM TD uLSIF Reinforcement learning Subspace information criterion Sample reuse policy iteration Support vector machine Temporal difference Unconstrained LSIF 245
  • 249. Bibliography I. H. Akaike. Statistical predictor identification. Annals of the Institute of Statistical Mathematics, 22(1):203-217, 1970. 2. H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, AC-19(6):716-723, 1974. 3. H. Akaike. Likelihood and the Bayes procedure. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, editors, Bayesian Statistics, pages 141-166. University of Valencia Press, Vlencia, Spain, 1980. 4. T. Akiyama, H. Hachiya, and M. Sugiyama. Efficient exploration through active learning for value function approximation in reinforcement learning. Neural Networks, 23(5):639-648, 2010. 5. A. Albert. Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York and London, 1972. 6. N. Altman and C. Leger. On the optimality of prediction-based selection criteria and the convergence rates of estimators. Journal of the Royal Statistical Society, series B, 59(1):205-216, 1997. 7. S. Amari. Theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, EC-16(3):299-307, 1967. 8. F. Babiloni, F. Cincotti, L. Lazzarini, J. del R. Millan, J. Mourifio, M. Varsta, J. Heikkonen, L. Bianchi, and M. G. Marciani. Linear classification of low-resolution EEG patterns produced by imagined hand movements. IEEE Transactions on Rehabilitation Engineering, 8(2):186-188, June 2000. 9. F. R. Bach. Active learning for misspecified generalized linear models. In B. Schiilkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages 65-72. MIT Press, Cambridge, MA, 2007. 10. P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA, 1998. II. L. Bao and S. S. Intille. Activity recognition from user-annotated acceleration data. In Pro­ ceedings of the 2nd IEEE International Conference on Pervasive Computing, pages 1-17. Springer, New York, 2004. 12. R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ, 1961. 13. P. D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Nashua, NH, 1996. 14. M. J. Best. An algorithm for the solution of the parametric quadratic programming problem. CORR Report 82-24, Faculty of Mathematics, University of Waterloo, Ontaris, Canada, 1982. 15. N. B. Bharatula, M. Stager, P. Lukowicz, and G. Triister. Empirical study of design choices in multi-sensor context recognition. In Proceedings of the 2nd International Form on Applied Wearable Computing, pages 79-93. 2005.
  • 250. 248 Bibliography 16. S. Bickel, 1. Bogojeska, T. Lengauer, and T. Scheffer. Multi-task learning for HIV therapy screening. In A. McCallum and S. Roweis, editors, Proceedings of the 25th Annual International Conference on Machine Learning, pages 56-63. Madison, WI, Omnipress, 2008. 17. S. Bickel, M. Briickner, and T. Scheffer. Discriminative leaming for differing training and test distributions. In Z. Ghahramani, editor, Proceedings of the 24th International Conference on Machine Learning, pages 81-88. 2007. 18. S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In B. Sch6ikopf, 1. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages 161-168. MIT Press, Cambridge, MA, 2007. 19. H. J. Bierens. Maximum likelihood estimation of Heckman's sample selection model. Unpub­ lished manuscript, Pennsylvania State University, 2002. Available at http://econ.la.psu.edur hbierens/EasyRegTours/HECKMAN_Tourfiles/Heckman.PDF. 20. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. 21. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006. 22. B. Blankertz, G. Dornhege, M. Krauledat, and K.-R. Miiller. The Berlin brain-computer inter­ face: EEG-based communication without subject training. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2):147-152, 2006. 23. B. Blankertz, G. Dornhege, M. Krauledat, K.-R. Miiller, and G. Curio. The Berlin brain­ computer interface: Report from the feedback sessions. Technical Report 1, Fraunhofer FIRST, 2005. 24. J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspon­ dence leaming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 120-128. 2006. 25. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. SchOlkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49-e57, 2006. 26. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi­ fiers. In D. Haussler, editor, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 144-152. ACM Press, New York, 1992. 27. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. 28. L. Breiman. Arcing classifiers. Annals of Statistics, 26(3):801-849, 1998. 29. W. Campbell. Generalized linear discriminant sequence kernels for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 161-164. 2002. 30. O. Chapelle, B. SchOlkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006. 31. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33-61, 1998. 32. K. F. Cheng and C. K. Chu. Semiparametric density estimation under a two-sample density ratio model. Bernoulli, 10(4):583-604, 2004. 33. D. A. Cohn, Z. Ghahramani, and M.1. Jordan. Active leaming with statistical models. Journal of Artificial Intelligence Research, 4:129-145, 1996. 34. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, New York, 1991. 35. P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimating the cor­ rect degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377-403, 1979.
  • 251. Bibliography 249 36. H. Daume III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 256-263. 2007. 37. A. Demiriz, K P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column generation. Machine Learning, 46(1/3):225-254, 2002. 38. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, series B, 39(1):1-38, 1977. 39. G. Dornhege, B. Blankertz, G. Curio, and K-R. Miiller. Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms. IEEE Transactions on Biomedical Engineering, 51(6):993-1002, June 2004. 40. G. Dornhege, 1. del R. Millan, T. Hinterberger, D. McFarland, and K-R. Miiller, editors. Toward Brain-Computer Interfacing. MIT Press, Cambridge, MA, 2007. 41. N. R. Draper and H. Smith. Applied Regression Analysis, third edition. Wiley-Interscience, New York, 1998. 42. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, second edition. Wiley­ Interscience, New York, 2000. 43. B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):1-26, 1979. 44. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407-499, 2004. 45. B. Efron and R. 1. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, New York, 1994. 46. C. Elkan. Privacy-preserving data mining via importance weighting. In Proceedings of the ECMUPKDD Workshop on Privacy and Security Issues in Data Mining and Machine Learning, pages 15-21, Springer, 2010. 47. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1-50, 2000. 48. V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. 49. FG-NET Consortium. The FG-NET Aging Database. http://www.fgnet.rsunit.coml. 50. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179-188, 1936. 51. G. S. Fishman. Monte Carlo: Concepts, Algorithms, and Applications. Springer-Verlag, Berlin, 1996. 52. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, pages 148-156. Morgan Kaufmann, San Francisco, 1996. 53. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2):337-407, 2000. 54. Y. Fu, Y. Xu, and T. S. Huang. Estimating human age by manifold analysis of face pictures and regression on aging features. In Proceedings of the IEEE International Conference on Multimedia and Expo, pages 1383-1386. 2007. 55. K Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17-26, 2000. 56. K Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learn­ ing with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5(1):73-99, January 2004. 57. K Fukunaga. Introduction to Statistical Pattern Recognition, second edition. Academic Press, Boston, 1990. 58. S. Furui. Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2):254-272, 1981.
  • 252. 250 Bibliography 59. S. Furui. Comparison of speaker recognition methods using statistical features and dynamic features. IEEE Transactions on Acoustics. Speech and Signal Processing. 29(3):342-350. 1981. 60. C. Gao. F. Kong. and J. Tan. HealthAware: Tackling obesity with Health Aware smart phone systems. In Proceedings of the IEEE International Conference on Robotics and Biomimetics. pages 1549-1554. IEEE Press. Piscataway. NJ. 2009. 61. X. Geng. Z. Zhou. Y. Zhang. G. Li. and H. Dai. Learning from facial aging patterns for auto­ matic age estimation. In Proceedings of the 14th ACM International Conference on Multimedia. pages 307-316. 2006. 62. A. Globerson and S. Roweis. Metric leaming by collapsing classes. In Y. Weiss. B. Schiilkopf. and J. Platt. editors. Advances in Neural Information Processing Systems. volume 18. pages 451-458. MIT Press. Cambridge. MA, 2006. 63. J. Goldberger. S. Roweis. G. Hinton. and R. Salakhutdinov. Neighbourhood components anal­ ysis. In L. K. Saul. Y. Weiss. and L. Bottou. editors. Advances in Neural Information Processing Systems. volume 17. pages 513-520. MIT Press. Cambridge. MA. 2005. 64. G. H. Golub and C. F. van Loan. Matrix Computations. third edition. Johns Hopkins University Press. Baltimore. 1996. 65. G. Guo. G. Mu. Y. Fu. C. Dyer. and T. Huang. A study on automatic age estimation using a large database. In Proceedings of the IEEE International Conference on Computer Vision. pages 1986-1991. 2009. 66. H. Hachiya. T. Akiyama. M. Sugiyama. and J. Peters. Adaptive importance sampling for value function approximation in off-policy reinforcement leaming. Neural Networks, 22(10):1399-1410, 2009. 67. H. Hachiya, J. Peters, and M. Sugiyama. Efficient sample reuse in EM-based policy search. In W. Buntine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor, editors, Machine Learning and Knowledge Discovery in Databases. Volume 5781 of Lecture Notes in Artificial Intelligence, pages 469-484. Springer, Berlin, 2009. 68. H. Hachiya, M. Sugiyama, and N. Ueda. Importance-weighted least-squares probabilis­ tic classifier for covariate shift adaptation with application to human activity recognition. Neurocomputing, 2011. In press. 69. W. Hardie, M. Miiller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric Models. Springer, Berlin, 2004. 70. T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regnlarization path for the support vector machine. Journal of Machine Learning Research, 5:1391-1415, 2004. 71. T. Hastie and R. Tibshirani. Generalized Additive Models, volume 43 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, London, 1990. 72. T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607-616, 1996. 73. T. Hastie and R. Tibshirani. Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society, Series B, 58(1):155-176, 1996. 74. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining. Inference, and Prediction. Springer, New York, 2001. 75. Y. Hattori, S. Inoue, T. Masaki, G. Hirakawa, and O. Sudo. Gathering large scale human activ­ ity information using mobile sensor devices. In Proceedings of the Second International Workshop on Network Traffic Control. Analysis and Applications, pages 708-713. 2010. 76. J. J. Heckman. The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement, 5(4):120-137, 1976. 77. J. J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153-161, 1979. 78. R. E. Henkel. Tests of Significance. Sage Publications, Beverly Hills, CA, 1976.
  • 253. Bibliography 251 79. S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Inlier-based outlier detection via direct density ratio estimation. In F. Giannotti, D. Gunopulos, F. Turini, C. Zaniolo, N. Ramakr­ ishnan, and X. Wu, editors, Proceedings of the IEEE International Conference on Data Mining, pages 223-232. 2008. 80. S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26(2):309-336, 2011. 81. G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002. 82. J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. SchOlkopf. Correcting sample selection bias by unlabeled data. In B. SchOlkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19, pages 601-608. MIT Press, Cambridge, MA, 2007. 83. P.1. Huber. Robust Statistics. Wiley-Interscience, New York, 1981. 84. A. Hyvfuinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, New York, 2001. 85. M. Ishiguro, Y. Sakamoto, and G. Kitagawa. Bootstrapping log likelihood and EIC, an extension of AIC. Annals of the Institute of Statistical Mathematics, 49(3):411-434, 1997. 86. J. Jiang and C. Zhai. Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 264-271. Association for Computational Linguistics, 2007. 87. T. Kanamori. Pool-based active leaming with optimal sampling distribution and its informa­ tion geometrical interpretation. Neurocomputing, 71(1-3):353-362, 2007. 88. T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391-1445, July 2009. 89. T. Kanamori and H. Shimodaira. Active leaming algorithm using the maximum weighted log-likelihood estimator. Journal of Statistical Planning and Inference, 116(1):149-162, 2003. 90. T. Kanamori, T. Suzuki, and M. Sugiyama. Condition number analysis of kernel-based density ratio estimation. Technical report TR09-0006, Department of Computer Science, Tokyo Institute of Technology, 2009. Available at: http://arxiv.org.abs/0912.2800. 91. Y. Kawahara and M. Sugiyama. Change-point detection in time-series data by direct density­ ratio estimation. In H. Park, S. Parthasarathy, H. Liu, and Z. Obradovic, editors, Proceedings of the SIAM International Conference on Data Mining, pages 389-400. SIAM, 2009. 92. M. Kawanabe, M. Sugiyama, G. Blanchard, and K.-R. Miiller. A new algorithm of non­ Gaussian component analysis with radial kernel functions. Annals of the Institute of Statistical Mathematics, 59(1):57-75, 2007. 93. 1. Kiefer. Optimum experimental designs. Journal of the Royal Statistical Society, Series B, 21:272-304, 1959. 94. M. Kimura and M. Sugiyama. Dependence-maximization clustering with least-squares mutual information. Journal of Advanced Computational Intelligence and Intelligent Informatics, 15(7): 800-805, 2011. 95. D. E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming, third edition. Addison-Wesley Professional, Boston, 1998. 96. S. Konishi and G. Kitagawa. Generalised information criteria in model selection. Biometrika, 83(4):875-890, 1996. 97. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79-86, 1951. 98. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282-289. Morgan Kaufmann, San Francisco, 2001.
  • 254. 252 Bibliography 99. M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107-1149, December 2003. 100. P. Leijdekkers and V. Gay. A self-test to detect a heart attack using a mobile phone and wearable sensors. In Proceedings of the 21stIEEE International Symposium on Computer-Based Medical Systems, pages 93-98. IEEE Computer Society, Washington, DC, 2008. 101. S. Lemm, B. Blankertz, G. Curio, and K.-R. Muller. Spatio-spectral filters for improving the classification of single trial EEG. IEEE Transactions on Biomedical Engineering, 52(9):1541- 1548, September 2005. 102. Y. Li, H. Kambara, Y. Koike, and M. Sugiyama. Application of covariate shift adapta­ tion techniques in brain computer interfaces. IEEE Transactions on Biomedical Engineering, 57(6):1318-1324, 2010. 103. c.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Journal of Machine Learning Research, 9:627-650, April 2008. 104. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley, New York, 1987. 105. A. Luntz and V. Brailovsky. On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3, 1969. (In Russian). 106. D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415-447, 1992. 107. D. J. C. MacKay. Information-based objective functions for active data selection. Neural Computation, 4(4):590-604, 1992. 108. C. L. Mallows. Some comments on Cpo Technometrics, 15(4):661-675, 1973. 109. O. L. Mangasarian and D. R. Musicant. Robust linear and support vector regression.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):950-955, 2000. 110. 1. Mariethoz and S. Bengio. A kernel trick for sequences applied to text-independent speaker verification systems. Pattern Recognition, 40(8):2315-2324, 2007. 111. T. Matsui and K. Aikawa. Robust model for speaker verification against session-dependent utterance variation. IEICE Transactions on Information and Systems, E86-D(4):712-718, 2003. 112. T. Matsui and S. Furui. Concatenated phoneme models for text-variable speaker recognition. In Proceedings of the IEEE International Conference on Audio Speech and Signal Processing, pages 391-394. 1993. 113. T. Matsui and K. Tanabe. Comparative study of speaker identification methods: dPLRM, SVM, and GMM.IEICE Transactions on Information and Systems, E89-D(3):1066-1073, 2006. 114. P. McCullagh and 1. A. Neider. Generalized Linear Models, second edition. Chapman & Hall/CRC, London, 1989. 115. P. Melville and R. J. Mooney. Diverse ensembles for active learning. In Proceedings of the 21st International Conference on Machine Learning, pages 584-591. ACM Press, New York, 2004. 116. J. del R. Millan. On the need for on-line learning in brain-computer interfaces. In Pro­ ceedings of the International Joint Conference on Neural Networks, volume 4, pages 2877-2882. 2004. 117. T. P. Minka. A comparison of numerical optimizers for logistic regression. Technical report, Microsoft Research. 2007. 118. N. Murata, S. Yoshizawa, and S. Amari. Network information criterion-Determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6):865-872, 1994. 119. W. K. Newey, J. L. Powell, and J. R. Walker. Semiparametric estimation of selection mod­ els: Some empirical results. American Economic Review Papers and Proceedings, 80(2):324-328, 1990. 120. X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847-5861, 2010.
  • 255. Bibliography 253 121. K. Pelckmans, 1. A. K. Suykens, and B. de Moor. Additive regularization trade-off: Fusion of training and validation levels in kernel methods. Machine Learning, 62(3):217-252, 2006. 122. G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: Basic principles. Clinical Neurophysiology, 110(11):1842-1857, November 1999. 123. P. J. Phillips, P. 1. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. J. Worek. Overview of the face recognition grand challenge. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 947-954. 2005. 124. D. Precup, R. S. Sutton, and S. Dasgupta. Off-policy temporal-difference learning with func­ tion approximation. In Proceedings of the 18th International Conference on Machine Learning, pages 417-424. 2001. 125. D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pages 759-766. Morgan Kaufmann, San Franciscs, 2000. 126. P. A. Puhani. The heckman correction for sample selection and its critique: A short survey. Journal of Economic Surveys, 14(1):53-68, 2000. 127. F. Pukelsheim. Optimal Design of Experiments. Wiley, New York, 1993. 128. J. Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619-630, 1998. 129. L. Rabiner and B-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, NJ, 1993. 130. H. Rarnoser, J. Miiller-Gerking, and G. Pfurtscheller. Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Transactions on Rehabilitation Engineering, 8(4):441-446, 2000. 131. C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, New York, 1965. 132. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1-3):19-41, 2000. 133. D. A. Reynolds and R. C. Rose. Robust text-independent speaker verification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72-83, 1995. 134. K. Ricanek and T. Tesafaye. MORPH: A longitudinal image database of normal adult age­ progression. In Proceedings of the IEEE 7th International Conference on Automatic Face and Gesture Recognition, pages 341-345. 2006. 135. J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465-471, 1978. 136. J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, Series B, 49(3):223-239, 1987. 137. R. T. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7):1443-1471, 2002. 138. P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. Wiley, New York, 1987. 139. A. I. Schein and L. H. Ungar. Active learning for logistic regression: An evaluation. Machine Learning, 68(3):235-265, 2007. 140. M. Schmidt. minFunc, 2005. http://people.cs.ubc.carschmidtml Software/minFunc.html. 141. B. Schiilkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2001. 142. G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461-464, 1978. 143. P. Shenoy, M. Krauledat, B. Blankertz, R. P. N. Rao, and K.-R. Miiller. Towards adaptive classification for BCI. Journal of Neural Engineering, 3(1):R13-R23, 2006.
  • 256. 254 Bibliography 144. R. Shibata. Statistical aspects of model selection. In J. C. Willems, editor, From Data to Model, pages 215-240. Springer-Verlag, New York, 1989. 145. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log­ likelihood function. Journal of Statistical Planning and Inference, 90(2):227-244, 2000. 146. A. Smola, L. Song, and C. H. Teo. Relative novelty detection. In D. van Dyk and M. Welling, editors, Proceedings of the i2th international Conference on Artificial intelligence and Statistics, volume 5 of JMLR Workshop and Conference Proceedings, pages 536-543. 2009. 147. L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo. Supervised feature selec­ tion via dependence estimation. In Proceedings of the 24th International Conference on Machine Learning, pages 823-830. ACM Press, New York, 2007. 148. C. M. Stein. Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9(6):1135-1151, 1981. 149. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67-93, 2001. 150. M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36(2):111-147, 1974. 151. M. Stone. Asymptotics for and against cross-validation. Biometrika, 64(1):29-35, 1977. 152. E. P. Stuntebeck, J. S. Davis 11, G. D. Abowd, and M. Blount. Healthsense: Classification of health-related sensor data through user-assisted machine leaming. In Proceedings of the 9th Workshop on Mobile Computing Systems and Applications, pages 1-5. ACM, New York, 2008. 153. M. Sugiyama. Active leaming in approximately linear regression based on conditional expectation of generalization error. Journal of Machine Learning Research, 7:141-166, January 2006. 154. M. Sugiyama. Dimensionality reduction of multimodal labeled data by local Fisher discrim­ inant analysis. Journal of Machine Learning Research, 8:1027-1061, May 2007. 155. M. Sugiyama. Superfast-trainable multi-class probabilistic classifier by least-squares poste­ rior fitting. IEiCE Transactions on Information and Systems, E93-D(10):2690-2701, 2010. 156. M. Sugiyama, T. Ide, S. Nakajima, and J. Sese. Semi-supervised local Fisher discriminant analysis for dimensionality reduction. Machine Learning, 78(1-2):35-61, 2010. 157. M. Sugiyama, T. Kanamori, T. Suzuki, S. Hido, J. Sese, I. Takeuchi, and L. Wang. A density-ratio framework for statistical data processing. iPSJ Transactions on Computer Vision and Applications, 1:183-208, 2009. 158. M. Sugiyama, M. Kawanabe, and P. L. Chui. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Networks, 23(1):44-59, 2010. 159. M. Sugiyama, M. Kawanabe, and K.-R. Miiller. Trading variance reduction with unbi­ asedness: The regularized subspace information criterion for robust model selection in kemel regression. Neural Computation, 16(5):1077-1104, 2004. 160. M. Sugiyama, M. Krauledat, and K.-R. Miiller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985-1005, May 2007. 161. M. Sugiyama and K.-R. Miiller. The subspace information criterion for infinite dimensional hypothesis spaces. Journal of Machine Learning Research, 3:323-359, November 2002. 162. M. Sugiyama and K.-R. Miiller. Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions, 23(4):249-279, 2005. 163. M. Sugiyama and S. Nakajima. Pool-based active learning in approximate linear regression. Machine Learning, 75(3):249-274, 2009. 164. M. Sugiyama, S. Nakajima, H. Kashima, P. von Biinau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In 1. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems volume 20, pages 1433-1440. Cambridge, MA, MIT Press, 2008.
  • 257. Bibliography 255 165. M. Sugiyama and H. Ogawa. Subspace information criterion for model selection. Neural Computation, 13(8):1863-1889, 2001. 166. M. Sugiyama and H. Ogawa. Active leaming with model selection-Simultaneous optimiza­ tion of sample points and models for trigonometric polynomial models. IEICE Transactions on Information and Systems, E86-D(12):2753-2763, 2003. 167. M. Sugiyama and N. Rubens. A batch ensemble approach to active leaming with model selection. Neural Networks, 21(9):1278-1286, 2008. 168. M. Sugiyama and T. Suzuki. Least-squares independence test. IEICE Transactions on Information and Systems, E94-D(6):1333-1336, 2011. 169. M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura. Least-squares two-sample test. Neural Networks, 24(7):735-751, 2011. 170. M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge, UK: Cambridge University Press, 2012. 171. M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Biinau, and M. Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699-746, 2008. 172. M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Least­ squares conditional density estimation. IEICE Transactions on Information and Systems, E93- D(3):583-594, 2010. 173. M. Sugiyama, M. Yamada, P. von Biinau, T. Suzuki, T. Kanamori, and M. Kawanabe. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Networks, 24(2):183-198, 2011. 174. R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. Cambridge, MA, MIT Press, 1998. 175. T. Suzuki and M. Sugiyama. Sufficient dimension reduction via squared-loss mutual information estimation. In Y. W. Teh and M. Tiggerington, editors, Proceedings of the 13th Inter­ national Conference on Artificial Intelligence and Statistics, volume 9 of JMLR Workshop and Conference Proceedings, pages 804-811. 2010. 176. T. Suzuki and M. Sugiyama. Least-squares independent component analysis. Neural Computation, 23(1):284-301, 2011. 177. T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 1O(1):S52, 2009. 178. T. Suzuki, M. Sugiyama, J. Sese, and T. Kanamori. Approximating mutual information by maximum likelihood density ratio estimation. In Y. Saeys, H. Liu, I. Inza, L. Wehenkel, and Y. Van de Peer, editors, Proceedings of the ECML-PKDD2008 Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery, volume 4 ofJMLR Workshop and Conference Proceedings, pages 5-20. 2008. 179. T. Suzuki, M. Sugiyama, and T. Tanaka. Mutual information approximation via maximum likelihood estimation of density ratio. In Proceedings of the IEEE International Symposium on Information Theory, pages 463-467. 2009. 180. A. Takeda, J. Gotoh, and M. Sugiyama. Support vector regression as conditional value­ at-risk minimization with application to financial time-series analysis. In S. Kaski, D. J. Miller, E. Oja, and A. Honkela, editors, IEEE International Workshop on Machine Learning for Signal Processing, pages 118-123. 2010. 181. A. Takeda and M. Sugiyama. On generalization performance and non-convex optimization of extended v-support vector machine. New Generation Computing, 27(3):259-279, 2009. 182. K. Takeuchi. Distribution of information statistics and validity criteria of models. Mathemat­ ical Sciences, 153:12-18, 1976. (In Japanese). 183. R. Tibshirani. Regression shrinkage and subset selection with the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267-288, 1996.
  • 258. 256 Bibliography 184. F. H. C. Tivive and A. Bouzerdoum. A gender recognition system using shunting inhibitory convolutional neural networks. In Proceedings of the IEEE International Joint Conference on Neural Networks, volume 10, pages 5336-5341. 2006. 185. Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama. Direct density ratio estimation for large-scale covariate shift adaptation. In M. J. Zaki, K. Wang, C. Apte, and H. Park, editors, Proceedings of the 8th SIAM International Conference on Data Mining, pages 443-454. SIAM, 2008. 186. Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama. Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138-155, 2009. 187. Y. Tsuboi, H. Kashima, S. Mori, H. Oda, and Y. Matsumoto. Training conditional ran­ dom fields using incomplete annotations. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 897-904. 2008. 188. K. Ueki, M. Sugiyama, and Y. Ihara. Perceived age estimation under lighting condition change by covariate shift adaptation. In Proceedings of the 20th International Conference on Pattern Recognition, pages 3400-3403. 2010. 189. K. Ueki, M. Sugiyama, and Y. Ihara. Semi-supervised estimation of perceived age from face images. In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 319-324. 2010. 190. L. G. Valiant. A theory of the leamable. Communications of the Association for Computing Machinery, 27:1134-1142, 1984. 191. S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. 192. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996. 193. V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. 194. C. Vidaurre, A. Schlogl, R. Cabeza, and G. Pfurtscheller. About adaptive classifiers for brain computer interfaces. Biomedizinische Technik, 49(1):85-86, 2004. 195. G. Wahba. Spline Models for Observational Data. Philadelphia, SIAM, 1990. 196. S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, Cambridge, UK, 2009. 197. H. White. Maximum likelihood estimation of misspecified models. Econometrica, 50(1):1-25, 1982. 198. G. Wichern, M. Yamada, H. Thornburg, M. Sugiyama, and A. Spanias. Automatic audio tagging using covariate shift adaptation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 253-256. 2010. 199. D. P. Wiens. Robust weights and designs for biased regression models: Least squares and generalized M-estimation. Journal of Statistical Planning and Inference, 83(2):395-412, 2000. 200. P. M. Williams. Bayesian regularization and prnning using a Laplace prior. Neural Computation, 7(1):117-143, 1995. 201. J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan. Brain­ computer interfaces for communication and control. Clinical Neurophysiology, 113(6):767-791, 2002. 202. M. Yamada and M. Sugiyama. Direct importance estimation with Gaussian mixture models. IEICE Transactions on Information and Systems, E92-D(10):2159-2162, 2009. 203. M. Yamada and M. Sugiyama. Dependence minimizing regression with model selection for non-linear causal inference under non-Gaussian noise. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, pages 643-648. AAAI Press, 2010. 204. M. Yamada, M. Sugiyama, and T. Matsui. Semi-supervised speaker identification under covariate shift. Signal Processing, 90(8):2353-2361, 2010. 205. M. Yamada, M. Sugiyama, G. Wichern, and J. Simm. Direct importance estimation with a mixture of probabilistic principal component analyzers. IEICE Transactions on Information and Systems, E93-D(10):2846-2849, 2010.
  • 259. Bibliography 257 206. M. Yamada, M. Sugiyama, G. Wichern, and J. Simm. Improving the accuracy of least-squares probabilistic classifiers. IEICE Transactions on Information and Systems, E94- D(6):I337-1340, 2011. 207. K. Yamagishi, N. Ito, H. Kosuga, N. Yasuda, K. Isogami, and N. Kozuno. A simplified measurement of farm worker's load using an accelerometer. Journal of the Japanese Society of Agricultural Technology Management, 9(2):127-132, 2002. (In Japanese.) 208. L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 17, pages 1601-1608. MIT Press, Cambridge, MA, 2005.
  • 260. Index Active learning, 183 ensemble, 219 with model selection, 215 pool-based, 204,234 population-based, 188,225 Active learning/model selection dilemma, 215 Active policy iteration, 231 Affine transform, 236 Age prediction, 152 Akaike information criterion importance-weighted, 47 Ball batting, 232 Basis polynomial, 10,198,210,221,236 trigonometric polynomial, 10,216 Bayes decision rule, 143 Bellman equation, 166 Bellman residual, 168 Bi-orthogonality, 104 Bias-variance decomposition, 186,228 Bias-variance trade-off, 23 Boosting importance-weighted, 40 Cepstral mean normalization, 142 Chebyshev approximation, 35 Classification, 8,35,41 Common spatial patterns, 138 Conditional random field, 150 Conditional value-at-risk, 34,40 Confidence, 38,160 Consistency, 21 Covariate shift, 3,9 Cross-validation, 74,77,80,84 importance-weighted, 64,144,156,169,174 leave-one-out, 65,88,113 Curse of dimensionality, II, 74 Design matrix, 27 Dimensionality reduction, 103 Dirac delta function, 166 Direct density-ratio estimation with dimensionality reduction, 103 Distribution conditional, 8 test input, 9, 73 training input, 7,73 Domain adaptation, 149 Dual basis, 104 EM algorithm, 146 Empirical risk minimization, 21 importance-weighted, 23 Error function, 129 Event-related desynchronization, 138 Expected shortfall, 35 Experimental design, 183 Extrapolation, S Feasible region, 25 Feature selection, 25 Fisher discriminant analysis, 108,137 importance-weighted, 36,43,138 local, 109 Flattening parameter, 23,47, 140 F-measure, 151 Generalization error, 9,53,183 estimation, 47 input-dependent analysis, 51,54,193,206 input-independent analysis, 51,64,191,195, 205,207 single-trial, 51 Generalized eigenvalue problem, 37 Gradient method, 28,32 conjugate, 39 stochastic, 29,33 Gram-Schmidt orthonormalization, 112 Hazard ratio, 131 Heteroscedastic noise, 228
  • 261. 260 Huber regression importance-weighted,31 Human activity recognition,157 Idempotence,105 Importance estimation kernel density estimation,73 kernel mean matching,75 Kullback-Leibler importance estimation procedure,78,144,156 least-squares importance fitting,83 logistic regression,76 unconstrained least-squares importance fitting,87 Importance sampling,22 Importance weight,9,22 adaptive,23,169 Akaike information criterion,47 boosting,40 classification,35 cross-validation,64,144,156,169,174 empirical risk minimization,23 estimation,73,103 Fisher discriminant analysis,138 least squares,26,41,66,190 logistic regression,38,144 regression,25 regularized,23,IS3 subspace information criterion,54 support vector machine,39 Inverse Mills ratio,131 Inverted pendulum,176 Jacobian,104 Kernel Gaussian,12,29,73,75,84,144,153 model,12,29,143,153 sequence,143 Kernel density estimation,73 Kernel mean matching,75 k-means clustering,146 Kullback-Leibler divergence,47,78 Kullback-Leibler importance estimation procedure,78,144,150,156 Learning matrix,27 Least squares,38,41,55 general,133 importance-weighted,26,41,66,205 Least-squares importance fitting,83 Linear discriminant analysis,see Fisher discriminant analysis Linear learning,49,59,184 Linear program,30 Logistic regression,76 importance weight,38,144 Index Loss 0/1,8,35,64 absolute,30 classification,38 deadzone-linear,33 exponential,40 hinge,39 Huber,31 logistic,39 regression,25 squared,8,26,38,41,49,53,55,83,183 Machine learning,3 Mapping hetero-distributional,106 homo-distributional,106 Markov decision problem,165 Maximum likelihood,38 Mel-frequency cepstrum coefficient,142 Model additive,II approximately correct,14,53,192,193,195, 229 correctly specified,13,189 Gaussian mixture,80,142 kernel,12,29,143,153 linear-in-input,10,25,37,39,41,43 Iinear-in-parameter,10,27,32,49,54,78, 80,83,184,198,221 log-linear,77,80,150 logistic,38,143,150 misspecified,14,53,190 multiplicative,II nonparametric,13,73 pararnetric,13 probabilistic principal-component-analyzer mixture,80,123 Model drift, 216 Model error,54,57,186,188,229 Model overfitting,219 Model selection,23,47,174 Natural language processing,149 Newton's method,39 Oblique projection,104 Off-policy reinforcement learning,169 Outliers,30,31 Parametric optimization,85 Policy iteration,166 Probit model,129 Quadratic program,32,75,84 Regression,8,25,40,66 Regularization pararneter,24
  • 262. Index Regularization path tracking,85 Regularizer absolute,24 squared,24 Reinforcement learning,4,165,225 Rejection sampling,200 Resampling weight function,205 Robustness parameter,31 Sample test,9,73 training,7,73 Sample selection bias,125 Sample reuse policy iteration,175,225 Scatter matrix between-class,36,108 local between-class,110 local within-class,110 within-class,36,108 Semisupervised learning,4,142 Sherman-Woodbury-Morrison formula,88 Speaker identification,142 Sphering,105 Subspace hetero-distributional,104 homo-distributional,104 Subspace information criterion importance-weighted,54 Supervised learning,3,7 Support vector machine,75,142 importance-weighted, 39 Support vector regression,33 Test input distribution, 9,73 samples,9,73 Training input distribution, 7,73 samples,7,73 Unbiasedness,66 asymptotic,50,60,66 Unconstrained least-squares importance fitting,87,113 Universal reproducing kernel Hilbert space,75 Unsupervised learning,3 Value function,166 Value-at-risk,34 Variable selection,25 Wafer alignment,234 261
  • 263. Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors Bioinformatics: The Machine Learning Approach, Pierre Baldi and Si/lren Bmnak Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey Learning in Graphical Models, Michael I. Jordan Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and Richard Scheines Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and Si/lren Bmnak Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard SchOlkopf and Alexander J. Smola Introduction to Machine Learning, Ethem Alpaydin Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K.I. Williams Semi-Supervised Learning, Olivier Chapelle, Bernhard SchOlkopf, and Alexander Zien, Eds. The Minimum Description Length Principle, Peter D. Grunwald Introduction to Statistical Relational Learning, Lise Getoor and Ben Taskar, Eds. Probabilistic Graphical Models: Principles and Techniques, Daphne Koller and Nir Friedman Introduction to Machine Learning, second edition, Ethem Alpaydin Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation, Masashi Sugiyama and Motoaki Kawanabe Boosting: Foundations and Algorithms, Robert E. Schapire and Yoav Freund