0% found this document useful (0 votes)
211 views

(2007) - Cucker-Learning Theory - An Approximation Theory Viewpoint

Uploaded by

Pablo Abuin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views

(2007) - Cucker-Learning Theory - An Approximation Theory Viewpoint

Uploaded by

Pablo Abuin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 237

This page intentionally left blank

CAMBRIDGE MONOGRAPHS ON
APPLIED AND COMPUTATIONAL
MATHEMATICS

Series Editors
M. J. ABLOWITZ, S. H. DAVIS, E. J. HINCH, A. ISERLES,
J. OCKENDEN, P. J. OLVER

Learning Theory:
An Approximation Theory Viewpoint
The Cambridge Monographs on Applied and Computational Mathematics reflect the
crucial role of mathematical and computational techniques in contemporary science.
The series publishes expositions on all aspects of applicable and numerical mathematics,
with an emphasis on new developments in this fast-moving area of research.
State-of-the-art methods and algorithms as well as modern mathematical descriptions
of physical and mechanical ideas are presented in a manner suited to graduate research
students and professionals alike. Sound pedagogical presentation is a prerequisite. It is
intended that books in the series will serve to inform a new generation of researchers.
Within the series will be published titles in the Library of Computational
Mathematics, published under the auspices of the Foundations of Computational
Mathematics organisation. Learning Theory: An Approximation Theory View Point
is the first title within this new subseries.
The Library of Computational Mathematics is edited by the following editorial board:
Felipe Cucker (Managing Editor) Ron Devore, Nick Higham, Arieh Iserles, David
Mumford, Allan Pinkus, Jim Renegar, Mike Shub.

Also in this series:


A practical Guide to Pseudospectral Methods, Bengt Fornberg
Dynamical Systems and Numerical Analysis, A. M. Stuart and A. R. Humphries
Level Set Methods, J. A. Sethian
The Numerical Solution of Integral Equations of the Second Kind,
Kendall E. Atkinson
Orthogonal Rational Functions, Adhemar Bultheel, Pablo González-Vera,
Erik Hendriksen, and Olav Njåstad
Theory of Composites, Graeme W. Milton
Geometry and Topology for Mesh Generation, Herbert Edelsbrunner
Schwarz-Christoffel Mapping, Tobin A. Driscoll and Lloyd N. Trefethen
High-Order Methods for Incompressible Fluid, M.O. Deville,
E.H. Mund and P. Fisher
Practical Extrapolation Methods, Avram Sidi
Generalized Riemann Problems in Computational Fluid Dynamics,
M. Ben-Artzi and J. Falcovtz
Radial Basis Functions, Martin Buhmann
Learning Theory: An Approximation
Theory Viewpoint

FELIPE CUCKER
City University of Hong Kong
DING-XUAN ZHOU
City University of Hong Kong
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo

Cambridge University Press


The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521865593

© Cambridge University Press 2007

This publication is in copyright. Subject to statutory exception and to the provision of


relevant collective licensing agreements, no reproduction of any part may take place
without the written permission of Cambridge University Press.

First published in print format 2007

ISBN-13 978-0-511-27551-7 eBook (NetLibrary)


ISBN-10 0-511-27551-X eBook (NetLibrary)

ISBN-13 978-0-521-86559-3 hardback


ISBN-10 0-521-86559-X hardback

Cambridge University Press has no responsibility for the persistence or accuracy of urls
for external or third-party internet websites referred to in this publication, and does not
guarantee that any content on such websites is, or will remain, accurate or appropriate.
Contents

Foreword ix
Preface xi

1 The framework of learning 1


1.1 Introduction 1
1.2 A formal setting 5
1.3 Hypothesis spaces and target functions 9
1.4 Sample, approximation, and generalization errors 11
1.5 The bias–variance problem 13
1.6 The remainder of this book 14
1.7 References and additional remarks 15
2 Basic hypothesis spaces 17
2.1 First examples of hypothesis space 17
2.2 Reminders I 18
2.3 Hypothesis spaces associated with Sobolev spaces 21
2.4 Reproducing Kernel Hilbert Spaces 22
2.5 Some Mercer kernels 24
2.6 Hypothesis spaces associated with an RKHS 31
2.7 Reminders II 33
2.8 On the computation of empirical target functions 34
2.9 References and additional remarks 35
3 Estimating the sample error 37
3.1 Exponential inequalities in probability 37
3.2 Uniform estimates on the defect 43
3.3 Estimating the sample error 44
3.4 Convex hypothesis spaces 46
3.5 References and additional remarks 49

v
vi Contents

4 Polynomial decay of the approximation error 54


4.1 Reminders III 55
4.2 Operators defined by a kernel 56
4.3 Mercer’s theorem 59
4.4 RKHSs revisited 61
4.5 Characterizing the approximation error in RKHSs 63
4.6 An example 68
4.7 References and additional remarks 69
5 Estimating covering numbers 72
5.1 Reminders IV 73
5.2 Covering numbers for Sobolev smooth kernels 76
5.3 Covering numbers for analytic kernels 83
5.4 Lower bounds for covering numbers 101
5.5 On the smoothness of box spline kernels 106
5.6 References and additional remarks 108
6 Logarithmic decay of the approximation error 109
6.1 Polynomial decay of the approximation error for C ∞
kernels 110
6.2 Measuring the regularity of the kernel 112
6.3 Estimating the approximation error in RKHSs 117
6.4 Proof of Theorem 6.1 125
6.5 References and additional remarks 125
7 On the bias–variance problem 127
7.1 A useful lemma 128
7.2 Proof of Theorem 7.1 129
7.3 A concrete example of bias–variance 132
7.4 References and additional remarks 133
8 Least squares regularization 134
8.1 Bounds for the regularized error 135
8.2 On the existence of target functions 139
8.3 A first estimate for the excess generalization error 140
8.4 Proof of Theorem 8.1 148
8.5 Reminders V 151
8.6 Compactness and regularization 151
8.7 References and additional remarks 155
9 Support vector machines for classification 157
9.1 Binary classifiers 159
Contents vii

9.2 Regularized classifiers 161


9.3 Optimal hyperplanes: the separable case 166
9.4 Support vector machines 169
9.5 Optimal hyperplanes: the nonseparable case 171
9.6 Error analysis for separable measures 173
9.7 Weakly separable measures 182
9.8 References and additional remarks 185
10 General regularized classifiers 187
10.1 Bounding the misclassification error in terms of the
generalization error 189
10.2 Projection and error decomposition 194
10.3 Bounds for the regularized error D(γ , φ) of fγ 196
10.4 Bounds for the sample error term involving fγ 198
φ
10.5 Bounds for the sample error term involving fz,γ 201
10.6 Stronger error bounds 204
10.7 Improving learning rates by imposing noise conditions 210
10.8 References and additional remarks 211

References 214
Index 222
Foreword

This book by Felipe Cucker and Ding-Xuan Zhou provides solid mathematical
foundations and new insights into the subject called learning theory.
Some years ago, Felipe and I were trying to find something about brain
science and artificial intelligence starting from literature on neural nets. It was
in this setting that we encountered the beautiful ideas and fast algorithms of
learning theory. Eventually we were motivated to write on the mathematical
foundations of this new area of science.
I have found this arena to with its new challenges and growing number of
application, be exciting. For example, the unification of dynamical systems and
learning theory is a major problem.Another problem is to develop a comparative
study of the useful algorithms currently available and to give unity to these
algorithms. How can one talk about the “best algorithm” or find the most
appropriate algorithm for a particular task when there are so many desirable
features, with their associated trade-offs? How can one see the working of
aspects of the human brain and machine vision in the same framework?
I know both authors well. I visited Felipe in Barcelona more than 13 years
ago for several months, and when I took a position in Hong Kong in 1995, I
asked him to join me. There Lenore Blum, Mike Shub, Felipe, and I finished
a book on real computation and complexity. I returned to the USA in 2001,
but Felipe continues his job at the City University of Hong Kong. Despite the
distance we have continued to write papers together. I came to know Ding-Xuan
as a colleague in the math department at City University. We have written a
number of papers together on various aspects of learning theory. It gives me
great pleasure to continue to work with both mathematicians. I am proud of our
joint accomplishments.
I leave to the authors the task of describing the contents of their book. I will
give some personal perspective on and motivation for what they are doing.

ix
x Foreword

Computational science demands an understanding of fast, robust algorithms.


The same applies to modern theories of artificial and human intelligence. Part of
this understanding is a complexity-theoretic analysis. Here I am not speaking of
a literal count of arithmetic operations (although that is a by-product), but rather
to the question: What sample size yields a given accuracy? Better yet, describe
the error of a computed hypothesis as a function of the number of examples,
the desired confidence, the complexity of the task to be learned, and variants
of the algorithm. If the answer is given in terms of a mathematical theorem, the
practitioner may not find the result useful. On the other hand, it is important
for workers in the field or leaders in laboratories to have some background
in theory, just as economists depend on knowledge of economic equilibrium
theory. Most important, however, is the role of mathematical foundations and
analysis of algorithms as a precursor to research into new algorithms, and into
old algorithms in new and different settings.
I have great confidence that many learning-theory scientists will profit from
this book. Moreover, scientists with some mathematical background will find
in this account a fine introduction to the subject of learning theory.
Stephen Smale
Chicago
Preface

Broadly speaking, the goal of (mainstream) learning theory is to approximate


a function (or some function features) from data samples, perhaps perturbed
by noise. To attain this goal, learning theory draws on a variety of diverse
subjects. It relies on statistics whose purpose is precisely to infer information
from random samples. It also relies on approximation theory, since our estimate
of the function must belong to a prespecified class, and therefore the ability
of this class to approximate the function accurately is of the essence. And
algorithmic considerations are critical because our estimate of the function is
the outcome of algorithmic procedures, and the efficiency of these procedures
is crucial in practice. Ideas from all these areas have blended together to form
a subject whose many successful applications have triggered its rapid growth
during the past two decades.
This book aims to give a general overview of the theoretical foundations of
learning theory. It is not the first to do so. Yet we wish to emphasize a viewpoint
that has drawn little attention in other expositions, namely, that of approximation
theory. This emphasis fulfills two purposes. First, we believe it provides a
balanced view of the subject. Second, we expect to attract mathematicians
working on related fields who find the problems raised in learning theory close
to their interests.
While writing this book, we faced a dilemma common to the writing of any
book in mathematics: to strike a balance between clarity and conciseness. In
particular, we faced the problem of finding a suitable degree of self-containment
for a book relying on a variety of subjects. Our solution to this problem consists
of a number of sections, all called “Reminders,” where several basic notions
and results are briefly reviewed using a unified notation.
We are indebted to several friends and colleagues who have helped us in
many ways. Steve Smale deserves a special mention. We first became interested
in learning theory as a result of his interest in the subject, and much of the

xi
xii Preface

material in this book comes from or evolved from joint papers we wrote with
him. Qiang Wu, Yiming Ying, Fangyan Lu, Hongwei Sun, Di-Rong Chen,
Song Li, Luoqing Li, Bingzheng Li, Lizhong Peng, and Tiangang Lei regularly
attended our weekly seminars on learning theory at City University of Hong
Kong, where we exposed early drafts of the contents of this book. They, and
José Luis Balcázar, read preliminary versions and were very generous in their
feedback. We are indebted also to David Tranah and the staff of Cambridge
University Press for their patience and willingness to help. We have also been
supported by the University Grants Council of Hong Kong through the grants
CityU 1087/02P, 103303, and 103704.
1
The framework of learning

1.1 Introduction
We begin by describing some cases of learning, simplified to the extreme, to
convey an intuition of what learning is.

Case 1.1 Among the most used instances of learning (although not necessarily
with this name) is linear regression. This amounts to finding a straight line that
best approximates a functional relationship presumed to be implicit in a set of
data points in R2 , {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )} (Figure 1.1). The yardstick
used to measure how good an approximation a given line Y = aX + b is, is
called least squares. The best line is the one that minimizes


m
Q(a, b) = (yi − axi − b)2 .
i=1

Figure 1.1

1
2 1 The framework of learning

Case 1.2 Case 1.1 readily extends to a classical situation in science, namely,
that of learning a physical law by curve fitting to data. Assume that the law at
hand, an unknown function f : R → R, has a specific form and that the space
of all functions with this form can be parameterized by N real numbers. For
instance, if f is assumed to be a polynomial of degree d , then N = d +1 and the
parameters are the unknown coefficients w0 , . . . , wd of f . In this case, finding
the best fit by the least squares method estimates the unknown f from a set
of pairs {(x1 , y1 ), . . . , (xm , ym )}. If the measurements generating this set were
exact, then yi would be equal to f (xi ). However, in general one expects the
values yi to be affected by noise. That is, yi = f (xi ) + ε, where ε is a random
variable (which may depend on xi ) with mean zero. One then computes the
vector of coefficients w such that the value


m 
d
(fw (xi ) − yi )2 , with fw (x) = wj x j
i=1 j=0

is minimized, where, typically, m > N . In general, the minimum value above is


not 0. To solve this minimization problem, one uses the least squares technique,
a method going back to Gauss and Legendre that is computationally efficient
and relies on numerical linear algebra.
Since the values yi are affected by noise, one might take as starting point,
instead of the unknown f , a family of probability measures εx on R varying
with x ∈ R. The only requirement on these measures is that for all x ∈ R, the
mean of εx is f (x). Then yi is randomly drawn from εxi . In some contexts the
xi , rather than being chosen, are also generated by a probability measure ρX
on R. Thus, the starting point could even be a single measure ρ on R × R –
capturing both the measure ρX and the measures εx for x ∈ R – from which the
pairs (xi , yi ) are randomly drawn.
A more general form of the functions in our approximating class could be
given by


N
fw (x) = wi φi (x),
i=1

where the φi are the elements of a basis of a specific function space, not
necessarily of polynomials.

Case 1.3 The training of neural networks is an extension of Case 1.2. Roughly
speaking, a neural network is a directed graph containing some input nodes,
some output nodes, and some intermediate nodes where certain functions are
1.1 Introduction 3

computed. If X denotes the input space (whose elements are fed to the input
nodes) and Y the output space (of possible elements returned by the output
nodes), a neural network computes a function from X to Y . The literature on
neural networks shows a variety of choices for X and Y , which can be continuous
or discrete, as well as for the functions computed at the intermediate nodes. A
common feature of all neural nets, though, is the dependence of these functions
on a set of parameters, usually called weights, w = {wj }j∈J . This set determines
the function fw : X → Y computed by the network.
Neural networks are trained to learn functions. As in Case 1.2, there is a
target function f : X → Y , and the network is given a set of randomly chosen
pairs (x1 , y1 ), . . . , (xm , ym ) in X × Y . Then, training algorithms select a set of
weights w attempting to minimize some distance from fw to the target function
f :X → Y.

Case 1.4 A standard example of pattern recognition involves handwritten


characters. Consider the problem of classifying handwritten letters of the
English alphabet. Here, elements in our space X could be matrices with entries
in the interval [0, 1] – each entry representing a pixel in a certain gray scale of a
digitized photograph of the handwritten letter or some features extracted from
the letter. We may take Y to be
 

26 
26
Y = y∈R 26
|y= λi ei such that λi = 1 .
i=1 i=1

Here ei is the ith coordinate vector in R26 , each coordinate corresponding to


a letter. If  ⊂ Y is the set of points y as above such that 0 ≤ λi ≤ 1, for
i = 1, . . . , 26, one can interpret a point in  as a probability measure on the set
{A, B, C, . . . , X, Y, Z}. The problem is to learn the ideal function f : X → Y that
associates, to a given handwritten letter x, a linear combination of the ei with
coefficients {Prob{x = A}, Prob{x = B}, . . . , Prob{x = Z}}. Unambiguous
letters are mapped into a coordinate vector, and in the (pure) classification
problem f takes values on these ei . “Learning f ” means finding a sufficiently
good approximation of f within a given prescribed class.
The approximation of f is constructed from a set of samples of handwritten
letters, each of them with a label in Y . The set {(x1 , y1 ), . . . , (xm , ym )} of these
m samples is randomly drawn from X × Y according to a measure ρ on X × Y .
This measure satisfies ρ(X × ) = 1. In addition, in practice, it is concentrated
around the set of pairs (x, y) with y = ei for some 1 ≤ i ≤ 26. That is, the
occurring elements x ∈ X are handwritten letters and not, say, a digitized image
of the Mona Lisa. The function f to be learned is the regression function fρ of ρ.
4 1 The framework of learning

That is, fρ (x) is the average of the y values of {x} × Y (we are more precise
about ρ and the regression function in Section 1.2).
Case 1.5 A standard approach for approximating characteristic (or indicator)
functions of sets is known as PAC learning (from “probably approximately
correct”). Let T (the target concept) be a subset of Rn and ρX be a probability
measure on Rn that we assume is not known in advance. Intuitively, a set
S ⊂ Rn approximates T when the symmetric difference ST = (S \ T ) ∪
(T \ S) is small, that is, has a small measure. Note that if fS and fT denote the
characteristic
 functions of S and T , respectively, this measure, called the error
of S, is Rn |fS − fT | d ρX . Note that since the functions take values in {0, 1},
only this integral coincides with Rn (fS − fT )2 d ρX .
Let C be a class of subsets of Rn and assume that T ∈ C. One strategy for
constructing an approximation of T in C is the following. First, draw points
x1 , . . . , xm ∈ Rn according to ρX and label each of them with 1 or 0 according
to whether they belong to T . Second, compute any function fS : Rn → {0, 1},
fS ∈ C, that coincides with this labeling over {x1 , . . . , xm }. Such a function will
provide a good approximation S of T (small error with respect to ρX ) as long
as m is large enough and C is not too wild. Thus the measure ρX is used in both
capacities, governing the sample drawing and measuring the error set ST .
A major goal in PAC learning is to estimate how large m needs to be to obtain
an ε approximation of T with probability at least 1 − δ as a function of ε and δ.
The situation described above is noise free since each randomly drawn point
xi ∈ Rn is correctly labeled. Extensions of PAC learning allowing for labeling
mistakes with small probability exist.
Case 1.6 (Monte Carlo integration) An early instance of randomization in
algorithmics appeared in numerical  integration. Let f : [0, 1]n → R. One way
of approximating the integral x∈[0,1]n f (x) dx consists of randomly drawing
points x1 , . . . , xm ∈ [0, 1]n and computing

1
m
Im (f ) = f (xi ).
m
i=1

Under mild conditions on the regularity of f , Im (f ) → f with probability 1;
that is, for all ε > 0,
   
 
lim Prob Im (f ) − f (x) dx > ε → 0.
m→∞ x1 ,...,xm x∈[0,1]n

Again we find the theme of learning an object (here a single real number,
although defined in a nontrivial way through f ) from a sample. In this case
1.2 A formal setting 5

the measure governing the sample is known (the measure in [0, 1]n inherited
from the standard Lebesgue measure on Rn ), but the same idea can be used
for an unknown measure. If ρX is a probability  measure on X ⊂ Rn , a
domain or manifold, Im (f ) will approximate x∈X f (x) d ρX for large m with
high probability as long as the points x1 , . . . , xm are drawn from X according to
the measure ρX . Note that no noise is involved here. An extension of this idea
to include noise is, however, possible.
A common characteristic of Cases 1.2–1.5 is the existence of both an
“unknown” function f : X → Y and a probability measure allowing one
to randomly draw points in X × Y . That measure can be on X (Case 1.5), on Y
varying with x ∈ X (Cases 1.2 and 1.3), or on the product X ×Y (Case 1.4). The
only requirement it satisfies is that, if for x ∈ X a point y ∈ Y can be randomly
drawn, then the expected value of y is f (x). That is, the noise is centered at zero.
Case 1.6 does not follow this pattern. However, we have included it since it is
a well-known algorithm and shares the flavor of learning an unknown object
from random data.
The development in this book, for reasons of unity and generality, is based on
a single measure on X × Y . However, one should keep in mind the distinction
between “inputs” x ∈ X and “outputs” y ∈ Y .

1.2 A formal setting


Since we want to study learning from random sampling, the primary object in
our development is a probability measure ρ governing the sampling that is not
known in advance.
Let X be a compact metric space (e.g., a domain or a manifold in Euclidean
space) and Y = Rk . For convenience we will take k = 1 for the time being. Let
ρ be a Borel probability measure on Z = X × Y whose regularity properties
will be assumed as required. In the following we try to utilize concepts formed
naturally and solely from X , Y , and ρ.
Throughout this book, if ξ is a random variable (i.e., a real-valued function
on a probability space Z), we will use E(ξ ) to denote the expected value (or
average, or mean) of ξ and σ 2 (ξ ) to denote its variance. Thus

E(ξ ) = ξ(z) dp and σ 2 (ξ ) = E((ξ − E(ξ ))2 ) = E(ξ 2 ) − (E(ξ ))2 .
z∈Z

A central concept in the next few chapters is the generalization error (or
least squares error or, if there is no risk of ambiguity, simply error) of f , for
6 1 The framework of learning

f : X → Y , defined by

E(f ) = Eρ (f ) = (f (x) − y)2 d ρ.
Z

For each input x ∈ X and output y ∈ Y , (f (x) − y)2 is the error incurred through
the use of f as a model for the process producing y from x. This is a local error.
By integrating over X × Y (w.r.t. ρ, of course) we average out this local error
over all pairs (x, y). Hence the word “error” for E(f ).
The problem posed is: What is the f that minimizes the error E(f )? To answer
this question we note that the error E(f ) naturally decomposes as a sum. For
every x ∈ X , let ρ(y|x) be the conditional (w.r.t. x) probability measure on Y .
Let also ρX be the marginal probability measure of ρ on X , that is, the measure
on X defined by ρX (S) = ρ(π −1 (S)), where π : X × Y → X is the projection.
For every integrable function ϕ : X × Y → R a version of Fubini’s theorem
relates ρ, ρ(y|x), and ρX as follows:
  
ϕ(x, y) d ρ = ϕ(x, y) d ρ(y|x) d ρX .
X ×Y X Y

This “breaking” of ρ into the measures ρ(y|x) and ρX corresponds to looking


at Z as a product of an input domain X and an output set Y . In what follows,
unless otherwise specified, integrals are to be understood as being over ρ, ρ(y|x)
or ρX .
Define fρ : X → Y by

fρ (x) = y d ρ(y|x).
Y

The function fρ is called the regression function of ρ. For each x ∈ X , fρ (x) is


the average of the y coordinate of {x} × Y (in topological terms, the average of y
on the fiber of x). Regularity hypotheses on ρ will induce regularity properties
on fρ .
We will assume throughout this book that fρ is bounded .
Fix x ∈ X and consider the function from Y to R mapping y into (y − fρ (x)).
Since the expected value of this function is 0, its variance is

σ (x) = (y − fρ (x))2 d ρ(y|x).
2
Y

Now average over X , to obtain



σρ2 = σ 2 (x) d ρX = E(fρ ).
X
1.2 A formal setting 7

The number σρ2 is a measure of how well conditioned ρ is, analogous to the
notion of condition number in numerical linear algebra.

Remark 1.7
(i) It is important to note that whereas ρ and fρ are generally “unknown,” ρX
is known in some situations and can even be the Lebesgue measure on X
inherited from Euclidean space (as in Cases 1.2 and 1.6).
(ii) In the remainder of this book, if formulas do not make sense or ∞ appears,
then the assertions where these formulas occur should be considered
vacuous.

Proposition 1.8 For every f : X → Y ,



E(f ) = (f (x) − fρ (x))2 d ρX + σρ2 .
X

Proof From the definition of fρ (x) for each x ∈ X , Y (fρ (x)−y) = 0. Therefore,

E(f ) = (f (x) − fρ (x) + fρ (x) − y)2
Z
  
= (f (x) − fρ (x))2 + (fρ (x) − y)2
X X Y
 
+2 (f (x) − fρ (x))(fρ (x) − y)
X Y
  
= (f (x) − fρ (x)) + σρ + 2 (f (x) − fρ (x)) (fρ (x) − y)
2 2
X X Y

= (f (x) − fρ (x))2 + σρ2 .
X

1
The first term on the right-hand side of Proposition 1.8 provides an average
(over X ) of the error suffered from the use of f as a model for fρ . In addition,
since σρ2 is independent of f , Proposition 1.8 implies that fρ has the smallest
possible error among all functions f : X → Y . Thus σρ2 represents a lower
bound on the error E and it is due solely to our primary object, the measure ρ.
Thus, Proposition 1.8 supports the following statement:
The goal is to “learn” (i.e., to find a good approximation of) fρ from random
samples on Z.

1 Throughout this book, the square  denotes the end of a proof or the fact that no proof is given.
8 1 The framework of learning

We now consider sampling. Let

z ∈ Z m, z = ((x1 , y1 ), . . . , (xm , ym ))

be a sample in Z m , that is, m examples independently drawn according to ρ.


Here Z m denotes the m-fold Cartesian product of Z. We define the empirical
error of f (w.r.t. z) to be

1
m
Ez (f ) = (f (xi ) − yi )2 .
m
i=1

If ξ is a random variable on Z, we denote the empirical mean of ξ (w.r.t. z) by


Ez (ξ ). Thus,

1
m
Ez (ξ ) = ξ(zi ).
m
i=1

For any function f : X → Y we denote by fY the function

fY : X × Y → Y
(x, y)  → f (x) − y.

With these notations we may write E(f ) = E(fY2 ) and Ez (f ) = Ez (fY2 ). We have
already remarked that the expected value of (fρ )Y is 0; we now remark that its
variance is σρ2 .
Remark 1.9 Consider the PAC learning setting discussed in Case 1.5 where
X = Rn and T is a subset of Rn .2 The measure ρX described there can be
extended to a measure ρ on Z by defining, for A ⊂ Z,

ρ(A) = ρX ({x ∈ X | (x, fT (x)) ∈ A}),

where, we recall, fT is the characteristic function of the set T . The marginal


measure of ρ on X is our original ρX . In addition, σρ2 = 0, the error E specializes
to the error mentioned in Case 1.5, and the regression function fρ of ρ coincides
with fT except for a set of measure zero in X .

2 Note, in this case, that X is not compact. In fact, most of the results in this book do not require
compactness of X but only completeness and separability.
1.3 Hypothesis spaces and target functions 9

1.3 Hypothesis spaces and target functions


Learning processes do not take place in a vacuum. Some structure needs to
be present at the beginning of the process. In our formal development, we
assume that this structure takes the form of a class of functions (e.g., a space
of polynomials, of splines, etc.). The goal of the learning process is thus to find
the best approximation of fρ within this class.
Let C (X ) be the Banach space of continuous functions on X with the norm

f ∞ = sup |f (x)|.
x∈X

We consider a subset H of C (X ) – in what follows called hypothesis space –


where algorithms will work to find, as well as is possible, the best approximation
for fρ . A main choice in this book is a compact, infinite-dimensional subset of
C (X ), but we will also consider closed balls in finite-dimensional subspaces
of C (X ) and whole linear spaces.
If fρ ∈ H, simplifications will occur, but in general we will not even assume
that fρ ∈ C (X ) and we will have to consider a target function fH in H. Define
fH to be any function minimizing the error E(f ) over f ∈ H, namely, any
optimizer of

min (f (x) − y)2 d ρ.
f ∈H Z


Notice that since E(f ) = X (f − fρ )2 + σρ2 , fH is also an optimizer of

min (f − fρ )2 d ρX .
f ∈H X

Let z ∈ Z m be a sample. We define the empirical target function fH,z = fz


to be a function minimizing the empirical error Ez (f ) over f ∈ H, that is, an
optimizer of

1
m
min (f (xi ) − yi )2 . (1.1)
f ∈H m
i=1

Notethatalthoughfz isnotproducedbyanalgorithm,itisclosetoalgorithmic.The
statement of the minimization problem (1.1) depends on ρ only through its
dependenceonz,butoncez isgiven,sois(1.1),anditssolutionfz canbelookedfor
without further involvement of ρ. In contrast to fH , fz is “empirical” from its
dependence on the sample z. Note finally that E(fz ) and Ez (f ) are different objects.
We next prove that fH and fz exist under a mild condition on H.
10 1 The framework of learning

Definition 1.10 Let f : X → Y and z ∈ Z m . The defect of f (w.r.t. z) is

Lz (f ) = Lρ,z (f ) = E(f ) − Ez (f ).

Notice that the theoretical error E(f ) cannot be measured directly, whereas
Ez (f ) can. A bound on Lz (f ) becomes useful since it allows one to bound
the actual error from an observed quantity. Such bounds are the object of
Theorems 3.8 and 3.10.
Let f1 , f2 ∈ C (X ). Toward the proof of the existence of fH and fz , we first
estimate the quantity

|Lz (f1 ) − Lz (f2 )|

linearly by f1 − f2 ∞ for almost all z ∈ Z m (a Lipschitz estimate). We recall


that a set U ⊆ Z is said to be full measure when Z \ U has measure zero.
Proposition 1.11 If, for j = 1, 2, |fj (x) − y| ≤ M on a full measure set U ⊆ Z
then, for all z ∈ U m ,

|Lz (f1 ) − Lz (f2 )| ≤ 4M f1 − f2 ∞.

Proof. First note that since

(f1 (x) − y)2 − (f2 (x) − y)2 = (f1 (x) + f2 (x) − 2y)(f1 (x) − f2 (x)),

we have
 
 
|E(f1 ) − E(f2 )| =  (f1 (x) + f2 (x) − 2y)(f1 (x) − f2 (x)) d ρ 
Z

≤ |(f1 (x) − y) + (f2 (x) − y)| f1 − f2 ∞ dρ
Z
≤ 2M f1 − f2 ∞.

Also, for all z ∈ U m , we have


 m 
1  

|Ez (f1 ) − Ez (f2 )| =  (f1 (xi ) + f2 (xi ) − 2yi )(f1 (xi ) − f2 (xi ))
m 
i=1

1
m
≤ |(f1 (xi ) − y) + (f2 (xi ) − yi )| f1 − f2 ∞
m
i=1

≤ 2M f1 − f2 ∞.
1.4 Sample, approximation, and generalization errors 11

Thus

|Lz (f1 ) − Lz (f2 )| = |E(f1 ) − Ez (f1 ) − E(f2 ) + Ez (f2 )| ≤ 4M f1 − f2 ∞.

Remark 1.12 Notice that for bounding |Ez (f1 ) − Ez (f2 )| in this proof – in
contrast to the bound for |E(f1 ) − E(f2 )| – the use of the ∞ norm is crucial.
Nothing less will do.
Corollary 1.13 Let H ⊆ C (X ) and ρ be such that, for all f ∈ H, |f (x) − y| ≤
M almost everywhere. Then E, Ez : H → R are continuous.

Proof. The proof follows from the bounds |E(f1 )−E(f2 )| ≤ 2M f1 −f2 ∞ and
|Ez (f1 ) − Ez (f2 )| ≤ 2M f1 − f2 ∞ shown in the proof of Proposition 1.11. 

Corollary 1.14 Let H ⊆ C (X ) be compact and such that for all f ∈ H,


|f (x) − y| ≤ M almost everywhere. Then fH and fz exist.

Proof. The proof follows from the compactness of H and the continuity of
E, Ez : C (X ) → R. 

Remark 1.15
(i) The functions fH and fz are not necessarily unique. However, we see a
uniqueness result for fH in Section 3.4 when H is convex.
(ii) Note that the requirement of H to be compact is what allows Corollary 1.14
to be proved and therefore guarantees the existence of fH and fz .
Other consequences (e.g., the finiteness of covering numbers) follow in
subsequent chapters.

1.4 Sample, approximation, and generalization errors


For a given hypothesis space H, the error in H of a function f ∈ H is the
normalized error

EH (f ) = E(f ) − E(fH ).

Note that EH (f ) ≥ 0 for all f ∈ H and that EH (fH ) = 0. Also note that E(fH )
and EH (f ) are different objects.
12 1 The framework of learning

Continuing the discussion after Proposition 1.8, it follows from our


definitions and proposition that

(fz − fρ )2 dρX + σρ2 = E(fz ) = EH (fz ) + E(fH ). (1.2)
X

The quantities in (1.2) are the main characters in this book. We have already
noted that σρ2 is a lower bound on the error E that is solely due to the measure ρ.
The generalization error E(fz ) of fz depends on ρ, H, the sample z, and the
scheme (1.1) defining fz . The squared distance X (fz − fρ )2 dρX is the excess
generalization error of fz . A goal of this book is to show that under some
hypotheses on ρ and H, this excess generalization error becomes arbitrarily
small with high probability as the sample size m tends to infinity.
Now consider the sum EH (fz ) + E(fH ). The second term in this sum
depends on the choice of H but is independent of sampling. We will call it
the approximation error. Note that this approximation error is the sum

A(H) + σρ2 ,

where A(H) = X (fH − fρ )2 dρX . Therefore, σρ2 is a lower bound for the
approximation error.
The first term, EH (fz ), is called the sample error or estimation error.
Equation (1.2) thus reduces our goal above – to estimate X (fz − fρ )2 or,
equivalently, E(fz ) – into two different problems corresponding to finding
estimates for the sample and approximation errors. The way these problems
depend on the measure ρ calls for different methods and assumptions in their
analysis.
The second problem (to estimate A(H)) is independent of the sample z.
But it depends heavily on the regression function fρ . The worse behaved fρ is
(e.g., the more it oscillates), the more difficult it will be to approximate fρ well
with functions in H. Consequently, all bounds for A(H) will depend on some
parameter measuring the behavior of fρ .
The first problem (to estimate the sample error EH (fz )) is posed on the space
H, and its dependence on ρ is through the sample z. In contrast with the
approximation error, it is essentially independent of fρ . Consequently, bounds
for EH (fz ) will not depend on properties of fρ . However, due to their dependence
on the random sample z, they will hold with only a certain confidence. That is,
the bound will depend on a parameter δ and will hold with a confidence of at
least 1 − δ.
This discussion extends to some algorithmic issues. Although dependence
on the behavior of fρ seems unavoidable in the estimates of the approximation
1.5 The bias–variance problem 13

error (and hence on the generalization error E(fz ) of fz ), such a dependence


is undesirable in the design of the algorithmic procedures leading to fz (e.g.,
the selection of H). Ultimately, the goal is to be able, given a sample z, to
select a hypothesis space H and compute  the resulting fz without assumptions
on fρ and then to exhibit bounds on X (fz − fρ )2 dρX that are, with high
probability, reasonably good in the measure that fρ is well behaved. Yet, in
many situations, the choice of some parameter related to the selection of H is
performed with methods that, although satisfactory in practice, lack a proper
theoretical justification. For these methods, our best theoretical results rely on
information about fρ .

1.5 The bias–variance problem


For fixed H the sample error decreases when the number m of examples
increases (as we see in Theorem 3.14). Fix m instead. Then, typically, the
approximation error will decrease when enlarging H, but the sample error will
increase. The bias–variance problem consists of choosing the size of H when
m is fixed so that the error E(fz ) is minimized with high probability. Roughly
speaking, the “bias” of a solution f coincides with the approximation error, and
its “variance” with the sample error. This is common terminology:
A model which is too simple, or too inflexible, will have a large bias, while one
which has too much flexibility in relation to the particular data set will have a
large variance. Bias and variance are complementary quantities, and the best
generalization [i.e. the smallest error] is obtained when we have the best
compromise between the conflicting requirements of small bias and small
variance.3

Thus, a too small space H will yield a large bias, whereas one that is too large
will yield a large variance. Several parameters (radius of balls, dimension, etc.)
determine the “size” of H, and different instances of the bias–variance problem
are obtained by fixing all of them except one and minimizing the error over this
nonfixed parameter.
Failing to find a good compromise between bias and variance leads to what
is called underfitting (large bias) or overfitting (large variance). As an example,
consider Case 1.2 and the curve C in Figure 1.2(a) with the set of sample
points and assume we want to approximate that curve with a polynomial of
degree d (the parameter d determines in our case the dimension of H). If d is
too small, say d = 2, we obtain a curve as in Figure 1.2(b) which necessarily
“underfits” the data points. If d is too large, we can tightly fit the data points
3 [18] p. 332.
14 1 The framework of learning

Figure 1.2

but this “overfitting” yields a curve as in Figure 1.2(c). In terms of the error
decomposition (1.2) this overfitting corresponds to a small approximation error
but large sample error.
As another example of overfitting, consider the PAC learning situation in
Case 1.5 with C consisting of all subsets of Rn . Consider also a sample
{(x1 , 1), . . . , (xk , 1), (xk+1 , 0), . . . , (xm , 0)}. The characteristic function of the
set S = {x1 , . . . , xk } has zero sample error, but its approximation error is the
measure (w.r.t. ρX ) of the set T {x1 , . . . , xk }, which equals the measure of T as
long as ρX has no points with positive probability mass.

1.6 The remainder of this book


In Chapter 2 we describe some common choices for the hypothesis space H. One
of them, derived from the use of reproducing kernel Hilbert spaces (RKHSs),
will be systematically used in the remainder of the book.
The focus of Chapter 3 is on estimating the sample error. We want to
estimate how close one may expect fz and fH to be, depending on the size of
the sample and with a given confidence. Or, equivalently,
How many examples do we need to draw to assert, with a confidence greater than
1 − δ, that X (fz − fH )2 d ρX is not more than ε?

Our main result in Chapter 3, Theorem 3.3, gives an answer.


Chapter 4 characterizes the measures ρ and some families {HR }R>0 of
hypothesis spaces for which A(HR ) tends to zero with polynomial decay; that
is, A(HR ) = O(R−θ ) for some θ > 0. These families of hypothesis spaces are
defined using RKHSs. Consequently, the chapter opens with several results on
these spaces, including a proof of Mercer’s theorem.
The bounds for the sample error in Chapter 3 are in terms of a specific
measure of the size of the hypothesis space H, namely, its covering numbers.
This measure is not explicit for all of the common choices of H. In Chapter 5
1.7 References and additional remarks 15

we give bounds for these covering numbers for most of the spaces H introduced
in Chapter 2. These bounds are in terms of explicit geometric parameters of H
(e.g., dimension, diameter, smoothness, etc.).
In Chapter 6 we continue along the lines of Chapter 4. We first show some
conditions under which the approximation error can decay as O(R−θ ) only if
fρ is C ∞ . Then we show a polylogarithmic decay in the approximation error
of hypothesis spaces defined via RKHSs for some common instances of these
spaces.
Chapter 7 gives a solution to the bias–variance problem for a particular family
of hypothesis spaces (and under some assumptions on fρ ).
Chapter 8 describes a new setting, regularization, in which the hypothesis
space is no longer required to be compact and argues some equivalence
with the setting described above. In this new setting the computation of the
empirical target function is algorithmically very simple. The notion of excess
generalization error has a natural version, and a bound for it is exhibited.
A special case of learning is that in which Y is finite and, most particularly,
when it has two elements (cf. Case 1.5). Learning problems of this kind are
called classification problems as opposed to the ones with Y = R, which are
called regression problems. For classification problems it is possible to take
advantage of the special structure of Y to devise learning schemes that perform
better than simply specializing the schemes used for regression problems.
One such scheme, known as the support vector machine, is described, and
its error analyzed, in Chapter 9. Chapter 10 gives a detailed analysis for natural
extensions of the support vector machine.
We have begun Chapters 3–10 with brief introductions. Our intention is that
maybe after reading Chapter 2, a reader can form an accurate idea of the contents
of this book simply by reading these introductions.

1.7 References and additional remarks


The setting described in Section 1.2 was first considered in learning theory by
V. Vapnik and his collaborators. An account of Vapnik’s work can be found
in [134].
For the bias–variance problem in the context of learning theory see [18, 54]
and the references therein.
There is a vast literature in learning theory dealing with the sample error. A
pair of representative books for this topic are [6, 44].
Probably the first studies of the two terms in the error decomposition (1.2)
were [96] and [36].
16 1 The framework of learning

In this book we will not go deeper into the details of PAC learning. A standard
reference for this is [67].
Other (but not all) books dealing with diverse mathematical aspects of
learning theory are [7, 29, 37, 57, 59, 61, 92, 95, 107, 111, 124, 125, 132,
133, 136, 137]. In addition, a number of scientific journals publish papers on
learning theory. Two devoted wholly to the theory as developed in this book
are Journal of Machine Learning Research and Machine Learning.
Finally, we want to mention that the exposition and structure of this chapter
largely follow [39].
2
Basic hypothesis spaces

In this chapter we describe several examples of hypothesis spaces. One of these


examples (or, rather, a family of them) – a subset H of an RKHS – will be
systematically used in the remainder of this book.

2.1 First examples of hypothesis space


Example 2.1 (Homogeneous polynomials) Let Hd = Hd (Rn+1 ) be the
linear space of homogeneous polynomials of degree d in x0 , x1 , . . . , xn . Let
X = S(Rn+1 ), the n-dimensional unit sphere. An element in Hd defines a
function from X to R and can be written as

f = wα x α .
|α|=d

Here, α = (α0 , . . . , αn ) ∈ Nn+1 is a “multi-index,” |α| = α0 + · · · + αn ,


and xα = x0α0 · · · xnαn . Thus, Hd is a finite-dimensional vector space. We may
consider H = {f ∈ Hd | f ∞ ≤ 1} to be a hypothesis space. Because of the
scaling f (λx) = λd f (x), taking the bound f ∞ ≤ 1 causes no loss.

Example 2.2 (Finite-dimensional function spaces) This generalizes the


previous example. Let φ1 , . . . , φN ∈ C (X ) and E be the linear subspace of
C (X ) spanned by {φ1 , . . . , φN }. Here we may take H = {f ∈ E | f ∞ ≤ R}
for some R > 0.
The next two examples deal with infinite-dimensional linear spaces. To
describe them better, we first remind the reader of a few basic notions and
notations.

17
18 2 Basic hypothesis spaces

2.2 Reminders I
(I) We first recall some commonly used spaces of functions.
We have already defined C (X ). Recall that this is the Banach space of
bounded continuous functions on X with the norm

f C (X ) = f ∞ = sup | f (x)|.
x∈X

When X ⊆ Rn , for s ∈ N, we denote by C s (X ) the space of functions on X that


are s times differentiable and whose sth partial derivatives Dα f are continuous.
This is also a Banach space with the norm

f C s (X ) = max Dα f .
|α|≤s

Here, α ∈ Nn and Dα f is the partial derivative ∂ |α| f /∂1α x1 , . . . , ∂nα xn .


The space C ∞ (X ) is the intersection of the spaces C s (X ) for s ∈ N. We do
not define any norm on C ∞ (X ) and consider it only as a linear space.
Let ν be a Borel measure on X , p ∈ [1, ∞), and L be the linear space of
functions f : X → Y such that the integral

| f (x)|p d ν
X

p
exists. The space Lν (X ) is defined to be the quotient of L under the equivalence
relation ≡ given by

f ≡ g ⇐⇒ | f (x) − g(x)|p d ν = 0.
X

This is a Banach space with the norm


 1/p
f Lνp (X ) = | f (x)|p d ν .
X

If p = 2, Lν2 (X ) is actually a Hilbert space with the scalar product



 f , gLν2 (X ) = f (x)g(x) d ν.
X

When there is no risk of confusion we write  , ν and ν instead of  , Lν2 (X )


and Lν2 (X ) . In addition, when ν = ρX we simply write ρ instead of the
more cumbersome ρX .
2.2 Reminders I 19

p
Note that elements in Lν (X ) are classes of functions. In general, however,
one abuses language and refers to them as functions on X . For instance, we say
p
that f ∈ Lν (X ) is continuous when there exists a continuous function in the
class of f .
The support of a measure ν on X is the smallest closed subset Xν of X such
that ν(X \ Xν ) = 0.
A function f : X → R is measurable when, for all α ∈ R, the set {x ∈ X |
f (x) ≤ α} is a Borel subset of X .
The space Lν∞ (X ) is defined to be the set of all measurable functions on X
such that

f Lν∞ (X ) := sup | f (x)| < ∞.


x∈Xν

Each element in Lν∞ (X ) is a class of functions that are identical on Xν .


A measure ν is finite, when ν(X ) < ∞. Also, we say that ν is nondegenerate
when, for each nonempty open subset U ⊆ X , ν(U ) > 0. Note that ν is
nondegenerate if and only if Xν = X .
If ν is finite and nondegenerate, then we have a well-defined injection
p
C (X ) → Lν (X ), for all 1 ≤ p ≤ ∞.
p
When ν is the Lebesgue measure, we sometimes denote Lν (X ) by L p (X )
or, if there is no risk of confusion, simply by L p .
(II) We next briefly recall some basics about the Fourier transform. The
Fourier transform F : L 1 (Rn ) → L 1 (Rn ) is defined by

F( f )(w) = e−iw·x f (x) dx.
Rn

The function F( f ) is well defined and continuous on Rn (note, however,


that F( f ) is a complex-valued function). One major property of the Fourier
transform in L 1 (Rn ) is the convolution property

F( f ∗ g) = F( f )F(g),

where f ∗ g denotes the convolution of f and g defined by



( f ∗ g)(x) = f (x − u)g(u) du.
Rn

The extension of the Fourier transform to L 2 (Rn ) requires some caution.


Let C0 (Rn ) denote the space of continuous functions on Rn with compact
20 2 Basic hypothesis spaces

support. Clearly, C0 (Rn ) ⊂ L 1 (Rn ) ∩ L 2 (Rn ). In addition, C0 (Rn ) is dense


in L 2 (Rn ). Thus, for any f ∈ L 2 (Rn ), there exists {φk }k≥1 ⊂ C0 (Rn ) such
that φk − f → 0 when k → ∞. One can prove that for any such sequence
(φk ), the sequence (F(φk )) converges to the same element in L 2 (Rn ). We
denote this element by F( f ) and we say that it is the Fourier transform of f .
The notation f instead of F( f ) is often used.
The following result summarizes the main properties of F : L 2 (Rn ) →
L 2 (Rn ).

Theorem 2.3 (Plancherel’s theorem) For f ∈ L 2 (Rn )



(i) F( f )(w) = lim e−iw·x f (x) dx, where the convergence is for the
k→∞ [−k,k]n
norm in L 2 (Rn ).
(ii) F( f ) = (2π)n/2 f .
1
(iii) f (x) = lim eiw·x F( f )(w) dw, where the convergence is
k→∞ (2π)n [−k,k]n
for the norm in L 2 (Rn ). If f ∈ L 1 (Rn ) ∩ L 2 (Rn ), then the convergence
holds almost everywhere.
(iv) The map F : L 2 (Rn ) → L 2 (Rn ) is an isomorphism of Hilbert spaces.

(III) Our third reminder is about compactness.


It is well known that a subset of Rn is compact if and only if it is closed and
bounded. This is not true for subsets of C (X ). Yet a characterization of compact
subsets of C (X ) in similar terms is still possible.
A subset S of C (X ) is said to be equicontinuous at x ∈ X when for every
ε > 0 there exists a neighborhood V of x such that for all y ∈ V and f ∈ S,
| f (x) − f (y)| < ε. The set S is said to be equicontinuous when it is so at every
x in X .

Theorem 2.4 (Arzelá–Ascoli theorem) Let X be compact and S be a subset of


C (X ). Then S is a compact subset of C (X ) if and only if S is closed, bounded,
and equicontinuous. 

The fact that every closed ball in Rn is compact is not true in Hilbert space.
However, we will use the fact that closed balls in a Hilbert space H are weakly
compact. That is, every sequence {fn }n∈N in a closed ball B in H has a weakly
convergent subsequence {fnk }k∈N , or, in other words, there is some f ∈ B
such that
lim fnk , g = f , g, ∀g ∈ H .
k→∞
2.3 Hypothesis spaces associated with Sobolev spaces 21

(IV) We close this section with a discussion of completely monotonic


functions. This discussion is on a less general topic than the preceding contents
of these reminders.
A function f : [0, ∞) → R is completely monotonic if it is continuous on
[0, ∞), C ∞ on (0, ∞), and, for all r > 0 and k ≥ 0, (−1)k f (k) (r) ≥ 0.
We will use the following characterization of completely monotonic
functions.
Proposition 2.5 A function f : [0, ∞) → R is completely monotonic if and only
if, for all t ∈ (0, ∞),
 ∞
f (t) = e−tσ d ν(σ ),
0

where ν is a finite Borel measure on [0, ∞). 

2.3 Hypothesis spaces associated with Sobolev spaces


Definition 2.6 Let J : E → F be a linear map between the Banach spaces E
and F. We say that J is bounded when there exists b ∈ R such that for all x ∈ E
with x = 1, J (x) ≤ b. The operator norm of J is

J = sup J (x) .
x =1

If J is not bounded, then we write J = ∞. We say that J is compact when


the closure J (B) of J (B) is compact for any bounded set B ⊂ E.

Example 2.7 (Sobolev spaces) Let X be a domain in Rn with smooth boundary.


For every s ∈ N we can define an inner product in C ∞ (X ) by
 
f , gs = Dα fDα g.
X |α|≤s

Here we are integrating with respect to the Lebesgue measure µ on X


inherited from Euclidean space. We will denote by s the norm induced
by  , s . Notice that when s = 0, the inner product above coincides with
that of L 2 (X ). That is, 0 = . We define the Sobolev space H s (X ) to
be the completion of C ∞ (X ) with respect to the norm s . The Sobolev
embedding theorem asserts that for all r ∈ N and all s > n/2 + r, the
inclusion

Js : H s (X ) → C r (X )
22 2 Basic hypothesis spaces

is well defined and bounded. In particular, for all s > n/2, the inclusion

Js : H s (X ) → C (X )

is well defined and bounded. From Rellich’s theorem it follows that if X is


compact, this last embedding is compact as well. Thus, if BR denotes the closed
ball of radius R in H s (X ) we may take HR,s = H = Js (BR ).

2.4 Reproducing Kernel Hilbert Spaces


Definition 2.8 Let X be a metric space. We say that K : X × X → R is
symmetric when K(x, t) = K(t, x) for all x, t ∈ X and that it is positive
semidefinite when for all finite sets x = {x1 , . . . , xk } ⊂ X the k × k matrix
K[x] whose (i, j) entry is K(xi , xj ) is positive semidefinite. We say that K is
a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The
matrix K[x] above is called the Gramian of K at x.
For the remainder of this section we fix a compact metric space X and a
Mercer kernel K:X × X → R. Note that the positive semidefiniteness implies
that K(x, x) ≥ 0 for each x ∈ X . We define

CK := sup K(x, x).


x∈X

Then

CK = sup |K(x, t)|


x,t∈X

since, by the positive semidefiniteness of the matrix K[{x, t}], for all x, t ∈ X ,

(K(x, t))2 ≤ K(x, x)K(t, t).

For x ∈ X , we denote by Kx the function

Kx : X → R
t  → K(x, t).

The main result of this section is given in the following theorem.


2.4 Reproducing Kernel Hilbert Spaces 23

Theorem 2.9 There exists a unique Hilbert space (HK ,  , HK ) of functions
on X satisfying the following conditions:
(i) for all x ∈ X , Kx ∈ HK ,
(ii) the span of the set {Kx | x ∈ X } is dense in HK , and
(iii) for all f ∈ HK and x ∈ X , f (x) = Kx , f HK .
Moreover, HK consists of continuous functions and the inclusion IK : HK →
C (X ) is bounded with IK ≤ CK .
Proof. Let H0 be the span of the set {Kx | x ∈ X }. We define an inner product
in H0 as

 
s 
r
f , g = αi βj K(xi , tj ), for f = αi Kxi , g = βj Ktj .
1≤i≤s i=1 j=1
1≤j≤r

The conditions for the inner product can be easily checked. For example, if
f , f  = 0, then for each t ∈ X the positive semidefiniteness of the Gramian of
K at the subset {xi }si=1 ∪ {t} tells us that for each  ∈ R


s 
s
αi K(xi , xj )αj + 2 αi K(xi , t) +  2 K(t, t) ≥ 0.
i,j=1 i=1

However, si,j=1 αi K(xi , xj )αj = f , f  = 0. By letting  be arbitrarily small,

we see that f (t) = si=1 αi K(xi , t) = 0. This is true for each t ∈ X ; hence f is
the zero function.
Let HK be the completion of H0 with the associated norm. It is easy to check
that HK satisfies the three conditions in the statement. We need only prove that
it is unique. So, assume H is another Hilbert space of functions on X satisfying
the conditions noted. We want to show that

H = HK and  , H =  , HK . (2.1)

We first observe that H0 ⊂ H . Also, for any x, t ∈ X , Kx , Kt H = K(x, t) =


Kx , Kt HK . By linearity, for every f , g ∈ H0 , f , gH = f , gHK . Since both
H and HK are completions of H0 , (2.1) follows from the uniqueness of the
completion.
To see the remaining assertion consider f ∈ HK and x ∈ X . Then

| f (x)| = |Kx , f HK | ≤ f HK Kx HK = f HK K(x, x).


24 2 Basic hypothesis spaces

This implies that f ∞ ≤ CK f HK and, thus, IK ≤ CK . Therefore,


convergence in HK implies convergence in ∞ , and this shows that f
is continuous since f is the limit of elements in H0 that are continuous. 
In what follows, to reduce the amount of notation, we will write  , K instead
of  , HK and K instead of HK .
Definition 2.10 The Hilbert space HK in Theorem 2.9 is said to be an
Reproducing Kernel Hilbert Space (RKHS). Property (iii) in Theorem 2.9 is
refered to as the reproducing property.

2.5 Some Mercer kernels


In this section we discuss some families of Mercer kernels on subsets of Rn . In
most cases, checking the symmetry and continuity of a given kernel K will be
straightforward. Checking that K is positive semidefinite will be more involved.
The first family of Mercer kernels we look at is that of dot product kernels.
Let X = {x ∈ Rn : x ≤ R} be a ball of Rn with radius R > 0. A dot product
kernel is a function K : X × X → R given by


K(x, y) = ad (x · y)d ,
d =0

where ad ≥ 0 and ad R2d < ∞.
Proposition 2.11 Dot product kernels are Mercer kernels on X .
Proof. The kernel K is obviously symmetric and continuous on X × X .
To check its positive semidefiniteness, recall that the multinomial coefficients
associated with the pairs (d , α)
d!
Cαd = , α ∈ Zn+ , |α| = d ,
α1 ! · · · αn !
satisfy

(x · y)d = Cαd xα yα , ∀x, y ∈ Rn .
|α|=d

Let {x1 , . . . , xk } ⊂ X . Then, for all c1 , . . . , ck ∈ R,



 k 2
k   
α
ci cj K(xi , xj ) = ad d
Cα ci xi ≥ 0.
i,j=1 d =0 |α|=d i=1

Therefore, K is a Mercer kernel. 


2.5 Some Mercer kernels 25

An explicit example is the linear polynomial kernel.

Example 2.12 Let X be a subset of R containing at least two points, and K


the Mercer kernel on X given by K(x, y) = 1 + x · y. Then HK is the space of
linear functions and {1, x} forms an orthonormal basis of HK .

Proof. Note that for a ∈ X , Ka is the function 1 + ax of the variable x in


X ⊆ R. Take a  = b ∈ X . By the definition of the inner product in HK ,

Ka − Kb 2
K = Ka − Kb , Ka − Kb K = K(a, a) − 2K(a, b) + K(b, b)
= 1 + a2 − 2(1 + ab) + 1 + b2 = (a − b)2 .

But (Ka − Kb )(x) = (1 + ax) − (1 + bx) = (a − b)x. So Ka − Kb 2


K =
(a − b)x 2K = (a − b)2 x 2K . It follows that x K = 1.
In the same way,

Ka , Ka − Kb K = K(a, a) − K(a, b) = (1 + a2 ) − (1 + ab) = a(a − b).

But (Ka − Kb )(x) = (a − b)x and Ka (x) = 1 + ax. So

Ka , Ka − Kb K = 1 + ax, (a − b)xK = (a − b)1, xK + a(a − b) x 2


K.

Since x 2K = 1, we have (a − b)1, xK = 0. But a − b  = 0, and hence


1, xK = 0.
Similarly, Ka , Ka K = K(a, a) = 1 + a2 can also be written as

1 + ax, 1 + axK = 1 2
K + 2a1, xK + a2 x 2
K = 1 2
K + a2 .

Hence 1 K = 1. We have thus proved that {1, x} is an orthonormal system


of HK .
Finally, each function Kc = 1 + cx with c ∈ X is contained in span{1, x},
which is a closed subspace of HK . Therefore, span{1, x} = HK and {1, x} is an
orthonormal basis of HK . 

The above extends to the multivariate case under a slightly stronger


assumption on X .

Example 2.13 Let X ⊆ Rn containing 0 and the coordinate vectors ej , j =


1, . . . , n. Let K be the Mercer kernel on X given by K(x, y) = 1 + x · y. Then
HK is the space of linear functions and {1, x1 , x2 , . . . , xn } forms an orthonormal
basis of HK .
26 2 Basic hypothesis spaces

Proof. Note that K(v, x) = 1 + v · x = 1 + v1 x1 + · · · + vn xn with v =


(v1 , . . . , vn ) ∈ X ⊂ Rn .
We argue as in Example 2.12. For each 1 ≤ j ≤ n,

Kej − K0 2
K = K(ej , ej ) − 2K(ej , 0) + K(0, 0) = 1.

But (Kej −K0 )(x) = (1+xj )−1 = xj , and therefore 1 = Kej −K0 2K = xj 2K .
One can prove, similarly, that 1, xj K = 0 and 1, 1K = 1. Now consider i  = j:

Kei , Kej K = K(ei , ej ) = 1 = 1 + xi , 1 + xj K


= 1 2
K + 1, xi K + 1, xj K + xi , xj K .

Since we have shown that 1, xi K = 1, xj K = 0 and 1 K = 1, it follows that


xi , xj K = 0. We have thus proved that {1, x1 , x2 , . . . , xn } is an orthonormal
system of HK . But each function Kv = 1 + v · x with v ∈ X is contained
in span{1, x1 , x2 , . . . , xn }, which is a closed subspace of HK . Therefore,
span{1, x1 , x2 , . . . , xn } = HK and {1, x1 , x2 , . . . , xn } is an orthonormal basis
of HK . 

The second family is that of translation invariant kernels as given by,

K(x, y) = k(x − y),

where k is an even function on Rn , that is, k(−x) = k(x) for all x ∈ Rn .


We say that the Fourier transform k of k is nonnegative (respectively, positive)
when it is real valued and k(ξ ) ≥ 0 (respectively, k(ξ ) > 0) for all ξ ∈ Rn .

Proposition 2.14 Let k ∈ L 2 (Rn ) be continuous and even. Suppose the


Fourier transform of k is nonnegative. Then the kernel K(x, y) = k(x − y)
is a Mercer kernel on Rn and hence a Mercer kernel on any subset X of Rn .

Proof. We need only show the positive semidefiniteness. To do so, for any
x1 , . . . , xm ∈ Rn and c1 , . . ., cm ∈ R, we apply the inverse Fourier transform


k(x) = (2π)−n k(ξ )eix·ξ d ξ
Rn
2.5 Some Mercer kernels 27

to get


m 
m 
cj c K(xj , x ) = cj c (2π)−n k(ξ )eixj ·ξ e−ix ·ξ d ξ
j,=1 j,=1 Rn
⎛ ⎞ 
 m 
m
= (2π)−n k(ξ ) ⎝ cj e j ⎠
ix ·ξ
c eix ·ξ d ξ
Rn j=1 =1
 2
  
 m

= (2π)−n k(ξ )  cj eixj ·ξ  d ξ ≥ 0,
Rn  j=1 

where | | means the module in C and z is the complex conjugate of z. Thus, K


is a Mercer kernel on any subset of Rn . 
Example 2.15 (A spline kernel) Let k be the univariate function supported on
[−2, 2] given by k(x) = 1 − |x|/2 for −2 ≤ x ≤ 2. Then the kernel K defined
by K(x, y) = k(x − y) is a Mercer kernel on any subset X of R.
Proof. One can easily check that 2k(x) equals the convolution of
the characteristic function χ[−1,1] with itself. But χ[−1,1] (ξ ) = 2 sin ξ/ξ .
Thus, k(ξ ) = 2(sin ξ/ξ )2 ≥ 0 and the Mercer property follows from
Proposition 2.14. 
Remark 2.16 Note that the kernel K defined in Example 2.15 is given by

1 − |x−y| if |x − y| ≤ 2
K(x, y) = 2
0 otherwise,

and therefore CK = 1.
Multivariate splines can also be used to construct translation-invariant
kernels. Take B = [b1 b2 . . . bq ] to be an n × q matrix (called the direction
set) such that q ≥ n and the n × n submatrix B0 = [b1 b2 . . . bn ] is invertible.
Define
1
MB0 = χpar(B0 ) ,
| det B0 |

the normalized characteristic function of the parallepiped


⎧ ⎫
⎨ n
1 ⎬
par(B0 ) := tj bj | |tj | ≤ , 1 ≤ j ≤ n
⎩ 2 ⎭
j=1
28 2 Basic hypothesis spaces

spanned by the vectors b1 , . . . , bn in Rn . Then the (centered) box spline MB can


be inductively defined by

 1
2
M[b1 b2 ... bn+j ] (x) = M[b1 b2 ... bn+j−1 ] (x − tbn+j ) dt
− 12

for j = 1, . . . , q − n. One can check by induction that its Fourier transform


satisfies


q
sin(ξ · bj /2)
MB (ξ ) = .
ξ · bj /2
j=1

Example 2.17 (A box spline kernel) Let B = [b1 b2 . . . bq ] be an n×q matrix


where [b1 b2 . . . bn ] is invertible. Choose k(x) = (MB ∗ MB )(x) to be the box
spline with direction set [B, B]. Then, for all ξ ∈ Rn ,


q
sin(ξ · bj /2) 2
k(ξ ) =
ξ · bj /2
j=1

and the kernel K(x, y) = k(x − y) is a Mercer kernel on any subset X of Rn .


An interesting class of translation invariant kernels is provided by radial
basis functions. Here the kernel takes the form K(x, y) = f ( x − y 2 ) for
a univariate function f on [0, +∞). The following result allows us to verify
positive semidefiniteness easily for this type of kernel.

Proposition 2.18 Let X ⊂ Rn , f : [0, ∞) → R and K:X × X → R defined


by K(x, y) = f ( x − y 2 ). If f is completely monotonic, then K is positive
semidefinite.

Proof. By Proposition 2.5, there is a finite Borel measure ν on [0, ∞) for


which
 ∞
f (t) = e−tσ d ν(σ )
0

for all t ∈ [0, ∞). It follows that


 ∞
e−σ
2
K(x, y) = f ( x − y 2 ) = x−y
d ν(σ ).
0
2.5 Some Mercer kernels 29

Now note that for each σ ∈ [0, ∞), the Fourier transform of e−σ x 2
equals

( π/σ )n e− ξ /4σ . Hence,
2

 π n ξ 2
−σ x −n
e−
2 2
e = (2π) 4σ eix·ξ d ξ .
Rn σ

Therefore, reasoning as in the proof of Proposition 2.14, we have, for all x =


(x1 , . . . , xm ) ∈ X m ,
 2
  π n  m 

m ∞ ξ 2  −ix ·ξ 
c cj K(x , xj ) = (2π)−n
2
e−  c e j  d ξ d ν(σ ) ≥ 0.
σ

 j 
,j=1 0 Rn  j=1 

Corollary 2.19 Let c > 0. The following functions are Mercer kernels on any
subset X ⊂ Rn :

(i) (Gaussian) K(x, t) = e− x−t /c .


2 2

(ii) (Inverse multiquadrics) K(x, t) = (c2 + x − t 2 )−α with α > 0.

Proof. Clearly, both kernels are continuous and symmetric. In (i) K is positive
semidefinite by Proposition 2.18 with f (r) = e−r/c . The same is true for (ii)
2

taking f (r) = (c2 + r)−α . 

Remark 2.20 The kernels of (i) and (ii) in Corollary 2.19 satisfy CK = 1 and
CK = c−α , respectively.

A key example of a finite-dimensional RKHS induced by a Mercer kernel


follows. Unlike in the case of the Mercer kernels of Corollary 2.19, we will not
use Proposition 2.18 to show positivity.

Example 2.1 (continued) Recall that Hd = Hd (Rn+1 ) is the linear space


of homogeneous polynomials of degree d in x0 , x1 , . . . , xn . Its dimension (the
number of coefficients of a polynomial f ∈ Hd ) is

n+d
N= .
n

The number N is exponential in n and d . We notice, however, that in some


situations one may consider a linear space of polynomials with a given
monomial structure; that is, only a prespecified set of monomials may appear.
30 2 Basic hypothesis spaces

We can make Hd an inner product space by taking



f , gW = wα vα (Cαd )−1
|α|=d
 
for f , g ∈ Hd , f = wα xα , g = vα xα . This inner product, which we call
the Weyl inner product, is natural and has an important invariance property. Let
O(n + 1) be the orthogonal group in Rn+1 , that is, the group of (n + 1) × (n + 1)
real matrices whose action on Rn+1 preserves the inner product on Rn+1 ,

σ (x) · σ (y) = x · y, for all x, y ∈ Rn+1 and all σ ∈ O(n + 1).

The action of O(n + 1) on Rn+1 induces an action of O(n + 1) on Hd . For


f ∈ Hd and σ ∈ O(n + 1) we define σ ( f ) ∈ Hd by σ f (x) = f (σ −1 (x)).
The invariance property of  , W , called orthogonal invariance, is that for all
f , g ∈ Hd ,

σ ( f ), σ (g)W = f , gW .

Note that if f W denotes the norm induced by  , W , then

| f (x)| ≤ f W x d
,

where x is the standard norm of x ∈ Rn+1 . This follows from taking the
action of σ ∈ O(n + 1) such that σ (x) = ( x , 0, . . ., 0).
Let X = S(Rn+1 ) and

K :X × X → R
(x, t)  → (x · t)d .

Let also

 : X → RN
 
x  → xα (Cαd )1/2 .
|α|=d

Then, for x, t ∈ X , we have



(x) · (t) = xα t α Cαd = (x · t)d = K(x, t).
|α|=d

This equality enables us to prove that K is positive semidefinite. For t1 , . . . , tk ∈


X , the entry in row i and column j of K[t] is (ti )·(tj ). Therefore, if M denotes
2.6 Hypothesis spaces associated with an RKHS 31

the matrix whose jth column is (tj ), we have that K[t] = M T M , from which
the positivity of K[t] follows. Since K is clearly continuous and symmetric, we
conclude that K is a Mercer kernel.
The next proposition shows the RKHS associated with K.
Proposition 2.21 Hd = HK as function spaces and inner product spaces.
Proof. We know from the proof of Theorem 2.9 that HK is the completion of
H0 , the span of {Kx | x ∈ X }. Since H0 ⊆ Hd and Hd has finite dimension, the
same holds for H0 . But then H0 is complete and we deduce that

HK = H0 ⊆ Hd .

The map V : Rn+1 → RN defined by V(x) = (xα )|α|=d is a well-known


object in algebraic geometry, where it is called a Veronese embedding. We
note here that the map  defined above is related to V, since for every x ∈ X ,
(x) = DV(x), where D is the diagonal matrix with entries (Cαd )1/2 . The image
of Rn+1 by the Veronese embedding is an algebraic variety called the Veronese
variety, which is known to be nondegenerate, that is, to span all of RN . This
implies that HK = Hd as vector spaces. We now show that they are actually
the same inner product space.
By definition of the inner product in H0 , for all x, t ∈ X ,

Kx , Kt H0 = K(x, t) = Cαd xα t α .
|α|=d


On the other hand, since Kx (w) = |α|=d Cαd xα wα , we know that the Weyl
inner product of Kx and Kt satisfies
 
Kx , Kt W = (Cαd )−1 Cαd xα Cαd t α = Cαd xα t α = Kx , Kt K .
|α|=d |α|=d

We conclude that since the polynomials Kx span all of H0 , the inner product in
HK = H0 is the Weyl inner product. 

2.6 Hypothesis spaces associated with an RKHS


We now proceed with the last example in this chapter.
Proposition 2.22 Let K be a Mercer kernel on a compact metric space X , and
HK its RKHS. For all R > 0, the ball BR := {f ∈ HK : f K ≤ R} is a closed
subset of C (X ).
32 2 Basic hypothesis spaces

Proof. Suppose {fn } ⊂ BR converges in C (X ) to a function f ∈ C (X ). Then,


for all x ∈ X ,

f (x) = lim fn (x).


n→∞

Since a closed ball of a Hilbert space is weakly compact, we have that BR is


weakly compact. Therefore, there exists a subsequence {fnk }k∈N of {fn } and an
element f ∈ BR such that

lim fnk , gK = f , gK , ∀g ∈ HK .


k→∞

For each x ∈ X , take g = Kx to obtain

lim fnk (x) = lim fnk , Kx K = f , Kx K = f (x).


k→∞ k→∞

But limk→∞ fnk (x) = f (x), so we have f (x) = f (x) for every point x ∈ X .
Hence, as continuous functions on X , f = f . Therefore, f ∈ BR . This shows
that BR is closed as a subset of C (X ). 
Proposition 2.23 Let K be a Mercer kernel on a compact metric space X , and
HK be its RKHS. For all R > 0, the set IK (BR ) is compact.
Proof. By the Arzelá–Ascoli theorem (Theorem 2.4) it suffices to prove that
BR is equicontinuous.
Since X is compact, so is X × X . Therefore, since K is continuous on X × X ,
K must be uniformly continuous on X × X . It follows that for any ε > 0, there
exists δ > 0 such that for all x, y, y ∈ X with d (y, y ) ≤ δ,

|K(x, y) − K(x, y )| ≤ ε.

For f ∈ BR and y, y ∈ X with d (y, y ) ≤ δ, we have

| f (y) − f (y )| = |f , Ky − Ky K | ≤ f K Ky − K y  K



≤ R(K(y, y) − K(y, y ) + K(y , y ) − K(y, y ))1/2 ≤ R 2ε.
  


Example 2.24 (Hypothesis spaces associated with an RKHS) Let X be
compact and K : X × X → R be a Mercer kernel. By Proposition 2.23, for
all R > 0 we may consider IK (BR ) to be a hypothesis space. Here and in what
follows BR denotes the closed ball of radius R centered on the origin.
2.7 Reminders II 33

2.7 Reminders II
The general nonlinear programming problem is the problem of finding x ∈ Rn
to solve the following minimization problem:

min f (x)
s.t. gi (x) ≤ 0, i = 1, . . . , m, (2.2)
hj (x) = 0, j = 1, . . . , p,

where f , gi , hj : Rn → R. The function f is called the objective function, and the


equalities and inequalities on gi and hj are called the constraints. Points x ∈ Rn
satisfying the constraints are feasible and the subset of Rn of all feasible points
is the feasible set.
Although stating this problem in all its generality leads to some conceptual
clarity, it would seem that the search for an efficient algorithm to solve it is
hopeless. A vast amount of research has thus focused on particular cases and
the emphasis has been on those cases for which efficient algorithms exist. We do
not develop here the complexity theory giving formal substance to the notion of
efficiency – we do not need such a development; instead we content ourselves
with understanding the notion of efficiency according to its intuitive meaning:
an efficient algorithm is one that computes its outcome in a reasonably short
time for reasonably long inputs. This property can be found in practice and
studied in theory (via several well-developed measures of complexity available
to complexity theorists).
One example of a well-studied case is linear programming. This is the case
in which both the objective function and the constraints are linear. It is also a
case in which efficient algorithms exist (and have been both used in practice
and studied in theory). A much more general case for which efficient algorithms
exist is that of convex programming.
A subset S of a linear space H is said to be convex when, for all x, y ∈ S and
all λ ∈ [0, 1], λx + (1 − λ)y ∈ S.
A function f on a convex domain S is said to be convex if, for all λ ∈ [0, 1]
and all x, y ∈ S, f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). If S is an interval on
R, then, for x0 , x ∈ S, x0 < x, we have

f (x0 + λ(x − x0 )) = f (λx + (1 − λ)x0 ) ≤ λf (x) + (1 − λ)f (x0 )

and
f (x0 + λ(x − x0 )) − f (x0 ) f (x) − f (x0 )
≤ .
λ(x − x0 ) x − x0
34 2 Basic hypothesis spaces

This means that the function t → ( f (t) − f (x0 ))/(t − x0 ) is increasing in the
interval [x0 , x]. Hence, the right derivative

f (t) − f (x0 )
f+ (x0 ) := lim
t→(x0 )+ t − x0

exists. In the same way we see that the left derivative

f (t) − f (x0 )
f− (x0 ) := lim
t→(x0 )− t − x0

exists. These two derivatives, in addition, satisfy f− (x0 ) ≤ f+ (x0 ) whenever x0
is a point in the interior of S. Hence, both f− (x0 ) and f+ (x0 ) are nondecreasing
in S.
In addition to those listed above, convex functions satisfy other properties.
We highlight the fact that the addition of convex functions is convex and that if a
function f is convex and C 2 then its Hessian D2 f (x) at x is positive semidefinite
for all x in its domain.
The convex programming problem is the problem of finding x ∈ Rn
to solve (2.2) with f and gi convex functions and hj linear. As we have
remarked, efficient algorithms for the convex programming problem exist.
In particular, when f and the gi are quadratic functions, the corresponding
programming problem, called the convex quadratic programming problem,
can be solved by even more efficient algorithms. In fact, convex quadratic
programs are a particular case of second-order cone programs. And second-
order cone programming today provides an example of the success of interior
point methods: very large amounts of input data can be efficiently dealt with,
and commercial code is available. (For references see Section 2.9).

2.8 On the computation of empirical target functions


A remarkable property of the hypothesis space H = IK (BR ), where BR is the
ball of radius R in an RKHS HK , is the fact that the optimization problem of
computing the empirical target function fz reduces to a convex programming
problem.
Let K be a Mercer kernel, and HK its associated RKHS. Let z ∈ Z m . Denote
by HK,z the finite-dimensional subspace of HK spanned by {Kx1 , . . ., Kxm } and
let P be the orthogonal projection P : HK → HK,z .
Proposition 2.25 Let B ⊆ HK . If f ∈ HK is a minimizer of Ez in B, then P( f )
is a minimizer of Ez in P(B), the image of B under P.
2.9 References and additional remarks 35

Proof. For all f ∈ B and all i = 1, . . ., m, f , Kxi K = P( f ), Kxi K . Since


both f and P( f ) are in HK , the reproducing property implies that

f (xi ) = f , Kxi K = P( f ), Kxi K = (P( f ))(xi ).

It follows that Ez ( f ) = Ez (P( f )). Taking f to be a minimizer of Ez in B proves


the statement. 
Corollary 2.26 Let B ⊆ HK be such that P(B) ⊆ B. If Ez can be minimized in
B then such a minimizer can be chosen in P(B). 

Corollary 2.26 shows that in many situations – for example, when B is


convex – the empirical target function fz may be chosen in HK,z . Recall from
Theorem 2.9 that the norm K restricted to HK,z is given by
 m 2
  
m
 
 ci Kxi  = ci K(xi , xj )cj = cT K[x]c.
 
i=1 i,j=1
m ∗
Therefore, when B = BR , we may take fz = i=1 ci Kxi , where c∗ ∈ Rm is a
solution of the following problem:
 m 2
1 
m
min ci K(xi , xj ) − yj
m
j=1 i=1

s.t. cT K[x]c ≤ R2 .

Note that this is a convex quadratic programming problem and, therefore, can
be efficiently solved.

2.9 References and additional remarks


An exhaustive exposition of the Fourier transform can be found in [120].
For a proof of Plancherel’s theorem see section 4.11 of [40]. The Arzelá–
Ascoli theorem is proved, for example, in section 11.4 of [70] or section 9 of [2].
Proposition 2.5 is shown in [106] (together with a more difficult converse). For
extensions to conditionally positive semidefinite kernels generated by radial
basis functions see [86].
The definition of H s (X ) can be extended to s ∈ R, s ≥ 0 (called fractional
Sobolev spaces), using a Fourier transform argument [120]. We will do so in
Section 5.1.
36 2 Basic hypothesis spaces

References for Sobolev space are [1, 129], and [47] for embedding theorems.
A substantial amount of the theory of RKHSs was surveyed by
N. Aronszajn [9]. On page 344 of this reference, Theorem 2.9, in essence,
is attributed to E. H. Moore.
The special dot product kernel K(x, y) = (c + x · y)d for some c ≥ 0 and
d ∈ N was introduced into the field of statistical learning theory by Vapnik (see,
e.g., [134]). General dot product kernels are described in [118]; see also [101]
and [79]. Spline kernels are discussed extensively in [137].
Chapter 14 of [19] is a reference for the unitary and orthogonal invariance of
 , W . A reference for the nondegeneracy of the Veronese variety mentioned
in Proposition 2.21 is section 4.4 of [109].
A comprehensive introduction to convex optimization is the book [25]. For
second-order cone programming see the articles [3, 119].
For more families of Mercer kernels in learning theory see [107]. More
examples of box splines can be found in [41]. Reducing the computation of fz
from HK to HK,z is ensured by representer theorems [137]. For a general form
of these theorems see [117].
3
Estimating the sample error

The main result in this chapter provides bounds for the sample error of a compact
and convex hypothesis space. We have already noted that with m fixed, the
sample error increases with the size of H. The bounds we deduce in this chapter
show this behavior with respect to a particular measure for the size of H: its
capacity as measured by covering numbers.

Definition 3.1 Let S be a metric space and η > 0. We define the covering
number N (S, η) to be the minimal  ∈ N such that there exist  disks in S with
radius η covering S. When S is compact this number is finite.

Definition 3.2 Let M > 0 and ρ be a probability measure on Z. We say that a


set H of functions from X to R is M-bounded when

sup | f (x) − y| ≤ M
f ∈H

holds almost everywhere on Z.

Theorem 3.3 Let H be a compact and convex subset of C (X ). If H is


M-bounded, then, for all ε > 0,
 ε   mε
Prob {EH ( fz ) ≤ ε} ≥ 1 − N H, exp − .
z∈Zm 12M 300M 2

3.1 Exponential inequalities in probability


Write the sample error EH ( fz ) = E( fz ) − E( fH ) as

EH ( fz ) = E( fz ) − Ez ( fz ) + Ez ( fz ) − Ez ( fH ) + Ez ( fH ) − E( fH ).

37
38 3 Estimating the sample error

Since fz minimizes Ez in H, Ez ( fz ) − Ez ( fH ) ≤ 0. Then a bound for EH ( fz )


follows from bounds for E( fz ) − Ez ( fz ) and Ez ( fH ) − E( fH ). For f : X → R
consider a random variable ξ on Z given by ξ(z) = ( f (x) − y)2 , where z =

(x, y) ∈ Z. Then Ez ( f ) − E( f ) = m1 m i=1 ξ(zi ) − E(ξ ) = Ez (ξ ) − E(ξ ).
The rate of convergence of this quantity is the subject of some well-known
inequalities in probability theory.
If ξ is a nonnegative random variable and t > 0, then ξ ≥ ξ χ{ξ ≥t} ≥ tχ{ξ ≥t} ,
where χJ denotes the characteristic function of J . Noting that Prob{ξ ≥ t} =
E(χ{ξ ≥t} ), we obtain Markov’s inequality,

E(ξ )
Prob{ξ ≥ t} ≤ .
t

Applying Markov’s inequality to (ξ − E(ξ ))2 for an arbitrary random variable


ξ yields Chebyshev’s inequality, for any t > 0,

σ 2 (ξ )
Prob{|ξ − E(ξ )| ≥ t} = Prob{(ξ − E(ξ ))2 ≥ t 2 } ≤ .
t2
One particular use of Chebyshev’s inequality is for sums of independent random
variables. If ξ is a random variable on a probability space Z with mean E(ξ ) = µ
and variance σ 2 (ξ ) = σ 2 , then, for all ε > 0,
  
1  m  σ2
 
Prob  ξ(z i ) − µ  ≥ ε ≤ .
z∈Z m  m  mε 2
i=1

This inequality provides a simple form of the weak law of large numbers since

it shows that when m → ∞, m1 m i=1 ξ(zi ) → µ with probability 1.
For any 0 < δ < 1 and by taking ε = σ 2 /(mδ) in the inequality above it
follows that with confidence 1 − δ,
  !
1 
m  σ2
 
 ξ(zi ) − µ ≤ . (3.1)
m  mδ
i=1

The goal of this section is to extend inequality (3.1) to show a faster rate of decay.
1
Typical bounds with confidence 1 − δ will be of the form c(log(2/δ)/m) 2 +θ
with 0 < θ < 12 depending on the variance of ξ . The improvement in the error is
seen both in its dependence on δ – from 2/δ to log(2/δ) – and in its dependence
1 1
on m – from m− 2 to m−( 2 +θ) . Note that {ξi = ξ(zi )}mi=1 are independent random
variables with the same mean and variance.
3.1 Exponential inequalities in probability 39

Proposition 3.4 (Bennett) Let {ξi }m


i=1 be independent random variables on a
m
probability space Z with means {µi } and variances {σi }. Set  := i=1 σi2 .
2 2

If for each i |ξi − µi | ≤ M holds almost everywhere, then for every ε > 0
we have

m    
" # ε 2 Mε
Prob ξi − µi > ε ≤ exp − 1+ log 1 + 2 −1 .
M Mε 
i=1

$ % loss of generality, we assume µi = 0. Then the variance of ξi


Proof. Without
is σi2 = E ξi2 .
Let c be an arbitrary positive constant that will be determined later. Then

m    m  
" #
I := Prob ξi − µi > ε = Prob exp cξi > e .

i=1 i=1

By Markov’s inequality and the independence of {ξi }, we have



m  
m
$ %
I ≤ e−cε E exp cξi = e−cε E ecξi . (3.2)
i=1 i=1

Since |ξi | ≤ M almost everywhere and E(ξi ) = 0, the Taylor expansion for ex
yields
+∞  $  %
 +∞  −2 2

$ % c E ξi c M σi
E e cξi
=1+ ≤1+ .
! !
=2 =2

Using 1 + t ≤ et , it follows that



+∞  −2 2   cM 
$ % c M σi e − 1 − cM 2
E ecξi ≤ exp = exp σi ,
! M2
=2

and therefore
 
ecM − 1 − cM 2
I ≤ exp −cε +  .
M2

Now choose the constant c to be the minimizer of the bound on the right-hand
side above:
1 Mε
c= log 1 + 2 .
M 
40 3 Estimating the sample error

That is, ecM − 1 = M ε/ 2 . With this choice,


  
ε 2 Mε
I ≤ exp − 1+ log 1 + 2 −1 .
M Mε 

This proves the desired inequality. 

Let g : [0, +∞) → R be given by

g(λ) := (1 + λ) log(1 + λ) − λ.

Then Bennett’s inequality asserts that



m   
" # 2 Mε
Prob ξi − µi > ε ≤ exp − 2 g . (3.3)
M 2
i=1

Proposition 3.5 Let {ξi }m


i=1 be independent random variables on a probability
space Z with means {µi } and variances {σi2 } and satisfying |ξi (z)−E(ξi )| ≤ M

for each i and almost all z ∈ Z. Set  2 := m i=1 σi . Then for every ε > 0,
2

(Generalized Bennett’s inequality)


 m  
" # ε Mε
Prob ξi − µi > ε ≤ exp − log 1 + 2 .
2M 
i=1

(Bernstein)
 m ⎧ ⎫
" # ⎨ ε 2 ⎬
Prob ξi − µi > ε ≤ exp −   .
⎩ 2 2 + 1 M ε ⎭
i=1 3

(Hoeffding)
 m  
" # ε2
Prob ξi − µi > ε ≤ exp − .
2mM 2
i=1

Proof. The first inequality follows from (3.3) and the inequality

λ
g(λ) ≥ log(1 + λ), ∀λ ≥ 0. (3.4)
2
3.1 Exponential inequalities in probability 41

To verify (3.4), define a C 2 function f on [0, ∞) by

f (λ) := 2 log(1 + λ) − 2λ + λ log(1 + λ).

We can see that f (0) = 0, f  (0) = 0, and f  (λ) = λ(1 + λ)−2 ≥ 0 for
λ ≥ 0. Hence f (λ) ≥ 0 and

log(1 + λ) − λ ≥ − 12 λ log(1 + λ), ∀λ ≥ 0.

It follows that
λ
g(λ) = λ log(1 + λ) + log(1 + λ) − λ ≥ log(1 + λ), ∀λ > 0.
2
This verifies (3.4) and then the generalized Bennett’s inequality.
Since g(λ) ≥ 0, we find that the function h defined on [0, ∞) by h(λ) =
(6 + 2λ)g(λ) − 3λ2 satisfies similar conditions: h(0) = h (0) = 0, and h (λ) =
(4/(1 + λ))g(λ) ≥ 0. Hence h(λ) ≥ 0 for λ ≥ 0 and

3λ2
g(λ) ≥ .
6 + 2λ
Applying this to (3.3), we get the proof of Bernstein’s inequality.
To prove Hoeffding’s inequality, we follow the proof of Proposition 3.4 and
use (3.2). As the exponential function is convex and −M ≤ ξi ≤ M almost
surely,

cξi − (−cM ) cM cM − cξi −cM


ecξi ≤ e + e
2cM 2cM
holds almost everywhere. It follows from E(ξi ) = 0 and the Taylor expansion
for ex that
∞ ∞ ∞
$ % 1 1 1  (−cM ) 1  (cM )  (cM )2j
E ecξi ≤ e−cM + ecM = + =
2 2 2 ! 2 ! (2j)!
=0 =0 j=0

∞ $ %j ∞ $ %j  
 (cM )2 /2 j
1  (cM )2 /2 (cM )2
= ≤ = exp .
j! 2 − 1 j! 2
j=0 =1 j=0
& '
This, together with (3.2), implies
& that I ≤ exp
' −cε + m(cM )2 /2 . Choose

c = ε/(mM 2 ). Then I ≤ exp −ε2 /(2mM 2 ) . 


Bounds for the distance between empirical mean and expected value follow
from Proposition 3.5.
42 3 Estimating the sample error

Corollary 3.6 Let ξ be a random variable on a probability space Z with mean


E(ξ ) = µ and variance σ 2 (ξ ) = σ 2 , and satisfying |ξ(z) − E(ξ )| ≤ M for
almost all z ∈ Z. Then for all ε > 0,
(Generalized Bennett)
   
1
m
mε Mε
Prob ξ(zi ) − µ ≥ ε ≤ exp − log 1 + 2 .
z∈Z m m 2M σ
i=1

  ⎧ ⎫
1
m ⎨ mε2 ⎬
(Bernstein) Prob ξ(z i ) − µ ≥ ε ≤ exp −   .
z∈Z m m ⎩ 2 σ2 + 1M ε ⎭
i=1 3
   
1
m
mε 2
(Hoeffding) Prob ξ(zi ) − µ ≥ ε ≤ exp − .
z∈Zm m 2M 2
i=1

Proof. Apply Proposition 3.5 to the random variables {ξi = ξ(zi )/m} that

satisfy |ξi − E(ξi )| ≤ M /m, σ 2 (ξi ) = σ 2 /m2 , and σi2 = σ 2 /m. 

Remark 3.7 Each estimate given in Corollary 3.6 is said to be a one-side


probability inequality. The same bound holds true when “≥ε” is replaced by
“≤ −ε.” By taking the union of these two events we obtain a two-side
 
probability inequality stating that Probz∈Z m | m1 m 
i=1 ξ(zi ) − µ ≥ ε is
bounded by twice the bound occurring in the corresponding one-side inequality.

Recall the definition of the defect function Lz ( f ) = E( f ) − Ez ( f ). Our


first main result, Theorem 3.8, states a bound for Prob{Lz ( f ) ≥ −ε} for a
single function f : X → Y . This bound follows from Hoeffding’s bound in
Corollary 3.6 by taking ξ = −fY2 = −( f (x) − y)2 satisfying |ξ | ≤ M 2 when
f is M-bounded.

Theorem 3.8 Let M > 0 and f : X → Y be M-bounded. Then, for all  > 0,

 
mε2
Prob {Lz ( f ) ≥ −ε} ≥ 1 − exp − . 
z∈Z m 2M 4

Remark 3.9
(i) Note that the confidence (i.e., the right-hand side in the inequality above)
is positive and approaches 1 exponentially quickly with m.
3.2 Uniform estimates on the defect 43

(ii) A case implying the M-boundedness for f is the following. Define


& '
Mρ = inf M̄ ≥ 0 | {(x, y) ∈ Z : |y − fρ (x)| ≥ M̄ } has measure zero .

Then take M = P + Mρ , where P ≥ f − fρ Lρ∞ = sup |f (x) − fρ (x)|.


X
x∈XρX
(iii) It follows from Theorem 3.8 that for any 0 < δ < 1, Ez ( f ) − E( f ) ≤
M 2 2 log(1/δ)/m with confidence 1 − δ.

3.2 Uniform estimates on the defect


The second main result in this chapter extends Theorem 3.8 to families of
functions.

Theorem 3.10 Let H be a compact M-bounded subset of C (X ). Then, for all


ε > 0,
   
 ε  mε2
Prob sup L z ( f ) ≤ ε ≥ 1 − N H, exp − .
z∈Z m f ∈H 8M 8M 4

Notice the resemblance to Theorem 3.8. The only essential difference is in


the covering number, which takes into account the extension from a single f
to the family H. This has the effect of requiring the sample size m to increase
accordingly to achieve the confidence level of Theorem 3.8.

Lemma 3.11 Let H = S1 ∪ . . . ∪ S and ε > 0. Then


  
 

Prob
m
sup Lz ( f ) ≥ ε ≤ Prob
m
sup Lz ( f ) ≥ ε .
z∈Z f ∈H z∈Z f ∈Sj
j=1

Proof. The proof follows from the equivalence

sup Lz ( f ) ≥ ε ⇐⇒ ∃j ≤  s.t. sup Lz ( f ) ≥ ε


f ∈H f ∈Sj

and the fact that the probability of a union of events is bounded by the sum of
the probabilities of those events. 
$ ε
%
Proof of Theorem 3.10 Let  = N H, 4M and consider f1 , . . . , f such that
ε
the disks Dj centered at fj and with radius 4M cover H. Let U be a full measure
44 3 Estimating the sample error

set on which supf ∈H |f (x) − y| ≤ M . By Proposition 1.11, for all z ∈ U m and


all f ∈ Dj ,
ε
|Lz ( f ) − Lz ( fj )| ≤ 4M f − fj ∞ ≤ 4M = ε.
4M
Since this holds for all z ∈ U m and all f ∈ Dj , we get

sup Lz ( f ) ≥ 2ε ⇒ Lz ( fj ) ≥ ε.
f ∈Dj

We conclude that for j = 1, . . . , ,


   
& '
mε2
Prob sup Lz ( f ) ≥ 2ε ≤ Prob Lz ( fj ) ≥ ε ≤ exp − ,
z∈Z m f ∈Dj z∈Z m 2M 4

where the last inequality follows from Hoeffding’s bound in Corollary 3.6 for
ξ = −( f (x) − y)2 on Z. The statement now follows from Lemma 3.11 by
replacing ε by 2ε . 
Remark 3.12 Hoeffding’s inequality can be seen as a quantitative instance of
the law of large numbers. An “abstract” uniform version of this law can be
extracted from the proof of Theorem 3.10.
Proposition 3.13 Let F be a family of functions from a probability space Z to
R and d a metric on F. Let U ⊂ Z be of full measure and B, L > 0 such that
(i) |ξ(z)| ≤ B for all ξ ∈ F and all z ∈ U , and
(ii) |Lz (ξ1 ) − Lz (ξ2 )| ≤ L d (ξ1 , ξ2 ) for all ξ1 , ξ2 ∈ F and all z ∈ U m , where

1
m
Lz (ξ ) = ξ(z) − ξ(zi ).
Z m
i=1

Then, for all ε > 0,


   
ε  mε 2
Prob sup |Lz (ξ )| ≤ ε ≥ 1 − N F, 2 exp − 2 . 
z∈Z m ξ ∈F 2L 8B

3.3 Estimating the sample error


How good an approximation of fH can we expect fz to be? In other words, how
small can we expect the sample error EH ( fz ) to be? The third main result in
this chapter, Theorem 3.14, gives the first answer.
3.3 Estimating the sample error 45

Theorem 3.14 Let H be a compact M-bounded subset of C (X ). Then, for


all ε > 0,

(  
ε  ) mε 2
Prob {EH ( fz ) ≤ ε} ≥ 1 − N H, + 1 exp − .
z∈Zm 16M 32M 4

Proof. Recall that EH ( fz ) ≤ E( fz ) − Ez ( fz ) + Ez ( fH ) − E( fH ).


By Theorem 3.8 applied to the single function fH and 2ε , we know that

Ez ( fH ) − E( fH ) ≤ 2ε with probability at least 1 − exp − 8M mε2
4 .
ε
On the other hand, Theorem 3.10 with ε replaced by 2 tells us that with
$ % 
ε mε2
probability at least 1 − N H, 16M exp − 32M 4 ,

ε
sup Lz ( f ) = sup {E( f ) − Ez ( f )} ≤
f ∈H f ∈H 2

holds, which implies in particular that E( fz ) − Ez ( fz ) ≤ 2ε . Combining these


two bounds, we know that EH ( fz ) ≤ ε with probability at least
*   + *  +
ε  mε 2 mε 2
1 − N H, exp − 1 − exp −
16M 32M 4 8M 4
(   
ε  ) mε 2
≥ 1 − N H, + 1 exp − .
16M 32M 4

This is the desired bound. 

Remark 3.15 Theorem 3.14 helps us deal with the question posed in
Section 1.3. Given ε, δ > 0, to ensure that

Prob
m
{EH ( fz ) ≤ ε} ≥ 1 − δ,
z∈Z

it is sufficient that the number m of examples satisfies


*   +
32M 4 ε  1
m≥ ln 1 + N H, + ln . (3.5)
ε2 16M δ

& $ % ' 
ε mε2
To prove this, take δ = N H, 16M + 1 exp − 32M 4 and solve for m. Note,

furthermore, that (3.5) gives a relation between the three basic variables ε, δ,
and m.
46 3 Estimating the sample error

3.4 Convex hypothesis spaces


The dependency on ε in Theorem 3.14 is quadratic. Our next goal is to show
that when the hypothesis space H is convex, this dependency is linear. This is
Theorem 3.3. Its Corollary 3.17 estimates directly fz − fH ρ as well.
Toward the proof of Theorem 3.3, we show an additional property of convex
hypothesis spaces. From the discussion in Section 1.3 it follows that for a convex
H, there exists a function fH in H whose distance in Ł2 to fρ is minimal. We
next prove that if H is convex and ρX is nondegenerate, then fH is unique.
Lemma 3.16 Let H be a convex subset of C (X ) such that fH exists. Then fH
is unique as an element in Ł2 and, for all f ∈ H,

( fH − f )2 ≤ EH ( f ).
X

In particular, if ρX is not degenerate, then fH is unique in H.


Proof. Let s = fH f be the segment of line with extremities fH and f :
Since H is convex, s ⊂ H. And, since fH minimizes the distance in Ł2 to

fρ over H, we have that for all g ∈ s, fH − fρ ρ ≤ g − fρ ρ . This means that


for each t ∈ [0, 1],

fH − fρ 2
ρ ≤ tf + (1 − t)fH − fρ 2ρ
 
, - t
= fH − fρ ρ + 2t f − fH , fH − fρ ρ + f − fH ρ .
2 2
2
, -
By taking t to be small enough, we see that f − fH , fH − fρ ρ ≥ 0. That is, the
angle f
ρ fH f is obtuse, which implies (note that the squares are crucial)

fH − f 2
ρ ≤ f − fρ 2
ρ − fH − fρ 2
ρ;
3.4 Convex hypothesis spaces 47

that is,

( fH − f )2 ≤ E( f ) − E( fH ) = EH ( f ).
X

This proves the desired inequality. The uniqueness of fH follows by considering


the line segment joining two minimizers fH and fH . Reasoning as above, one
can show that both angles fρf  f  and fρ
H H f  f  are obtuse. This is possible only
H H
if fH = fH . 
Corollary 3.17 With the hypotheses of Theorem 3.3, for all ε > 0,
   ε   mε
Prob ( fz − fH )2 ≤ ε ≥ 1 − N H, exp − . 
z∈Zm 12M 300M 2

Now, in addition to convexity and M-boundedness, assume that H is a


compact subset of C (X ), so that the covering numbers N (H, η) make sense
and are finite. The main stepping stone toward the proof of Theorem 3.3 is a
ratio probability inequality.
We first give such an inequality for a single random variable.
Lemma 3.18 Suppose a random variable ξ on Z satisfies E(ξ ) = µ ≥ 0, and
|ξ − µ| ≤ B almost everywhere. If E(ξ 2 ) ≤ cE(ξ ), then, for every ε > 0 and
0 < α ≤ 1,
    
µ − m1 mi=1 ξ(zi ) √ α 2 mε
Prob √ > α ε ≤ exp −
z∈Z m µ+ε 2c + 23 B

holds.
Proof. Since ξ satisfies |ξ − µ| ≤ B, the one-side Bernstein inequality in
Corollary 3.6 implies that
  ⎧ ⎫
m ⎨ ⎬
µ− 1
ξ(z i ) √ α m(µ + ε)ε
2
Prob √ i=1
m
> α ε ≤ exp −   .
z∈Zm µ+ε ⎩ 2 σ 2 (ξ ) + 1 Bα √µ + ε √ε ⎭
3

Here σ 2 (ξ ) ≤ E(ξ 2 ) ≤ cE(ξ ) = cµ. Then we find that

1 √ √ 1 B
σ 2 (ξ ) + Bα µ + ε ε ≤ cµ + B(µ + ε) ≤ c + (µ + ε).
3 3 3

This yields the desired inequality. 


48 3 Estimating the sample error

We next give a ratio probability inequality involving a set of functions.


Lemma 3.19 Let G be a set of functions on Z and c > 0 such that for each
g ∈ G, E(g) ≥ 0, E(g 2 ) ≤ cE(g), and |g − E(g)| ≤ B almost everywhere.
Then, for every ε > 0 and 0 < α ≤ 1, we have
   
E(g) − Ez (g) √ α 2 mε
Prob sup > 4α ε ≤ N (G, αε) exp − .
z∈Z m g∈G E(g) + ε 2c + 23 B

Proof. Let {gj }Jj=1 ⊂ G with J = N (G, αε) be such that G is covered by balls
in C (Z) centered on gj with radius αε.
Applying Lemma 3.18 to ξ = gj for each j, we have
   
E(gj ) − Ez (gj ) √ α 2 mε
Prob ≥ α ε ≤ exp − .
z∈Z m E(gj ) + ε 2c + 23 B

For each g ∈ G, there is some j such that g − gj C (Z) ≤ αε. Then |Ez (g) −
Ez (gj )| and |E(g) − E(gj )| are both bounded by αε. Hence

|Ez (g) − Ez (gj )| √ |E(g) − E(gj )| √


≤α ε and ≤ α ε.
E(g) + ε E(g) + ε

The latter implies that



E(gj ) + ε = E(gj ) − E(g) + E(g) + ε ≤ α ε E(g) + ε + (E(g) + ε)
√ $ %
≤ ε E(g) + ε + (E(g) + ε) ≤ 2 E(g) + ε .

It follows that E(gj ) + ε ≤ 2 E(g) + ε. We have thus seen that



(E(g) − Ez (g))/ E(g) + ε ≥ 4α ε implies (E(gj ) − Ez (gj ))/ E(g) + ε ≥
√ √
2α ε and hence (E(gj ) − Ez (gj ))/ E(gj ) + ε ≥ α ε. Therefore,
   
E(g) − Ez (g) √ 
J
E(gj ) − Ez (gj ) √
Prob sup ≥ 4α ε ≤ Prob ≥α ε ,
z∈Zm
g∈G E(g) + ε j=1
z∈Zm
E(gj ) + ε

which is bounded by J · exp −α 2 mε/(2c + 23 B) . 

We are in a position to prove Theorem 3.3.

Proof of Theorem 3.3 Consider the function set



G = ( f (x) − y)2 − ( fH (x) − y)2 : f ∈ H .
3.5 References and additional remarks 49

Each function g in G satisfies E(g) = EH ( f ) ≥ 0. Since H is M-bounded, we


have −M 2 ≤ g(z) ≤ M 2 almost everywhere. It follows that |g −E(g)| ≤ B :=
2M 2 almost everywhere. Observe that

g(z) = ( f (x) − fH (x)) [( f (x) − y) + ( fH (x) − y)] , z = (x, y) ∈ Z.



It follows that |g(z)| ≤ 2M |f (x) − fH (x)| and E(g 2 ) ≤ 4M 2 X ( f − fH )2 .
Taken together with Lemma 3.16, this implies E(g 2 ) ≤ 4M 2 EH ( f ) = cE(g)
with c = 4M 2 . Thus, all the conditions in Lemma 3.19 hold true and we can
draw the following conclusion from the identity Ez (g) = EH,z ( f ): for every
ε > 0 and 0 < α ≤ 1, with probability at least
 
α 2 mε
1 − N (G, αε) exp − ,
8M 2 + 23 2M 2
EH ( f ) − EH,z ( f ) √
sup ≤ 4α ε
f ∈H EH ( f ) + ε

holds, and,√therefore, for all f ∈ H, EH ( f ) ≤ EH,z ( f ) + 4α ε EH ( f ) + ε.
Take α = 2/8 and f = fz . Since EH,z ( fz ) ≤ 0 by definition of fz , we have

EH ( fz ) ≤ ε/2 EH ( fz ) + ε.

Solving the quadratic equation about EH ( fz ), we have EH ( fz ) ≤ ε.


Finally, by the inequality g1 − g2 C (Z) ≤ ( f1 (x) − f2 (x)) [( f1 (x) − y)+
( f2 (x) − y)] C (Z) ≤ 2M f1 − f2 C (X ) , it follows that
 αε 
N (G, αε) ≤ N H, .
2M

The desired inequality now follows by taking α = 2/8. 
Remark 3.20 Note that to obtain Theorem 3.3, convexity was only used in the
proof of Lemma 3.16. But the inequality proved in this lemma may hold true
in other situations as well. A case that stands out is when fρ ∈ H. In this case
fH = fρ and the inequality in Lemma 3.16 is trivial.

3.5 References and additional remarks


The exposition in this chapter largely follows [39]. The derivation of
Theorem 3.3 deviates from that paper – hence the slightly different constants.
50 3 Estimating the sample error

The probability inequalities given in Section 3.1 are standard in the literature
on the law of large numbers or central limit theorems (e.g., [21, 103, 132]).
There is a vast literature on further extensions of the inequalities in Section 3.2
stated in terms of empirical covering numbers and other capacity measures [13,
71] (called concentration inequalities) that is outside the scope of this book.
We mention the McDiarmid inequality [83], the Talagrand inequality [126, 22],
and probability inequalities in Banach spaces [99].
The inequalities in Section 3.4 are improvements of those in the Vapnik–
Chervonenkis theory [135]. In particular, Lemmas 3.18 and 3.19 are a covering
number version of an inequality given by Anthony and Shawe-Taylor [8]. The
convexity of the hypothesis space plays a central role in improving the sample
error bounds, as in Theorem 3.3. This can be seen in [10, 12, 74].
A natural question about the sample error is whether upper bounds such as
those in Theorem 3.3 are tight. In this regard, lower bounds called minimax
rates of convergence can be obtained (see, e.g., [148]).
The ideas described in this chapter can be developed for a more general
class of learning algorithms known as empirical risk minimization (ERM) or
structural risk minimization algorithms [17, 60, 110]. We devote the remainder
of this section to briefly describe some aspects of this development.
The greater generality of ERM comes from the fact that algorithms in this
class minimize empirical errors with respect to a loss function ψ : R → R+ .
The loss function measures how the sample value y approximates the function
value f (x) by evaluating ψ(y − f (x)).

Definition 3.21 We say that ψ : R → R+ is a regression loss function if it is


even, convex, and continuous and ψ(0) = 0.

For (x, y) ∈ Z, the value ψ(y−f (x)) is the local error suffered from the use of
f as a model for the process producing y at x. The condition ψ(0) = 0 ensures
a zero error when y = f (x). Examples of regression loss functions include the
least squares loss and Vapnik’s -insensitive norm.

Example 3.22 The least squares loss corresponds to the loss function ψ(t) =
t 2 . For  > 0, the -insensitive norm is the loss function defined by


|t| −  if |t| > 
ψ(t) = ψ (t) =
0 otherwise.
3.5 References and additional remarks 51

Given a regression loss function ψ one defines its associated generalization


error by

Eψ ( f ) = ψ(y − f (x)) dρ
Z

and, given z ∈ Z m as well, its associated empirical error by

1
m
ψ
Ez ( f ) = ψ(yi − f (xi )).
m
i=1

As in our development, given a hypothesis class H, these errors allow one to


ψ ψ
define a target function fH and empirical target function fz and to derive a
decomposition bounding the excess generalization error

 
ψ ψ ψ ψ ψ ψ ψ ψ
E ψ ( fz ) − E ψ ( fH ) ≤ E ψ ( fz ) − Ez ( fz ) + Ez ( fH ) − E ψ ( fH ) .
(3.6)

The second term on the right-hand side of (3.6) converges to zero, with
high probability when m → ∞, and its convergence rate can be estimated by
standard probability inequalities.
The first term on the right-hand side of (3.6) is more
 involved. If one writes
ψ ψ ψ ψ ψ 
ξz (z) = ψ(y −fz (x)), then E ( fz )−Ez ( fz ) = Z ξz (z)d ρ − m1 m i=1 ξz (zi ).
But ξz is not a single random variable; it depends on the sample z. Therefore,
the usual law of large numbers does not guarantee the convergence of this first
term. One major goal of classical statistical learning theory [134] is to estimate
ψ ψ ψ
this error term (i.e., E ψ ( fz )−Ez ( fz )). The collection of ideas and techniques
used to get such estimates, known as the theory of uniform convergence, plays
the role of a uniform law of large numbers. To see why, consider the quantity

ψ
sup |Ez ( f ) − E ψ ( f )|, (3.7)
f ∈H

which bounds the first term on the right-hand side of (3.6), hence providing
(together with bounds for the second term) an estimate for the sample error
ψ ψ
E ψ ( fz )−E ψ ( fH ). The theory of uniform convergence studies the convergence
of this quantity. It characterizes those function sets H such that the quantity (3.7)
tends to zero in probability as m → ∞.
52 3 Estimating the sample error

Definition 3.23 We say that a set H of real-valued functions on a metric space


X is uniform Glivenko–Cantelli (UGC) if for every  > 0,
    
1 
m

lim sup Prob sup sup  f (xi ) − f (x) dµ ≥  = 0,
→+∞ µ m≥ f ∈H m X
i=1

where the supremum is taken with respect to all Borel probability distributions
µ on X , and Prob denotes the probability with respect to the samples x1 , x2 , . . .
independently drawn according to such a distribution µ.
The UGC property can be characterized by the Vγ dimensions of H, as has
been done in [5].
Definition 3.24 Let H be a set of functions from X to [0, 1] and γ > 0. We say
that A ⊂ X is Vγ shattered by H if there is a number α ∈ R with the following
property: for every subset E of A there exists some function fE ∈ H such that
fE (x) ≤ α − γ for every x ∈ A \ E, and fE (x) ≥ α + γ for every x ∈ E. The Vγ
dimension of H, Vγ (H), is the maximal cardinality of a set A ⊂ X that is Vγ
shattered by H.
The concept of Vγ dimension is related to many other quantities involving
capacity of function sets studied in approximation theory or functional analysis:
covering numbers, entropy numbers, VC dimensions, packing numbers, metric
entropy, and others.
The following characterization of the UGC property is given in [5].
Theorem 3.25 Let H be a set of functions from X to [0, 1]. Then H is UGC if
and only if the Vγ dimension of H is finite for every γ > 0.
Theorem 3.25 may be used to verify the convergence of ERM schemes when
the hypothesis space H is a noncompact UGC set such as the union of unit balls
of reproducing kernel Hilbert spaces associated with a set of Mercer kernels. In
particular, for the Gaussian kernels with flexible variances, the UGC property
holds [150].
Many fundamental problems about the UGC property remain to be solved.
As an example, consider the empirical covering numbers.
Definition 3.26 For x = (xi )m ∞
i=1 ∈ X and H ⊂ C (X ), the  -empirical
m

covering number N∞ (H, x, η) is the covering number of H|x := {( f (xi ))m i=1 :
f ∈ H} as a subset of Rm with the following metric. For f , g ∈ C (X ) we take
dx ( f , g) = maxi≤m |f (xi ) − g(xi )|. The metric entropy of H is defined as

Hm (H, η) = sup log N∞ (H, x, η), m ∈ N, η > 0.


x∈X m
3.5 References and additional remarks 53

It is known [46] that a set H of functions from X to [0, 1] is UGC if and only if,
for every η > 0, limm→∞ Hm (H, η)/m = 0. In this case, one has Hm (H, η) =
O(log2 m) for every η > 0. It is conjectured in [5] that Hm (H, η) = O(log m)
is true for every η > 0. A weak form is, Is it true that for some α ∈ [1, 2), every
UGC set H satisfies

Hm (H, η) = O(logα m), ∀η > 0?


4
Polynomial decay of the approximation error

We continue to assume that X is a compact metric space (which may be a


compact subset of Rn ). Let K be a Mercer kernel on X and HK be its induced
RKHS. We observed in Section 2.6 that the space HK,R = IK (BR ) may be
considered as a hypothesis space. Here IK denotes the inclusion IK : HK →
C (X ). When R increases the quantity

A(fρ , R) : = inf E(f ) − E(fρ ) = inf f − fρ L


2
2
f ∈HK,R f K ≤R ρX

(which coincides with the approximation error modulo σρ2 ) decreases. The main
result in this chapter characterizes the measures ρ and kernels K for which this
decay is polynomial, that is, A(fρ , R) = O(R−θ ) with θ > 0.
Theorem 4.1 Suppose ρ is a Borel probability measure on Z. Let K be a Mercer
kernel on X and LK : Lρ2X → Lρ2X be the operator given by

LK f (x) = K(x, t)f (t)d ρX (t), x ∈ X.
X

θ/(4+2θ) θ/(4+2θ)
Let θ > 0. If fρ ∈ Range(LK ), that is, fρ = LK (g) for some g ∈
Lρ2X , then A(fρ , R) ≤ 22+θ g 2+θ R −θ . Conversely, if ρ is nondegenerate
LρX
2 X

and A(fρ , R) ≤ CR−θ for some constants C and θ , then fρ lies in the range of
θ/(4+2θ)−
LK for all  > 0.
Although Theorem 4.1 may be applied to spline kernels (see Section 4.6),
we show in Theorem 6.2 that for C ∞ kernels (e.g., the Gaussian kernel) and
under some conditions on ρX , the approximation error decay cannot reach the
order A(fρ , R) = O(R−θ ) unless fρ is C ∞ itself. Instead, also in Chapter 6,
we derive logarithmic orders like A(fρ , R) = O((log R)−θ ) for analytic kernels
and Sobolev smooth regression functions.

54
4.1 Reminders III 55

4.1 Reminders III


We recall some basic facts about Hilbert spaces.
A sequence {φn }n≥1 in a Hilbert space H is said to be a complete orthonormal
system (or an orthonormal basis) if the following conditions hold:

(i) for all n  = m ≥ 1, φn , φm  = 0,


(ii) for all n ≥ 1, φn = 1, and

(iii) for all f ∈ H , f = ∞ n=1 f , φn φn .

A sequence satisfying (i) and (ii) only is said to be an orthonormal system. The
numbers f , φn  are the Fourier coefficients of f in the basis {φn }n≥1 . It is easy

to see that these coefficients are unique since, if f = an φn , an = f , φn  for
all n ≥ 1.

Theorem 4.2 (Parseval’s theorem) If {φn } is an orthonormal system of a



Hilbert space H , then, for all f ∈ H , n f , φn 2 ≤ f 2 . Equality holds for
all f ∈ H if and only if {φn } is complete. 

We defined compactness of an operator in Section 2.3. We next recall


some other basic properties of linear operators and a main result for operators
satisfying them.

Definition 4.3 A linear operator L : H → H on a Hilbert space H is said to


be self-adjoint if, for all f , g ∈ H , Lf , g = f , Lg. It is said to be positive
(respectively strictly positive) if it is self-adjoint and, for all nontrivial f ∈ H ,
Lf , f  ≥ 0 (respectively Lf , f  > 0).

Theorem 4.4 (Spectral theorem) Let L be a compact self-adjoint linear


operator on a Hilbert space H . Then there exists in H an orthonormal
basis {φ1 , φ2 , . . . } consisting of eigenvectors of L. If λk is the eigenvalue
corresponding to φk , then either the set {λk } is finite or λk → 0 when k → ∞.
In addition, maxk≥1 |λk | = L . If, in addition, L is positive, then λk ≥ 0 for
all k ≥ 1, and if L is strictly positive, then λk > 0 for all k ≥ 1. 

We close this section by defining the power of a self-adjoint, positive,


compact, linear operator. If L is such an operator and θ > 0, then Lθ is the
operator defined by

  
Lθ ck φk = ck λθk φk .
56 4 Polynomial decay of the approximation error

4.2 Operators defined by a kernel


In the remainder of this chapter we consider ν, a finite Borel measure on X , and
Lν2 (X ), the Hilbert space of square integrable functions on X . Note that ν can
be any Borel measure. Significant particular cases are the Lebesgue measure µ
and the marginal measure ρX of Chapter 1.
Let K : X × X → R be a continuous function. Then the linear map

LK : Lν2 (X ) → C (X )

given by the following integral transform



(LK f )(x) = K(x, t)f (t) d ν(t), x ∈ X,

is well defined. Composition with the inclusion C (X ) → Lν2 (X ) yields a linear


operator LK : Lν2 (X ) → Lν2 (X ), which, abusing notation, we also denote by
LK .
The function K is said to be the kernel of LK , and several properties of LK
follow from properties of K. Recall the definitions of CK and Kx introduced in
Section 2.4.

Proposition 4.5 If K is continuous, then LK : Lν2 (X ) → C (X ) is well defined



and compact. In addition, LK ≤ ν(X )C2K . Here ν(X ) denotes the measure
of X .

Proof. To see that LK is well defined, we need to show that LK f is continuous


for every f ∈ Lν2 (X ). To do so, we consider f ∈ Lν2 (X ) and x1 , x2 ∈ X . Then
 
 
|(LK f )(x1 ) − (LK f )(x2 )| =  (K(x1 , t) − K(x2 , t))f (t) d ν(t)


≤ Kx1 − Kx2 Lν2 (X ) f Lν2 (X )


(by Cauchy–Schwarz)
 
≤ ν(X ) max |K(x1 , t) − K(x2 , t)| f Lν2 (X ) .
t∈X
(4.1)

Since K is continuous and X is compact, K is uniformly continuous. This


implies the continuity of LK f .
4.2 Operators defined by a kernel 57


The assertion LK ≤ ν(X )C2K follows from the inequality

|(LK f )(x)| ≤ ν(X ) sup |K(x, t)| f Lν2 (X ) ,


t∈X

which is proved as above.


Finally, to see that LK is compact, let (fn ) be a bounded sequence in Lν2 (X ).

Since LK f ∞ ≤ ν(X )C2K f Lν2 (X ) , we have that (LK fn ) is uniformly
bounded. By (4.1) we have that the sequence (LK fn ) is equicontinuous. By the
Arzelá–Ascoli theorem (Theorem 2.4), (LK fn ) contains a uniformly convergent
subsequence. 

Two more important properties of LK follow from properties of K. Recall


that we say K is positive semidefinite if, for all finite sets {x1 , . . . , xk } ⊂ X , the
k × k matrix K[x] whose (i, j) entry is K(xi , xj ) is positive semidefinite.

Proposition 4.6
(i) If K is symmetric, then LK : Lν2 (X ) → Lν2 (X ) is self-adjoint.
(ii) If, in addition, K is positive semidefinite, then LK is positive.

Proof. Part (i) follows easily from Fubini’s theorem and the symmetry of K.
For Part (ii), just note that

 
(ν(X ))2 
k
K(x, t)f (x)f (t) d ν(x) d ν(t) = lim K(xi , xj )f (xi )f (xj )
X X k→∞ k2
i,j=1

(ν(X ))2 T
= lim fx K[x]fx ,
k→∞ k2

where, for all k ≥ 1, x1 , . . . , xk ∈ X is a set of points conveniently chosen


and fx = (f (x1 ), . . . , f (xk ))T . Since K[x] is positive semidefinite the result
follows. 

Theorem 4.7 Let K : X × X → R be a Mercer kernel. There exists an


orthonormal basis {φ1 , φ2 , . . . } of Lν2 (X ) consisting of eigenfunctions of LK .
If λk is the eigenvalue corresponding to φk , then either the set {λk } is finite or
λk → 0 when k → ∞. In addition, λk ≥ 0 for all k ≥ 1, maxk≥1 λk = LK ,
and if λk  = 0, then φk can be chosen to be continuous on X .

Proof. By Propositions 4.5 and 4.6 LK : Lν2 (X ) → Lν2 (X ) is a self-adjoint,


positive, compact operator. Theorem 4.4 yields all the statements except the
continuity of the φk .
58 4 Polynomial decay of the approximation error

To prove this last fact, use the fact that φk = (1/λk )/LK (φk ) in Lν2 (X ). Then
we can choose the eigenfunction to be (1/λk )LK (φk ), which is a continuous
function. 
In what follows we fix a Mercer kernel K and let {φk ∈ Lν2 (X )} be an
orthonormal basis of Lν2 (X ) consisting of eigenfunctions of LK . We call the
φk orthonormal eigenfunctions. Denote by λk , k ≥ 1, the eigenvalue of LK
corresponding to φk . If λk > 0, the function φk is continuous. In addition, it lies
in the RKHS HK . This is so since

1 1
φk (x) = LK (φk )(x) = K(x, t)φk (t) d ν(t),
λk λk

and, thus, φk can be approximated by elements in the span of {Kx | x ∈ X }.


Theorem 2.9 then shows that φk ∈ HK . In fact,
   
1  1
φk K =  λ K φ
t k (t) d ν(t) 
 ≤ Kt K |φk (t)| d ν(t)
k X K λk X
CK
≤ ν(X ) < ∞.
λk

We shall assume, without loss of generality, that λk ≥ λk+1 for all k ≥ 1.


Using the eigenfunctions {φk }, we can find an orthonormal system of the
RKHS HK .
Theorem 4.8 Let ν be a Borel measure on X , and K : X × X → R a
Mercer kernel. Let λk be the kth eigenvalue of LK , and φk the corresponding

orthonormal eigenfunction. Then { λk φk : λk > 0} forms an orthonormal
system in HK .
Proof. We apply the reproducing property stated in Theorem 2.9. Assume
λi , λj > 0; we have
.  /
1
 λ i φi , λ j φ j  K = √ K(·, y)φi (y) d ν(y), λj φj
λi X K

λj
=√ φi (y)Ky , φj K d ν(y)
λi X

λj λj
=√ φi (y)φj (y) d ν(y) = √ φi , φj Lν2 (X ) = δij .
λi X λi

It follows that { λk φk : λk > 0} forms an orthonormal system in HK . 
4.3 Mercer’s theorem 59

Remark 4.9 When ν is nondegenerate, one can easily see from the definition
of the integral operator that LK has no eigenvalue 0 if and only if HK is dense
in Lν2 (X ).
In fact, the orthonormal system above forms an orthonormal basis of HK
when ρX is nondegenerate. This will be proved in Section 4.4. Toward this end,
we next prove Mercer’s theorem.

4.3 Mercer’s theorem


If f ∈ Lν2 (X ) and {φ1 , φ2 , . . .} is an orthonormal basis of Lν2 (X ), f can

be uniquely written as f = k≥1 ak φk , and, when the basis has infinitely

many functions, the partial sums N k=1 ak φk converge to f in Lν (X ). If this
2

convergence also holds in C (X ), we say that the series converges uniformly to f .


 
Also, we say that a series ak converges absolutely if the series |ak | is
convergent.
When LK has only finitely many positive eigenvalues {λk }m k=1 , K(x, t) =
m
λ φ
k=1 k k (x)φ k (t).

Theorem 4.10 Let ν be a Borel, nondegenerate measure on X , and K : X ×X →


R a Mercer kernel. Let λk be the kth positive eigenvalue of LK , and φk the
corresponding continuous orthonormal eigenfunction. For all x, t ∈ X ,

K(x, t) = λk φk (x)φk (t),
k≥1

where the convergence is absolute (for each x, t ∈ X × X ) and uniform


(on X × X ).

Proof. By Theorem 4.8, the sequence { λk φk }k≥1 is an orthonormal system
of HK . Let x ∈ X . The Fourier coefficients of the function Kx ∈ HK with
respect to this system are

 λk φk , Kx K = λk φk (x),

where Theorem 2.9(iii) is used. Then, by Parseval’s theorem, we know that


 
| λk φk (x)|2 = λk |φk (x)|2 ≤ Kx 2
K = K(x, x) ≤ C2K .
k≥1 k≥1


k≥1 λk |φk (x)| converges. This is true for each point x ∈ X .
Hence the series 2
60 4 Polynomial decay of the approximation error

Now we fix a point x ∈ X . When the basis {φk }k≥1 has infinitely many
functions, the estimate above, together with the Cauchy–Schwarz inequality,
tells us that for each t ∈ X ,
m+  m+ 1/2 m+ 1/2
   
 
 λk φk (x)φk (t) ≤ λk |φk (t)| 2
λk |φk (x)| 2
 
k=m k=m k=m
m+ 1/2

≤ CK λk |φk (x)|2 ,
k=m

which tends to zero uniformly (for t ∈ X ). Hence the series k≥1 λk φk (x)φk (t)
(as a function of t) converges absolutely and uniformly on X to a continuous
function gx . On the other hand, as a function in Lν2 (X ), Kx can be expanded by
means of an orthonormal basis consisting of {φk } and an orthonormal basis 
of the nullspace of LK . For f ∈ ,

Kx , f Lν2 (X ) = K(x, y)f (y) d ν = 0.
X

Hence the expansion of Kx is


  
Kx = Kx , φk Lν2 (X ) φk = LK (φk )(x)φk = λk φk (x)φk .
k≥1 k≥1 k≥1

Thus, as functions in Lν2 (X ), Kx = gx . Since ν is nondegenerate, Kx and


gx are equal on a dense subset of X . But both are continuous functions and,
therefore, must be equal for each t ∈ X . It follows that for any x, t ∈ X , the

series k≥1 λk φk (x)φk (t) converges to K(x, t). Since the limit function K(x, t)
is continuous, we know that the series must converge uniformly on X × X . 

Corollary 4.11 The sum λk is convergent and
 
λk = K(x, x) ≤ ν(X )C2K .
k≥1 X

Moreover, for all k ≥ 1, λk ≤ ν(X )C2K /k.



Proof. Taking x = t in Theorem 4.10, we get K(x, x) = k≥1 λk φk (x) .
2

Integrating on both sides of this equality gives


  
λk φk (x)2 d ν = K(x, x) d ν ≤ ν(X )C2K .
k≥1 X X
4.4 RKHSs revisited 61


However, since {φ1 , φ2 , . . .} is an orthonormal basis, φk2 = 1 for all k ≥ 1
and the first statement follows. The second statement holds true because the

assumption λk ≥ λj for j > k tells us that kλk ≤ kj=1 λj ≤ ν(X )C2K . 

4.4 RKHSs revisited



In this section we show that the RKHS HK has an orthonormal basis { λk φk }
derived from the integral operator LK (and thus dependent on the measure ν).
Theorem 4.12 Let ν be a Borel, nondegenerate measure on X , and K : X ×X →
R a Mercer kernel. Let λk be the kth positive eigenvalue of LK , and φk the

corresponding continuous orthonormal eigenfunction. Then { λk φk : λk > 0}
is an orthonormal basis of HK .

Proof. By Theorem 4.8, { λk φk : λk > 0} is an orthonormal system in HK .
To prove the completeness we need only show that for each x ∈ X , Kx lies
in the closed span of this orthonormal system. Complete this system to form

an orthonormal basis { λk φk : λk > 0} ∪ {ψj : j ≥ 0} of HK . By Parseval’s
theorem,
0 12 
Kx 2
K = K x , λk φ k + Kx , ψj 2K .
K
k j

Therefore, to show that Kx lies in the closed span of { λk φk : λk > 0}, it is
enough to require that

Kx 2K = Kx , λk φk 2K ,
k

that is, that


 
K(x, x) = λk Kx , φk 2K = λk φk (x)2 .
k k

Theorem 4.10 with x = t yields this identity for each x ∈ X . 


Since the RKHS HK is independent of the measure ν, it follows that when ν is
nondegenerate and dim HK = ∞, LK has infinitely many positive eigenvalues
λk , k ≥ 1, and
 ∞



HK = f = ak λk φk : {ak }k=1 ∈  . 2

k=1
62 4 Polynomial decay of the approximation error

When dim HK = m < ∞, LK has only m positive (repeated) eigenvalues.


In this case,
 

m
HK = f = ak λk φk : (a1 , . . . , am ) ∈ R m
.
k=1

In both cases, the map

1/2
LK : Lν2 (X ) → HK
 
ak φk  → ak λk φk

defines an isomorphism of Hilbert spaces between the closed span of {φk :


λk > 0} in Lν2 (X ) and HK . In addition, considered as an operator on Lν2 (X ),
1/2 1/2 1/2
LK is the square root of Lk in the sense that LK = LK ◦ LK (hence the
1/2
notation LK ). This yields the following corollary.

Corollary 4.13 Let ν be a Borel, nondegenerate measure on X , and K : X ×


1/2
X → R a Mercer kernel. Then HK = LK (Lν2 (X )). That is, every function
1/2
f ∈ HK can be written as f = LK g for some g ∈ Lν2 (X ) with f K =
g Lν2 (X ) . 

A different approach to the orthonormal basis { λk φk } is to regard it as a
function on X with values in 2 .

Theorem 4.14 The map

 : X → 2
 
x → λk φk (x)
k≥1

is well defined and continuous, and satisfies

K(x, t) = (x), (t).



Proof. For every x ∈ X , by Mercer’s theorem łk φk2 (x) converges to K(x, x).
This shows that (x) ∈ 2 . Also by Mercer’s theorem, for every x, t ∈ X ,



K(x, t) = łk φk (x)φk (t) = (x), (t)2 .
k=1
4.5 Characterizing the approximation error in RKHSs 63

It remains only to prove that  : X → 2 is continuous. For any x, t ∈ X ,

(x) − (t) 2
2
= (x), (x)2 + (t), (t)2 − 2(x), (t)2
= K(x, x) + K(t, t) − 2K(x, t),

which tends to zero when x tends to t by the continuity of K. 

4.5 Characterizing the approximation error in RKHSs


In this section we prove Theorem 4.1. It actually follows from a more
general characterization of the decay of the approximation error given using
interpolation spaces.

Definition 4.15 Let (B, · ) and (H, · H ) be Banach spaces and assume H
is a subspace of B. The K-functional K : B × (0, ∞) → R of the pair (B, H) is
defined, for a ∈ B and t > 0, by
& '
K(a, t) := inf a−b +t b H . (4.2)
b∈H

It can easily be seen that for fixed a ∈ B, the function K(a, t) of t is continuous,
nondecreasing, and bounded by a (take b = 0 in (4.2)). When H is dense in
B, K(a, t) tends to zero as t → 0. The interpolation spaces for the pair (B, H)
are defined in terms of the convergence rate of this function.
For 0 < r < 1, the interpolation space (B, H)r consists of all the elements
a ∈ B such that the norm
& '
a r := sup K(a, t)/t r
t>0

is finite.

Theorem 4.16 Let (B, ) be a Banach space, and (H, H ) a subspace, such
that b ≤ C0 b H for all b ∈ H and a constant C0 > 0. Let 0 < r < 1. If
a ∈ (B, H)r , then, for all R > 0,
& ' 2/(1−r) −2r/(1−r)
A(a, R) := inf a−b 2
≤ a r R .
b H ≤R

Conversely, if A(a, R) ≤ CR−2r/(1−r) for all R > 0, then a ∈ (B, H)r and
a r ≤ 2C (1−r)/2 .
64 4 Polynomial decay of the approximation error

Proof. Consider the function f (t) := K(a, t)/t. It is continuous on (0, +∞).
Since K(a, t) ≤ a , inf t>0 {f (t)} = 0.
Fix R > 0. If supt>0 {f (t)} ≥ R, then, for any 0 <  < 1, there exists some
tR, ∈ (0, +∞) such that

K(a, tR, )
f (tR, ) = = (1 − )R.
tR,

By the definition of the K-functional, we can find b ∈ H such that

a − b + tR, b H ≤ K(a, tR, )/(1 − ).

It follows that
K(a, tR, )
b H ≤ =R
(1 − )tR,

and
K(a, tR, )
a − b ≤ .
1−
But the definition of the norm a r implies that

K(a, tR, )
r ≤ a r.
tR,

Therefore
* +−r/(1−r) 2 31/(1−r)
K(a, tR, ) K(a, tR, )
a − b ≤
(1 − )tR, (1 − )tR,
r

1/(1−r)
1
≤ R−r/(1−r) ( a r )1/(1−r) .
1−

Thus,

2/(1−r) −2r/(1−r)
A(a, R) ≤ inf a − b 2
≤ a r R ;
0<<1

that is, the desired error estimate holds in this case.


Turn now to the case where supt>0 {f (t)} < R. Then, for any 0 <  <
1 − supu>0 {f (u)}/R and any t > 0, there exists some bt, ∈ H such that

a − bt, + t bt, H ≤ K(a, t)/(1 − ).


4.5 Characterizing the approximation error in RKHSs 65

This implies that

K(a, t) 1
bt, H ≤ ≤ sup{f (u)} < R
(1 − )t 1 −  u>0

and
K(a, t)
a − bt, ≤ .
1−
Hence
 2
A(a, R) ≤ inf a − bt, 2
≤ inf {K(a, t)} /(1 − )
t>0 t>0

& ' 2
≤ inf a r t r /(1 − ) = 0.
t>0

This again proves the desired error estimate. Hence the first statement of the
theorem holds.
suppose that A(a, R) ≤ CR−2r/(1−r) for all R > 0. Let t > 0.
Conversely, √
Choose Rt = ( C/t)1−r . Then, for any  > 0, we can find bt, ∈ H such that

−2r/(1−r)
bt, H ≤ Rt and a − bt, 2
≤ CRt (1 + )2 .

It follows that
√ −r/(1−r)
K(a, t) ≤ a − bt, + t bt, H ≤ CRt (1 + )
+ tRt ≤ 2(1 + )C (1−r)/2 t r .

Since  can be arbitrarily small, we have

K(a, t) ≤ 2C (1−r)/2 t r .

Thus, a r = supt>0 {K(a, t)/t r } ≤ 2C (1−r)/2 < ∞. 


The proof shows that if a ∈ H, then A(a, R) = 0 for R > a H . In
addtion, A(a, R) = O(R−2r/(1−r) ) if and only if a ∈ (B, H)r . A special case of
Theorem 4.16 characterizes the decay of A for RKHSs.
Corollary 4.17 Suppose ρ is a Borel probability measure on Z. Let θ > 0. Then
A(fρ , R) = O(R−θ ) if and only if fρ ∈ (Lρ2X , HK+ )θ/(2+θ) , where HK+ is the

closed subspace of HK spanned by the orthonormal system { λk φk : λk > 0}
given in Theorem 4.8 with the measure ρX .
66 4 Polynomial decay of the approximation error

Proof. Take B = Lρ2X and H = HK+ with the norm inherited from HK . Then

b ≤ λ1 b H for all b ∈ H. The statement now follows from Theorem 4.16
taking r = θ/(2 + θ ). 

Remark 4.18 When ρX is nondegenerate, HK+ = HK by Theorem 4.12.



Recall that the space LrK (Lν2 (X )) is { λk >0 ak λk φk
r : ak } ∈ 2 } with
the norm
  ⎛ ⎞1/2
 
  
 a k λk φ k 
r
=⎝ |ak |2 ⎠ .
 
λk >0  λk >0
LrK (Lν2 (X ))

Proof of Theorem 4.1 Take HK+ as in Corollary 4.17 and r = θ/(2 + θ).
θ/(4+2θ) θ/(4+2θ)
If fρ ∈ Range(LK ), then fρ = LK g for some g ∈ Lρ2X . Without
 
loss of generality, we may take g = λk >0 ak φk . Then g 2 = λk >0 ak2 < ∞
 θ/(4+2θ)
and fρ = λk >0 ak λk φk .

We show that fρ ∈ (LρX , HK+ )r . Indeed, for every t ≤ λ1 , there exists
2

some N ∈ N such that

λN +1 < t 2 ≤ λN .

N θ/(4+2θ)
Choose f = k=1 ak λk φk ∈ HK+ . We can see from Theorem 4.8 that

 N 2
  
N
 (θ/(4+2θ))−1/2  −2/(2+θ) −2/(2+θ)
f 2
= ak λ k λk φ k  = ak2 λk ≤ λN g 2.
K  
k=1 K k=1

In addition,
 2
  
 θ/(4+2θ)  θ/(2+θ) θ/(2+θ)
fρ − f L 2 2
= ak λ k φk  = ak2 λk ≤ λN +1 g 2.
ρX  
k>N Lρ2X k>N

Let K be the K-functional for the pair (Lρ2X , HK+ ). Then

θ/(4+2θ) −1/(2+θ)
K(fρ , t) ≤ fρ − f Lρ2 + t f K ≤ λN +1 g + tλN g .
X
4.5 Characterizing the approximation error in RKHSs 67

By the choice of N , we have

K(fρ , t) ≤ g 2t θ/(2+θ) = 2 g t r .

r/2 √
Since K(fρ , t) ≤ fρ Lρ2 ≤ λ1 g , we can also see that for t > λ1 ,
X
K(fρ , t)/t r ≤ g holds. Therefore, fρ ∈ (Lρ2X , HK+ )r and fρ r ≤ 2 g . It
follows from Theorem 4.16 that
$ %2/(1−r) −2r/(1−r)
A(fρ , R) ≤ inf f − fρ 2L 2 ≤ 2 g R
+ ρX
f ∈HK , f K ≤R

2+θ −θ
= 22+θ g R .

Conversely, if ρX is nondegenerate and A(fρ , R) ≤ CR−θ for some constant


C and all R > 0, then Theorem 4.12 states that HK+ = HK . This, together
with Theorem 4.16 and the polynomial decay of A(fρ , R), implies that fρ ∈
(Lρ2X , HK )r and fρ r ≤ 2C 1/(2+θ) .
Let m ∈ N. There exists a function fm ∈ HK such that

fρ − fm Lρ2 + 2−m fm K ≤ 4C 1/(2+θ) 2−mr .


X

Then

fρ − fm Lρ2 ≤ 4C 1/(2+θ) 2−mr and fm K ≤ 4C 1/(2+θ) 2m(1−r) .


X

  (m)
Write fρ = k c k φk and fm = k bk φk . Then, for all 0 <  < r,

 
(m) 2
 ck2  ck − b k
≤2
λr−2 λr−2
2−2m ≤λk <2−2(m−1) k 2−2m ≤λk <2−2(m−1) k
 
(m) 2
 bk
+2 ,
λr−2
2−2m ≤λk <2−2(m−1) k

which can be bounded by

21+2m(r−2) fρ − fm 2L 2 + 21+2(1−m)(1−r+2) fm 2
K ≤ C 2/(2+θ) 25−4m
ρX

+ C 2/(2+θ) 25+2(1−r)+4(1−m) .
68 4 Polynomial decay of the approximation error

Therefore,

 ck2 160
≤ C 2/(2+θ) < ∞.
λr−2 16 − 1
λk <1 k

θ/(4+2θ)−
This means that fρ ∈ Range(LK ). 

4.6 An example
In this section we describe a simple example for the approximation error in
RKHSs.

Example 4.19 Let X = [−1, 1], and let K be the spline kernel given in
Example 2.15, that is, K(x, y) = max{1 − |x − y|/2, 0}. We claim that HK
is the Sobolev space H 1 (X ) with the following equivalent inner product:

f , gK = f  , g  L 2 [−1,1] + 12 (f (−1) + f (1)) · (g(−1) + g(1)). (4.3)

Assume now that ρX is the Lebesgue measure. For θ > 0 and a function
fρ ∈ L 2 [−1, 1], we also claim that A(fρ , R) = O(R−θ ) if and only if fρ (x +
t) − fρ (x) L 2 [−1,1−t] = O(t θ/(2+θ) ).
To prove the first claim, note that we know from Example 2.15 that K is
a Mercer kernel. Also, Kx ∈ H 1 (X ) for any x ∈ X . To show that (4.3) is
the inner product in HK , it is sufficient to prove that f , Kx K = f (x) for any
f ∈ H 1 (X ) and x ∈ X . To see this, note that Kx = 12 χ[−1,x) − 12 χ(x,1] and
Kx (−1) + Kx (1) = 1. Then,
 x  1
f , Kx K = 1
2 f  (y) dy − 1
2 f  (y) dy + 12 (f (−1) + f (1)) = f (x).
−1 x

To prove the second claim, we use Theorem 4.16 with B = L 2 (X ) and H =


HK = H 1 (X ). Our conclusion follows from the following statement: for 0 <
r < 1 and f ∈ L 2 , K(f , t) = O(t r ) if and only if f (x+t)−f (x) L 2 [−1,1−t] =
O(t r ).
To verify the sufficiency of this statement, define the function ft : X → R by
 t
1
ft (x) = f (x + h) dh.
t 0
4.7 References and additional remarks 69

Taking norms of functions on the variable x, we can see that


  t  
1  1 t

f − ft L 2 =  f (x + h) − f (x) dh ≤ f (x + h) − f (x) L 2 dh
t 0  2 t 0
L

1 t r C r
≤ Ch dh = t
t 0 r+1

and
 
1 
ft H 1 (X ) =
t (f (x + t) − f (x)) 
 2 ≤ Ct .
r−1
L

Hence

K(f , t) ≤ f − ft L 2 + t ft H 1 (X ) ≤ (C/(r + 1) + C)t r .

Conversely, if K(f , t) ≤ Ct r , then, for any g ∈ H 1 (X ), we have

f (x + t) − f (x) L 2 = f (x + t) − g(x + t) + g(x + t) − g(x)


+ g(x) − f (x) L 2 ,

which can be bounded by


 t  
  t
2 f − g L2 + 
 g 
(x + h) dh 
 ≤ 2 f − g L2 + g  (x + h) L 2 dh
0 L2 0

≤ 2 f − g L2 + t g H 1 (X ) .

Taking the infimum over g ∈ H 1 (X ), we see that

f (x + t) − f (x) L 2 ≤ 2K(f , t) ≤ 2Ct r .

This proves the statement and, with it, the second claim.

4.7 References and additional remarks


For a proof of the spectral theorem for compact operators see, for example, [73]
and Section 4.10 of [40].
Mercer’s theorem was originally proved [85] for X = [0, 1] and ν, the
Lebesgue measure. Proofs for this simple case can also be found in [63, 73].
70 4 Polynomial decay of the approximation error

Theorems 4.10 and 4.12 are for general nondegenerate measures ν on a compact
space X . For an extension to a noncompact space X see [123].
The map  in Theorem 4.14 is called the feature map in the literature on
learning theory [37, 107, 134]. More general characterizations for the decay of
the approximation error being of type O(ϕ(R)) with ϕ decreasing on (0, +∞)
can be derived from the literature on approximation theory e.g., ([87, 94]) by
means of K-functionals and moduli of smoothness. For interpolation spaces
see [16].
RKHSs generated by general spline kernels are described in [137]. In the
proof of Example 4.19 we have used a standard technique in approximation
theory (see [78]). Here the function needs to be extended outside [−1, 1] for
defining ft , or the norm L 2 should be taken on [−1, 1 − t]. For simplicity, we
have omitted this discussion.
The characterization of the approximation error described in Section 4.5 is
taken from [113].
Consider the approximation for the ERM scheme with a general loss function
ψ
ψ in Section 3.5. The target function fH minimizes the generalization error E ψ
over H. If we minimize instead over the set of all measurable functions we
obtain a version (w.r.t. ψ) of the regression function.
Definition 4.20 Given the regression loss function ψ the ψ-regression function
is given by

fρψ (x) = argmin ψ(y − t) d ρ(y|x), x ∈ X .
t∈R Y

The approximation error (w.r.t. ψ) associated with the hypothesis space H is


defined as
ψ
E ψ (fH ) − E ψ (fρψ ) = min E ψ (f ) − E ψ (fρψ ).
f ∈H

Proposition 4.21 Let ψ be a regression loss function. If  ≥ 0 is the largest zero


ψ
of ψ and |y| ≤ M almost surely, then, for all x ∈ X , fρ (x) ∈ [−M − , M + ].

The approximation error can be estimated as follows [141].
ψ
Theorem 4.22 Assume |y − f (x)| ≤ M and |y − fρ (x)| ≤ M almost surely.
(i) If ψ is a regression loss function satisfying, for some 0 < s ≤ 1,

|ψ(t) − ψ(t  )|
sup = C < ∞, (4.4)
t,t  ∈[−M ,M ] |t − t  |s
4.7 References and additional remarks 71

then

E ψ (f ) − E ψ (fρψ ) ≤ C |f (x) − fρψ (x)|s d ρX
X

≤ C f − fρψ sL 1 ≤ C f − fρψ L 2 .
s/2
ρX ρX

(ii) If ψ is C 1 on [−M , M ] and its derivative satisfies (4.4), then


(1+s)/2
E ψ (f ) − E ψ (fρψ ) ≤ C f − fρψ 1+s
≤ C f − fρψ L 2 .
Lρ1+s
X ρX

(iii) If ψ  (u) ≥ c > 0 for every u ∈ [−M , M ], then

E ψ (f ) − E ψ (fρψ ) ≥ c
2 f − fρψ 2L 2 . 
ρX
5
Estimating covering numbers

The bounds for the sample error described in Chapter 3 are in terms of, among
other quantities, some covering numbers. In this chapter, we provide estimates
for these covering numbers when we take a ball in an RKHS as a hypothesis
space. Our estimates are given in terms of the regularity of the kernel. As a
particular case, we obtain the following.

Theorem 5.1 Let X be a compact subset of Rn , and Diam(X ) := maxx,y∈X


x − y its diameter.

(i) If K ∈ C s (X × X ) for some s > 0 and X has piecewise smooth boundary,


then there is C > 0 depending on X and s only such that

$ % n/s R 2n/s
ln N IK (BR ), η ≤ C(Diam(X ))n K C s (X ×X ) , ∀0 < η ≤ R/2.
η


(ii) If K(x, y) = exp − x − y 2 /σ 2 for some σ > 0, then, for all 0 < η ≤
R/2,

n+1
$ % 640n(Diam(X ))2 R n+1
ln N IK (BR ), η ≤ n 32 + ln .
σ2 η

If, moreover, X contains a cube in the sense that X ⊇ x∗ + [− 2 , 2 ]n for


some x∗ ∈ X and  > 0, then, for all 0 < η ≤ R/2,

$ % R n/2
ln N IK (BR ), η ≥ C1 ln .
η

Here C1 is a positive constant depending only on σ and .

72
5.1 Reminders IV 73

Part (i) of Theorem 5.1 follows from Theorem 5.5 and Lemma 5.6. It shows
how the covering number decreases as the index s of the Sobolev smooth kernel
increases. A case where the hypothesis of Part (ii) applies is that of the box spline
kernels described in Example 2.17. We show this is so in Proposition 5.25.
When the kernel is analytic, better than
$ Sobolev% smoothness for any index
s > 0, one can see from Part (i) that ln N IK (BR ), η decays at a rate$faster than%s
(R/η) for any  > 0. Hence one would expect a decay rate such as ln(R/η)
for some s. This is exactly what Part (ii) of Theorem 5.1 shows for Gaussian
kernels. The lower bound stated in Part (ii) also tells us that the upper bound
is almost sharp. The proof for Part (ii) is given in Corollaries 5.14 and 5.24
together with Proposition 5.13, where an explicit formula for the constant C1
can be found.

5.1 Reminders IV
To prove the main results of this chapter, we use some basic knowledge from
function spaces and approximation theory.
Approximation theory studies the approximation of functions by functions in
some “good” family – for example, polynomials, splines, wavelets, radial basis
functions, ridge functions. The quality of the approximation usually depends
on, in addition to the size of the approximating family, the regularity of the
approximated function. In this section, we describe some common measures of
regularity for functions.
(I) Consider functions on an arbitrary metric space (X, d ). Let 0 < s ≤ 1. We
say that a continuous function f on X is Lipschitz-s when there exists a constant
C > 0 such that for all x, y ∈ X ,

| f (x) − f (y)| ≤ C(d (x, y))s .

We denote by Lip(s) the space of all Lipschitz-s functions with the norm

f Lip(s) := | f |Lip(s) + f C (X ) ,

where | |Lip(s) is the seminorm

| f (x) − f (y)|
| f |Lip(s) := sup .
x=y∈X (d (x, y))s

This is a Banach space. The regularity of a function f ∈ Lip(s) is measured by


the index s. The bigger the index s, the higher the regularity of f .
74 5 Estimating covering numbers

(II) When X is a subset of a Euclidean space Rn , we can consider a more


general measure of regularity. This can be done by means of various orders of
divided differences.
Let X be a closed subset of Rn and f : X → R. For r ∈ N, t ∈ Rn , and x ∈ X
such that x, x + t, . . . , x + rt ∈ X , define the divided difference


r
r
rt f (x) := (−1)r−j f (x + jt).
j
j=0

In particular, when r = 1, 1t f (x) = f (x + t) − f (x). Divided differences can


be used to characterize various types of function spaces. Let

Xr,t = {x ∈ X | x, x + t, . . . , x + rt ∈ X }.

For 0 < s < r and 1 ≤ p < ∞, the generalized Lipschitz space


Lip∗(s, L p (X )) consists of functions f in L p (X ) for which the seminorm

 1/p
−s
| f |Lip∗(s,L p (X )) := sup t |rt f (x)|p dx
t∈Rn Xr,t

is finite. This is a Banach space with the norm

f ∗
Lip (s,L p (X ))
:= | f |Lip∗(s,L p (X )) + f L p (X ) .

The space Lip∗(s, C (X )) is defined in a similar way, taking

−s
| f |Lip∗(s,C (X )) := sup t sup |rt f (x)|
t∈Rn x∈Xr,t

and

f ∗
Lip (s,C (X ))
:= | f |Lip∗(s,C (X )) + f C (X ) .

Clearly, f Lip∗(s,L p (X )) ≤ (µ(X ))1/p f Lip∗(s,C (X )) for all p < ∞ when X is


compact.
When X has piecewise smooth boundary, each function f ∈ Lip∗(s, L p (X ))
can be extended to Rn . If we still denote this extension by f , then
5.1 Reminders IV 75

there exists a constant CX ,s,p depending on X , s, and p such that


for all f ∈ Lip∗(s, L p (X )),

f ∗
Lip (s,L p (Rn ))
≤ CX ,s,p f ∗
Lip (s,L p (X ))
.

When p = ∞, we write f Lip∗(s) instead of f Lip∗(s,C (X )) . In this case, under


some mild regularity condition for X (e.g., when the boundary of X is piecewise
smooth), for s not an integer, s =  + s0 with  ∈ N, and 0 < s0 < 1,
Lip∗(s, C (X )) consists of continuous functions on X such that Dα f ∈ Lip(s0 ),
for any α = (α1 , . . . , αn ) ∈ Nn with |α| ≤ . In particular, if X = [0, 1]n or
Rn . In this case, Lip∗(s, C (X )) = C s (X ) when s is not an integer, and C s (X ) ⊂
Lip∗(s, C (X )) when s is an integer. Also, C 1 (X ) ⊂ Lip(1) ⊂ Lip∗(1, C (X )), the
last being known as the Zygmund class.
Again, the regularity of a function f ∈ Lip∗(s, L p (X )) is measured by the
index s. The bigger the index s, the higher the regularity of f .
(III) When p = 2 and X = Rn , it is also natural to measure the regularity
of functions in L 2 (Rn ) by means of the Fourier transform. Let s > 0. The
fractional Sobolev space H s (Rn ) consists of functions in L 2 (Rn ) such that the
following norm is finite:
  1/2
1
f H s (Rn ) := (1 + ξ 2 s
) | f (ξ )|2 d ξ .
(2π)n Rn

When s ∈ N, H s (Rn ) coincides with the Sobolev space defined in Section 2.3.
Note that for a function f ∈ H s (Rn ), its regularity s is tied to the decay of
f . The larger s is, the faster the decay of f is. These subspaces of L 2 (Rn ) and
those described in (II) are related as follows. For any  > 0,

Lip∗(s, L 2 (Rn )) ⊂ H s (Rn ) ⊂ Lip∗(s − , L 2 (Rn )).

For any integer d < s − n2 , f ∈ C d (Rn ) and f C d ≤ Cd f H s . In particular,


if s > n2 , it follows by taking d = 0 that H s (Rn ) ⊆ C (Rn ). Note that this is
the Sobolev embedding theorem mentioned in Section 2.3 for X = Rn . (These
facts can be easily shown using the inverse Fourier transform when s > n and
d < s − n. When n ≥ s > n2 and s − n ≤ d < s − n2 , the proofs are more
involved.) Thus, if X ⊆ Rn has piecewise smooth boundary and d < s − n2 ,
each function f ∈ Lip∗(s, L 2 (X )) can be extended to a C d function on Rn and
there exists a constant CX ,s,d such that for all f ∈ Lip∗(s, L 2 (X )),

f C d (Rn ) ≤ CX ,s,d f ∗
Lip (s,L 2 (X ))
. (5.1)
76 5 Estimating covering numbers

5.2 Covering numbers for Sobolev smooth kernels


Recall that if E is a Banach space and R > 0, we denote

BR (E) = {x ∈ E : x ≤ R}.

If the space E is clear from the context, we simply write BR .


Lemma 5.2 Let E ⊆ C (X ) be a Banach space. For all η, R > 0,
 η
N (BR , η) = N B1 , .
R
Proof. The proof follows from the fact that {B( f1 , η), . . . , B( fk , η)} is a
covering of BR if and only if {B ( f1 /R, η/R) , . . . , B ( fk /R, η/R)} is a covering
of B1 . 
It follows from Lemma 5.2 that it is enough to estimate covering numbers of
the unit ball. We start with balls in finite-dimensional spaces.
Theorem 5.3 Let E be a finite-dimensional Banach space, N = dim E, and
R > 0. For 0 < η < R,
N
2R
N (BR , η) ≤ +1
η

and, for η ≥ R, N (BR , η) = 1.


Proof. Choose a basis {ej }N
j=1 of E. Define a norm | | on R by
N

 
 N 
 
|x| := 
 x 
j j ,
e x = (x1 , . . . , xN ) ∈ RN .
 j=1 

Let η > 0. Suppose that N (BR , η) > ((2R/η)+1)N . Then BR cannot be covered
by ((2R/η)+1)N balls with radius η. Hence we can find elements f (1) , . . . , f ()
in BR such that

2R N 4
j−1
> +1 and f ( j)  ∈ B( f (i) , η), ∀j ∈ {2, . . . , }.
η
i=1

Therefore, for i = j, f (i) − f (j) > η.


 ( j) ( j) = (x ( j) , . . . , x ( j) ) ∈ RN . Then, for
Set f ( j) = Nm=1 xm em ∈ BR and x 1 N
i  = j,

|x(i) − x( j) | > η.
5.2 Covering numbers for Sobolev smooth kernels 77

Also, |x( j) | ≤ R.
Denote by Br the ball of radius r > 0 centered on the origin in (RN , | |). Then

 
4 η
x(j) + B1 ⊆ B
R+ η2 ,
2
j=1

and the sets in this union are disjoint. Therefore, if µ denotes the Lebesgue
measure on RN ,
⎛ ⎞
   η   
4 
η
µ⎝ x(j) + B1 ⎠ = µ x(j) + B1 ≤ µ B R+ η2 .
2 2
j=1 j=1

It follows that
 η N $ %  η N $ %
 µ B1 ≤ R + µ B1 ,
2 2
and thereby
N N
R + (η/2) 2R
≤ = +1 .
η/2 η

This is a contradiction. Therefore we must have, for all η > 0,


N
2R
N (BR , η) ≤ +1 .
η

If, in addition, η ≥ R, then BR can be covered by the ball with radius η


centered on the origin and hence N (BR , η) = 1. 
The study of covering numbers is a standard topic in the field of function
spaces. The asymptotic behavior of the covering numbers for Sobolev spaces
is a well-known result. For example, the ball BR (Lip∗(s, C ([0, 1]n ))) in the
generalized Lipschitz space on [0, 1]n satisfies

R n/s $ % R n/s
Cs ≤ ln N BR (Lip∗(s, C ([0, 1]n ))), η ≤ Cs , (5.2)
η η

where the positive constants Cs and Cs depend only on s and n (i.e., they are
independent of R and η).
We will not prove the bound (5.2) here in all its generality (references for a
proof can be found in Section 5.6). However, to give an idea of the methods
78 5 Estimating covering numbers

involved in such a proof, we deal next with the special case 0 < s < 1 and
n = 1. Recall that in this case, Lip∗(s, C ([0, 1])) = Lip(s). Since B1 (Lip(s)) ⊂
B1 (C ([0, 1])), we have N (B1 (Lip(s)), η) = 1 for all η ≥ 1.

Proposition 5.4 Let 0 < s < 1 and X = [0, 1]. Then, for all 0 < η ≤ 14 ,

1/s 1/s
1 1 4
≤ ln N (B1 (Lip(s)), η) ≤ 4 .
8 2η η

The restriction η ≤ 1
4 is required only for the lower bound.

Proof. We first deal with the upper bound. Set ε = (η/4)1/s . Define x =
{xi = iε}di=1 , where d =  1ε  denotes the integer part of 1/ε. Then x is an ε-net
of X = [0, 1] (i.e., for all x ∈ X , the distance from x to x is at most ε). If
f ∈ B1 (Lip(s)), then f C (X ) ≤ 1 and −1 ≤ f (xi ) ≤ 1 for all i = 1, . . . , d .
Hence, (νi − 1) η2 ≤ f (xi ) ≤ νi η2 for some νi ∈ J := {−m + 1, . . . , m}, where
m is the smallest integer greater than η2 . For ν = (ν1 , . . . , νd ) ∈ J d define

 η η
Vν := f ∈ B1 (Lip(s)) | (νi − 1) ≤ f (xi ) ≤ νi for i = 1, . . . , d .
2 2
5
Then B1 (Lip(s)) ⊆ ν∈J d Vν . If f , g ∈ Vν , then, for each i ∈ {1, . . . , d },

max | f (x) − g(x)| ≤ | f (xi ) − g(xi )| + max | f (x) − f (xi )|


|x−xi |≤ε |x−xi |≤ε

+ max |g(x) − g(xi )|


|x−xi |≤ε
η
≤ + 2ε s = η.
2

Therefore, Vν has diameter at most η as a subset of C (X ). That is, {Vν }ν∈J d is


an η-covering of B1 (Lip(s)). What is left is to count nonempty sets Vν .
If Vν is nonempty, then Vν contains some function f ∈ B1 (Lip(s)). Since
− η2 ≤ f (xi+1 ) − νi+1 η2 ≤ 0 and − η2 ≤ f (xi ) − νi η2 ≤ 0, we have

η η η
− ≤ f (xi+1 ) − f (xi ) − (νi+1 − νi ) ≤ .
2 2 2

It follows that for each i = 1, . . . , d − 1,

η η
|νi+1 − νi | − ≤ | f (xi+1 ) − f (xi )| ≤ |xi+1 − xi |s ≤ εs .
2 2
5.2 Covering numbers for Sobolev smooth kernels 79

This yields |νi+1 −νi | ≤ η2 (ε s + η2 ) = 32 and then νi+1 ∈ {νi −1, νi , νi +1}. Since
ν1 has 2m possible values, the number of nonempty Vν is at most 2m · 3d −1 .
Therefore,

ln N (B1 (Lip(s)), η) ≤ ln(2m · 3d −1 )


= ln 2 + (d − 1) ln 3 + ln m
1 2
≤ ln 2 + ln 3 + ln +1 .
ε η

But ln(1 + t) ≤ t for all t ≥ 0, so we have

1/s
2 2 1
ln +1 ≤ ≤2
η η η

and hence
1/s 1/s 1/s
4 1 4
ln N (B1 (Lip(s)), η) ≤ ln 2 + ln 3 + 2 ≤4 .
η η η

We now prove the lower bound. Set ε = (2η)1/s and x as above. For
i = 1, . . . , d − 1, define fi to be the hat function of height η2 on the interval
[xi − ε, xi + ε]; that is,
⎧ η η

⎨ 2 − 2ε t if 0 ≤ t ≤ ε
η η
fi (xi + t) = + 2ε t if −ε ≤ t < 0


2
0 if t  ∈ [−ε, ε].

Note that fi (xj ) = η2 δij .


For every nonempty subset I of {1, . . . , d − 1} we define

fI (x) = fi (x).
i∈I

If I1  = I2 , there is some i ∈ (I1 \ I2 ) ∪ (I2 \ I1 ) and


η
fI1 − fI2 C (X ) ≥ | fI1 (xi ) − fI2 (xi )| = .
2

It follows that N (B1 (Lip(s)), η) is at least the number of nonempty subsets of


{1, . . . , d − 1} (i.e., 2d −1 − 1), provided that each fI lies in B1 (Lip(s)). Let us
prove that this is the case.
80 5 Estimating covering numbers

Observe that fI is piecewise linear on each [xi , xi+1 ] and its values on xi are
either η2 or 0. Hence fI C (X ) ≤ η2 ≤ 12 . To evaluate the Lipschitz-s seminorm
of fI we take x, x + t ∈ X with t > 0. If t ≥ ε, then

| fI (x + t) − fI (x)| 2 fI C (X ) η 1
≤ ≤ s = .
t s ε s ε 2

If t < ε and xi  ∈ (x, x + t) for all i ≤ d − 1, then fI is linear on [x, x + t] with


η
slope at most 2ε and hence

| fI (x + t) − fI (x)| (η/2ε)t η η 1
s
≤ s
= t 1−s < ε 1−s = .
t t 2ε 2ε 4

If t < ε and xi ∈ (x, x + t) for some i ≤ d − 1, then

| fI (x + t) − fI (x)| ≤ | fI (x + t) − fI (xi )| + | fI (xi ) − fI (x)|


η η η
≤ (x + t − xi ) + (xi − x) = t
2ε 2ε 2ε

and hence

| fI (x + t) − fI (x)| η η 1
≤ t 1−s ≤ ε −s = .
ts 2ε 2 4

Thus, in all three cases, | fI |Lip(s) ≤ 12 , and therefore fI Lip(s) ≤ 1. This shows
fI ∈ B1 (Lip(s)).
Finally, since η ≤ 41 , we have ε ≤ 12 and d ≥ 2. It follows that

N (B1 (Lip(s)), η) ≥ 2d −1 − 1 ≥ 2d −2 ≥ 2(1/2ε)−2 ,

which implies

1/s 1/s
1 1 1 1
ln N (B1 (Lip(s)), η) ≥ ln 2 − ln 4 ≥ . 
2 2η 8 2η

Now we can give some upper bounds for the covering number of balls in
RKHSs. The bounds depend on the regularity of the Mercer kernel. When the
kernel K has Sobolev or generalized Lipschitz regularity, we can show that
the RKHS HK can be embedded into a generalized Lipschitz space. Then an
estimate for the covering number follows.
5.2 Covering numbers for Sobolev smooth kernels 81

Theorem 5.5 Let X be a closed subset of Rn , and K : X × X → R be a Mercer


kernel. If s > 0 and K ∈ Lip∗(s, C (X × X )), then HK ⊂ Lip∗( 2s , C (X )) and,
for all r ∈ N, r > s,
7
f ∗
Lip (s/2)
≤ 2r+1 K ∗
Lip (s)
f K, ∀f ∈ HK .

Proof. Let s < r ∈ N and f ∈ HK . Let x, t ∈ Rn such that x, x+t, . . . , x+rt ∈


X . By Theorem 2.9(iii),
8 r 9

r
r , -  r
rt f (x) = (−1) r−j
Kx+jt , f K
= (−1)r−j Kx+jt , f .
j j
j=0 j=0 K

It follows from the Cauchy–Schwarz inequality that


⎛ ⎞1/2
r
r 
r
r
|rt f (x)| ≤ f K ⎝ (−1)r−j (−1)r−i K(x + jt, x + it)⎠
j i
j=0 i=0
⎛ ⎞1/2

r
r
= f K⎝ (−1)r−j r(0,t) K(x + jt, x)⎠ .
j
j=0

Here (0, t) denotes the vector in R2n where the first n components are zero.
By hypothesis, K ∈ Lip∗(s, C (X × X )). Hence
 
 r 
 K(x + jt, x) ≤ |K| ∗ t s .
 (0,t)  Lip (s)

This yields
⎛ ⎞1/2

r
r 7
|rt f (x)| ≤ f K ⎝ |K|Lip∗(s) t s⎠
≤ 2r |K|Lip∗(s) f K t s/2
.
j
j=0

Therefore,
7
| f |Lip∗( s ) ≤ 2r |K|Lip∗(s) f K.
2

Combining this inequality with the fact (cf. Theorem 2.9) that

f ∞ ≤ K ∞ f K,
82 5 Estimating covering numbers

we conclude that f ∈ Lip∗( 2s , C (X )) and


7
f ∗
Lip (s/2)
≤ 2r+1 K ∗
Lip (s)
f K.

The proof of the theorem is complete. 


Lemma 5.6 Let n ∈ N, s > 0, D ≥ 1, and x∗ ∈ Rn . If X ⊆ x∗ + D[0, 1]n and
X has piecewise smooth boundary then

N (BR (Lip∗(s, C (X ))), η) ≤ N (BCX ,s Ds R (Lip∗(s, C ([0, 1]n ))), η/2),

where CX ,s is a constant depending only on X and s.


Proof. For f ∈ C ([0, 1]n ) define f ∗ ∈ C (x∗ + D[0, 1]n ) by

x − x∗
f ∗ (x) = f .
D

Then f ∗ ∈ Lip∗(s, C (x∗ + D[0, 1]n )) if and only if f ∈ Lip∗(s, C ([0, 1]n )).
Moreover,

D−s f ∗
Lip (s,C ([0,1]n ))
≤ f∗ ∗
Lip (s,C (x∗ +D[0,1]n ))
≤ f ∗
Lip (s,C ([0,1]n ))
.

Since X ⊆ x∗ + D[0, 1]n and X has piecewise smooth boundary, there is a


constant CX ,s depending only on X and s such that

BR (Lip∗(s, C (X ))) ⊆ { f ∗ |X | f ∗ ∈ BCX ,s R (Lip∗(s, C (x∗ + D[0, 1]n )))}


⊆ { f ∗ |X | f ∈ BCX ,s Ds R (Lip∗(s, C ([0, 1]n )))}.

Let {f1 , . . . , fN } be an η2 -net of BCX ,s Ds R (Lip∗(s, C ([0, 1]n ))) and N its cover-
ing number. For each j = 1, . . . , N take a function gj∗ |X ∈ BR (Lip∗(s, C (X )))
with gj ∈ BCX ,s Ds R (Lip∗(s, C ([0, 1]n ))) and gj − fj Lip∗(s,C ([0,1]n )) ≤ η2 if it
exists. Then {gj∗ |X | j = 1, . . . , N } provides an η-net of BR (Lip∗(s, C (X ))):
each f ∈ BR (Lip∗(s, C (X ))) can be written as the restriction g ∗ |X to X of
some function g ∈ BCX ,s Ds R (Lip∗(s, C ([0, 1]n ))), so there is some j such that
g − fj Lip∗(s,C ([0,1]n )) ≤ η2 . This implies that

f − gj∗ |X ∗
Lip (s,C (X ))
≤ g ∗ − gj∗ ∗
Lip (s,C (x∗ +D[0,1]n ))

≤ g − gj ∗
Lip (s,C ([0,1]n ))
≤ η.

This proves the statement. 


5.3 Covering numbers for analytic kernels 83

Proof of Theorem 5.1(i) Recall that C s (X ) ⊂ Lip∗(s, C (X )) for any s > 0.


Then, by Theorem 5.5 with s < r ≤ s + 1, the assumption K ∈ C s (X × X )
implies that IK (BR ) ⊆ B72s+2 K ∗ R (Lip∗(s/2, C (X ))). This, together with
Lip (s)
(5.2) and Lemma 5.6 with D ≥ Diam(X ), shows Theorem 5.1(i). 

When s is not an integer, and the boundary of X is piecewise smooth,


C s (X ) = Lip∗(s, C (X )). As a corollary of Theorem 5.5, we have the following.

Proposition 5.7 Let X be a closed subset of Rn with piecewise smooth


boundary, and K : X × X → R a Mercer kernel. If s > 0 is not an even integer
and K ∈ C s (X × X ), then HK ⊂ C s/2 (X ) and
7
f ∗
Lip (s/2)
≤ 2r+1 K ∗
Lip (s)
f K, ∀f ∈ HK . 

Theorem 5.5 and the upper bound in (5.2) yield upper-bound estimates for the
covering numbers of RKHSs when the Mercer kernel has Sobolev regularity.

Theorem 5.8 Let X be a closed subset of Rn with piecewise smooth boundary,


and K : X × X → R a Mercer kernel. Let s > 0 such that K belongs to
Lip∗(s, C (X × X )). Then, for all 0 < η ≤ R,

2n/s
R
ln N (IK (BR ), η) ≤ C ,
η

where C is a constant independent of R and η. 

It is natural to expect covering numbers to have smaller upper bounds when


the kernel is analytic, a regularity stronger than Sobolev smoothness. Proving
this is our next step.

5.3 Covering numbers for analytic kernels


In this section we continue our discussion of the covering numbers of balls
of RKHSs and provide better estimates for analytic kernels. We consider a
convolution kernel K given by K(x, t) = k(x − t), where k is an even function
in L 2 (Rn ) and k(ξ ) > 0 almost everywhere on Rn . Let X = [0, 1]n . Then K
is a Mercer kernel on X . Our purpose here is to bound the covering number
N (IK (BR ), η) when k is analytic.
We will use the Lagrange interpolation polynomials. Denote by s (R)
the space of real polynomials in one variable of degree at most s. Let
84 5 Estimating covering numbers

t0 , . . . , ts ∈ R be different. We say that wl,s ∈ s (R), l = 0, . . . , s, are the


Lagrange interpolation polynomials with interpolating points {t0 , . . . , ts } when
s
l=0 wl,s (t) ≡ 1 and

wl,s (tm ) = δl,m , l, m ∈ {0, 1, . . . , s}.

It is easy to check that

 t − tj
wl,s (t) =
tl − t j
j∈{0,1,...,s}\{l}

satisfy these conditions.


We consider the set of interpolating points {0, 1s , 2s , . . . , 1} and univariate
functions {wl,s (t)}sl=0 defined by


s
st(st − 1) · · · (st − j + 1) j
wl,s (t) := (−1) j−l . (5.3)
j! l
j=l

Since


j
j
(−1) j−l z l = (z − 1) j ,
l
l=0

the following functions of the variable z are equal:


s 
s
st(st − 1) · · · (st − j + 1)
wl,s (t)z l = (z − 1) j . (5.4)
j!
l=0 j=0

s
In particular, l=0 wl,s (t) ≡ 1. In addition, it can be easily checked that
m
wl,s = δl,m , l, m ∈ {0, 1, . . . , s}. (5.5)
s

This means that the wl,s are the Lagrange interpolation polynomials, and
hence
 t − j/s  st − j
wl,s (t) = = .
l/s − j/s l−j
j∈{0,1,...,s}\{l} j∈{0,1,...,s}\{l}
5.3 Covering numbers for analytic kernels 85

The norm of these polynomials (as elements in C ([0, 1])) can be estimated as
follows:

Lemma 5.9 Let s ∈ N, l ∈ {0, 1, . . . , s}. Then, for all t ∈ [0, 1],

s
|wl,s (t)| ≤ s .
l

Proof. Let m ∈ {0, 1, . . . , s − 1} and st ∈ (m, m + 1). Then for l ∈ {0, 1, . . . ,


m − 1},
: :m :s 
 l−1 
 j=0 (st − j) j=l+1 (st − j) j=m+1 (st − j)
|wl,s (t)| =
l!(s − l)!
(m + 1)!(s − m)! s
≤ ≤s .
(st − l)l!(s − l)! l

When l ∈ {m + 1, . . . , s},
: :l−1 :s 
 m 
 j=0 (st − j) j=m+1 (st − j) j=l+1 (st − j)
|wl,s (t)| =
l!(s − l)!
(m + 1)!(s − m)! s
≤ ≤s .
(l − m)l!(s − l)! l

The case l = m can be dealt with in the same way. 

We now turn to the multivariate case. Denote XN := {0, 1, . . . , N }n . The


multivariate polynomials {wα,N (x)}α∈XN are defined as


n
wα,N (x) = wαj ,N (xj ), x = (x1 , . . . , xn ), α = (α1 , . . . , αn ). (5.6)
j=1

We use the polynomials in (5.6) as a family of multivariate polynomials, not


as interpolation polynomials any more. For these polynomials, we have the
following result.

Lemma 5.10 Let x ∈ [0, 1]n and N ∈ N. Then



|wα,N (x)| ≤ (N 2N )n (5.7)
α∈XN
86 5 Estimating covering numbers

and, for θ ∈ [− 12 , 21 ]n ,

 
 −iθ·Nx   1 n−1 N
e − (x)e −iθ·α 
 w α,N  ≤ n 1 + 2N max |θj |
1≤ j≤n
(5.8)
α∈XN

holds.

Proof. The bound (5.7) follows directly from Lemma 5.9.


To derive the second bound (5.8), we first consider the univariate case. Let
t ∈ [0, 1]. Then the univariate function z Nt is analytic on the region |z − 1| ≤ 12 .
On this region,


 Nt(Nt − 1) · · · (Nt − j + 1)
z Nt = (1 + (z − 1))Nt = (z − 1) j .
j!
j=0

It follows that for η ∈ [− 12 , 12 ] and z = e−iη ,

 
 −iη·Nt N
Nt(Nt − 1) · · · (Nt − j + 1) −iη 
e − (e j
− 1) 
 j!
j=0

  
 Nt(Nt − 1) · · · (Nt − j + 1)  j
≤   |η| ≤ |η|N .
 j! 
j=N +1

This, together with (5.4) for z = e−iη , implies

   
 N   N Nt(Nt − 1) · · · (Nt − j + 1) −iη 
 wl,N (t)e −iη·l   j
 = j!
(e − 1) 
l=0 j=0

1
≤ 1 + |η|N ≤ 1 + (5.9)
2N

and
 
 −iη·Nt N

e − wl,N (t)e −iη·l 
 ≤ |η| .
N
 (5.10)
l=0

Now we can derive the bound in the multivariate case. Let θ ∈


:n
[− 12 , 12 ]n . Then e−iθ·Nx = m=1 e
−iθm ·Nxm . We approximate e−iθm ·Nxm by
5.3 Covering numbers for analytic kernels 87

N −iθm ·αm
αm =0 wαm ,N (xm )e for m = 1, 2, . . . , n. We have

    2 3
 −iθ ·Nx    n m−1 
e − wα,N (x)e −iθ·α = e −iθs ·Nxs
  
α∈XN m=1 s=1
⎡ ⎤ ⎡ ⎤

N n N

× ⎣e−iθm ·Nxm − wαm ,N (xm )e−iθm ·αm ⎦ ⎣ wαs ,N (xs )e−iθs ·αs ⎦ .
αm =0 s=m+1 αs =0 

Applying (5.9) to the last term and (5.10) to the middle term, we see that this
expression can be bounded by


n N
1 n−m
1 n−1 N
max |θj | 1+ ≤n 1+ max |θj | .
1≤ j≤n 2N 2N 1≤ j≤n
m=1

Thus, bound (5.8) holds. 


We can now state estimates on the covering number N (IK (BR ), η) for a
convolution-type kernel K(x, t) = k(x − t). The following function measures
the regularity of the kernel function k:
 
2n−2 2N
1 |ξ |
(2π)−n max
j
ϒk (N ) = n3 1 + N k(ξ ) dξ
2 1≤ j≤n ξ ∈[−N /2,N /2]n N
2 
+ 1 + (N 2N )n (2π)−n k(ξ ) d ξ .
ξ ∈[−N /2,N /2]n

The domain of this function is split into two$ parts. %In the first part, ξ ∈
N
[−N /2, N /2]n , and therefore, for j = 1, . . . , n, |ξj |/N ≤ 2−N ; hence this
first part decays exponentially quickly as N becomes large. In the second part,
ξ  ∈ [−N /2, N /2]n , and therefore ξ is large when N is large. The decay of k
(which is equivalent to the regularity of k; see Part (III) in Section 5.1) yields
the fast decay of ϒk on this second part. For more details and examples of
bounding ϒk (N ) by means of the decay of k (or, equivalently, the regularity
of k), see Corollaries 5.12 and 5.16.
Theorem 5.11 Assume that k is an even function in L 2 (Rn ) and k(ξ ) > 0
almost everywhere on Rn . Let K(x, t) = k(x − t) for x, t ∈ [0, 1]n . Suppose
limN →∞ ϒk (N ) = 0. Then, for 0 < η < R2 ,

R
ln N (IK (BR ), η) ≤ (N + 1)n ln 8 k(0)(N + 1)n/2 (N 2N )n (5.11)
η
88 5 Estimating covering numbers

holds, where N is any integer satisfying

 η 2
ϒk (N ) ≤ . (5.12)
2R

Proof. By Proposition 2.14, K is a Mercer kernel on Rn . Let f ∈ BR .


Then, by reproducing that property of K, f (x) =  f , Kx K . Recall that
XN = {0, 1, . . . , N }n . For x ∈ [0, 1]n , we have
  8 9 
   
  α    
 f (x) − f wα,N (x) =  f , Kx − wα,N (x)K Nα 
 
 α∈XN
N   α∈XN 
K

≤ f K {QN (x)}
1/2
,


where {QN (x)}1/2 is the HK -norm of the function Kx − α∈XN wα,N (x)Kα/N .
It is explicitly given by

  α
QN (x) := k(0) − 2 wα,N (x)k x −
N
α∈XN
 α−β
+ wα,N (x)k wβ,N (x). (5.13)
N
α,β∈XN

By the evenness of k and the inverse Fourier transform,



α α
k(x − ) = (2π)−n k(ξ )eiξ ·(x− N ) d ξ ,
N R n

we obtain

  2
  
−n  iξ ·(x− Nα ) 
QN (x) = (2π) k(ξ )1 − wα,N (x)e  dξ
Rn α∈XN
  
2
 −i ξ ·Nx 
−i Nξ ·α 
= (2π)−n 
k(ξ )e − wα,N (x)e
N
 dξ.
Rn α∈XN

Now we separate this integral into two parts, one with ξ ∈ [− N2 , N2 ]n and the
other with ξ  ∈ [− N2 , N2 ]n . For the first region, (5.8) in Lemma 5.10 with θ = Nξ
5.3 Covering numbers for analytic kernels 89

tells us that
  2
 −i ξ ·Nx  
 −i Nξ ·α 
k(ξ )e N − wα,N (x)e  dξ
ξ ∈[−N /2,N /2]n α∈XN

1 2n−2   |ξj | 2N
≤ n2 1 + k(ξ ) dξ.
2N N
1≤ j≤n ξ ∈[−N /2,N /2]
n

For the second region, we apply (5.7) in Lemma 5.10 and obtain

  
2
 ξ ξ 
k(ξ )e−i N ·Nx − wα,N (x)e−i N ·α  d ξ
ξ ∈[−N /2,N /2]n α∈XN

≤ (1 + (N 2N )n )2 k(ξ ) d ξ .
ξ ∈[−N /2,N /2]n

Combining the two cases above, we have


  
2n−2 2N
1 −n |ξj |
QN (x) ≤ n 3
1+ N max (2π) k(ξ ) dξ
2 1≤ j≤n ξ ∈[−N /2,N /2]n N

(1 + (N 2N )n )2
+ k(ξ ) d ξ = ϒk (N ).
(2π )n ξ ∈[−N /2,N /2]n

Hence

⎨   α
sup k(0) − 2 wα,N (x)k x −
x∈[0,1]n ⎩ α∈XN
N

 α−β ⎬
+ wα,N (x)k wβ,N (x) ≤ ϒk (N ). (5.14)
N ⎭
α,β∈XN

Since N satisfies (5.12), we have


 
 
  α  η
 f (x) − f wα,N (x) ≤ .
 
 α∈XN
N  2
C (X )
90 5 Estimating covering numbers

Also, by the reproducing property, | f ( Nα )| = | f , Kα/N K | ≤ f


√ √ K
K(α/N , α/N ) ≤ R k(0). Hence
  α  
 
 f 2 ≤ R k(0)(N + 1)n/2 .
N  (XN )

Here 2 (XN ) is the 2 space of sequences {x(α)}α∈XN indexed by XN .



Apply Theorem 5.3 to the ball of radius r := R k(0)(N + 1)n/2 in the
finite-dimensional space 2 (XN ) and  = η/(2(N 2N )n ). Then there are {cl :
l = 1, . . . , [(2r/ + 1)#XN ]} ⊂ 2 (XN ) such that for any d ∈ 2 (XN ) with
d 2 (XN ) ≤ r, we can find some l satisfying

d − cl 2 (XN ) ≤ .

This, together with Lemma 5.10, yields


 
 
  
 d w (x) − c l
w (x)  ≤ d − cl ∞ (XN )
 α α,N α α,N 
α∈XN α∈XN 
C (X )
 
 
 
 |wα,N (x)| ≤ (N 2N )n  ≤ η/2.
 
α∈XN 
C (X )

Here ∞ (XN ) is the ∞ space of sequences {x(α)}α∈XN indexed by XN that


satisfies the relationship c ∞ (XN ) ≤ c 2 (XN ) for all c ∈ ∞ (XN ).

Thus, with d = { f ( Nα )}, we see that f (x) − α∈XN cαl wα,N (x) C (X ) can
be bounded by
 
  α 
 
 f (x) − f w (x) 
 α,N 
 α∈XN
N 
C (X )
 
 
  
+  d w
α α,N (x) − c l
w
α α,N (x) 
 ≤ η.
α∈XN α∈XN 
C (X )


We have covered IK (BR ) by balls with centers α∈XN cαl wα,N (x) and radius
η. Therefore,

#XN
2r
N (IK (BR ), η) ≤ +1 .

5.3 Covering numbers for analytic kernels 91

That is,

2r
ln N (IK (BR ), η) ≤ (N + 1)n ln +1

R
≤ (N + 1)n ln 8 k(0)(N + 1)n/2 (N 2N )n .
η

The proof of Theorem 5.11 is complete. 

To see how to handle the function ϒk (N ) measuring the regularity of the


kernel, and then to estimate the covering number, we turn to the example of
Gaussian kernels.

Corollary 5.12 Let σ > 0, X = [0, 1]n , and K(x, y) = k(x − y) with

 
x 2
k(x) = exp − 2 , x ∈ Rn .
σ

Then, for 0 < η < R2 ,

n
R 54n R 90n2
ln N (IK (BR ), η) ≤ 3 ln + 2 +6 (6n + 1) ln + 2 + 11n + 3
η σ η σ
(5.15)

holds. In particular, when 0 < η < R exp{−(90n2 /σ 2 ) − 11n − 3}, we have

n+1
R
ln N (IK (BR ), η) ≤ 4n (6n + 2) ln . (5.16)
η

Proof. It is well known that


k(ξ ) = (σ π)n e−σ ξ
2 2 /4
. (5.17)

Hence k(ξ ) > 0 for any ξ ∈ Rn .


Let us estimate the function ϒk . For the first part, with 1 ≤ j ≤ n, we have


(2π)−1 σ πe−σ ξ /4 d ξ < 1
2 2

ξ ∈[−N /2,N /2]


92 5 Estimating covering numbers

when   = j. Hence

√ |ξj | 2N
(2π)−n (σ π)n e−σ ξ
2 2 /4

ξ ∈[−N /2,N /2]n N
√ 
σ π N /2 −σ 2 |ξj |2 /4 |ξj | 2N
≤ e d ξj
2π −N /2 N
2N
1 2 1
≤√  N+ .
π σN 2

If we apply Stirling’s formula, this expression can be bounded by


2N N +(1/2) N
2 2N + 1 1 2
√ e1/(6(2N +1)) ≤ .
σN 2e 2N + 1 σ eN
2

As for the second term of ϒk , we have




(2π)−n (σ π)n e−σ ξ
2 2 /4

ξ ∈[−N /2,N /2]n
√ n 
σ π
e−σ |ξj | /4 d ξj
2 2

2π ξj ∈[−N /2,N /2]
j=1
√ 
nσ π +∞ −(σ 2 /4)(t 2 −t/2) −(σ 2 /4)(t/2)
≤ e e dt
π N /2
nσ 8
≤ √ e−σ N (N −1)/16 2 e−σ N /16
2 2

π σ
8n −(σ 2 /16)N 2
= √ e .
σ π

If we combine these two estimates, the function ϒk satisfies


2n−2 N
1 2 8n
√ (1 + (N 2N )n )2 e−(σ /16)N .
2 2
ϒk (N ) ≤ n3 1 + +
2N σ eN
2 σ π

Notice that when N ≥ n + 3,


2n−2 2n−2
1 1
1+ ≤ 1+ ≤e
2N 2n

and

(1 + (N 2N )n )2 ≤ 21−2n+4Nn .
5.3 Covering numbers for analytic kernels 93

It follows that
N
2 n42−n −(σ 2 /16)N 2 +4nN ln 2
ϒk (N ) ≤ n3 e + √ e .
σ eN
2 σ π

Choose N ≥ 80n ln 2/σ 2 . Then


N
1 4
ϒk (N ) ≤ n3 e + √ 2−nN
16en ln 2 σ π
N
1 4
≤e √ 2−nN + (5.18)
σ π
16n
  N
4 1 1
≤ e+ √ max , n .
σ π 16n 2

When N ≥ 2 ln Rη + 52 and N ≥ 3n ln Rη + 5n − ln(σ π )/(n ln 2), we know
that each term in the estimates for ϒk is bounded by (η/(2R))2 /2. Hence (5.12)
holds.
Finally, we choose the smallest N satisfying

80n ln 2 R
N≥ + 3 ln + 5.
σ2 η

Then, by checking the cases σ ≥ 1 and σ < 1, we see that (5.12) is valid for
any 0 < η < R2 . By Theorem 5.11,
n
R 80n ln 2 5 R
ln N (IK (BR ), η) ≤ 3 ln + +6 ln 2 nN + ln + ln 8
η σ2 2 η
n
R 54n R 90n2
≤ 3 ln + 2 +6 (6n + 1) ln + 2 + 11n + 3 .
η σ η σ

This proves (5.15).


When 0 < η < Re(90n /σ )−11n−3 , we have
2 2

n+1
R
ln N (IK (BR ), η) ≤ 4n (6n + 2) ln .
η

This yields the last inequality in the statement. 


One can easily derive from the covering number estimates for X = [0, 1]n
such estimates for an arbitrary X .
94 5 Estimating covering numbers

Proposition 5.13 Let K(x, y) = k(x − y) be a translation-invariant Mercer


kernel on Rn and X ⊆ Rn . Let  > 0 and K  be the Mercer kernel on
X  = [0, 1]n given by

K  (x, y) = k((x − y)), x, y ∈ [0, 1]n .

Then
(i) If X ⊆ x∗ + [−/2, /2]n for some x∗ ∈ X , then, for all η, R > 0,

N (IK (BR ), η) ≤ N (IK  (BR ), η).

(ii) If X ⊇ x∗ + [−/2, /2]n for some x∗ ∈ X , then, for all η, R > 0,

N (IK (BR ), η) ≥ N (IK  (BR ), η).

Proof.
m
(i) Denote t0 = ( 12 , 12 , . . . , 12 ) ∈ [0, 1]n . Let g = i=1 ci Kxi ∈ IK (BR ). Then


m 
m
xi − x∗ xj − x ∗
g 2
K = ci cj k(xi − xj ) = ci c j k  + t0 − − t0
 
i,j=1 i,j=1
 m 2

m
xi − x ∗ xj − x∗  
   
= ci cj K + t0 , + t0 = ci K xi −x∗  ,
    +t0 
i,j=1 i=1 K

the last line by the definition of K  . Since g ∈ IK (BR ), we have


m
ci K 
xi −x∗ ∈ IK  (BR ).
 +t0
i=1

If { f1 , . . . , fN } is an η-net of IK  (BR ) on X  = [0, 1]n with


N = N (IK  (BR ), η), then there is some j ∈ {1, . . . , N } such that
 m 
 
  
 ci K xi −x∗ − fj  ≤ η.
  +t0 
i=1 C ([0,1]n )

This means that


 m 
 xi − x∗ 
  
sup  ci K t, + t0 − fj (t) ≤ η.
t∈[0,1]n   
i=1
5.3 Covering numbers for analytic kernels 95

Take t = ((x − x∗ )/) + t0 . When x ∈ X ⊆ x∗ + [−/2, /2]n , we have


t ∈ [0, 1]n . Hence
 m 
 x − x∗ xi − x∗ x − x∗ 
 
sup  ci K  + t0 , + t0 − fj + t0  ≤ η.
x∈X  i=1
   

This is the same as


 m 
 x − x∗ 
 
sup  ci k(x − xi ) − fj + t0 
x∈X  i=1  
 m 
 x − x∗ 
 
= sup  ci Kxi (x) − fj + t0  ≤ η.
x∈X  i=1
 

This shows that if we define fj∗ (x) := fj (((x − x∗ )/) + t0 ), the set

{f1∗ , . . . , fN∗ } is an η-net of the function set { mi=1 ci Kxi ∈ IK (BR )} in C (X ).
Since this function set is dense in IK (BR ), we have

N (IK (BR ), η) ≤ N = N (IK  (BR ), η).

(ii) If X ⊇ x∗ + [−/2, /2]n and {g1 , . . . , gN } is an η-net of IK (BR ) with


N = N (IK (BR ), η), then, for each g ∈ BR , we can find some j ∈ {1, . . . , N }
such that g − gj C (X ) ≤ η.
  
Let f = m i=1 ci Kti ∈ IK  (BR ). Then, for any t ∈ X ,


m 
m 
m
f (t) = ci k((t − ti )) = ci k(x − xi ) = ci Kxi (x),
i=1 i=1 i=1

where x = x(t) = x∗ + (t − t0 ) ∈ X and xi = x∗ + (ti − t0 ) ∈ X . It


follows from this expression that


m 
m
f 2
K
= ci cj K  (ti , tj ) = ci cj k((ti − tj ))
i,j=1 i,j=1


m
= ci cj K(xi , xj ) = g 2
K ≤ R,
i,j=1
96 5 Estimating covering numbers


where g = m i=1 ci Kxi . So, g ∈ IK (BR ) and we have g − gj C (X ) =
supx∈X |g(x) − gj (x)| ≤ η for some j ∈ {1, . . . , N }. But for x = x(t) ∈ X ,


m 
m
g(x) = ci k(x − xi ) = ci k((t − ti )) = f (t).
i=1 i=1

It follows that
   
sup f (t) − gj (x∗ + (t − t0 )) ≤ sup g(x) − gj (x) ≤ η.
t∈[0,1]n x∈X

This shows that if we define gj∗ (t) := gj (x∗ +(t−t0 )), the set {g1∗ , . . . , gN∗ }
is an η-net of IK  (BR ) in C ([0, 1]n ). 

If we take  = 2Diam(X ) to be twice the diameter of X , then the


condition X ⊆ x∗ + [−/2, /2]n holds for any x∗ ∈ X . If, moreover,
k(x) = exp{− x 2 /σ 2 } then K  (x, y) = exp{− x − y 2 /(σ 2 /2 )}. In this
situation, Corollary 5.12 and Proposition 5.13 yield the following upper bound
for the covering numbers of Gaussian kernels.
Corollary 5.14 Let σ > 0, X ⊂ Rn with Diam(X ) < ∞, and K(x, y) =
exp{− x − y 2 /σ 2 }. Then, for any 0 < η ≤ R2 , we have
n
R 216n(Diam(X ))2
ln N (IK (BR ), η) ≤ 3 ln + +6
η σ2
R 360n2 (Diam(X ))2
(6n + 1) ln + + 11n + 3 . 
η σ2

We now note that the upper bound in Part (ii) of Theorem 5.1 follows from
Corollary 5.14. We next apply Theorem 5.11 to kernels with exponentially
decaying Fourier transforms.
Theorem 5.15 Let k be as in Theorem 5.11, and assume that for some constants
C0 > 0 and λ > n(6 + 2 ln 4),

k(ξ ) ≤ C0 e−λ ξ
, ∀ξ ∈ Rn .

Denote  := max{1/eλ, 4n /eλ/2 }. Then for 0 < η ≤ 2R C0 (2n−1)/4 ,
n 
4 R 4 R
ln N (IK (BR ), η) ≤ ln + 1 + C1 + 1 ln + C2
ln (1/) η ln (1/) η
(5.19)
5.3 Covering numbers for analytic kernels 97

holds, where
?
2 ln(32C0 ) C0 n/2 −3/8 n C1
C1 := 1 + , C2 := ln 8 2 ( 2 ) .
ln (1/) λ

Proof. Let N ∈ N and 1 ≤ j ≤ n. Since |ξj |/N < 1 for ξ ∈ [−N /2, N /2]n ,
we have
 
|ξj | 2N |ξj | 2N
k(ξ ) d ξ ≤ C0 e−λ ξ dξ
ξ ∈[−N /2,N /2]n N ξ ∈[−N /2,N /2]n N

C0
≤ N N n−1 |ξj |N e−λ|ξj | d ξj
N ξj ∈[−N /2,N /2]

2C0
≤ N n−1 N !
λ +1 N N
N

√ N n−1/2
≤ 2C0 2π2(1/12)+1 ,
(eλ)N +1
the last inequality by Stirling’s formula. Hence the first term of ϒk (N ) is at
most

1 2n−2 √ N n−(1/2)
n3 1 + (2π)−n 2C0 2π2(1/12)+1
2N (eλ)N +1
N +1
1
≤ 4C0 N n−(1/2) .

Here we have bounded the constant term as follows


$ %2 n−1
1 2n−2 −n
√ (1/12)+1 1 + (1/2N ) 2(1/12)+2
n 1+ N
3
(2π) 2 2π2 =n 3

2 2π 2π
 
2 n−1 (1/12)+2
3 (1 + (1/2)) 2
≤n √
2π 2π
9 n−1
2(1/12)+2
= n3 √
8π 2π
≤ 4, for all n ∈ N.

For the other term in ϒk (N ), we have


 
k(ξ ) d ξ ≤ C0 e−λ ξ
dξ. (5.20)
ξ ∈[−N /2,N /2]n ξ ≥N /2
98 5 Estimating covering numbers

To estimate this integral, we recall spherical coordinates in Rn :

ξ1 = r cos θ1
ξ2 = r sin θ1 cos θ2
ξ3 = r sin θ1 sin θ2 cos θ3
..
.
ξn−1 = r sin θ1 sin θ2 . . . sin θn−2 cos θn−1
ξn = r sin θ1 sin θ2 . . . sin θn−2 sin θn−1 ,

where r ∈ (0, ∞), θ1 , . . . , θn−2 ∈ [0, φ), and θn−1 ∈ [0, 2π ). For a radial
function f ( ξ ) we have
  r2
f ( ξ ) d ξ = wn−1 f (r)r n−1 dr,
r1 ≤ ξ ≤r2 r1

where
 2π  π  π
wn−1 = ... sinn−2 θ1 sinn−3 θ2 . . . sin θn−2 d θ1 d θ2 . . . d θn−2 d θn−1
0 0 0

 π
n−2
= 2π sinn−j−1 θj d θj
j=1 0

2π n/2
= .
(n/2)

Applying this with f ( ξ ) = e−λ ξ , we find that


  ∞
2π n/2
e−λ ξ
dξ = e−λr r n−1 dr.
ξ ≥N /2 (n/2) N /2

By repeatedly integrating by parts, we see that


 ∞ n−1  ∞
−λr n−1 1 N −λN /2 n−1
e r dr = e + e−λr r n−2 dr
N /2 λ 2 λ N /2


n
1 (n − 1)! N n−j
= ... = e−λN /2 .
λj (n − j)! 2
j=1
5.3 Covering numbers for analytic kernels 99

Therefore, returning to (5.20), since λ > 2n,



2π n/2  1 (n − 1)!
n n−j
N
k(ξ ) d ξ ≤ C0 e−λN /2
ξ ∈[−N /2,N /2]n (n/2) λj (n − j)! 2
j=1

2π n/2  1 (n − 1)!
n n−j
N
≤ C0 e−λN /2
(n/2) (2n) j (n − j)! 2
j=1

2π n/2  (n − 1)! −n n−j −λN /2


n
≤ C0 2 N e
(n/2) (n − j)!j!
j=1

2C0 π n/2
≤ 2N n−1 e−λN /2 .
(n/2)
It follows that the second term in ϒk (N ) is bounded by
  n 2 2C0 π n/2 n−1 −λN /2 4n N
1 + N 2N (2π)−n 2N e ≤ 4C0 N 3n .
(n/2) eλ/2
Combining the two bounds above, we have
N +1 N
1 4n
ϒk (N ) ≤ 4C0 N n−1/2 + 4C0 N 3n ≤ 8C0 N 3n N .
eλ eλ/2
Since λ > n(6 + 2 ln 4), the definition of  yields
 
1 4n
 ≤ max , < e−3n .
e(2n ln 4 + n) en ln 4+3n

Since xe−x ≤ e−1 for all x ∈ (0, ∞), we have


 3n  3n
N 3n N /2 = N N /6n = Ne−N /6n ln(1/)
3n 3n
6n 2
≤ e−1 ≤ < 1.
log(1/) e
Then, for N ≥ 4n/ ln(1/),

ϒk (N ) ≤ 8C0 N /2 . (5.21)

Thus, for 0 < η ≤ 2R C0 (2n−1)/4 , we may take N ∈ N such that N ≥
4n/ ln(1/) and N > 2 to obtain
 η 2
8C0 N /2 ≤ ≤ 8C0 (N −1)/2 . (5.22)
2R
100 5 Estimating covering numbers

Under this choice, (5.12) holds. Then, by Theorem 5.11,

R
ln N (IK (BR ), η) ≤ (N + 1)n ln 8 k(0)(N + 1)n/2 (N 2N )n .
η

Now, by (5.22),

2 ln(32C0 ) 4 R
N ≤1+ + ln .
ln(1/) ln(1/) η

Also, since

(N + 1)n/2 (N 2N )n ≤ 2n/2 (−3/8 2n )N ,

we have
n
4 R 2 ln(32C0 )
ln N (IK (BR ), η) ≤ ln + 2 + ln 8 k(0)2n/2
ln(1/) η ln (1/)

(−3/8 2n )1+2 ln(32C0 )/ ln(1/) (R/η)(4/ ln(1/))+1 .

Finally observe that


 
−n −n C0
|k(0)| = |(2π) k(ξ ) d ξ | ≤ (2π) C0 e−λ ξ
dξ ≤ .
λ

Then (5.19) follows. 

We can apply Theorem 5.15 to inverse multiquadric kernels.

Corollary 5.16 Let c > n(6 + 2 ln 4), α > n/2, and

k(x) = (c2 + x 2 )−α , x ∈ Rn .

Then there is a constant C0 depending only on α such that for 0 < η ≤



2R C0 (2n−1)/4 ,
n 
4 R 4 R
ln N (IK (BR ), η) ≤ ln + 1 + C1 + 1 ln + C2
ln(1/) η ln(1/) η

holds, where  = max{1/ec, 4n /ec/2 } and C1 , C2 are the constants defined in


Theorem 5.15.
5.4 Lower bounds for covering numbers 101

Proof. For any  > 0, we know that there are positive constants C0 ≥ 1,
depending only on α, and C0∗ , depending only on α and , such that

C0∗ e−(c+) ξ
≤ k(ξ ) ≤ C0 e−c ξ
∀ξ ∈ Rn . (5.23)

Then we can apply Theorem 5.15 with λ = c, and the desired estimate
follows. 

5.4 Lower bounds for covering numbers


In this section we continue our discussion of the covering numbers of balls in
RKHSs and provide some lower-bound estimates. This is done by bounding
the related packing numbers.

Definition 5.17 Let S be a compact set in a metric space and η > 0. The
packing number M(S, η) is the largest integer m ∈ N such that there exist m
points x1 , . . . , xm ∈ S being η-separated; that is, the distance between xi and xj
is greater than η if i  = j.

Covering and packing numbers are closely related.

Proposition 5.18 For any η > 0,

M(S, 2η) ≤ N (S, η) ≤ M(S, η).

Proof. Let k = M(S, 2η) and {a1 , . . . , ak } be a set of 2η-separated points


in S. Then, by the triangle inequality, no closed ball of radius η can contain
more than one ai . This shows that N (S, η) ≥ k.
To prove the other inequality, let k = M(S, η) and {a1 , . . . , ak } be a set of
η-separated points in S. Then, the balls B(ai , η) cover S. Otherwise, there would
exist a point ak+1 whose distance to aj , j = 1, . . . , k, was greater than η and
one would have M(S, η) ≥ k + 1. 

The lower bounds for the packing numbers are presented in terms of the
Gramian matrix
$ %m
K[x] = K(xi , xj ) i,j=1 , (5.24)

where x := {x1 , . . . , xm } is a set of points in X . Denote by K[x]−1 2 the norm


of K[x]−1 (if it exists) as an operator on Rm with the 2-norm.
102 5 Estimating covering numbers

We use nodal functions in the RKHS HK to provide lower bounds of covering


numbers. They are used in the next chapter as well to construct interpolation
schemes to estimate the approximation error.
Definition 5.19 Let x := {x1 , . . . , xm } ⊆ X . We say that {ui }m
i=1 is a set of nodal
functions associated with the nodes x1 , . . . , xm if ui ∈ span(Kx1 , . . . , Kxm ) and
ui (xj ) = δij .
The following result characterizes the existence of nodal functions.
Proposition 5.20 Let K be a Mercer kernel on X and x := {x1 , . . . , xm } ⊂ X .
Then the following statements are equivalent:
(i) The nodal functions {ui }mi=1 exist.
(ii) The functions {Kxi }mi=1 are linearly independent.
(iii) The Gramian matrix K[x] is invertible.
(iv) There exists a set of functions { fi }m
i=1 ∈ HK such that fi (xj ) = δij for
i, j = 1, . . . , m.
In this case, the nodal functions are uniquely given by


m
ui (x) = (K[x]−1 )i,j Kxj (x), i = 1, . . . , m. (5.25)
j=1

Moreover, for each x ∈ X , the vector (ui (x))m


i=1 is the unique minimizer in R
m

of the quadratic function Q given by


m 
m
Q(w) = wi K(xi , xj )wj − 2 wi K(x, xi ) + K(x, x), w ∈ Rm .
i,j=1 i=1

Proof.
(i) ⇒ (ii). The nodal function property implies that the nodal functions
{ui } are linearly independent. Hence (i) implies (ii), since the
m-dimensional space span{ui }m i=1 is contained in span{Kxi }i=1 .
m

(ii) ⇒ (iii). A solution d = (d1 , . . . , dm ) ∈ Rm of the linear system

K[x]d = 0

satisfies
 2 ⎧ ⎫
 m  ⎨ ⎬
  m m
 d K  = d K(x , x )d = 0.
 j xj  i
⎩ i j j

 j=1  i=1 j=1
K
5.4 Lower bounds for covering numbers 103

Then the linear independence of {Kxj }m j=1 implies that the linear
system has only the zero solution; that is, K[x] is invertible.
(iii) ⇒ (iv). When K[x] is invertible, the functions { fi }m i=1 given by fi =
m −1 ) K satisfy
j=1 (K[x] i,j xj


m
fi (xj ) = (K[x]−1 )i, K(x , xj ) = (K[x]−1 K[x])i,j = δij .
=1

These are the desired functions.


(iv) ⇒ (i). Let Px be the orthogonal projection from HK onto span{Kxi }m
i=1 .
Then for i, j = 1, . . . , m,

Px ( fi )(xj ) = Px ( fi ), Kxj K = fi , Kxj K = fi (xj ) = δij .

So {ui = Px (fi )}m


i=1 are the desired nodal functions.
The uniqueness of the nodal functions follows from the invertibility
of the Gramian matrix K[x].
Since the quadratic form Q can be written as Q(w) = wT K[x]w −
2bT w + K(x, x) with the positive definite matrix K[x] and the
vector b = (K(x, xi ))m ∗
i=1 , we know that the minimizer w of Q in

R is given by the linear system K[x]w = b, which is exactly
m

(ui (x))m
i=1 . 

When the RKHS has finite dimension , then, for any m ≤ , we can find
nodal functions {uj }m
j=1 associated with some subset x = {x1 , . . . , xm } ⊆ X ,
whereas for m >  no such nodal functions exist. When dim HK = ∞, then,
for any m ∈ N, we can find a subset x = {x1 , . . . , xm } ⊆ X that possesses a set
of nodal functions.

Theorem 5.21 Let K be a Mercer kernel on X , m ∈ N, and x = {x1 , . . . , xm } ⊆


X such that K[x] is invertible. Then

M(IK (BR ), η) ≥ 2m − 1

for all η > 0 satisfying

2
1 R
K[x]−1 2 < .
m η
104 5 Estimating covering numbers

Proof. By Proposition 5.20, the set of nodal functions {uj (x)}m


j=1 associated
with x exists and can be expressed by


m
ui (x) = (K[x]−1 )ij Kxj (x), i = 1, . . . , m.
j=1

For each nonempty subset J of {1, . . . , m}, we define $the function % uJ (x) :=
  u (x), where η > η satisfies K[x]−1  2 . These 2m − 1
j∈J η j 2 < 1
m R/η
functions are η-separated in C (X ).
For J1  = J2 , there exists some j0 ∈ {1, . . . , m} lying in one of the sets J1 , J2 ,
but not in the other. Hence

uJ1 − uJ2 ∞ ≥ |uJ1 (xj0 ) − uJ2 (xj0 )| ≥ η > η.

What is left is to show that the functions uJ lie in BR . To see this, take
∅  = J ⊆ {1, . . . , m}. Then
8 m 
9
  m 
 
 −1  −1
uJ K = η
2
K[x] Kx , η K[x] 
K xs
j js
j∈J =1 j ∈J s=1 K
m 
  
m  
= η2 K[x]−1 K[x]−1 (K[x])s
j j s
j,j ∈J =1 s=1

  m 
 
= η2 K[x]−1 = η2 (K[x])−1 e
jj i
j,j ∈J i=1

≤ η2 (K[x])−1 e 1 (J ) ≤ η2 m (K[x])−1 e 2 (J )

≤ η2 m e (K[x])−1 2 = η2 m (K[x])−1 2 ,

where e is the vector in 2 (J ) with all components 1. It$ follows


%2 that uJ ≤
√ K
η m K[x]−1 2 , and uJ ∈ BR since K[x]−1 2 ≤ m1 R/η .
1/2

Thus lower bounds for packing numbers and covering numbers can be
obtained in terms of the norm of the inverse of the Gramian matrix. The latter
can be estimated for convolution-type kernels, that is, kernels K(x, y) = k(x−y)
for some function k in Rn , in terms of the Fourier transform k of k.
Proposition 5.22 Suppose K(x, y) = k(x−y) is a Mercer kernel on X = [0, 1]n
and the Fourier transform of k is positive; that is,

k(ξ ) > 0, ∀ξ ∈ Rn .
5.4 Lower bounds for covering numbers 105

For N ∈ N, if XN := {0, 1, . . . , N − 1}n and x = {α/N }α∈XN , then

−1
K[x]−1 2 ≤ N −n inf k(ξ ) .
ξ ∈[−N π,N π] n

Proof. By the inverse Fourier transform,



k(x) = (2π)−n k(ξ )eix·ξ d ξ ,
Rn

we know that for any vector c := (cα )α∈XN ,


 
cT K[x]c = cα cβ (2π)−n k(ξ )ei((α/N )−(β/N ))·ξ d ξ
α,β Rn
  2
 
= (2π) −n 
k(ξ ) iα·ξ/N 
cα e  dξ
Rn α
  2
 
= (2π) −n
N n 
k(N ξ ) iα·ξ 
cα e  d ξ .
Rn α

Bounding from below the integral over the subset [−N π , N π ]n , we see that
  2
 
c K[x]c ≥ (2π )
T −n
N n
inf k(η)  iα·ξ 
cα e  d ξ
η∈[−N π,N π]n 
[−π,π]n
α

= c 2
2 (XN )
Nn inf k(ξ ) .
ξ ∈[−N π,N π]n

It follows that the smallest eigenvalue of the matrix K[x] is at least

Nn inf k(ξ ) ,
ξ ∈[−N π,N π]n

from which the estimate for the norm of the inverse matrix follows. 
Combining Theorem 5.21 and Proposition 5.22, we obtain the following
result.
Theorem 5.23 Suppose K(x, y) = k(x − y) is a Mercer kernel on X = [0, 1]n
and the Fourier transform of k is positive. Then, for N ∈ N,
 η
ln N IK (BR ), ≥ ln M(IK (BR ), η) ≥ ln 2{N n − 1},
2
106 5 Estimating covering numbers

provided N satisfies
 η 2
inf k(ξ ) ≥ . 
ξ ∈[−N π ,N π]n R

As an example, we use Theorem 5.23 to give lower bounds for covering


numbers of balls in RKHSs in the case of Gaussian kernels.
Corollary 5.24 Let σ > 0, n ∈ N, and
 
x 2
k(x) = exp − 2 , x ∈ Rn .
σ

Set X = [0, 1]n , and let the kernel K be given by

K(x, t) = k(x − t), x, t ∈ [0, 1]n .



Then, for 0 < η ≤ R2 (σ π/2)n/2 e−nσ π /8 ,
2 2

n
2 1 R √
ln N (IK (BR ), η) ≥ ln 2 ln + ln(σ π )
σπ n η
n/2
2
− + 1 ln 2 − ln 2.
n

Proof. Since k is positive, we may use Theorem 5.23. 

5.5 On the smoothness of box spline kernels


The only result of this section, Proposition 5.25, shows a way to construct box
spline kernels with a prespecified smoothness r ∈ N.
Proposition 5.25 Let B0 = [b1 , . . . , bn ] be an invertible n × n matrix. Let
B = [B0 B0 . . . B0 ] be an s-fold copy of B0 and k(x) = (MB ∗MB )(x) be induced
by the convolution of the box spline MB with itself. Finally, let X ⊆ Rn , and let
K : X × X → R be defined by K(x, y) = k(x − y). Then K ∈ C r (X × X ) for
all r < 2s − n.
Proof. By Example 2.17, the Fourier transform k of k(x) = (MB ∗ MB )(x)
satisfies
⎧ ⎫s
⎨ n
sin((ξ · bj )/2) 2 ⎬
k(ξ ) = .
⎩ (ξ · bj )/2 ⎭
j=1
5.5 On the smoothness of box spline kernels 107

To get the smoothness of K we estimate the decay of k. First we observe that


the function t  → (sin t)/t satisfies, for all t ∈ (−1, 1),
 
 sin t  1 2
 
 t  ≤ |t| ≤ 1 + |t| .

Also, when t ∈ (−1, 1), |(sin t)/t| ≤ 1. Hence, for all t ∈ R,

 
 sin t 2 2 2
4
 
 t  ≤ 1 + |t|

1 + t2
.

It follows that for all ξ ∈ Rn ,


n
sin((ξ · bj )/2) 2
4n
≤:   2  .
(ξ · bj )/2 n
1 + (ξ · bj )/2
j=1 j=1

If we denote η = B0T ξ , then ξ · bj = bTj ξ = ηj and

      

n
 ξ · bj 2 n ηj2 1 2
n
1
1 +   = 1 + ≥ 1 + ηj = 1 + η 2 .
2  4 4 4
j=1 j=1 j=1

But η 2 = B0T ξ 2 ≥ |λ0 |2 ξ 2 , where λ0 is the smallest (in modulus)


eigenvalue of B0 . It follows that for all ξ ∈ Rn ,
 s
s
4n 4n 2 −s
k(ξ ) ≤ ≤ (1 + ξ ) .
1 + 14 |λ0 |2 ξ 2 min{1, |λ0 |2 /4}

Therefore, for any p < 2s − n2 ,

 2s
4n
(1 + ξ ) |k(ξ )|2 d ξ ≤
2 p
Rn min{1, |λ0 |2 /4}
 2s−p
1
dξ < ∞
Rn 1+ ξ 2

and, thus, k ∈ H p (Rn ). By the Sobolev embedding theorem, K ∈ C r (X × X )


for all r < 2s − n. 
108 5 Estimating covering numbers

5.6 References and additional remarks


Properties of function spaces on bounded domains X are discussed in [120].
In particular, one can find conditions on X (such as having a minimally smooth
boundary) for the extension of function classes on a bounded domain X to the
corresponding classes on Rn .
Estimating covering numbers for various function spaces is a standard theme
in the fields of function spaces [47] and approximation theory [78, 100]. The
upper and lower bounds (5.2) for generalized Lipschitz spaces and, more
generally, Triebel–Lizorkin spaces can be found in [47].
The upper bounds for covering numbers of balls of RKHSs associated with
Sobolev smooth kernels described in Section 5.2 (Theorem 5.8) and the lower
bounds given in Section 5.4 (Theorem 5.21) can be found in [156]. The bounds
for analytic translation invariant kernels discussed in Section 5.3 are taken
from [155].
The bounds (5.23) for the Fourier transform of the inverse multiquadrics
can be found in [82] and [105], where properties of nodal functions and
Proposition 5.20 can also be found.
For estimates of smoothness of general box splines sharper than those in
Proposition 5.25, see [41].
6
Logarithmic decay of the approximation error

In Chapter 4 we characterized the regression functions and kernels for which


the approximation error has a decay of order O(R−θ ). This characterization was
in terms of the integral operator LK and interpolation spaces. In this chapter we
continue this discussion.
We first show, in Theorem 6.2, that for a C ∞ kernel K (and under a mild
condition on ρX ) the approximation error can decay as O(R−θ ) only if fρ
is C ∞ as well. Since the latter is too strong a requirement on fρ , we now
focus on regression functions and kernels for which a logarithmic decay in the
approximation error holds. Our main result, Theorem 6.7, is very general and
allows for several applications. The result, which will be proved in Section 6.4,
shows some such consequences for our two main examples of analytic kernels.

Theorem 6.1 Let X be a compact subset of Rn with piecewise smooth boundary


and fρ ∈ H s (X ) with s > 0. Let σ , c > 0.

(i) (Gaussian) For K(x, t) = e−|x−t|


2 /σ 2
we have

inf fρ − g L 2 (X ) ≤ C(ln R)−s/8 , R ≥ 1,


g K ≤R

where C is a constant independent of R. When s > n2 ,

inf fρ − g C (X ) ≤ C(ln R)(n/16)−(s/8) , R ≥ 1.


g K ≤R

(ii) (Inverse multiquadrics) For K(x, t) = (c2 + |x − t|2 )−α with α > 0 we have

inf fρ − g L 2 (X ) ≤ C(ln R)−s/2 , R ≥ 1,


g K ≤R

109
110 6 Logarithmic decay of the approximation error

where C is a constant independent of R. When s > n2 ,

inf fρ − g C (X ) ≤ C(ln R)(n/4)−(s/2) , R ≥ 1.


g K ≤R

The quantity inf g K ≤ R fρ − g L 2 (X ) is not the approximation error (−σρ2 )


A( fρ , R) unless ρX is the Lebesgue measure µ. It is, however, possible to obtain
bounds on A( fρ , R) using Theorem 6.1. If s > n2 , one can use the bound in C (X )
for bounding inf g K ≤ R fρ − g Lρ2 (X ) for an arbitrary ρX . This is so since
X
sup{ f Lρ2 (X ) : ρX is a probability measure on X } = f C (X ) . In the general
X
case,
inf fρ − g Lρ2 (X ) ≤ Dµρ inf fρ − g L 2 (X ) ,
g K ≤R X g K ≤R

where Dµρ denotes the operator norm of the identity

Id
L 2 (X ) → Lρ2X (X ).

We call Dµρ the distortion of ρ (with respect to µ). It measures how much
ρX distorts the ambient measure µ. It is often reasonable to suppose that the
distortion Dµρ is finite.
Since ρ is not known, neither, in general is Dµρ . In some cases, however,
the context may provide some information about Dµρ . An important case is the
one in which, despite ρ not being known, we do know ρX . In this case Dµρ
may be derived.
In Theorem 6.1 we assume Sobolev regularity only for the approximated
function fρ . To have better approximation orders, more information about ρ
should be used: for instance, analyticity of fρ or degeneracy of the marginal
distribution ρX .

6.1 Polynomial decay of the approximation error for C ∞


kernels
In this section we use Corollary 4.17, Theorem 5.5, and the embedding
relation (5.1) to prove that a C ∞ kernel cannot yield a polynomial decay in the
approximation error unless fρ is C ∞ itself, assuming a mild condition on the
measure ρX .
We say that a measure ν dominates the Lebesgue measure on X when d ν(x) ≥
C0 dx for some constant C0 > 0.
6.1 Polynomial decay of the approximation error for C ∞ kernels 111

Theorem 6.2 Assume X ⊆ Rn has piecewise smooth boundary and K is a C ∞


Mercer kernel on X . Assume as well that ρX dominates the Lebesgue measure
on X . If for some θ > 0

A( fρ , R) := inf f − g Lρ2 (X ) = O(R−θ ),


g K ≤R X

then fρ is C ∞ on X .

Proof. Since ρX dominates the Lebesgue measure µ we have that ρX is


nondegenerate. Hence, HK+ = HK by Remark 4.18. By Corollary 4.17, our decay
assumption implies that fρ ∈ (Lρ2X (X ), HK )θ/(2+θ) . We show that for all s > 0,
fρ ∈ Lip∗(s, Lµ2 (X )). To do so, we take r ∈ N, r ≥ 2s(2 + θ )/θ > s, and t ∈ Rn .
Let g ∈ HK and x ∈ Xr,t . Then


r
r
rt fρ (x) = rt ( fρ −g)(x)+rt g(x) = (−1)r−j ( fρ −g)(x + jt)+rt g(x).
j
j=0

Let  = 2s(2 + θ )/θ. Using the triangle inequality and the definition of

Lip (/2,C (X ))
, it follows that

 1/2

r
r
|rt fρ (x)|2 dx ≤ fρ − g Lµ2 (X ) + rt g Lµ2 (X )
Xr,t j
j=0
/2
≤ 2r fρ − g Lµ2 (X ) + µ(X ) g ∗
Lip (/2,C (X ))
t .

Since K is C ∞ , g ∈ HK , and r > 2 , we can apply Theorem 5.5 to deduce that

7
g ∗
Lip (/2,C (X ))
≤ 2r+1 K ∗
Lip ()
g K.


Also, d ρX (x) ≥ C0 dx implies that fρ − g Lµ2 (X ) ≤ (1/ C0 ) fρ − g Lρ2 (X )
X
and µ(X ) ≤ 1/C0 . By taking the infimum over g ∈ HK , we see that

 1/2
1 
|rt fρ (x)|2 dx ≤√ inf 2r fρ − g Lρ2 (X )
Xr,t C0 g∈HK X

7
+ 2r+1 K Lip∗() g K t /2
112 6 Logarithmic decay of the approximation error

1  r 7 r+1 
≤√ 2 + 2 K Lip∗()
C0

inf fρ − g Lρ2 (X ) + t /2 g K .
g∈HK X

Since fρ ∈ (Lρ2X (X ), HK )θ/(2+θ) , by the definition of the interpolation space


in terms of the K-functional, we have
  
inf fρ − g Lρ2 (X ) + t /2 g K = K fρ , t /2
g∈HK X

≤ C0 t θ/2(2+θ)
= C0 t s ,

where C0 may be taken as the norm of fρ in the interpolation space. It follows
that
 1/2
−s
| fρ |Lip∗(s,L 2 (X )) = sup t |rt fρ (x)|2 dx
µ
t∈Rn Xr,t

1  r 7 r+1 
≤√ 2 + 2 K ∗
Lip ()
C0 < ∞.
C0

Therefore, fρ ∈ Lip∗(s, Lµ2 (X )). By (5.1) this implies fρ ∈ C d (X ) for any


integer d < s − n2 . But s can be arbitrarily large, from which it follows that
fρ ∈ C ∞ (X ). 

6.2 Measuring the regularity of the kernel


The approximation error depends not only on the regularity of the approximated
function but also on the regularity of the Mercer kernel. We next measure the
regularity of a Mercer kernel K on a finite set of points x = {x1 , . . . , xm } ⊂ X .
To this end, we introduce the following function:
⎧ ⎧ ⎫1/2 ⎫

⎨ ⎨ ⎬ ⎪ ⎬
m  m
K (x) := sup inf K(x, x) − 2 wi K(x, xi ) + wi K(xi , xj )wj .
x∈X ⎪
⎩w∈Rm ⎩ i=1 i,j = 1
⎭ ⎪ ⎭
(6.1)

We show that by choosing wi appropriately, one has K (x) → 0 as x becomes


dense in X . It is the order of decay of K (x) with respect to the density of x in
X that now measures the regularity of functions in HK . The faster the decay,
the more regular the functions.
6.2 Measuring the regularity of the kernel 113

As an example to see how K (x) measures the regularity of functions in HK ,


suppose that for some 0 < s ≤ 1, the kernel K is Lip(s); that is,

|K(x, y) − K(x, t)| ≤ C(d (y, t))s , ∀x, y, t ∈ X ,

where C is a constant independent of x, y, t. Define the number

dx := max min d (x, xi )


x∈X i ≤ m

to measure the density of x in X . Let x ∈ X . Choose x ∈ x such that


d (x, x ) ≤ dx . Set the coefficients {wj }m
j = 1 as w = 1, and wj = 0 if j  = . Then


m 
K(x, x)−2 wi K(x, xi )+ wi K(xi , xj )wj = K(x, x)−2K(x, x )+K(x , x ).
i=1 i,j = 1

The Lip(s) regularity and the symmetry of K yield

K(x, x) − 2K(x, x ) + K(x , x ) ≤ 2C(d (x, x ))s ≤ 2Cdxs .

Hence
K (x) ≤ 2Cdxs .
In particular, if X = [0, 1] and x = { j/N }N j=0 , then dx ≤ 2N , and therefore
1
−s
K (x) ≤ 2 CN . We obtain a polynomial decay with exponent s.
1−s

When K is C s , the function K (x) decays as O(dxs ). For analytic kernels,


K (x) often decays exponentially. In this section we derive these decaying
rates for the function K (x).
Recall the Lagrange interpolation polynomials {wl,s−1 (t)}s−1 l=0 on [0, 1] with
interpolating points {0, 1/s − 1, . . . , 1} defined by (5.3) with s replaced by s−1.
For any polynomial p of degree at most s − 1,


s−1
wl,s−1 (t)p(l/(s − 1)) = p(t).
l=0

This, together with the Taylor expansion of f at t,


s−1 (j)
f (t)
f (y) = (y − t)j + Rs ( f )(y, t),
j!
j=0
114 6 Logarithmic decay of the approximation error

implies that
 s−1   s−1 
   
   
 wl,s−1 (t)( f (l/(s − 1)) − f (t)) =  wl,s−1 (t)Rs ( f )(l/(s − 1), t) .
   
l=0 l =0

Here Rs ( f ) is the linear operator representing the remainder of the Taylor


expansion and satisfies
  y 
 1 

|Rs ( f )(y, t)| =  (y − u) f (u)du
s−1 (s)
(s − 1)! t
|y − t|s (s) 1 (s)
≤ f ∞≤ f ∞.
s! s!

Using Lemma 5.9, we now obtain


 s−1 
  (s − 1)2s−1
 
 wl,s−1 (t)( f (l/(s − 1)) − f (t)) ≤ f (s) ∞. (6.2)
  s!
l=0

Recall also the multivariate Lagrange interpolation polynomials


n
wα,s−1 (x) = wαj ,s−1 (xj ),
j=1

x = (x1 , . . . , xn ), α = (α1 , . . . , αn ) ∈ {0, . . . , s − 1}n

defined by (5.6).
Now we can estimate K (x) for C s kernels as follows.

Theorem 6.3 Let X = [0, 1]n , s ∈ N, and K be a Mercer kernel on X such that
for each α ∈ Nn with |α| ≤ s,

∂α ∂ |α|
K(x, y) = K(x, y) ∈ C ([0, 1]2n ).
∂yα ∂y1α1 · · · ∂ynαn

Then, for N ≥ s and x = {α/N }α∈{0,1, ..., N −1}n , we have


 n  s 

22sn+1 ss+2n  ∂ K 
K (x) ≤   N −s .
s!  ∂ys 
i=1 i C (X ×X )
6.2 Measuring the regularity of the kernel 115

Proof. Let x ∈ X . Then x = Nβ + s−1 N t for some β ∈ {0, 1, . . . , N − s + 1}


n

and t ∈ [0, 1] . Choose the coefficients in the definition of K (x) to be


n


wγ ,s−1 (t) if α = β + γ , γ ∈ {0, . . . , s − 1}n
wα =
0 otherwise.

Then we can see that the expression in the definition of K is


 β +γ
K(x, x) − 2 wγ ,s−1 (t)K x,
N
γ ∈{0,...,s−1}n
 β +γ β +η
+ wγ ,s−1 (t)K , wη,s−1 (t).
N N
γ ,η∈{0,...,s−1}n


But γ ∈{0,...,s−1}n wγ ,s−1 (t) ≡ 1. So the above expression equals

   
β +γ
wγ ,s−1 (t) K(x, x) − K x, + wγ ,s−1 (t)
N
γ ∈{0,...,s−1} n γ ∈{0,...,s−1}n
⎧ ⎫
⎨  β +γ β +η β +γ ⎬
wη,s−1 (t) K , −K ,x .
⎩ N N N ⎭
η∈{0,...,s−1}n

$
Using Equation (6.2) for %the univariate function g(z) = f γ1 /(s − 1), . . . ,
(γi−1 )/(s − 1), z, ti+1 , . . . , tn with z ∈ [0, 1] and all the other variables fixed,
we get
 
 s−1   
 γi  (s − 1)2s−1  ∂ s f 
 w (t ) g − g(t )  ≤   .
 γi ,s−1 i
s−1
i   ∂t s 
 γi =0  s! i C ([0,1]n )

Using Lemma 5.9 for γj , j  = i, we conclude that for a function f on [0, 1]n and
for i = 1, . . . , n,


  γ1 γi−1 γi
 wγ ,s−1 (t) f ,..., , , ti+1 , . . . , tn
 s−1 s−1 s−1
γ ∈{0,...,s−1}n

γ1 γi−1 
−f ,..., , ti , ti+1 , . . . , tn  ≤ ((s − 1)2s−1 )n−1
s−1 s−1
 s 
(s − 1)2 s−1 ∂ f 
  .
s!  ∂t s 
i C ([0,1]n )
116 6 Logarithmic decay of the approximation error

Replacing γi /(s − 1) by ti each time for one i ∈ {1, . . . , n}, we obtain


 
 
  γ 
 wγ ,s−1 (t) f − f (t) 
 s−1
γ ∈{0,...,s−1}n 
n  s 
((s − 1)2s−1 )n  
∂ f 

≤  .
s! ∂ti C ([0,1]n )
s
i=1

  
β+γ
Applying this estimate to the functions f (t) = K x, β+(s−1)t
N and K N ,

β+(s−1)t
N , we find that the expression for K can be bounded by

 n  s 
 ((s − 1)2s−1 )n  s s  ∂ K 
1 + ((s − 1)2 )
s−1 n   .
s! N  ∂ys 
i C (X ×X )
i=1

This bound is valid for each x ∈ X . Therefore, we obtain the required estimate
for K (x) by taking the supremum for x ∈ X . 

The behavior of the quantity K (x) is better if the kernel is of convolution


type, that is, if K(x, y) = k(x − y) for an analytic function k.

Theorem 6.4 Let X = [0, 1]n and K(x, y) = k(x − y) be a Mercer kernel on X
with
k(ξ ) ≤ C0 e−λ|ξ | , ∀ξ ∈ Rn
for some constants C0 > 0 and λ > 4 + 2n ln 4. Then, for x = { Nα }α∈{0,1,...,N −1}n
with N ≥ 4n/ ln min{eλ, 4−n eλ/2 }, we have
  N /2
1 4n
K (x) ≤ 4C0 max , .
eλ eλ/2

Proof. Let XN := {0, . . . , N − 1}n . For a fixed x ∈ X , choose the coefficients


wi in (6.1) to be wα,N (x). Then the expression of the definition of K (x) becomes
QN (x) given by (5.13). It follows that K (x) ≤ supx∈X QN (x). Hence by (5.14),

K (x) ≤ ϒk (N ).

But the assumption of the kernel here verifies the condition in Theorem 5.15.
Thus, we can apply the estimate (5.21) for ϒk (N ) to draw our conclusion
here. 
6.3 Estimating the approximation error in RKHSs 117

6.3 Estimating the approximation error in RKHSs


Recall the nodal functions {ui = ui,x }mi=1 associated with a finite subset x of X ,
given by (5.25). We use them in the RKHS HK on a compact metric space
(X , d ) to construct an interpolation scheme. This scheme is defined as follows:


m
Ix ( f )(x) = f (xi )ui (x), x ∈ X , f ∈ C (X ). (6.3)
i=1

It satisfies Ix ( f )(xi ) = f (xi ) for i = 1, . . . , m.


The error of the interpolation scheme for functions in HK can be estimated
as follows.
Proposition 6.5 Let K be a Mercer kernel and x = {x1 , . . . , xm } ⊂ X such
that K[x] is invertible. Define the interpolation scheme Ix by (6.3). Then, for
f ∈ HK ,
Ix ( f ) − f C (X ) ≤ K (x) f K
and
Ix ( f ) K ≤ f K.

Proof. For x ∈ X

m 
m
Ix ( f )(x) − f (x) = f (xi )ui (x) − f (x) = ui (x)Kxi , f K − Kx , f K
i=1 i=1
.
m /
= ui (x)Kxi − Kx , f ,
i=1 K

the second equality by the reproducing property of HK applied to f . By the


Cauchy–Schwarz inequality in HK ,
 
 m 
|Ix ( f )(x) − f (x)| ≤ 
 u i (x)K xi − K 
x f K.
i=1 K

Since Ks , Kt K = K(s, t), we have


 2
 
 m  m m
 u (x)K −K  = K(x, x)−2 u (x)K(x, x )+ ui (x)K(xi , xj )uj (x).
 i xi x i i
i=1 K i=1 i,j=1

By Proposition 5.20, the quadratic function


m 
m
Q(w) = K(x, x) − 2 wi K(x, xi ) + wi K(xi , xj )wj
i=1 i,j=1
118 6 Logarithmic decay of the approximation error

is minimized over Rm at (ui (x))m


i=1 . Therefore,

 m 
 
 ui (x)Kxi − Kx 
  ≤ K (x).
i=1 K

It follows that
|Ix ( f )(x) − f (x)| ≤ K (x) f K.

This proves the estimate for Ix ( f ) − f C (X ) .


To prove the other inequality, note that, since Ix ( f ) ∈ HK and
Ix ( f )(xi ) = f (xi ), for i = 1, . . . , m,

0 = Ix ( f )(xi ) − f (xi ) = Kxi , Ix ( f ) − f K .

This means that Ix ( f ) − f is orthogonal to span{Kxi }m


i=1 . Hence Ix ( f ) is the
orthogonal projection of f onto span{Kxi }mi=1 and therefore Ix ( f ) K ≤ f K .

Proposition 6.5 bounds the interpolation error Ix ( f ) − f C (X ) in terms of
the regularity of K and the density of x in X measured by K (x), when the
approximated function f lies in the RKHS (i.e., f ∈ HK ).
In the remainder of this section, we deal with the interpolation error when
the approximated function is from a larger function space (e.g., a Sobolev
space), not necessarily from HK . To this end, we need to know how large
HK is compared with the space where the approximated function lies. For a
convolution-type kernel, that is, a kernel K(x, y) = k(x − y) with k(ξ ) > 0, this
depends on how slowly k decays. So, in addition to the function K measuring
the smoothness of K, we use the function k : R+ → R+ defined by

−1/2
k (r) := inf k(ξ ) ,
[−rπ,rπ] n

measuring the speed of decay of k. Note that k is a nondecreasing function.


We also use the function ϒk (N ), which measures the regularity of K, introduced
in Section 5.3.
Lemma 6.6 Let k ∈ L 2 (Rn ) be an even function with k(ξ ) > 0, and K be
the kernel on X = [0, 1]n given by K(x, y) = k(x − y). For f ∈ L 2 (Rn ) and
M ≤ N ∈ N, we define fM ∈ L 2 (Rn ) by

f (ξ ) if ξ ∈ [−M π , M π ]n
fM (ξ ) =
0 otherwise.
6.3 Estimating the approximation error in RKHSs 119

Then, for x = {0, N1 , . . . , NN−1 }n , we have


(i) Ix ( fM ) K ≤ f L 2 k (N ).
(ii) fM − Ix ( fM ) C (X ) ≤ f L 2 k (M )ϒk (N ).
(iii) f − fM 2L 2 (X ) ≤ (2π)−n ξ ∈[−M π,M π ]n |f (ξ )|2 d ξ → 0 (as M → ∞).
Proof.
(i) For i, j ∈ XN := {0, 1, . . . , N − 1}n and xi = i/N ∈ x, expression (5.25) for
the nodal function ui associated with x gives

ui , uj K = (K[x]−1 )is (K[x]−1 )jt Kxs , Kxt K
s,t∈XN
$ % $ % $ % $ %
= K[x]−1 is
K[x]−1 jt
K[x] ts
= K[x]−1 ij .
s,t∈XN

Then, for g ∈ C (X ), we have


 2
 
Ix (g) 2 
= g(xi )ui (x)
K 
i∈XN K
 $ %T $ %
= g(xi )g(xj )ui , uj K = g|x K[x]−1 g|x ,
i,j∈XN

n
where g|x is the vector (g(xi ))i∈XN ∈ RN . It follows that

Ix (g) 2
K ≤ K[x]−1 2 g|x 2
2 (XN )
= K[x]−1 2 |g(xi )|2 .
i∈XN

We now apply this analysis to the function fM satisfying fM (ξ ) = 0 for


ξ ∈ [−N π, N π]n \ [−M π, M π]n to obtain

   2
 
| fM (xj )| =2 (2π)−n fM (ξ )e i(j/N )·ξ
d ξ 

j∈XN j∈XN ξ ∈[−M π,M π]n
  2
 
≤ (2π)−n fM (N ξ )eij·ξ N n d ξ 

j∈XN ξ ∈[−π,π]n

 
≤ (2π)−n fM (N ξ )N n 2 d ξ ≤ N n f 2 2 .
L
ξ ∈[−π ,π] n

Then
Ix ( fM ) 2
K ≤ K[x]−1 2 N n f 2L 2 .
120 6 Logarithmic decay of the approximation error

$ %2
But, by Proposition 5.22, K[x]−1 2≤N
−n k (N ) . Therefore,

Ix ( fM ) K ≤ f L 2 k (N ).

This proves the statement in (i).


(ii) Let x ∈ X . Then
   
−n ixj ·ξ
fM (x)−Ix ( fM )(x) = (2π) f (ξ ) e −
ix·ξ
uj (x)e dξ.
ξ ∈[−M π,M π]n j∈XN

By the Cauchy–Schwarz inequality,


  1/2
−n |f (ξ )|2
| fM (x) − Ix ( fM )(x)| ≤ (2π) dξ
ξ ∈[−M π,M π]n k(ξ )
   2 1/2
 ix·ξ  
(2π) −n 
k(ξ )e − ixj ·ξ 
uj (x)e  d ξ .
R n
j∈XN

The first term on the right is bounded by f L 2 k (M ), since k(ξ ) ≥


−2
k (M ) for ξ ∈ [−M π, M π] . The second term is
n

  1/2
k(0) − 2 uj (x)k(x − xj ) + ui (x)k(xi − xj )uj (x) ,
j∈XN i,j∈XN

which can be bounded by ϒk (N ) according to (5.14) with {0, 1, . . . , N }n


replaced by {0, 1, . . . , N − 1}n . Therefore,

fM − Ix ( fM ) C (X ) ≤ f L 2 k (M )ϒk (N ).

(iii) By Plancherel’s formula (Theorem 2.3),



f − fM 2L 2 (Rn ) = (2π)−n |f (ξ )|2 d ξ .
ξ ∈[−M π,M π]n

Thus, all the statements hold true. 

Lemma 6.6 provides quantitative estimates for the interpolation error:


  1/2
f − Ix ( fM ) L 2 (X ) ≤ (2π)−n |f (ξ )|2 d ξ
ξ ∈[−M π,M π]n

+ f L 2 k (M )ϒk (N )
6.3 Estimating the approximation error in RKHSs 121

with
Ix ( fM ) K ≤ f L 2 k (N ).
Choose N = N (M ) ≥ M such that k (M )ϒk (N ) → 0 as M → +∞. We
then have f − Ix ( fM ) L 2 (X ) → 0. Also, the RKHS norm of Ix ( fM ) is
asymptotically controlled by k (N ).
We can now state the main estimates for the approximation error for balls
in the RKHS HK on X = [0, 1]n . Denote by −1 k the inverse function of the
nondecreasing function k ,

−1
k (R) := max{r > 0 : k (r) ≤ R}, for R > 1/k(0).

Theorem 6.7 Let X = [0, 1]n , s > 0, and f ∈ H s (Rn ). Then, for R > f 2,
& −s
'
inf f − g L 2 (X ) ≤ inf k (M ) f 2 ϒk (NR ) + f s (π M ) ,
g K ≤R 0 < M ≤ NR

where NR = −1
k (R/ f 2 ), the integer part of −1
k (R/ f 2 ). If s > n2 , then
 
f s
inf f −g C (X ) ≤ inf k (M ) f 2 ϒk (NR ) + √ M (n/2)−s .
g K ≤R 0 < M ≤ NR s − n/2

Proof. Take N to be NR .
Let M ∈ (0, N ]. Set the function fM as in Lemma 6.6. Then, by Lemma 6.6,

Ix ( fM ) K ≤ f 2 k (N ) ≤ R

and

f − Ix ( fM ) L 2 (X ) ≤ fM − Ix ( fM ) C (X ) + f − fM L 2 (X )
−s
≤ k (M ) f 2 ϒk (N ) + f s (π M ) .

If s > n2 , then

f s n
f − fM C (X ) ≤ (2π)−n |f (ξ )|d ξ ≤ √ M 2 −s .
ξ ∈[−M π,M π ]n s − n/2

Hence the second statement of the theorem also holds. 

Corollary 6.8 Let X = [0, 1]n , s > 0, and f ∈ H s (Rn ). If for some
α1 , α2 , C1 , C2 > 0, one has

k(ξ ) ≥ C1 (1 + |ξ |)−α1 , ∀ξ ∈ Rn
122 6 Logarithmic decay of the approximation error

and
ϒk (N ) ≤ C2 N −α2 , ∀N ∈ N,

then, for R > (1 + 1/C1 )(1 + 1/k(0)) f 2,

 R −γ
−2s/α1
inf f − g L 2 (X ) ≤ C3 f 2 + C1 f s ,
g K ≤R f 2

−2α2 /α1 $ $√ %α1 %1/2


where C3 := 2α2 C2 C1 2α1 /C1 + nπ C1 and

 4α2 s
α1 (α1 +2s) if α1 + 2s ≥ 2α2
γ= 2s
α1 , if α1 + 2s < 2α2 .

If, in addition, s > n/2, then


 
(n−2s)/α1 s−n/2 −γ 
C1 2 R
inf f − g C (X ) ≤ C3 f 2+ √ f s ,
g K ≤R s − n/2 f 2

where

α2 (4s−2n)
 α1 (α1 +2s−n) if α1 + 2s − n ≥ 2α2 ,
γ = 2s−n
α1 , if α1 + 2s − n < 2α2 .

Proof. By the assumption on the lower bound of k, we find

1 √
k (r) ≤ √ (1 + nπ r)α1 /2 .
C1

It follows that for R/ f 2 ≥ max{1/C1 , 1/k(0)},

−1
2/α1
k (R/ f 2) ≥ C1 (R/ f 2)
2/α1

and
2/α1
NR ≥ 12 C1 (R/ f 2)
2/α1 .

Also,
 −α2
ϒk (N ) ≤ C2 [−1
k (R/ f 2 )] ≤ 2α2 C2 (C1 R/ f 2)
−2α2 /α1
.
6.3 Estimating the approximation error in RKHSs 123

Then, by Theorem 6.7,


 √
nπM )α1
1/2
(1 +
inf f − g L 2 (X ) ≤ inf
g K ≤R 0 < M ≤ NR C1

−2α2 /α1
C1 R
2α2 C2 f 2 + f s (π M )
−s
.
f 2

Take M = 12 C1
2/α1
(R/ f γ /s
2) with γ as in the statement. Then, M ≤ NR , and
we can see that
 R −γ
−2s/α1
inf f − g L 2 (X ) ≤ C3 f 2 + C1 f s .
g K ≤R f 2

This proves the first statement of the corollary. The second statement can be
proved in the same way. 

Corollary 6.9 Let X = [0, 1]n , s > 0, and f ∈ H s (Rn ). If for some
α1 , α2 , δ1 , δ2 , C1 , C2 > 0, one has
& '
k(ξ ) ≥ C1 exp −δ1 |ξ |α1 , ∀ξ ∈ Rn

and
& '
ϒk (N ) ≤ C2 exp −δ2 N α2 , ∀N ∈ N,
then, for R > (1 + A/C1 ) f 2,

  −γ s
C2 B s/α1 s s/2 C1
inf f −g L 2 (X ) ≤ √ f 2 + δ1 2n f s ln R + ln ,
g K ≤R C1 f 2

where A, B are constants depending only on α1 , α2 , δ1 , δ2 , and


 1
α1 if α1 < α2
γ= α2
if α1 ≥ α2 .
2α12

If, in addition, s > n2 , then, for R > (1 + A/C1 ) f 2,

  γ ( n2 −s)
 C2 B C1
inf f − g C (X ) ≤ B √ f 2 + f s ln R + ln ,
g K ≤R C1 f 2

where B is a constant depending only on α1 , α2 , δ1 , δ2 , n, and s.


124 6 Logarithmic decay of the approximation error

Proof. By the assumption on the lower bound of k, we find


 
1 δ1 √ α1
k (r) ≤ √ exp ( nπ r) .
C1 2

It follows that for R/ f 2 ≥ max{1/ C1 , 1/k(0)},
√ 1/α1
1 2 R C1
−1
k (R/ f 2) ≥ √ ln ,
nπ δ1 f 2

and its integer part NR satisfies


√ 1/α1
1 2 R C1
NR ≥ R := √ ln .
2 nπ δ1 f 2
√ √
Also, for R > f 2 exp{(2/δ1 )1 (2 nπ)α1 }/ C1 ,
 & '
ϒk (NR ) ≤ C2 exp −δ2 [−1
k (R/ f 2 )]
α2
≤ C2 exp −δ2 Rα2 .

√ √
Then, by Theorem 6.7, for R > (1 + (1 + exp{(2/δ1 )(2 nπ )α1 }/ C1 )) f 2,
  
C2 f 2 δ1 √ α1 α2
inf f − g L 2 (X ) ≤ inf √ exp ( nπ M ) − δ2 R
g K ≤R 0<M ≤R C1 2
'
+ f s (πM )−s .

Take √
−1/α1 1/α1 −1 γ
δ1 2 R C1
M= √ ,
ln
nπ f 2

where γ is given in our statement.
 For R > f 2 / C1 , M ≤ R ≤ −1 k (R/ f 2 )
−α −α2 /α1 √ −α α1 /(γ α12 −α2 )

holds. Therefore, if R > exp (δ2 2 δ12 ( nπ ) )2 f 2 / C1 ,
we have

C2 f 2 −α /α √
inf f − g L 2 (X ) ≤ √ exp −δ2 2−α2 −1 δ1 2 1 ( nπ )−α2
g K ≤R C1
√ α /α
 √ −γ s
R C1 2 1 s/α R C1
ln + δ1 1 2s ns/2 f s ln
f 2 f 2
  √ −γ s
C2 f 2 s/α C1
≤ √ C3 + δ1 1 2s ns/2 f s ln R + ln ,
C1 f 2
6.5 References and additional remarks 125

−α /α √
where C3 := supx≥1 x−γ s exp{−δxα2 /α1 } and δ = δ22−α2 −1 δ1 2 1 ( nπ )−α2 .
This proves the first statement of the corollary. The second statement follows
using the same argument. 

6.4 Proof of Theorem 6.1


Now we can apply Corollary 6.9 to verify Theorem 6.1. We may assume that
X ⊆ [0, 1]n . Since X has piecewise smooth boundary, every function f ∈ H s
can be extended to a function F ∈ H s (Rn ) such that f H s (Rn ) ≤ CX f H s (X ) ,
where the constant CX depends only on X , not on f ∈ H s (X ).
(i) From (5.17) and (5.18), we know that the condition of Corollary 6.9 is
satisfied with
√ σ2
C1 = (σ π)n , δ1 = , α1 = 2
4
and
C2 > 0, α2 = 1, δ2 = ln min{16n, 2n }.
Then the bounds given in Corollary 6.9 hold. Since α1 > α2 , we have γ = 18
and the first statement of Theorem 6.1 follows with bounds depending
on CX .
(ii) By (5.23), we see that the condition of Corollary 6.9 is valid. Moreover,

C1 > 0, δ1 = c + ε, α1 = 1

and  
1 ec/2
C2 > 0, α2 = 1, δ2 = ln min ec, n .
2 4
Then α1 = α2 and the bounds of Corollary 6.9 hold with γ = 12 . This yields
the second statement of Theorem 6.1.

6.5 References and additional remarks


Logarithmic decays of the approximation error can be characterized by general
interpolation spaces [16], as done for polynomial decays in Chapter 4. However,
sharp bounds for the decay of the K-functional are hard to obtain. For example,
it is unknown how far the power index 8s in Theorem 6.1 can be improved.
The function K (x) defined by (6.1) is called the power function in
the literature on radial basis functions. It was introduced by Madych and
126 6 Logarithmic decay of the approximation error

Nelson [82], and extensively used by Wu and Schaback [147], and it plays
an important role in error estimates for scattered data interpolation using radial
basis functions. In that literature (e.g., [147, 66]), the interpolation scheme (6.3)
is essential. What is different in learning theory is the presence of an RKHS
HK , not necessarily a Sobolev space.
Theorem 6.1 was proved in [113], and the approach in Section 6.3 was
presented in [157].
7
On the bias–variance problem

Let K be a Mercer kernel, and HK its induced RKHS. Assume that


(i) K ∈ C s (X × X ), and
θ/(4+2θ)
(ii) the regression function fρ satisfies fρ ∈ Range(LK ), for some θ > 0.
Fix a sample size m and a confidence 1 − δ, with 0 < δ < 1. To each R > 0
we associate a hypothesis space H = HK,R , and we can consider fH and, for
z ∈ Z m , fz . The bias–variance problem consists of finding the value of R that
minimizes a natural bound for the error E( fz ) (with confidence 1 − δ). This
value of R determines a particular hypothesis space in the family of such spaces
parameterized by R, or, to use a terminology common in the learning literature,
it selects a model.
Theorem 7.1 Let K be a Mercer kernel on X ⊂ Rn satisfying conditions (i)
and (ii) above.
(i) We exhibit, for each m ∈ N and δ ∈ [0, 1), a function

Em,δ = E : R+ → R

such that for all R > 0 and randomly chosen z ∈ Z m ,



( fz − fρ )2 d ρX ≤ E(R)
X

with confidence 1 − δ.
(ii) There is a unique minimizer R∗ of E(R).
(iii) When m → ∞, we have R∗ → ∞ and E(R∗ ) → 0.
The proof of Theorem 7.1 relies on the main results of Chapters 3, 4, and 5.
We show$ in Section 7.3 that% R∗ and E(R∗ ) $have the asymptotic% expressions
−θ/((2+θ)(1+2n/s))
R∗ = O m 1/((2+θ)(1+2n/s)) and E(R∗ ) = O m .

127
128 7 On the bias–variance problem

It follows from the proof of Theorem 7.1 that R∗ may be easily computed
from m, δ, IK , Mρ , fρ ∞ , g Lρ2 , and θ . Here g ∈ Lρ2X (X ) is such that
X
θ/(4+2θ)
LK (g) = fρ . Note that this requires substantial information about ρ and,
in particular, about fρ . The next chapter provides an alternative approach to
the one considered thus far whose corresponding bias–variance problem can be
solved without information on ρ.

7.1 A useful lemma


The following lemma will be used here and in Chapter 8.

Lemma 7.2 Let c1 , c2 , . . . , c > 0 and s > q1 > q2 > . . . > q−1 > 0. Then
the equation

xs − c1 xq1 − c2 xq2 − · · · − c−1 xq−1 − c = 0

has a unique positive solution x∗ . In addition,



x∗ ≤ max (c1 )1/(s−q1 ) , (c2 )1/(s−q2 ) , . . . , (c−1 )1/(s−q−1 ) , (c )1/s .

Proof. We prove the first assertion by induction on . If  = 1, then the


equation is xs − c1 = 0, which has a unique positive solution.
For  > 1 let ϕ(x) = xs − c1 xq1 − c2 xq2 − · · · − c−1 xq−1 − c . Then, taking
the derivative with respect to x,

ϕ  (x) = sxs−1 − q1 c1 xq1 −1 − · · · − c−1 q−1 xq−1 −1


 q1 c1 q1 −q−1 c−1 q−1 
= sxq−1 −1 xs−q−1 − x − ··· −
s s
=: sxq−1 −1 ψ(x).

By induction, hypothesis ϕ  has a unique positive zero that is the unique


positive zero x̄ of ψ. Since ψ(0) < 0 and limx→+∞ ψ(x) = +∞, we deduce
that ψ(x) < 0 for x ∈ [0, x̄) and ψ(x) > 0 for x ∈ (x̄, +∞). This implies that
ϕ  (x) < 0 for x ∈ (0, x̄) and ϕ  (x) > 0 for x ∈ (x̄, +∞). Therefore, ϕ is strictly
decreasing on [0, x̄) and strictly increasing on (x̄, +∞). But ϕ(0) < 0 and, hence
ϕ(x̄) < 0. Since ϕ is strictly increasing on (x̄, +∞) and limx→+∞ ϕ(x) = +∞,
we conclude that ϕ has a unique zero x∗ on (x̄, +∞) which is its unique positive
zero. The shape of ϕ is as in Figure 7.1. This proves the first statement.
7.2 Proof of Theorem 7.1 129

x x*

Figure 7.1

& '
To prove the second statement, letx > max (ci )1/(s−qi ) | i = 1, . . . ,  ,
where we set q = 0. Then, for i = 1, . . . , , ci < 1 xs−qi . It follows that


 
 1
ci xqi < xs−qi xqi = xs ;

i=1 i=1

that is, ϕ(x) > 0. 

Remark 7.3 Note that given c1 , c2 , . . . , c and s, q1 , q2 , . . . , q−1 , one can


efficiently compute (a good approximation of ) x∗ using algorithms such as
Newton’s method.

7.2 Proof of Theorem 7.1


We first describe the natural bound we plan to minimize. Recall that E( fz )
equals the sum EH ( fz ) + E( fH ) of the sample and approximation errors, or,
equivalently,

( fz − fρ )2 = EH (fz ) + inf f − fρ 2L 2 .
X f K ≤R ρX

We first want to bound the sample error. Let

M = M (R) = IK R + Mρ + fρ ∞. (7.1)
130 7 On the bias–variance problem

Then, for all f ∈ HK,R , | f (x) − y| ≤ M almost everywhere since

| f (x) − y| ≤ | f (x)| + |y| ≤ | f (x)| + |y − fρ (x)| + | fρ (x)|


≤ IK R + Mρ + fρ ∞.

The sample error ε satisfies, with confidence 1 − δ, by Theorem 3.3,


 ε  −mε/300M 2
N HK,R , e ≥δ
12M
ε
and therefore, by Theorem 5.1(i) with η = 12M (which applies due to
assumption (i)),
 
12MR 2n/s  mε
exp C exp − ≥δ
ε 300M 2

n/s
(with C = C(Diam(X ))n K C s (X ×X ) and C depending on X and s but
independent of R, ε, and M ) or
2n/s
mε 1 12M 2
− ln −C ≤ 0,
300M 2 δ IK ε

where we have also used that R IK ≤ M . Write v = ε/M 2 . Multiplying by


v2n/s , the inequality above takes the form

c0 v d +1 − c1 v d − c2 ≤ 0, (7.2)
 
where d = 2ns , c0 = 300 , c1 = ln δ , and c2 = C (12/ IK ) .
m 1 d

If we take the equality in (7.2), we obtain an equation that, by Lemma 7.2,


has exactly one positive solution for v. Let v ∗ (m, δ) be this solution. Then
ε(R) = M 2 v ∗ (m, δ) is the best bound we can obtain from Theorem 3.3 for the
sample error.
Now consider the approximation error. Owing to assumption (ii),
Theorem 4.1 applies to yield

A(fρ , R) ≤ 22+θ g 2+θ


Lρ2X
R−θ =: A(R),

θ/(4+2θ)
where g ∈ Lρ2X (X ) is such that LK (g) = fρ and

A( fρ , R) = inf E( f ) − E( fρ ) = inf f − fρ L
2
2 .
f ∈HK,R f K ≤R ρX

We can therefore take E(R) = A(R) + ε(R) and Part (i) is proved.
7.2 Proof of Theorem 7.1 131

We now proceed with Part (ii). For a point R > 0 to be a minimum of


A(R) + ε(R), it is necessary that A (R) + ε  (R) = 0. Taking derivatives and
noting that by (7.1), M  (R) = IK , we get

A (R) = −22+θ g 2+θ


Lρ2X
θ R−θ −1 and ε  (R) = 2M IK v ∗ (m, δ).

Therefore, writing Q = R1 , it is necessary that


  $ %
22+θ g 2+θ
θ Qθ+2 − 2(Mρ + fρ ∞) IK v ∗ (m, δ) Q
2 ∗
− 2 IK v (m, δ) = 0. (7.3)

By Lemma 7.2, it follows that there is a unique positive solution Q∗ of (7.3)


and, thus, a unique positive solution R∗ of A (R) + ε (R) = 0. This solution is
the only minimum of E since E(R) → ∞ when R → 0 and when R → ∞.
We finally prove Part (iii). Note that by Lemma 7.2, the solution of the
equation induced by (7.2) satisfies
⎧   ⎫
⎨ 600 ln(1/δ) 600C d 1/(d +1) ⎬
12
v∗ (m, δ) ≤ max , .
⎩ m m IK ⎭

Therefore, v∗ (m, δ) → 0 when m → ∞. Also, since 1/R∗ is a root of (7.3),


Lemma 7.2 applies again to yield

1 4(Mρ + fρ ∞ ) IK v ∗ (m, δ) 1/(θ+1)
≤ max ,
R∗ 22+θ g 2+θ θ
1/(θ+2)

4 IK 2 v ∗ (m, δ)
× ,
22+θ g 2+θ θ

from which it follows that R∗ → ∞ when m → ∞. Note that this implies that

2+θ −θ
lim A(R∗ ) ≤ lim 22+θ g R∗ = 0.
m→∞ m→∞

Finally, since Q∗ is a solution of equation (7.3),


  ( )
22+θ g 2+θ
θ Q∗θ − 2(Mρ + fρ ∞) IK Q∗ − 2 I K 2

v∗ (m, δ)
× = 0,
Q∗2
132 7 On the bias–variance problem

and therefore v∗ (m, δ)R2∗ = v∗ (m, δ)/Q∗2 → 0 when m → ∞, and, by (7.1),

lim ε(R∗ ) = lim M 2 v ∗ (m, δ)


m→∞ m→∞
2 ∗
= lim ( IK R∗ + Mρ + fρ ∞) v (m, δ) = 0.
m→∞

This finishes the proof of the theorem. 

7.3 A concrete example of bias–variance


Let R ≥ 1 in the proof of Theorem 7.1. Then M ≤ ( IK + Mρ + fρ ∞ )R
and we may take

2 2 ∗
ε(R) = ( IK + Mρ + fρ ∞) R v (m, δ)

as an upper bound for the sample error with confidence 1 − δ. Hence, under
conditions (i) and (ii), we may choose

2 2 ∗ 2+θ −θ
E(R) = ( IK + Mρ + fρ ∞) R v (m, δ) + 22+θ g R .

With this choice,


1/(2+θ)
θ
R∗ = (v ∗ (m, δ))−1/(2+θ) 2 g ( IK + Mρ + fρ ∞)
−2/(2+θ)
2

tends to infinity as m does so and


2 3
2/(2+θ) θ/(2+θ)
θ 2
E(R∗ ) = +
2 θ

× 4 g 2 ( IK + Mρ + fρ ∞ )2θ/(2+θ) (v ∗ (m, δ))θ/(2+θ)


 
= O m−θ/((2+θ)(1+2n/s)) → 0 as m → ∞.

Example 7.4 Let K be the spline kernel on X = [−1, 1] given in Example 4.19.
$ %
If ρX is the Lebesgue measure and fρ (x + t) − fρ (x) L 2 ([−1,1−t]) = O t θ
for some θ > 0, then A( fρ , R) = O(R−θ ). Take s = 21 and n = 1. Then we
have
 
E(R∗ ) = O m−θ/(5(2+θ)) .
7.4 References and additional remarks 133

$ %
When θ is sufficiently large, fz − fρ 2L 2 = O m−(1/5)+ε for an arbitrarily
ρX
small ε.

7.4 References and additional remarks


In this chapter we have considered a form of the bias–variance problem that
optimizes the parameter R, fixing all the others. One can consider other forms
of the bias–variance problem by optimizing other parameters. For instance, in
Example 2.24, one can consider the degree of smoothness of the kernel K. The
smoother K is, the smaller HK is. Therefore, the sample error decreases and
the approximation error increases with a parameter reflecting this smoothness.
We have already discussed the bias–variance problem in Section 1.5. Further
ideas on this problem can be found in Chapter 9 of 18 and in [95].
Bounds for the roots of real and complex polynomials such as those in
Lemma 7.2 are a standand theme in algebra going back to Gauss. A reference
for several such bounds is [91]. Theorem 7.1 was originally proved in [39].
8
Least squares regularization

We now abandon the setting of a compact hypothesis space adopted thus far
and change the perspective slightly. We will consider as a hypothesis space an
RKHS HK but we will add a penalization term in the error to avoid overfitting,
as in the setting of compact hypothesis spaces.
In what follows, we consider as a hypothesis space H = HK – that is, H is
a whole linear space – and the regularized error Eγ defined by

Eγ ( f ) = (f (x) − y)2 d ρ + γ f 2
K
Z

for a fixed γ > 0. For a sample z, the regularized empirical error Ez,γ is
defined by
1
m
Ez,γ ( f ) = (yi − f (xi ))2 + γ f 2K .
m
i=1
One can consider a target function fγ minimizing Eγ ( f ) over HK and an
empirical target fz,γ minimizing Ez,γ over HK . We prove in Section 8.2 the
existence and uniqueness of these target and empirical target functions. One
advantage of this new approach, which becomes apparent from the results in
this section, is that the empirical target function can be given an explicit form,
readily computable, in terms of the sample z, the parameter γ , and the kernel K.
Our discussion of Sections 1.4 and 1.5 remains valid in this context and the
following questions concerning fz,γ require an answer: Given γ > 0, how large
is the excess generalization error E(fz,γ ) − E(fρ )? Which value of γ minimizes
the excess generalization error? The main result of this chapter provides some
answer to these questions.

Theorem 8.1 Assume that K satisfies log N (B1 , η) ≤ C0 (1/η)s for some
θ/2
s∗ > 0, and ρ satisfies fρ ∈ Range(LK ) for some 0 < θ ≤ 1. Take γ∗ = m−ζ

134
8.1 Bounds for the regularized error 135

with ζ < 1/(1 + s∗ ). Then, for every 0 < δ < 1 and m ≥ mδ , with confidence
1 − δ,

$ %2 $ %
fz,γ∗ (x) − fρ (x) d ρX ≤ C0 log 2/δ m−θζ
X
holds.
−θ/2
Here C0 is a constant depending only on s∗ , ζ , CK , M, C0 , and LK fρ ,
and mδ depends also on δ. We may take
$ %1/s∗ $ %1+1/s∗ $ %2/(ζ −1/(1+s∗ ))
mδ := max 108/C0 log(2/δ) , 1/(2c) ,

$ %1/(1+s∗ )
where c = (2CK + 5) 108C0 .

At the end of this chapter, in Section 8.6, we show that the regularization
approach just introduced and the minimization in compact hypothesis spaces
considered thus far are closely related.
The parameter γ is said to be the regularization parameter. The whole
approach outlined above is called a regularization scheme.
Note that γ∗ can be computed from knowledge of m and s∗ only. No
information on fρ is required. The next example shows a simple situation where
Theorem 8.1 applies and yields bounds on the generalization error from a simple
assumption on fρ .

Example 8.2 Let K be the spline kernel on X = [−1, 1] given in Example 4.19.
If ρX is the Lebesgue measure and fρ (x + t) − fρ (x) L 2 ([−1,1−t]) = O(t θ ) for
some 0< θ ≤ 1,then, by the conclusion of Example 4.19 and Theorem 4.1, fρ ∈
(θ −ε)/2
Range LK for any ε > 0. Theorem 5.8 also tells us that log N (B1 , η) ≤
C0 (1/η) . So, we may take s∗ = 2. Choose γ∗ = m−ζ with ζ =
2 1−2ε
3 < 31 .
Then Theorem 8.1 yields

2
E(fz,γ∗ ) − E(fρ ) = fz,γ∗ − fρ 2L 2 = O log m−(θ/3)+ε
ρX δ

with confidence 1 − δ.

8.1 Bounds for the regularized error


Let X , K, fγ , and fz,γ = fz be as above. Assume, for the time being, that fγ and
fz,γ exist.
136 8 Least squares regularization

Theorem 8.3 Let fγ ∈ HK and fz be as above. Then E(fz ) − E(fρ ) ≤ E(fz ) −


E(fρ ) + γ fz 2K , which can be bounded by
 
E(fγ ) − E(fρ ) + γ fγ 2
K + E(fz ) − Ez (fz ) + Ez (fγ ) − E(fγ ) . (8.1)

Proof. Write E(fz ) − E(fρ ) + γ fz 2


K as
   
{E(fz ) − Ez (fz )} + Ez (fz ) + γ fz 2K − Ez (fγ ) + γ fγ 2
K
 
+ Ez (fγ ) − E(fγ ) + E(fγ ) − E(fρ ) + γ fγ 2K .

The definition of fz implies that the second term is at most zero. Hence E(fz ) −
E(fρ ) + γ fz 2K is bounded by (8.1). 

The first term in (8.1) is the regularized error of fγ . We denote it by D(γ );


that is,

D(γ ) := E( fγ ) − E( fρ ) + γ fγ 2
K = inf E( f ) − E( fρ ) + γ f 2
K .
f ∈HK
(8.2)

Note that by Proposition 1.8,

D(γ ) = fγ − fρ 2
ρ + γ fγ 2
K ≥ fγ − fρ 2
ρ.

We call the second term in (8.1) the sample error (this use of the expression
differs slightly from the one in Section 1.4).
In this section we give bounds for the regularized error. The bounds
(Proposition 8.5 below) easily follow from the next general result.

Theorem 8.4 Let H be a Hilbert space and A a self-adjoint, positive compact


operator on H . Let s > 0 and γ > 0. Then
(i) For all $ a ∈ H , the minimizer
% b of the optimization problem
minb∈H b − a + γ A b
2 −s 2 exists and is given by
$ %−1
b = A2s + γ Id A2s a.

(ii) For 0 < θ ≤ 2s,


b − a ≤ γ θ/(2s) A−θ a ,
where we define A−θ a = ∞ if a  ∈ Range(Aθ ).
8.1 Bounds for the regularized error 137

(iii) For 0 < θ ≤ s,


 
min b−a 2
+ γ A−s b 2
≤ γ θ/s A−θ a 2 .
b∈H

(iv) For s < θ ≤ 3s,

A−s (b − a) ≤ γ θ/(2s)−1/2 A−θ a .

Proof. First note that replacing A by As and θ/(2s) by θ we can reduce the
" #−θ/(2s)
problem to the case s = 1 where A−θ a = (As )2 a.

(i) Consider
ϕ(b) = b − a 2
+ γ A−1 b 2 .
If a point b minimizes ϕ, then it must be a zero of the derivative Dϕ whose
value at b ∈ Range(A) satisfies ϕ(b + εf ) − ϕ(b) = Dϕ(b), εf  + o(ε)
for f ∈ Range(A). But ϕ(b + εf ) − ϕ(b) = 2b − a, εf  + 2γ A−2 b, εf  +
ε2 f 2 + ε2 γ A−1 f 2 . So b satisfies (Id + γ A−2 )b = a, which implies
b = (Id+γ A−2 )−1 a = (A2 +γ Id)−1 A2 a. Note that the operator Id+γ A−2
is invertible since it is the sum of the identity and a positive (but maybe
unbounded) operator.
We use the method from Chapter 4 to prove the remaining statements. If
λ1 ≥ λ2 ≥ . . . denote the eigenvalues of A2 corresponding to normalized
eigenvectors {φk }, then

 λk
b= ak φk ,
λk + γ
k≥1


where a = k≥1 ak φk . It follows that

 −γ
b−a = ak φk .
λk + γ
k≥1

& 2 θ '1/2
Assume A−θ a = k ak /λk < ∞.
(ii) For 0 < θ ≤ 2, we have

 −γ 2  γ 2−θ
λk θ
ak2
b−a 2
= ak2 = γ θ
k≥1
λk + γ
k≥1
λk + γ λk + γ λθk

≤ γ θ A−θ a 2 .
138 8 Least squares regularization

 $√ %
(iii) For 0 < θ ≤ 1, A−1 b = k≥1 ( λk )/(λk + γ ) ak φk . Hence

 γ 2  λk
b−a 2
+ γ A−1 b 2
= ak2 + γ a2
λk + γ (λk + γ )2 k
k≥1 k≥1
 γ
= a2 ,
λk + γ k
k≥1

which is bounded by

 γ 1−θ
λk θ
ak2
γθ ≤ γ θ A−θ a 2 .
k≥1
λk + γ λk + γ λθk

(iv) When 1 < θ ≤ 3, we find that


 −γ
A−1 (b − a) = √ ak φk .
(λk + γ ) λk
k≥1

It follows that

 γ2
A−1 (b − a) 2
= a2
(λk + γ )2 λk k
k≥1

 γ 3−θ
λk θ−1 2
ak
= γ θ−1 ≤ γ θ−1 A−θ a 2 .
k≥1
λk + γ λk + γ λθk

Thus all the estimates have been verified. 

Bounds for the regularized error D(γ ) follow from Theorem 8.4.

Proposition 8.5 Let X ⊂ Rn be a compact domain and K a Mercer kernel


θ/2
such that for some 0 < θ ≤ 2 fρ ∈ Range(LK ). Then
−θ/2
(i) fγ − fρ 2ρ = E( fγ ) − E( fρ ) ≤ γ θ LK fρ 2.

(ii) For 0 < θ ≤ 1,

−θ/2
D(γ ) = E(fγ ) − E( fρ ) + γ fγ 2
K ≤ γ θ LK fρ 2
ρ.

(iii) For 1 < θ ≤ 3,

−θ/2
fγ − fρ K ≤ γ (θ−1)/2 LK fρ ρ.
8.2 On the existence of target functions 139

1/2
Proof. Apply Theorem 8.4 with H = Ł2, s = 1, A = LK , and a = fρ , and
−1/2
use that LK f = A−1 f = f K . We know that fγ is the minimizer of

min ( f − fρ 2
+γ f K)
2
= min ( f − fρ 2
+γ f K)
2
f ∈Ł2 f ∈HK

since f K = ∞ for f  ∈ HK . Our conclusion follows from Theorem 8.4 and


Proposition 1.8. 

8.2 On the existence of target functions


Let X , K, fγ , and fz,γ = fz be as above. Since the hypothesis space HK is not
compact, the existence of fγ and fz,γ is not obvious. The goal of this section is
to prove that both fγ and fz,γ exist and are unique. In addition, we show that
fz,γ is easily computable from γ , the sample z, and the kernel K on the compact
metric space X .
Proposition 8.6 Let ν = ρX in the definition of the integral operator LK . For
all γ > 0 the function fγ = (LK + γ Id)−1 LK fρ is the unique minimizer of Eγ
over HK .
1/2
Proof. Apply Theorem 8.4 with H = Lν2 (X ), s = 1, A = LK , and a = fρ .
−1/2
Since, for all f ∈ HK , f K = LK f Lν2 (X ) , we have

b−a 2
+ γ A−s b 2
= b − fρ 2Lρ (X ) + γ b 2
K = Eγ (b) − σρ2 .
X

Thus, the minimizer fγ is b in Theorem 8.4, and the proposition follows. 


Proposition 8.7 Let z ∈ Z m and γ > 0. The empirical target function can be
expressed as

m
fz (x) = ai K(x, xi ),
i=1
where a = (a1 , . . . , am ) is the unique solution of the well-posed linear system
in Rm
(γ m Id + K[x])a = y.
Here, we recall that K[x] is the m × m matrix whose (i, j) entry is
K(xi , xj ), x = (x1 , . . . , xm ) ∈ X m , and y = (y1 , . . . , ym ) ∈ Y m such that
z = ((x1 , y1 ), . . . , (xm , ym )).

Proof. Let H (f ) = m1 m i=1 (yi − f (xi )) + γ f K . Take ν to be a Borel,
2 2

nondegenerate measure on X and LK to be the corresponding integral operator.


140 8 Least squares regularization

Let {φk }k≥1 be an orthonormal basis of Lν2 (X ) consisting of eigenfunctions of


LK , and let {λk }k≥1 be their corresponding eigenvalues. By Theorem 4.12, we
 
can then write, for any f ∈ HK , f = λk >0 ck φk with f 2K = λk >0 ck2 /λk .
m
For every k with λk > 0, ∂H /∂ck = −2 m i=1 (yi −f (xi ))φk (xi )+2γ (ck /λk ).
If f is a minimum of H , then, for each k with λk > 0, we must have ∂H /∂ck = 0
or, solving for ck ,

m
ck = λk ai φk (xi ),
i=1

where ai = (yi − f (xi ))/γ m. Thus,

  
m
f (x) = ck φk (x) = λk ai φk (xi )φk (x)
λk >0 λk >0 i=1


m  
m
= ai λk φk (xi )φk (x) = ai K(xi , x),
i=1 λk >0 i=1

where we have applied Theorem 4.10 in the last equality. Replacing f (xi ) in
the definition of ai above we obtain

m
yi − j=1 aj K(xj , xi )
ai = .
γm

Multiplying both sides by γ m and writing the result in matrix form we obtain
(γ m Id + K[x])a = y, and this system is well posed since K[x] is positive
semidefinite and the result of adding a positive semidefinite matrix and the
identity is positive definite. 

8.3 A first estimate for the excess generalization error


In this section we bound the confidence for the sample error to be small enough.
The main result is Theorem 8.10.
In what follows we assume that

& '
M = inf M̄ ≥ 0 | {(x, y) ∈ Z | |y| ≥ M̄ } has measure zero
8.3 A first estimate for the excess generalization error 141

is finite. Note that


|y| ≤ M and |fρ (x)| ≤ M
almost surely.
For R > 0 let BR = {f ∈ HK : f K ≤ R}. Recall that for each f ∈ HK ,

f ∞ ≤ CK f K , where CK = supx∈X K(x, x). & '
For the sample error estimates, we require the confidence N (B1 , η) exp − mη
54
to be at most δ. So we define the following quantity to realize this confidence.
Definition 8.8 Let g = gK,m : R+ → R be the function given by

g(η) = log N (B1 , η) − .
54
The function g is strictly decreasing in (0, + ∞) with g(0) = + ∞ and
g(+∞) = −∞. Also, g(1) = − 54 m
. Moreover, limη→ε+ N (B1 , η) = N (B1 , ε)
for all ε > 0. Therefore, for 0 < δ < 1, the inequality

g(η) ≤ log δ (8.3)

has a unique minimal solution v∗ (m, δ). Moreover,

lim v ∗ (m, δ) = 0.
m→∞


More quantitatively, when K is C s on X ⊂ Rn , log N (B1 , η) ≤ C0 (1/η)s with
s∗ = 2n
s (cf. Theorem 5.1(i)). In this case the following decay holds.

Lemma 8.9 If the Mercer kernel K satisfies log N (B1 , η) ≤ C0 (1/η)s for
some s∗ > 0, then
 1/(1+s∗ ) 
108 log(1/δ) 108C0
v ∗ (m, δ) ≤ max , .
m m

Proof. Observe that g(η) ≤ h(η) := C0 (1/η)s − mη 54 . Since h is also strictly
decreasing and continuous on (0, +∞), we can take  to be the unique positive
solution of the equation h(t) = log δ. We know that v∗ (m, δ) ≤ . The equation
h(t) = log δ can be expressed as

∗ 54 log(1/δ) s∗ 54C0
t 1+s − t − = 0.
m m
&
Then Lemma 7.2 with d = 2 yields  ≤ max 108 log(1/δ)/m,
$ %1/(1+s∗ ) '
108C0 /m . This verifies the bound for v ∗ (m, δ). 
142 8 Least squares regularization

Theorem 8.10 For all γ ∈ (0, 1] and 0 < δ < 1, with confidence 1 − δ,
$ %
2(CK + 3)2 M2 v ∗ (m, δ/2) 8C2K log 2/δ
E(fz ) − E(fρ ) ≤ + + 6M + 4
γ mγ
$ % $ %
48M2 + 6M log 2/δ
× D(γ ) + .
m

holds.
Theorem 8.10 will follow from some lemmas and propositions given in the
remainder of this section. Before proceeding with these results, however, we
note that from Theorem 8.10, a convergence property for the regularized scheme
follows.
Corollary 8.11 Let 0 < δ < 1 be arbitrary. Take γ = γ (m) to satisfy γ (m) →
0, limm→∞ mγ (m) ≥ 1, and γ (m)/(v ∗ (m, δ/2)) → +∞. If D(γ ) → 0, then,
for any ε > 0, there is some Mδ,ε ∈ N such that with confidence 1 − δ,

E(fz ) − E( fρ ) ≤ ε, ∀m ≥ Mδ,ε

holds. 
As an example, for Cs kernels on X ⊂ Rn ,
the decay of v ∗ (m, δ) shown in
Lemma 8.9 with s∗ = 2n
s yields the following convergence rate.

Corollary 8.12 Assume that K satisfies log N (B1 , η) ≤ C0 (1/η)s for some
θ/2
s∗ > 0, and ρ satisfies fρ ∈ Range(LK ) for some 0 < θ ≤ 1. Then, for all
γ ∈ (0, 1] and all 0 < δ < 1, with confidence 1 − δ,
  $2%
$ %2 log δ 1
fz (x) − fρ (x) d ρX ≤ C1 + 1/(1+s∗ ) + γ θ +
X mγ m γ

2 γθ 1
log +
δ mγ m
.
−θ/2
holds, where C1 is a constant depending only on s, CK , M, C0 , and LK fρ .
∗ ))  $ % 2
If γ = m$ −1/((1+θ)(1+s
% −θ/((1+θ)(1+s , then the convergence rate is X fz (x)−fρ (x) dρX ≤
∗ ))
6C1 log 2/δ m .
Proof. The proof is an easy consequence of Theorem 8.10, Lemma 8.9, and
Proposition 8.5. 
For C ∞ kernels, s∗ can be arbitrarily small. Then the decay rate exhibited in
Corollary 8.12 is m−(1/2)+ε for any ε > 0, achieved with θ = 1. We improve
8.3 A first estimate for the excess generalization error 143

Theorem 8.10 in the next section, where more satisfactory bounds (with decay
rate m−1+ε ) are presented. The basic ideas of the proof are included in this
section.
To move toward the proof of Theorem 8.10, we write the sample error as
 
1
m
E(fz ) − Ez (fz ) + Ez (fγ ) − E(fγ ) = E(ξ1 ) − ξ1 (zi )
m
i=1
 
m 
1
+ ξ2 (zi ) − E(ξ2 ) , (8.4)
m
i=1

where
$ %2 $ %2 $ %2 $ %2
ξ1 := fz (x) − y − fρ (x) − y and ξ2 := fγ (x) − y − fρ (x) − y .

The second term on the right-hand side of (8.4) is about the random variable
ξ2 on Z. Since its mean E(ξ2 ) = E(fγ )−E(fρ ) is nonnegative, we may apply the
Bernstein inequality to estimate this term. To do so, however, we need bounds
for fγ ∞ .
Lemma 8.13 For all γ > 0,

fγ K ≤ D(γ )/γ and fγ ∞ ≤ CK D(γ )/γ .

Proof. Since fγ is a minimizer of (8.2), we know that

γ fγ 2
K ≤ E( fγ ) − E( fρ ) + γ fγ 2
K = D(γ ).

Thus, the first inequality holds. The second follows from fγ ∞ ≤ CK fγ K.



Proposition 8.14 For every 0 < δ < 1, with confidence at least 1 − δ,
$ %
1
m
4C2K log 1/δ
ξ2 (zi ) − E(ξ2 ) ≤ + 3M + 1 D(γ )
m mγ
i=1
$ % $ %
24M2 + 3M log 1/δ
+
m
holds.
Proof. From the definition of ξ2 , it follows that
$ %&$ % $ %'
ξ2 = fγ (x) − fρ (x) fγ (x) − y + fρ (x) − y .
144 8 Least squares regularization

Almost everywhere, since |fρ (x)| ≤ M, we have


$ %$ % $ %2
|ξ2 | ≤ fγ ∞ +M fγ ∞ + 3M ≤ c := fγ ∞ + 3M .

Hence |ξ2 (z) − E(ξ2 )| ≤ B := 2c. Moreover, we have


$ %2 &$ % $ %'2
E(ξ22 ) = E fγ (x) − fρ (x) fγ (x) − y + fρ (x) − y
$ %2
≤ fγ − fρ 2
fγ ∞ + 3M ,

which implies that σ 2 (ξ2 ) ≤ E(ξ22 ) ≤ cD(γ ). Now we apply the one-side
Bernstein inequality in Corollary 3.6 to ξ2 . It asserts that for any t > 0,

1
m
ξ2 (zi ) − E(ξ2 ) ≤ t
m
i=1

with confidence at least


   
mt 2 mt 2
1 − exp − $ % ≥ 1 − exp − $ % .
2 σ 2 (ξ2 ) + 13 Bt 2c D(γ ) + 23 t

Choose t ∗ to be the unique positive solution of the quadratic equation

mt 2
− $ % = log δ.
2c D(γ ) + 23 t

Then, with confidence 1 − δ,

1
m
ξ2 (zi ) − E(ξ2 ) ≤ t ∗
m
i=1

holds. But
⎛ ! ⎞
@
2c $ % 2c 2 $ %
t ∗ = ⎝ log 1/δ + log(1/δ) + 2cm log 1/δ D(γ )⎠ m
3 3
$ % 7
4c log 1/δ $ %
≤ + 2c log 1/δ D(γ )/m.
3m
By Lemma 8.13, c ≤ 2C2K D(γ )/γ + 18M2 . It follows that
7 $ % 7 $ %
2CK D(γ )
2c log 1/δ D(γ )/m ≤ log 1/δ √ + 6M D(γ )/m

8.3 A first estimate for the excess generalization error 145

and, therefore, that


7 $ %
$ % $ %
8C2K
log 1/δ 1/δ 72M2 log log 1/δ
t∗ ≤ D(γ ) + + 2CK √ D(γ )
3mγ 3m mγ
$ %
3M log 1/δ
+ + 3MD(γ ).
m
This implies the desired estimate. 
The first term on the right-hand side of (8.4) is more difficult to deal with
because ξ1 involves the sample z through fz . We use a result from Chapter 3,
Lemma 3.19, to bound this term by means of a covering number. For R > 0,
define FR to be the set of functions from Z to R
$ %2 $ %2
FR := f (x) − y − fρ (x) − y : f ∈ BR . (8.5)

Proposition 8.15 For all ε > 0 and R ≥ M,


 $ % 
E( f ) − E( fρ ) − Ez ( f ) − Ez ( fρ ) √
Prob sup ≤ ε
z∈Zm
f ∈BR E( f ) − E(fρ ) + ε
 
ε mε
≥ 1 − N B1 , exp − .
(CK + 3)2 R2 54(CK + 3)2 R2

%2 $ the set %F2 R . Each function g ∈ FR has the form g(z) =


$Proof. Consider
f (x) − y − fρ (x) − y with f ∈ BR . Hence E(g) = E( f ) − E( fρ ) =
f − fρ 2 ≥ 0, Ez (g) = Ez ( f ) − Ez ( fρ ), and
&$ % $ %'
g(z) = ( f (x) − fρ (x)) f (x) − y + fρ (x) − y .

Since f ∞ ≤ CK f K ≤ CK R and |fρ (x)| ≤ M almost everywhere, we find


that $ %$ %
|g(z)| ≤ CK R + M CK R + 3M ≤ c := (CK R + 3M)2 .
So we have |g(z) − E(g)| ≤ B := 2c almost everywhere.
In addition,
( &$ % $ %'2 )
E(g 2 ) = E ( f (x) − fρ (x))2 f (x) − y + fρ (x) − y
$ %2
≤ CK R + 3M f − fρ 2
.

Thus, E(g 2 ) ≤ cE(g) for each g ∈ FR .


146 8 Least squares regularization

Applying Lemma 3.19 with α = 1


4 to the function set FR , we deduce that
$ % m
E(f ) − E( fρ ) − Ez ( f ) − Ez ( fρ ) E(g) − 1
i=1 g(zi ) √
sup = sup m
≤ ε
f ∈BR E( f ) − E(fρ ) + ε g∈FR E(g) + ε

with confidence at least


   
mε/16 mε
1−N (FR , ε/4) exp − ≥ 1−N (FR , ε/4) exp − .
2c + 23 B 54(CK + 3)2 R2

Here we have used the expressions for c, B = 2c, and the restriction R ≥ M.
What is left is to bound the covering number N (FR , ε/4). To do so, we note
that
$ % $ %  $ % $ %
 f1 (x) − y 2 − f2 (x) − y 2  ≤ f1 − f2 ∞
 f1 (x) − y + f2 (x) − y .

But |y| ≤ M almost surely, and f ∞ ≤ CK f K ≤ CK R for each f ∈ BR .


Therefore, almost surely,
$ % $ %  $ %
 f1 (x) − y 2 − f2 (x) − y 2  ≤ 2 M + CK R f1 − f2 ∞, ∀f1 , f2 ∈ BR .

Since an η/(2(MR + CK R2 ))-covering of B1 yields an η/(2(M + CK R))-


covering of BR , and vice versa, we see that for any η > 0, an
η/(2(MR + CK R2 ))-covering of B1 provides an η-covering of FR . That is,

η
N (FR , η) ≤ N B1 , , ∀η > 0. (8.6)
2(MR + CK R2 )

But R ≥ M and 2(1 + CK ) ≤ (CK + 3)2 . So our desired estimate follows. 


Now we can derive the error bounds. For R > 0, denote

W(R) := {z ∈ Z m : fz K ≤ R}.

Proposition 8.16 For all 0 < δ < 1 and R ≥ M, there is a set VR ⊂ Z m with
ρ(VR ) ≤ δ such that for all z ∈ W(R) \ VR , the regularized error Eγ ( fz ) =
E( fz ) − E(fρ ) + γ fz 2K is bounded by
$ %
2 2 ∗ 8C2K log 2/δ
2(CK + 3) R v (m, δ/2) + + 6M + 4 D(γ )

$ % $ %
48M2 + 6M log 2/δ
+ .
m
8.3 A first estimate for the excess generalization error 147

√ $ %
Proof. Note that E( f ) − E( fρ ) + ε ε ≤ 12 E( f ) − E( fρ ) + ε. Using the
quantity v∗ (m, δ), Proposition 8.15 with ε = (CK + 3)2 R2 v ∗ (m, 2δ ) tells us that
there is a set VR ⊂ Z m of measure at most 2δ such that
$ % $ %
E(f ) − E( fρ ) − Ez ( f ) − Ez ( fρ ) ≤ 12 E( f ) − E(fρ ) + (CK + 3)2
R2 v ∗ (m, δ/2), ∀f ∈ BR , z ∈ Z m \ VR .

In particular, when z ∈ W(R) \ VR , fz ∈ BR and

1
m
$ %
E(ξ1 ) − ξ1 (zi ) = E( fz ) − E( fρ ) − Ez ( fz ) − Ez ( fρ )
m
i=1
 
≤ 12 E( fz ) − E( fρ ) + (CK + 3)2 R2 v ∗ (m, δ/2).

Now apply Proposition 8.14 with δ replaced by 2δ . We can find another set
VR⊂ Z m of measure at most 2δ such that for all z ∈ Z m \ VR ,
$ %
1
m
4C2K log 2/δ
ξ2 (zi ) − E(ξ2 ) ≤ + 3M + 1 D(γ )
m mγ
i=1
$ % $ %
24M2 + 3M log 2/δ
+ .
m

Combining these two bounds with (8.4), we see that for all z ∈ W(R)\(VR ∪VR ),

1$ %
E(fz ) − Ez (fz ) + Ez ( fγ ) − E( fγ )≤ E(fz ) − E( fρ ) + (CK + 3)2 R2 v ∗ (m, δ/2)
2
$ %
4C2K log 2/δ
+ + 3M + 1 D(γ )

$ % $ %
24M2 + 3M log 2/δ
+ .
m

This inequality, together with Theorem 8.3, tells us that for all z ∈ W(R) \
(VR ∪ VR ),

1$ %
E(fz ) − E(fρ ) + γ fz 2
K ≤ D(γ ) + E(fz ) − E( fρ ) + (CK + 3)2
2
$ %
2 ∗ 4C2K log 2/δ
R v (m, δ/2) + + 3M + 1 D(γ )

$ % $ %
24M2 + 3M log 2/δ
+ .
m
148 8 Least squares regularization

This gives the desired bound with VR = VR ∪ VR . 

To prove Theorem 8.10, we still need an R satisfying W(R) = Z m .

Lemma 8.17 For all γ > 0 and almost all z ∈ Z m ,

M
fz K ≤√ .
γ

Proof. Since fz minimizes Ez,γ , we have

1
m
γ fz 2
K ≤ Ez,γ ( fz ) ≤ Ez,γ (0) = (yi − 0)2 ≤ M2 ,
m
i=1


the last almost surely. Therefore, fz K ≤ M/ γ for almost all z ∈ Z m . 

Lemma 8.17 says that W(M/ γ ) = Z m up to a set of measure zero (we

ignore this null set later). Take R := M/ γ ≥ M. Theorem 8.10 follows from
Proposition 8.16.

8.4 Proof of Theorem 8.1


In this section we improve the excess generalization error estimate of
Theorem 8.10. The method in the previous section was rough because we used

the bound fz K ≤ M/ γ shown in Lemma 8.17. This is much worse than
√ √
the bound for fγ given in Lemma 8.13, namely, fγ K ≤ D(γ )/ γ . Yet we
expect the minimizer fz of Ez,γ to be a good approximation of the minimizer
fγ of Eγ . In particular, we expect fz K also to be bounded by, essentially,
√ √
D(γ )/ γ . We prove that this is the case with high probability by applying
Proposition 8.16 iteratively. As a consequence, we obtain better bounds for the
excess generalization error.

Lemma 8.18 For all 0 < δ < 1 and R ≥ M, there is a set VR ⊂ Z m with
ρ(VR ) ≤ δ such that

W(R) ⊆ W(am R + bm ) ∪ VR ,

where am := (2CK + 5) v ∗ (m, δ/2)/γ and
7 $ % ! 7 $ %
2CK 2 log 2/δ √ D(γ ) (7M + 1) 2 log 2/δ
bm := √ + 6M + 4 + √ .
mγ γ mγ
8.4 Proof of Theorem 8.1 149

Proof. By Proposition 8.16, there is a set VR ⊂ Z m with ρ(VR ) ≤ δ such that


for all z ∈ W(R) \ VR ,
$ %
2 2 ∗ 8C2K log 2/δ
γ fz K ≤ 2(CK + 3) R v (m, δ/2) +
2
+ 6M + 4 D(γ )

$ % $ %
48M2 + 6M log 2/δ
+ .
m
This implies that

fz K ≤ am R + bm , ∀z ∈ W(R) \ VR ,

with am and bm as given in our statement. In other words, W(R) \ VR ⊆


W(am R + bm ). 

Lemma 8.19 Assume that K satisfies log N (B1 , η) ≤ C0 (1/η)s for some
s∗ > 0. Take γ = m−ζ with
$ ζ < 1/(1 +% s∗ ). For all 0 < δ < 1 and m ≥ mδ ,

with confidence 1 − 3δ/ 1/(1 + s ) − ζ
7 $ %$ %
fz K ≤ C2 log 2/δ D(γ )/γ + 1

holds. Here C2 > 0 is a constant that depends only on s∗ , ζ , CK , C0 , and M,


and mδ ∈ N depends also on δ.
$ %1/s∗ $ %1+1/s∗
Proof. By Lemma 8.9, when m ≥ 108/C0 log(2/δ) ,
$ %1/(1+s∗ )
v∗ (m, δ/2) ≤ 108C0 /m (8.7)

holds. It follows that


$ %1/(2+2s∗ ) ζ /2−1/(2+2s∗ )
am ≤ (2CK + 5) 108C0 m . (8.8)
$ %1/(2+2s∗ ) $ %2/(ζ −1/(1+s∗ ))
Denote c := (2CK + 5) 108C0 . When m ≥ 1/(2c) ,
we have am ≤ 12 .
 
$ %1/s∗ $ %1+1/s∗ $ %2/(ζ −1/(1+s∗ ))
Choose mδ := max 108/C0 log(2/δ) , 1/(2c) .

Define a sequence {R(j) }j∈N by R(0) = M/ γ and, for j ≥ 1,

R(j) = am R(j−1) + bm .

Then Lemma 8.17 proves W(R(0) ) = Z m , and Lemma 8.18 asserts that for
each j ≥ 1, W(R( j−1) ) ⊆ W(R( j) ) ∪ VR( j−1) with ρ(VR( j−1) ) ≤ δ. Apply this
150 8 Least squares regularization

$ %
∗) − ζ ≤ J ≤
inclusion
$ for j = %1, 2, . . . , J , with J satisfying 2/ 1/(1 + s
3/ 1/(1 + s∗ ) − ζ . We see that

−1
J4
Z m = W(R(0) ) ⊆ W(R(1) ) ∪ VR(0) ⊆ · · · ⊆ W(R(J ) ) ∪ VR( j) .
j=0

It $follows that the% measure of the set W(R(J ) ) is at least 1 − J δ ≥ 1 − 3


δ/ 1/(1 + s∗ ) − ζ . By the definition of the sequence, we have

J −1
 $ %
ζ /2−1/(2+2s∗ ) +ζ /2
R(J ) = am
J (0) j
R + bm am ≤ McJ mJ + bm ≤ McJ + bm .
j=0

Here we have used (8.8) and am ≤ 12 in the first inequality, and then the
restriction J ≥ 2/(1/(1 + s∗ ) − ζ ) > ζ /(1/(1 + s∗ ) −$ ζ ) in the% second
$ %3/ 1/(1+s∗ )−ζ
inequality. Note that cJ ≤ (2CK + 5)(108C0 + 1) . Since
γ = m−ζ , bm can be bounded as

7 $ % $ √ %
bm ≤ 2 log 2/δ 2CK + 6M + 4 D(γ )/γ + 7M + 1 .

7 $ %$√ %
Thus, R(J ) ≤ C2 log 2/δ D(γ )/γ + 1 with C2 depending only on
s∗ , ζ , CK , C0 , and M. 
7 $ %
Proof of Theorem 8.1 Applying Proposition 8.16 with R := C2 log 2/δ
$√ % −θ/2
D(γ )/γ + 1 , and using that γ∗ = m−ζ and D(γ∗ ) ≤ γ∗θ LK fρ 2 , we
deduce from (8.7) that for m ≥ mδ and all z ∈ W(R) \ VR ,

$ % ∗ ∗
E(fz ) − E(fρ ) ≤ 2(CK + 3)2 C22 log 2/δ 2mζ (1−θ) (108C0 )1/(1+s ) m−1/(1+s )
$ %
+ C2 log 2/δ m−θζ
$ %
≤ C3 log 2/δ m−θζ

holds. Here C2 , C3 are positive constants depending only on s∗ , ζ , CK , C0 , M,


−θ/2
and LK fρ . $ %
By Lemma 8.19, the set W(R) has measure at least 1 − 3δ/ 1/(1 + s ∗) − ζ
$ %
when m ≥ mδ . Replacing 3δ/ 1/(1 + s∗ ) − ζ + δ (the last δ from the
8.6 Compactness and regularization 151

bound on VR ) by δ and letting C0 be the resulting C3 , our conclusion


follows. 

When s∗ → 0 and θ → 1, we see that the convergence rate θζ can be


arbitrarily close to 1.

8.5 Reminders V
We use the following result on Lagrange multipliers.

Proposition 8.20 Let U be a Hilbert space and F, H be real-valued C 1


functions on U . Let c ∈ U be a solution of the problem

min F( f )
s.t. H ( f ) ≤ 0.

Then, there exist real numbers µ, λ, not both zero, such that

µDF(c) + λDH (c) = 0. (8.9)

Here D means derivative. Furthermore, if H (c) < 0, then λ = 0. Finally, if


either H (c) < 0 or DH (c)  = 0, then µ = 0 and µλ ≥ 0. 

If µ  = 0 above, we can take µ = 1 and we call the resulting λ the Lagrange


multiplier of the problem at c.

8.6 Compactness and regularization


Let z = (z1 , . . . , zm ) with zi = (xi , yi ) ∈ X × Y for i = 1, . . . , m. We also
write x = (x1 , . . . , xm ) and y = (y1 , . . . , ym ). Assume that y  = 0 and K[x] is
 ∗
invertible. Let a∗ = K[x]−1 y, f ∗ = m i=1 ai Kxi , and R0 = f
2 ∗ 2 = yK[x]−1 y.
K
Let Ez (γ ) and Ez (R) be the problems

1
m
min (f (xi ) − yi )2 + γ f 2
K
m
i=1

s.t. f ∈ HK
152 8 Least squares regularization

and

1
m
min ( f (xi ) − yi )2
m
i=1

s.t. f ∈ B(HK , R),

respectively, where R, γ > 0.


In Proposition 8.7 and Corollary 1.14 we have seen that the minimizers fz,γ
and fz,R of Ez (γ ) and Ez (R), respectively, exist. A natural question is, What is
the relationship between the problems Ez (γ ) and Ez (R) and their minimizers?
The main result in this section answers this question.
Theorem 8.21 There exists a decreasing global homeomorphism

z : (0, +∞) → (0, R0 )

satisfying
(i) for all γ > 0, fz,γ is the minimizer of Ez (z (γ )), and
(ii) for all R ∈ (0, R0 ), fz,R is the minimizer of Ez (−1
z (R)).

To prove Theorem 8.21, we use Proposition 8.20 for the problems Ez (R) with

U = HK , F(f ) = m1 m i=1 ( f (xi ) − yi ) , and H ( f ) = f K − R . Note that
2 2 2

for x ∈ X , the mapping

HK → R
f  → f (x) = f , Kx K

is a bounded linear functional and therefore C 1 with derivative Kx . It follows



that F is C 1 as well and DF(f ) = m2 m i=1 ( f (xi ) − yi )Kxi . Also, H is C and
1

DH (f ) = 2f . Define

z : (0, +∞) → (0, R0 )


γ  → fz,γ K.

Also, for each R ∈ (0, R0 ), choose one minimizer fz,R of Ez (R) and let

z : (0, R0 ) → (0, +∞)


R  → the Lagrange multiplier of Ez (R) at fz,R .
8.6 Compactness and regularization 153

Lemma 8.22 The function z is well defined.


Proof. We apply Proposition 8.20 to the problem Ez (R) and claim that fz,R
is not the zero function. Otherwise, H (fz,R ) = H (0) < 0, which implies

DF(fz,R ) = DF(0) = − m2 m i=1 yi Kxi = 0 by (8.9), contradicting the
invertibility of K[x].
Since fz,R is not the zero function, DH ( fz,R ) = 2fz,R  = 0. Also, DF( fz,R ) =
2 m
m i=1 (fz,R (xi ) − yi )Kxi  = 0, since K[x] is invertible. By Proposition 8.20,
µ = 0 and λ  = 0. Taking µ = 1, we conclude that the Lagrange multiplier λ
is positive. 
Proposition 8.23
(i) For all γ > 0, fz,γ is the minimizer of Ez (z (γ )).
(ii) Let R ∈ (0, R0 ). Then fz,R is the minimizer of Ez (z (R)).
Proof. Assume that

1 1
m m
( f (xi ) − yi )2 < ( fz,γ (xi ) − yi )2
m m
i=1 i=1

for some f ∈ B(HK , z (γ )). Then

1 1
m m
(f (xi ) − yi )2 + γ f 2
K < ( fz,γ (xi ) − yi )2 + γ z (γ )2
m m
i=1 i=1

1
m
= ( fz,γ (xi ) − yi )2 + γ fz,γ 2
K,
m
i=1

contradicting the requirement that fz,γ minimizes the objective function of


Ez (γ ). This proves (i).
For Part (ii) note that the proof of Lemma 8.22 yields µ = 1 and
λ = z (R) > 0. Since fz,R minimizes Ez (R), by Proposition 8.20, fz,R satisfies

D(F + λH )(fz,R ) = D(F + λ f K )( fz,R )


2
= 0;

that is, the derivative of the objective function of Ez (λ) vanishes at fz,R . Since
this function is convex and Ez (λ) is an unconstrained problem, we conclude
that fz,R is the minimizer of Ez (λ) = Ez (z (R)). 
Proposition 8.24 z is a decreasing global homeomorphism with inverse z .
Proof. Since K is a Mercer kernel, the matrix K[x] is positive definite by
the invertibility assumption. So there exist an orthogonal matrix P and a
154 8 Least squares regularization

diagonal matrix D such that K[x] = PDP −1 . Moreover, the main diagonal
entries d1 , . . . , dm of D are positive. Let y = P −1 y. Then, by Proposition 8.7,
 −1 −1 
fz,γ = m i=1 ai Kxi with a satisfying (γ mId + D)P a = P y = y . It follows
that
yi m
P −1 a =
γ m + di i=1
and, using P T = P −1 ,
A
B m
B yi 2
z (γ ) = fz,γ K = a K[x]a = (P a) DP a = C
T −1 T −1
di ,
γ m + di
i=1

which is positive for all γ ∈ [0, +∞) since y  = 0 by assumption.


Differentiating with respect to γ ,

1 
m
y 2i
z (γ ) = − di .
fz,γ K (γ m + di )3
i=1

This expression is negative for all γ ∈ [0, +∞). This shows that z is strictly
decreasing in its domain. The first statement now follows since z is continuous,
z (0) = R0 , and z (γ ) → 0 when γ → ∞.
To prove the second statement, consider γ > 0. Then, by Proposition 8.23(i)
and (ii),
fz,γ = fz,z (γ ) = fz,z (z (γ )) .
To prove that γ = z (z (γ )), it is thus enough to prove that for γ , γ  ∈
(0, +∞), if fz,γ = fz,γ  , then γ = γ  . To do so, let i be such that yi  = 0 (such
an i exists since y  = $0). Since the coefficient
%m vectors for fz,γ and fz,γ  are the
same a with P −1 a = yi /(γ m + di ) i=1 , we have in particular

yi y
=  i ,
γ m + di γ m + di

whence it follows that γ = γ  . 

Corollary 8.25 For all R < R0 , the minimizer fz,R of Ez (R) is unique.

Proof. Let γ = z (R). Then fz,R = fz,γ by Proposition 8.23(ii). Now use that
fz,γ is unique. 

Theorem 8.21 now follows from Propositions 8.23 and 8.24 and
Corollary 8.25.
8.7 References and additional remarks 155

Remark 8.26 Let E(γ ) and E(R) be the problems



min ( f (x) − y)2 dρ + γ f 2
K

s.t. f ∈ HK

and

min ( f (x) − y)2 dρ

s.t. f ∈ B(HK , R),

respectively, where R, γ > 0. Denote by fγ and fR their minimizers, respectively.


Also, let R1 = fρ K if fρ ∈ HK and R1 = ∞ otherwise. A development
similar to the one in this section shows the existence of a decreasing global
homeomorphism
 : (0, +∞) → (0, R1 )
satisfying

(i) for all γ > 0, fγ is the minimizer of E((γ )), and


(ii) for all R ∈ (0, R1 ), fR is the minimizer of E(−1 (R)).

Here fR is the target function fH when H is IK (BR ).

8.7 References and additional remarks


The problem of approximating a function from sparse data is often ill posed.
A standard approach to dealing with ill-posedness is regularization theory
[36, 54, 64, 102, 130]. Regularization schemes with RKHSs were introduced to
learning theory in [137] using spline kernels and in [53, 134, 133] using general
Mercer kernels. A key feature of RKHSs is ensuring that the minimizer of the
regularization scheme can be found in the subspace spanned by {Kxi }m i=1 . Hence,
the minimization over the possibly infinite-dimensional function space HK is
reduced to minimization over a finite-dimensional space [50]. This follows
from the reproducing property in RKHSs. We have devoted Section 2.8 to this
feature. It is extended to other contexts in the next two chapters.
The error analysis for the least squares regularization scheme was considered
in [38] in terms of covering numbers. The distance between fz,γ and fγ was
studied in [24] using stability analysis. In [153], using leave-one-out techniques,
156 8 Least squares regularization

it was proved that


 
$ % 2C2K 2
E m E( fz,γ ) ≤ 1 + inf E( f ) + γ f 2
K .
z∈Z mγ f ∈HK

In [42, 43] a functional analysis approach was employed to show that for any
0 < δ < 1, with confidence 1 − δ,

  2 7 $ %
E(fz,γ ) − E( fγ ) ≤ M√CK 1 + √
CK
1+ 2 log 2/δ .
m γ

Parts (i) and (ii) of Proposition 8.5 were given in [39]. Part (iii) with 1 < θ ≤ 2
was proved in [115], and the extension to 2 < θ ≤ 3 was shown by Mihn in
the appendix to [116]. In [115], a modified McDiarmid inequality was used to
derive error bounds in the metric induced by K . If fρ is in the range of LK ,
then, for any 0 < δ < 1 with confidence 1 − δ,
$ $ %%2 1/3
$ $ %%2 1/3
log 4/δ log 4/δ
fz,γ − fρ 2K ≤C by taking γ =
m m

holds, where C is a constant independent of m and δ. In [116] a Bennett


inequality for vector-valued random variables with values in Hilbert spaces
is applied, which yields better error bounds. If fρ is in the range of LK , then we
have
$ $ %%2 $ %
fz,γ −fρ 2L 2 ≤ C log 4/δ (1/m)2/3 by taking γ = log 4/δ (1/m)1/3 .
ρX

These results are capacity-independent error bounds. The error analysis


presented in this chapter is capacity dependent and was mainly done in [143].
When fρ ∈ HK and s∗ < 2, the learning rate given by Theorem 8.1 is better
than capacity-independent ones.
A proof of Proposition 8.20 can be found, for instance, in [11].
For some applications, such as signal processing, inverse problems, and
numerical analysis, the data (xi )mi=1 may be deterministic, not randomly drawn
according to ρX . Then the regularization scheme inf f ∈HK Ez,γ ( f ) involves only
the random data (yi )m
i=1 . For active learning [33, 81], the data (xi )i=1 are drawn
m

according to a user-defined distribution that is different from ρX . Such schemes


and their connections to richness of data have been studied in [114].
9
Support vector machines for classification

In the previous chapters we have dealt with the problem of learning a


function f : X → Y when Y = R. We have described algorithms producing an
approximation fz of f from a given sample z ∈ Z m and we have measured the
quality of this approximation with the generalization error E as a ruler.
Although this setting applies to a good number of situations arising in
practice, there are quite a few that can be better approached. One paramount
example is that described in Case 1.5. Recall that in this case we dealt with
a space Y consisting of two elements (in Case 1.5 they were 0 and 1).
Problems consisting of learning a binary (or finitely) valued function are
called classification problems. They occur frequently in practice (e.g., the
determination, from a given sample of clinical data, of whether a patient suffers
a certain disease), and they will be the subject of this (and the next) chapter.
A binary classifier on a compact metric space X is a function f : X → {1, −1}.
To provide some continuity in our notation, we denote Y = {−1, 1} and keep
Z = X × Y . Classification problems thus consist of learning binary classifiers.
To measure the quality of our approximations, an appropriate notion of error is
essential.
Definition 9.1 Let ρ be a probability distribution on Z := X × Y . The
misclassification error R( f ) for a classifier f : X → Y is defined to be the
probability of a wrong prediction, that is, the measure of the event { f (x)  = y},

R( f ) := Prob {f (x)  = y} = Prob(y  = f (x) | x) d ρX . (9.1)
z∈Z X y∈Y

Our target concept (in the sense of Case 1.5) is the set T := {x ∈ X | Prob{y =
1 | x} ≥ 12 }, since the conditional distribution at x is a binary distribution.
One goal of this chapter is to describe an approach to producing classifiers
from samples (and an RKHS HK ) known as support vector machines.

157
158 9 Support vector machines for classification

Figure 9.1

Needless to say, we are interested in bounding the misclassification error of


the classifiers obtained in this way. We do so for a certain class of noise-free
measures that we call weakly separable. Roughly speaking (a formal definition
follows in Section 9.7), these are measures for which there exists a function
fsp ∈ HK such that x ∈ T ⇐⇒ fsp (x) ≥ 0 and satisfy a decay condition near
the boundary of T . The situation is as in Figure 9.1, where the rectangle is the
set X , the dashed regions represent the set T , and their boundaries are the zero
set of fsp .
Support vector machines produce classifiers from a sample z, a real number
γ > 0 (a regularization parameter as in Chapter 8), and an RKHS HK . Let us
denote by Fz,γ such a classifier. One major result in this chapter is the following
(a more detailed statement is given in Theorem 9.26 below).

Theorem 9.2 Assume ρ is weakly separable by HK . Let B1 denote the unit ball
in HK .

(i) If log N (B1 , η) ≤ C0 (1/η)p for some p, C0 > 0 and all η > 0, then, taking
γ = m−β (for some β > 0), we have, with confidence 1 − δ,
 
r 2
1 2
R(Fz,γ ) ≤ C log ,
m δ

where r and C are positive constants independent of m and δ.


9.1 Binary classifiers 159

(ii) If log N (B1 , η) ≤ C0 (log(1/η))p for some p, C0 > 0 and all 0 < η < 1,
then, for sufficiently large m and some β > 0, taking γ = m−β , we have,
with confidence 1 − δ,
 
1 r 2 2
R(Fz,γ ) ≤ C (log m)p log ,
m δ

where r and C are positive constants independent of m and δ.


We note that the exponent β in both parts of Theorem 9.2, unfortunately,
depends on ρ. Details of this dependence are made explicit in Section 9.7,
where Theorem 9.26 is proved.

9.1 Binary classifiers


Just as in Chapter 1, where we saw that the real-valued function minimizing
the error is the regression function fρ , we may wonder which binary classifier
minimizes R. The answer is simple. For a function f : X → R define

1 if f (x) ≥ 0
sgn( f )(x) =
−1 if f (x) < 0.

Also, let Kρ := {x ∈ X : fρ (x) = 0} and κρ = ρX (Kρ ).


Proposition 9.3
(i) For any classifier f ,

R( f ) = 12 κρ + X \Kρ Prob(y  = f (x)|x) d ρX .
Y

(ii) R is minimized by any classifier coinciding on X \ Kρ with

fc := sgn( fρ ).

Proof. Since Y = {1, −1}, we have fρ (x) = ProbY (y = 1 | x) − ProbY (y =


−1 | x). This means that

1 if ProbY ( y = 1 | x) ≥ ProbY ( y = −1 | x)
fc (x) = sgn( fρ )(x) =
−1 if ProbY ( y = 1 | x) < ProbY ( y = −1 | x).
(9.2)

For any classifier f and any x ∈ Kρ , we have ProbY (y  = f (x)|x) = 12 . Hence


statement (i) holds.
160 9 Support vector machines for classification

For the second statement, we observe that for x ∈ X \ Kρ , ProbY (y = fc (x) |


x) > ProbY (y  = fc (x) | x). Then, for any classifier f , we have either f (x) =
fc (x) or Prob(y  = f (x) | x) = Prob(y = fc (x) | x) > Prob(y  = fc (x) | x).
Hence R(f ) ≥ R(fc ), and equality holds if and only if f and fc are equal almost
everywhere on X /Kρ . 

The classifier fc is called Bayes rule.

Remark 9.4 The role played by the quantity κρ is reminiscent of that played
by σρ2 in the regression setting. Note that κρ depends only on ρ. Therefore,
its occurrence in Proposition 9.3(i) – just as that of σρ2 in Proposition 1.8 – is
independent of f . In this sense, it yields a lower bound for the misclassification
error and is, again, a measure of how well conditioned ρ is.

As ρ is unknown, the best classifier fc cannot be found directly. The goal


of classification algorithms is to find classifiers that approximate Bayes rule fc
from samples z ∈ Z m .
A possible strategy to achieve this goal could consist of fixing an RKHS HK
and a γ > 0, finding the minimizer fz,γ of the regularized empirical error (as
described in Chapter 8), that is,

1
m
fz,γ = argmin ( f (xi ) − yi )2 + γ f 2
K, (9.3)
f ∈HK m
i=1

and then taking the function sgn( fz,γ ) as an approximation of fc . Note that this
strategy minimizes a functional on a set of real-valued continuous functions
and then applies the sgn function to the computed minimizer to obtain a
classifier.
A different strategy consists of first taking signs to obtain the set {sgn( f ) |
f ∈ HK } of classifiers and then minimizing an empirical error over this set.
To see which empirical error we want to minimize, note that for a classifier
f : X → Y,
 
R(f ) = χ{ f (x)=y} d ρ = χ{yf (x)=−1} dρ.
Z Z

Then, for f ∈ HK satisfying f (x)  = 0 almost everywhere,


  
R(sgn( f ))= χ{sgn( f (x))y=−1} dρ= χ{sgn( yf (x))=−1} dρ= χ{yf (x)<0} dρ.
Z Z Z
9.2 Regularized classifiers 161

By discretizing the integral into a sum, given the sample z = {(xi , yi )}m
i=1 ∈ Z ,
m

one might consider a binary classifier sgn( f ), where f is a solution of

1
m
argmin χ{yi f (xi )<0} ,
f ∈HK m
i=1
f (x)=0 a.e.

or, dropping the restriction that f (x)  = 0 a.e. for simplicity,

1
m
argmin χ{yi f (xi )<0} . (9.4)
f ∈HK m
i=1

Note that in practical terms, we are again minimizing over HK . But we are
now minimizing a different functional.
It is clear, however, that if f is any minimizer of (9.4), so is αf for all α > 0.
This shows that the regularized version of (9.4) (regularized by adding the term
γ f 2K to the functional to be minimized) has no solution. It also shows that
we can take as minimizer a function with norm 1. We conclude that we can
approximate the Bayes rule by sgn(fz0 ), where fz0 is given by

1
m
fz0 := argmin χ{ yi f (xi )<0} . (9.5)
f ∈HK m
i=1
f K =1

We show in the next section that although we can reduce the computation
of fz0 to a nonlinear programming problem, the problem is not a convex one.
Hence, we do not possess efficient algorithms to find fz0 (cf. Section 2.7). We
also introduce a third approach that lies somewhere in between those leading to
problems (9.3) and (9.5). This new approach then occupies us for the remainder
of this (and the next) chapter. We focus on its geometric background, error
analysis, and algorithmic features.

9.2 Regularized classifiers


A loss (function) is a function φ : R → R+ . For (x, y) ∈ Z and f : X → R, the
quantity φ(yf (x)) measures the local error (w.r.t. φ). Recall from Chapter 1 that
this is the error resulting from the use of f as a model for the process producing
y from x. Global errors are obtained by averaging over Z and empirical errors
by averaging over a sample z ∈ Z m .
162 9 Support vector machines for classification

Definition 9.5 The generalization error associated with the loss φ is defined as

φ
E (f ) := φ(yf (x)) dρ.
Z

The empirical error associated with the loss φ and a sample z ∈ Z m is defined as

1
m
φ
Ez (f ) := φ(yi f (xi )).
m
i=1

If f ∈ HK for some Mercer kernel K, then we can define regularized versions


of these errors. For γ > 0, we define the regularized error

Eγφ (f ) := φ(yf (x)) d ρ + γ f 2K
Z

and the regularized empirical error

1
m
φ
Ez,γ (f ) := φ(yi f (xi )) + γ f 2
K.
m
i=1

Examples of loss functions are the misclassification loss



0 if t ≥ 0
φ0 (t) =
1 if t < 0

and the least-squares loss φls = (1 − t)2 . Note that for functions f : X → R
and points x ∈ X such that f (x)  = 0, φ0 (yf (x)) = χ{y=sgn(f (x))} ; that is, the
local error is 1 if y and f (x) have different signs and 0 when the signs are
the same.
Proposition 9.6 Restricted to binary classifiers, the generalization error w.r.t.
φ0 is the misclassification error R; that is, for all classifiers f ,

R(f ) = E φ0 (f ).

In addition, the generalization error w.r.t. φls is the generalization error E.


Similar statements hold for the empirical errors.
Proof. The first statement follows from the equalities
 
R(f ) = χ{yf (x)=−1} dρ = φ0 (yf (x)) dρ = E φ0 (f ).
Z Z
9.2 Regularized classifiers 163

For the second statement, note that the generalization error E(f ) of f
satisfies
 
E(f ) = (y − f (x)) dρ = (1 − yf (x))2 dρ = E φls (f ),
2
Z Z

since elements y ∈ Y = {−1, 1} satisfy y2 = 1 and therefore

(y − f (x))2 = (y − y2 f (x))2 = y2 (1 − yf (x))2 = (1 − yf (x))2 . 

Recall that HK,z is the finite-dimensional subspace of HK spanned by


{Kx1 , . . . , Kxm } and P : HK → HK,z is the orthogonal projection. Corollary 2.26
showed that when H = IK (BR ), the empirical target function fz for the regression
problem can be chosen in HK,z . Proposition 8.7 gave a similar statement for the
regularized empirical target function fz,γ (and exhibited explicit expressions for
the coefficients of fz,γ as a linear combination of {Kx1 , . . . , Kxm }). The proofs
of Proposition 2.25 and Corollary 2.26 readily extend to show the following
result.

Proposition 9.7 Let K be a Mercer kernel on X , and φ a loss function. Let also
φ
B ⊆ HK , γ > 0, and z ∈ Z m . If f ∈ HK is a minimizer of Ez in B, then P(f ) is
φ φ
a minimizer of Ez in P(B). If, in addition, P(B) ⊆ B and Ez can be minimized
in B, then such a minimizer can be chosen in P(B). Similar statements hold for
φ
Ez,γ . 

We can use Proposition 9.7 to state the problem of computing fz0 as a


nonlinear programming problem.

Corollary 9.8 We can take

1
m
fz0 := argmin χ{yi f (xi )<0} .
f ∈HK,z m
i=1
f K =1

φ 
Proof. Let f∗ be a minimizer of Ez 0 (f ) = m1 m i=1 χ{yi f (xi )<0} in HK ∩ {f |
φ φ
f K = 1}. By Proposition 9.7, P(f∗ ) ∈ HK,z satisfies Ez 0 (f∗ ) = Ez 0 (P(f∗ )).
φ0 φ0
If P(f∗ ) = 0, we thus have Ez (f∗ ) = Ez (P(f∗ )/ P(f∗ ) K ), showing that a
minimizer exists in HK,z ∩ {f | f K = 1}.
φ φ
If P(f∗ ) = 0, then Ez 0 (f∗ ) = Ez 0 (0) = 1, the maximal possible error. This
φ0
means that for all f ∈ HK , Ez (f ) = 1, so we may take any function in
φ
HK,z ∩ {f | f K = 1} as a minimizer of Ez 0 . 
164 9 Support vector machines for classification

Proposition 9.7 (and Corollary 9.8 when φ = φ0 ) places the problem of


φ φ
finding minimizers of Ez or Ez,γ in the setting of the general nonlinear
programming problem. But we would actually like to deal with a programming
problem for which efficient algorithms exist – for instance, a convex
programming problem. This is not the case, unfortunately, for the loss
function φ0 .
φ
Remark 9.9 Take φ = φ0 and consider the problem of minimizing Ez on
{f ∈ HK | f K = 1}. By Corollary 9.8, we can minimize on {f ∈ HK,z |
f K = 1} and take
m
fz0 = cz,j Kxj ,
j=1

where

1 
m
cz = (cz,1 , . . . , cz,m ) = argmin χ m .
m j=1 cj yi K(xi ,xj )<0
c∈Rm i=1
cT K[x]c=1

Since SKm−1 = {c ∈ Rm | cT K[x]c = 1} is not a convex subset of Rm and


χ{mj=1 cj yi K(xi ,xj )<0} may not be a convex function of c ∈ SKm−1 , the optimization
problem of computing cz is not, in general, a convex programming problem.

We would like thus to replace the loss φ0 by a loss φ that, on one hand,
approximates Bayes rule – for which we will require that φ is close to
the misclassification loss φ0 – and, on the other hand, leads to a convex
programming problem. Although we could do so in the setting described in
Chapter 1 (we actually did it with fz0 above), we instead consider the regularized
setting of Chapter 8.

Definition 9.10 Let K be a Mercer kernel, φ a loss function, z ∈ Z m , and


γ > 0. The regularized classifier associated with K, φ, z, and γ is defined as
φ
sgn(fz,γ ), where
 
1 $
m
φ %
fz,γ := argmin φ yi f (xi ) + γ f 2
K . (9.6)
f ∈HK m
i=1

Note that (9.6) is a regularization scheme like those described in Chapter 8.


The constant γ > 0 is called the regularization parameter, and it is often
selected as a function of m, γ = γ (m).
9.2 Regularized classifiers 165

Proposition 9.11 If φ : R → R+ is convex, then the optimization problem


induced by (9.6) is a convex programming one.

φ m
Proof. According to Proposition 9.7, fz,γ = j=1 cz,j Kxj , where

⎛ ⎞
1 
m m
cz = (cz,1 , . . . , cz,m ) = argmin φ⎝ yi K(xi , xj )cj ⎠
c∈R m m
i=1 j=1


m
+γ ci K(xi , xj )cj .
i,j=1

 
m
For each i = 1, . . . , m, φ j=1 yi K(xi , xj )cj = φ(yT K[x]c) is a convex
function of c ∈ R . In addition, since K is a Mercer kernel, the Gramian matrix
m

K[x] is positive semidefinite. Therefore, the function c  → cT K[x]c is convex.


Thus, cz is the minimizer of a convex function. 

Regularized classifiers associated with general loss functions are discussed


in the next chapter. In particular, we show there that the least squares loss φls
yields a satisfactory algorithm from the point of view of convergence rates in
its error analysis. Here we restrict our exposition to a special loss, called hinge
loss,

φh (t) = (1 − t)+ = max{1 − t, 0}. (9.7)

The regularized classifier associated with the hinge loss, the support
vector machine, has been used extensively and appears to have a small
misclassification error in practice. One nice property of the hinge loss φh , not
possessed by the least squares loss φls , is the elimination of the local error
φ
when yf (x) > 1. This property often makes the solution fz,γh of (9.6) sparse
φ 
in the representation fz,γh = m i=1 cz,i Kxi . That is, most coefficients cz,i in this
φ
representation vanish. Hence the computation of fz,γh can, in practice, be very
fast. We return to this issue at the end of Section 9.4.
Although the definition of the hinge loss may not suggest at a first glance
any particular reason for inducing good classifiers, it turns out that there
is some geometry to explain why it may do so. We next disgress on this
geometry.
166 9 Support vector machines for classification

9.3 Optimal hyperplanes: the separable case


Suppose X ⊆ Rn and z = (z1 , . . . , zm ) is a sample set with zi = (xi , yi ),
i = 1, . . . , m. Then z consists of two classes with the following sets of indices:
I = {i | yi = 1} and II = {i | yi = −1}. Let H be a hyperplane given by
w · x = b with w ∈ Rn , w = 1, and b ∈ R. We say that I and II are separable
by H when, for i = 1, . . . , m,

w · xi > b if i ∈ I
w · xi < b if i ∈ II.

That is, points xi corresponding to I and II lie on different sides of H . We


say that I and II are separable (or that z is so) when there exists a hyperplane
H separating them. As shown in Figure 9.2, if w is a unit vector in Rn , then
the distance from a point x∗ ∈ Rn to the plane w · x = 0 is x∗ | cos θ| =
w x∗ | cos θ| = |w · x∗ |. For any b ∈ R, the hyperplane H given by w · x = b
is parallel to w · x = 0 and the distance from the point x∗ to H is |w · x∗ − b|.
When w · x∗ − b < 0, the point x∗ lies on the side of H opposite to the
direction w.
If I and II are separable by H , points xi with i ∈ I satisfy w·xi −b > 0 and the
point(s) in this set closest to H is (are) at a distance bI (w) := mini∈I {w·xi −b} =
mini∈I w·xi −b. Similarly, points xi with i ∈ II satisfy w·xi −b < 0 and the points
in this set closest to H is (are) at a distance bII (w) := − maxi∈II {w · xi − b} =
b − maxi∈II w · xi .
If we shift the separating hyperplane to w · x = c(w) with
 
c(w) = 1
2 min w · xi + max w · xi ,
i∈I i∈II

Figure 9.2
9.3 Optimal hyperplanes: the separable case 167

these distances become the same and equal to


 
(w) = 2 min w · xi − max w · xi
1
i∈I i∈II

= 12 {bI (w) + b + (bII (w) − b)}


= 12 {bI (w) + bII (w)} > 0.

Therefore, the two classes of points are separated by the hyperplane w · x = c(w)
and satisfy


⎪w · xi − c(w) ≥ mini∈I w · xi − c(w)


⎨ = 1 {min w · x − max
2 i∈I i i∈II w · xi } = (w) if i ∈ I

⎪w · xi − c(w) ≤ maxi∈II w · xi − c(w)



= 12 {maxi∈II w · xi − mini∈I w · xi } = −(w) if i ∈ II.

Moreover, there exist points from z on both hyperplanes w · x = c(w) ± (w)


(see Figure 9.3).
The quantity (w) is called the margin associated with the direction w, and
the set {x | w · x = c(w)} is the associated separating hyperplane.
Different directions w induce different separating hyperplanes. In Figure 9.3,
one can rotate w such that a hyperplane with smaller angle still separates
the data, and such a separating hyperplane will have a larger margin
(see Figure 9.4).

Figure 9.3
168 9 Support vector machines for classification

Figure 9.4

Any hyperplane in Rn induces a classifier. If its equation is w ·x −b = 0, then


the function x  → sgn(w · x − b) is such a classifier. This reasoning suggests
that the best classifier among those induced in this way may be that for which
the direction w yields a separating hyperplane with the largest possible margin
(w). Given z, such a direction is obtained by solving the optimization problem

max (w)
w =1

or, in other words,


 
max 21 min w · xi − max w · xi . (9.8)
w =1 yi =1 yi =−1

If w∗ is a maximizer of (9.8) with (w∗ ) > 0, then w∗ · x = c(w∗ ) is called the


optimal hyperplane and (w∗ ) is called the (maximal) margin of the sample.

Theorem 9.12 If z is separable with I and II both nonempty, then the


optimization problem (9.8) has a unique solution w∗ , (w∗ ) > 0, and the
optimal separating hyperplane is given by w∗ · x = c(w∗ ).

Proof. The function  : Rn → R defined by


 
(w) = 1
2 min w · xi − max w · xi
yi =1 yi =−1

= min w · xi + min w · xi ,
i∈I i∈II
9.4 Support vector machines 169

where  1
2 xi if yi = 1
xi =
− 12 xi if yi = −1,
is continuous. Therefore,  achieves a maximum value over the compact set
{w ∈ Rn | w ≤ 1}. The maximum cannot be achieved in the interior of this
set; for w∗ with w∗ < 1, we have

w∗ w∗ w∗ 1
 = min · x i + min · xi = (w∗ ) > (w∗ ).
w∗ i∈I w ∗ i∈II w ∗ w∗

Furthermore, the maximum cannot be attained at two different points.


Otherwise, for two maximizers w1∗  = w2∗ , we would have, for any i ∈ I and
j ∈ II,

w1∗ · xi + w1∗ · xj ≥ (w1∗ ), w2∗ · xi + w2∗ · xj ≥ (w2∗ ) = (w1∗ ),

which implies
   
1 ∗
2 w1 + 12 w2∗ · xi + 12 w1∗ + 12 w2∗ · xj ≥ (w1∗ ).

 
That is, 12 w1∗ + 21 w2∗ would be another maximizer, lying in the interior, which
is not possible. 

For the optimal hyperplane w∗ · x = c(w∗ ), all the vectors xi satisfy

yi (w∗ · xi − c(w∗ )) ≥ (w∗ )

no matter whether yi = 1 or yi = −1. The vectors xi for which equality holds


are called support vectors. From Figure 9.4, we see that these are points lying on
the two separating hyperplanes w∗ · x = c(w∗ ) ± (w∗ ). The classifier Rn → Y
associated with w∗ is given by
$ %
x  → sgn w∗ · x − c(w∗ ) .

9.4 Support vector machines


When z is separable, we can obtain a classifier by solving (9.8) and then taking,
if w∗ is the computed solution, the classifier x  → sgn(w∗ · x − c(w∗ )). We can
also solve an equivalent form of (9.8).
170 9 Support vector machines for classification

Theorem 9.13 Assume (9.8) has a solution w∗ with (w∗ ) > 0. Then w∗ =
w/ w , where w is a solution of

min w 2
w∈Rn , b∈R (9.9)
s.t. yi (w · xi − b) ≥ 1, i = 1, . . . , m.

Moreover, (w∗ ) = 1/ w is the margin.


Proof. A minimizer (w, b) of the quadratic function w 2 subject to the linear
constraints exists. Recall that
 
(w) = 12 min w · xi − max w · xi .
yi =1 yi =−1

Then
  
w 1 w b
 = min · xi −
w 2 yi =1 w w
 
w b 1
− max · xj − ≥ ,
yj =−1 w w w

since w · xi − b ≥ 1 when yi = 1, and w · xj − b ≤ −1 when yj = −1.


We claim that (w0 ) ≤ 1/ w for each unit vector w0 . If this is so, we
can conclude from Theorem 9.12 that  (w/ w ) = 1/ w = (w∗ ) and
w∗ = w/ w .
Suppose, to the contrary, that for some unit vector w0 ∈ Rn , (w0 ) > 1/ w
holds. Consider the vector w̄ = w0 /(w0 ) together with
$ %
2 min yi =1 w0 · xi + maxyj =−1 w0 · xj
1
b= .
(w0 )
They satisfy
$ %
w0 · xi − 1
2 minyi =1 w0 · xi + maxyj =−1 w0 · xj
w̄ · xi − b =
(w0 )
≥1 if yi = 1

and
$ %
w0 · xj − 1
2 minyi =1 w0 · xi + maxyj =−1 w0 · xj
w̄ · xj − b =
(w0 )
≤ −1 if yj = −1.
9.5 Optimal hyperplanes: the nonseparable case 171

But w̄ 2 = w0 2 /(w0 )2 = 1/(w0 )2 < w 2 , which is in contradiction


with w being a minimizer of (9.9). 
Thus, in the separable case, we can proceed by solving either the optimization
problem (9.8) or that given by (9.9). The resulting classifier is called the hard
margin classifier, and its margin is given by (w∗ ) with w∗ the solution of (9.8)
or by 1/ w with w the solution of (9.9).
It follows from Theorem 9.12 that there are at least n support vectors. In most
applications of the support vector machine, the number of support vectors is
much smaller than the sample size m. This makes the algorithm solving (9.9)
run faster.
Support vector machines (SVMs) consist of a family of efficient classification
algorithms: the SVM hard margin classifier (9.9), which works for separable
data, the SVM soft margin classifier (9.10) for nonseparable data (see next
section), and the general SVM algorithm (9.6) associated with the hinge loss
φh and a general Mercer kernel K. The first two classifiers can be expressed
in terms of the linear kernel K(x, y) = x · y + 1, whereas the general SVM
involves general Mercer kernels: the polynomial kernel (x · y + 1)d with d ∈ N
or Gaussians exp{− x − y 2 /σ 2 } with σ > 0. These SVM algorithms share a
φ 
special feature caused by the hinge loss φh : the solution fz,γh = m i=1 cz,i Kxi
often has a sparse vector of coefficients cz = (cz,1 , . . . , cz,m ), which makes the
algorithm computing cz run faster.

9.5 Optimal hyperplanes: the nonseparable case


In the nonseparable situation, there are no w ∈ Rn and b ∈ R such that the points
in z can be separated in to two classes with yi = 1 and yi = −1 by the hyperplane
w · x = b. In this case, we look for the soft margin classifier. This is defined by
introducing slack variables ξ = (ξ1 , . . . , ξm ) and considering the problem


m
min w 2 + 1
γm ξi
w∈Rn , b∈R, ξ ∈Rm i=1
(9.10)
s.t. yi (w · xi − b) ≥ 1 − ξi
ξi ≥ 0, i = 1, . . . , m.

Here γ > 0 is a regularization parameter. If (w, b, ξ ) is a solution of (9.10),


then its associated soft margin classifier is defined by x  → sgn(w · x − b).
The hard margin problem (9.9) in the separable case can be seen as a special
case of the soft margin one (9.10) corresponding to γ1 = ∞, in which case all
solutions have ξ = 0.
172 9 Support vector machines for classification

We claimed at the end of Section 9.2 that the regularized classifier associated
with the hinge loss was related to our previous discussion of margins and
separating hyperplanes. To see why this is so we next show that the soft
margin classifier is a special example of (9.6). Recall that the hinge loss φh is
defined by
φh (t) = (1 − t)+ = max{1 − t, 0}.

If (w, b, ξ ) is a solution of (9.10), then we must have ξi = (1 − yi (w · xi − b))+ ;


that is, ξi = φh (yi (w · xi − b)). Hence, (9.10) can be expressed by means of the
loss φh as
1
m
min φh (yi (w · xi − b)) + γ w 2 .
w∈Rn , b∈R m
i=1

If we consider the linear Mercer kernel K on Rn × Rn given by K(x, y) = x · y,


then HK = {w · x | x ∈ Rn }, Kw 2K = w 2 , and (9.10) can be written as

1
m
min φh (yi (f (xi ) − b)) + γ f 2
K. (9.11)
f ∈HK ,b∈R m
i=1

The scheme (9.11) is the same as (9.6) with the linear kernel except for the
constant term b, called offset. 1
One motivation to consider scheme (9.6) with an arbitrary Mercer kernel is
the expectation of separating data by surfaces instead of hyperplanes only. Let
f be a function on Rn , and f (x) = 0 the corresponding surface. The two classes
I and II are separable by this surface if, for i = 1, . . . , m,

f (xi ) > 0 if i ∈ I
f (xi ) < 0 if i ∈ II;

that is, if yi f (xi ) > 0 for i = 1, . . . , m. This set of inequalities is an empirical


version of the separation condition “yf (x) > 0 almost surely” for the probability
distribution ρ on Z. Such a separation condition is more general than the
separation by hyperplanes. In order to find such a separating surface using
efficient algorithms (convex optimization), we require that the function f lies
in an RKHS HK . Under such a separation condition, one may take γ = 0 and
algorithm (9.6) corresponds to a hard margin classifier. This is the context of
the next two sections, on error analysis.

1 We could have considered the scheme (9.11) with offset. We did not do so for simplicity of
exposition. References to work on the general case can be found in Section 9.8.
9.6 Error analysis for separable measures 173

9.6 Error analysis for separable measures


In this section we present an error analysis for scheme (9.6) with the hinge loss
φh (t) = (1 − t)+ for separable distributions.

Definition 9.14 Let HK be an RKHS of functions on X , and ρ a probability


measure on Z = X × Y . We say that ρ is strictly separable by HK with margin
 > 0 if there is some fsp ∈ HK such that fsp K = 1 and yfsp (x) ≥  almost
surely.

Remark 9.15
(i) Even under the weaker condition that yfsp (x) > 0 almost surely (which
we consider in the next section), we have y = sgn(fsp (x)) almost surely.
Hence, the variance σρ2 vanishes (i.e., ρ is noise free) and so does κρ .
(ii) As a consequence of (i), fc = sgn(fsp ).
(iii) Since fsp is continuous and |fsp (x)| ≥  almost surely, it follows that if ρ
is strictly separable, then
$ %
ρX T ∩ X \ T = 0,

where T = {x ∈ X | fc (x) = 1}. This implies that if X is connected,


ρX (T ) > 0, and ρX (X \ T ) > 0, then ρ is degenerate. The situation would
be as in Figure 9.5, where the two dashed regions represent the support of
the set T , those with dots represent the support of X \ T , and the remainder
of the rectangle has measure zero (for the measure ρX ).

Figure 9.5
174 9 Support vector machines for classification

Theorem 9.16 If ρ is strictly separable by HK with margin , then, for almost


every z ∈ Z m ,
φ φ γ
Ez h (fz,γh ) ≤ 2

φ
and fz,γh K ≤ 1
.
φ
Proof. Since fsp / ∈ HK , we see from the definition of fz,γh that
 2
φ φ φ φ fsp  fsp 
Ez h (fz,γh ) + γ fz,γh 2K ≤ Ez h +γ 
 .

 K

But y(fsp (x)/) ≥ 1 almost surely, that is, 1 − y(fsp (x)/) ≤ 0, so we have
$ % φ $ %
φh y(fsp (x)/) = 0 almost surely. It follows that Ez h fsp / = 0. Since
 
fsp /2 = 1/2 ,
K
φ φ φ γ
Ez h (fz,γh ) + γ fz,γh 2K ≤ 2

holds and the statement follows. 
φ
The results in Chapter 8 lead us to expect the solution fz,γh of (9.6) to satisfy
φ φ φ
E φh (fz,γh ) → E φh (fρ h ), where fρ h is a minimizer of E φh . We next show that
φ
this is indeed the case. To this end, we first characterize fρ h . For x ∈ X , let
ηx := ProbY (y = 1 | x).
Theorem 9.17 For any measurable function f : X → R

E φh ( f ) ≥ E φh (fc )

holds.
φ
That is, the Bayes rule fc is a minimizer fρ h of E φh .

Proof. Write E φh ( f ) = X h,x (f (x)) dρX , where

h,x (t) = φh (yt) dρ(y | x) = φh (t)ηx + φh (−t)(1 − ηx ).
Y

When t = fc (x) ∈ {1, −1}, for y = fc (x) one finds that yt = 1 and φh (yt) = 0,
whereas for y = −fc (x)  = fc (x), yt = −1 and φh (yt) = 2. So Y φh (yt) dρ(y |
x) = 2 Prob(y  = fc (x) | x) and h,x ( fc (x)) = 2 Prob(y  = fc (x) | x).
According to (9.2), Prob(y  = fc (x) | x) ≤ Prob(y = s | x) for s = ±1.
Hence, h,x (fc (x)) ≤ 2 Prob(y = s | x) for any s ∈ {1, −1}.
If t ≥ 1, then φh (t) = 0 and h,x (t) = (1 + t)(1 − ηx ) ≥ 2(1 − ηx ) ≥
h,x ( fc (x)).
9.6 Error analysis for separable measures 175

If t ≤ −1, then φh (−t) = 0 and h,x (t) = (1 − t)ηx ≥ 2ηx ≥ h,x ( fc (x)).
If −1 < t < 1, then h,x (t) = (1 − t)ηx + (1 + t)(1 − ηx )≥(1 − t)
2 h,x ( fc (x)) + (1 + t) 2 h,x ( fc (x)) = h,x ( fc (x)).

1 1

Thus, we have h,x (t) ≥ h,x ( fc (x)) for all t ∈ R. In particular,


 
E φh ( f ) = h,x ( f (x)) dρX ≥ h,x ( fc (x)) dρX = E φh ( fc ). 
X X

When ρ is strictly separable by HK , we see that sgn( y fsp (x)) = 1 and hence
y = sgn( fsp (x)) almost surely. This means fc (x) = sgn( fsp (x)) and y = fc (x)
almost surely. In this case, we have E φh ( fc ) = Z (1−y fc (x))+ = 0. Therefore,
φ
we expect E φh ( fz,γh ) → 0. To get error bounds showing that this is the case we
write
φ φ φ φ φ φ φ φ φ γ
E φh ( fz,γh ) = E φh ( fz,γh ) − Ez h ( fz,γh ) + Ez h ( fz,γh ) ≤ E φh ( fz,γh ) − Ez h ( fz,γh ) + .
2
(9.12)

Here we have used the first inequality in Theorem 9.16. The second inequality
φ
of that theorem tells us that fz,γh lies in the set { f ∈ HK | f K ≤ 1/}. So
φ
it is sufficient to estimate E φh ( f ) − Ez h ( f ) for functions f in this set in some
uniform way. We can use the same idea we used in Lemmas 3.18 and 3.19.
Lemma 9.18 Suppose a random variable ξ satisfies 0 ≤ ξ ≤ M . Denote
µ = E(ξ ). For every ε > 0 and 0 < α ≤ 1,
 m  
µ− 1
i=1 ξ(zi ) √ 3α 2 mε
Prob m
√ ≥ α ε ≤ exp −
z∈Zm µ+ε 8M

holds.
Proof. The proof follows from Lemma 3.18, since the assumption 0 ≤ ξ ≤ M
implies |ξ − µ| ≤ M and E(ξ 2 ) ≤ M E(ξ ). 
Lemma 9.19 Let F be a subset of C (X ) such that f C (X ) ≤ B for all f ∈ F.
Then, for every ε > 0 and 0 < α ≤ 1, we have
 φ
  
E φh ( f ) − Ez h ( f ) √ 3α 2 mε
Prob sup ≥ 4α ε ≤ N (F, αε) exp − .
z∈Zm
f ∈F E φh ( f ) + ε 8(1 + B)

Proof. Let { f1 , . . . , fN } be an αε-net for F with N = N (F, αε). Then, for


each f ∈ F, there is some j ≤ N such that f − fj C (X ) ≤ αε. Since
φh is Lipschitz, |φh (t) − φh (t  )| ≤ |t − t  | for all t, t  ∈ R. Therefore,
176 9 Support vector machines for classification

|φh ( y f (x)) − φh ( y fj (x))| ≤ f − fj C (X ) ≤ αε. It follows that |E φh ( f ) −


φ φ
E φh ( fj )| ≤ αε and |Ez h (f ) − Ez h ( fj )| ≤ αε. Hence,
φ φ
|E φh ( f ) − E φh ( fj )| √ |Ez h ( f ) − Ez h ( fj )| √
≤α ε and ≤ α ε.
E φh ( f ) + ε E φh ( f ) + ε

Also,
7 7
E φh ( fj ) + ε ≤ E φh ( f ) + ε + E φh ( fj ) − E φh ( f )
7 7
≤ E φh ( f ) + ε + |E φh ( fj ) − E φh ( f )|.

Since α ≤ 1, we have |E φh ( f ) − E φh (fj )| ≤ ε ≤ ε + E φh ( f ) and then


7 7
E φh ( fj ) + ε ≤ 2 E φh ( f ) + ε. (9.13)

Therefore,
φ φ φ
E φh ( f ) − Ez h ( f ) E φh ( f ) − E φh ( fj ) Ez h ( fj ) − Ez h ( f )
= +
E φh ( f ) + ε E φh ( f ) + ε E φh ( f ) + ε
φ φ
E φh ( fj ) − Ez h ( fj ) √ E φh ( fj ) − Ez h ( fj )
+ ≤ 2α ε + .
E φh ( f ) + ε E φh ( f ) + ε
φ √
It follows that if (E φh ( f ) − Ez h ( f ))/ E φh ( f ) + ε) ≥ 4α ε for some f ∈ F,
then
φ
E φh ( fj ) − Ez h ( fj ) √
≥ 2α ε.
E φh ( f ) + ε
This, together with (9.13), tells us that
φ
E φh ( fj ) − Ez h ( fj ) √
7 ≥ α ε.
E φh ( fj ) + ε

Thus,
 φ

E φh ( f ) − Ez h ( f ) √
Prob sup ≥ 4α ε
z∈Zm
f ∈F E φh ( f ) + ε
⎧ ⎫

⎨ ⎪
N φ φ
E h ( fj ) − Ez (fj )
h
√ ⎬
≤ Prob 7 ≥α ε .
z∈Z m ⎪
⎩ ⎪

j=1 E φh ( fj ) + ε
9.6 Error analysis for separable measures 177

The statement now follows from Lemma 9.18 applied to the random variable
ξ = φh ( y fj (x)), for j = 1, . . . , N , which satisfies 0 ≤ ξ ≤ 1 + fj C (X ) ≤
1 + B. 

We can now derive error bounds for strictly separable measures. Recall that
B1 denotes the unit ball of HK as a subset of C (X ).

Theorem 9.20 If ρ is strictly separable by HK with margin , then, for any


0 < δ < 1,
φ 2γ
E φh ( fz,γh ) ≤ 2ε∗ (m, δ) + 2

with confidence at least 1 − δ, where ε ∗ (m, δ) is the smallest positive solution
of the inequality in ε

ε 3mε
log N B1 , − ≤ log δ.
4 128(1 + CK /)

In addition,

(i) If log N (B1 , η) ≤ C0 (1/η)p for some p, C0 > 0, and all η > 0, then

∗ log(1/δ) 1/(1+p)
ε (m, δ) ≤ 86(1 + CK /) max , C0
m

p/(1+p) 1/(1+p)
4 1
.
 m

(ii) If log N (B1 , η) ≤ C0 (log(1/η))p for some p, C0 > 0 and all 0 < η < 1,
then, for m ≥ max{4/, 3},

(log m)p & '


ε ∗ (m, δ) ≤ 1 + 43(1 + CK /)(2p C0 + log(1/δ)) .
m

Proof. We apply Lemma 9.19 to the set F = { f ∈ HK | f K ≤ 1 }. Each


function f ∈ F satisfies f C (X ) ≤ CK f K ≤ CK /.& By 2Lemma 9.19, for any '
0 < α ≤ 1, with confidence at least 1−N (F, αε) exp −3α mε/(8(1 + CK /)) ,
we have
φ
E φh ( f ) − Ez h ( f ) √
sup ≤ 4α ε.
f ∈F E φh ( f ) + ε
178 9 Support vector machines for classification

φ
In particular, the function fz,γh , which belongs to F by Theorem 9.16, satisfies

φ φ φ
E φh ( fz,γh ) − Ez h ( fz,γh ) √
7 ≤ 4α ε.
φ
E φh ( fz,γh ) + ε

This, together with (9.12), yields


7
φh φ √ φ γ
E ( fz,γh ) ≤ 4α ε E φh ( fz,γh ) + ε + 2 .

7
φ
If we denote t = E φh ( fz,γh ) + ε, this inequality becomes

√  γ 
t 2 − 4α εt − ε + 2 ≤ 0.

Solving
7 the associated quadratic equation and taking into account that
φ
t = E φh ( fz,γh ) + ε ≥ 0, we deduce that
?
√ √ γ
0 ≤ t ≤ 2α ε + (2α ε)2 + ε + 2 .


Hence, using the elementary inequality (a + b)2 ≤ 2a2 + 2b2 , we obtain


 √ γ  2γ
φ
E φh ( fz,γh ) = t 2 −ε ≤ 2(4α 2 ε)+2 (2α ε)2 + ε + 2 −ε = 16α 2 ε+ε+ 2 .
 

Set α = 14 and ε = ε∗ (m, δ) as in the statement. Then the confidence 1 −


N (F, αε) exp{−3α 2 mε/(8(1 + CK /))} is at least 1 − δ, and with this
confidence we have

φ 2γ
E φh ( fz,γh ) ≤ 2ε∗ (m, δ) + .
2

(i) If log N (B1 , η) ≤ C0 (1/η)p , then ε ∗ (m, δ) ≤ ε∗ , where ε ∗ satisfies


p
4 3mε
C0 − = log δ.
ε 128(1 + CK /)

This equation can be written as


p
128(1 + CK /) 1 128(1 + CK /)C0 4
ε1+p − log ε p − = 0.
3m δ 3m 
9.6 Error analysis for separable measures 179

Then Lemma 7.2 with d = 2 yields



256(1 + CK /) 1
ε∗ (m, δ) ≤ ε∗ ≤ max log ,
3m δ

1/(1+p) p/(1+p)
256(1 + CK /)C0 4 −1/(1+p)
m .
3 

(ii) If log N (B1 , η) ≤ C0 (log(1/η))p , then ε ∗ (m, δ) ≤ ε∗ , where ε ∗ satisfies


p
4 3mε
C0 log − = log δ.
ε 128(1 + CK /)
 p
The function h : R+ → R defined by h(ε) = C0 log ε
4
− 3mε/
(128(1 + CK /)) is decreasing. Take
 
(2p C0 + log(1/δ))128(1 + CK /)
A = max ,1
3

and ε = A(log m)p /m. Then, for m ≥ max{4/, 3},

A(log m)p 1 4
≥ and log + log m ≤ 2 log m.
m m 

It follows that
p
A(log m)p 4 1
h ≤ C0 log + log m − (log m)p 2p C0 + log
m  δ
1
≤ −(log m)p log ≤ log δ.
δ

Since h is decreasing, we have

A(log m)p
ε∗ (m, δ) ≤ ε∗ ≤ . 
m

In order to obtain estimates for the misclassification error from the


generalization error, we need to compare R with E φh . This is simple.

Theorem 9.21 For any measure ρ and any measurable function f : X → R,

R(sgn( f )) − R( fc ) ≤ E φh ( f ) − E φh ( fc ).
180 9 Support vector machines for classification

Proof. Denote Xc = {x ∈ X | sgn( f )(x)  = fc (x)}. By the definition of the


misclassification error,

R(sgn( f )) − R( fc ) = Prob( y  = sgn( f )(x) | x)
Xc Y

− Prob( y  = fc (x) | x) dρX .


Y

For a point x ∈ Xc , we know that ProbY ( y  = sgn( f )(x) | x) = ProbY (y =


fc (x) | x). Hence, ProbY ( y  = sgn( f )(x) | x) − ProbY ( y  = fc (x) | x) = fρ (x)
or −fρ (x) according to whether fρ (x) ≥ 0. It follows that | fρ (x)| = ProbY ( y =
fc (x) | x) − ProbY ( y  = fc (x) | x) and, therefore,

 
R(sgn( f )) − R( fc ) =  fρ (x) dρX . (9.14)
Xc

By the definition of φh ,

0 if y = fc (x)
φh ( y fc (x)) = (1 − y fc (x))+ =
2 if y  = fc (x).
  
Hence E φh ( fc ) = X Y φh ( y fc (x))
 dρ( y|x) dρX = X 2 ProbY ( y  = fc (x)|x)
dρX . Furthermore, E φh ( f ) = X Y φh ( y f (x)) dρ( y|x) dρX . Thus, it is
sufficient for us to prove that

φh ( y f (x)) dρ( y|x) − 2 Prob( y  = fc (x)|x) ≥ | fρ (x)|, ∀x ∈ Xc . (9.15)
Y Y

We prove (9.15) in two cases.


If | f (x)| > 1, then x ∈ Xc implies that sgn( f (x))  = fc (x) and φh (−fc (x)
f (x)) = 0. Hence,

φh ( y f (x)) d ρ( y|x) = φh ( fc (x)f (x)) Prob( y = fc (x)|x)
Y Y

= (1 − fc (x)f (x)) Prob( y = fc (x)|x).


Y

Since −fc (x)f (x) = | f (x)| > 1, it follows that



φh ( y f (x)) dρ( y|x) − 2 Prob( y  = fc (x)|x)
Y Y

≥ (1 + | f (x)|) Prob( y = fc (x)|x) − Prob( y  = fc (x)|x)


Y Y

= (1 + | f (x)|)| fρ (x)| ≥ | fρ (x)|.


9.7 Error analysis for separable measures 181

If | f (x)| ≤ 1, then

φh ( y f (x)) dρ( y|x) − 2 Prob( y  = fc (x)|x)
Y Y

= (1 − fc (x)f (x)) Prob( y = fc (x)|x) + (1 + fc (x)f (x)) Prob( y  = fc (x)|x)


Y Y

− 2 Prob( y  = fc (x)|x)
Y

= Prob( y = fc (x)|x) − Prob( y  = fc (x)|x)


Y Y

+ fc (x)f (x) Prob( y  = fc (x)|x) − Prob( y = fc (x)|x)


Y Y

= (1 − fc (x)f (x))| fρ (x)|.

But x ∈ Xc implies that fc (x)f (x) ≤ 0 and 1 − fc (x)f (x) = 1 + | f (x)|. So in


this case we have

φh ( y f (x)) dρ(y|x) − 2 Prob( y  = fc (x)|x) = (1 + | f (x)|)| fρ (x)| ≥ | fρ (x)|.
Y Y


Combining Theorems 9.20 and 9.21, we can derive bounds for the
misclassification error for the support vector machine soft margin classifier
for strictly separable measures satisfying R( fc ) = E φh ( fc ) = 0.
Corollary 9.22 Assume ρ is strictly separable by HK with margin .
(i) If log N (B1 , η) ≤ C0 (1/η)p for some p, C0 > 0 and all η > 0, then, with
confidence 1 − δ,

φ 2γ
R(sgn( fz,γh )) ≤ 2 + 172(1 + CK /)

 
log(1/δ) 1/(1+p) 4 p/(1+p) 1 1/(1+p)
max , C0 .
m  m

(ii) If log N (B1 , η) ≤ C0 (log(1/η))p for some p, C0 > 0 and all 0 < η < 1,
then, for m ≥ max{4/, 3}, with confidence 1 − δ,

φ 2γ (log m)p & '


R(sgn( fz,γh )) ≤ + 2 + 86(1 + CK /)(2p C0 + log(1/δ)) . 
 2 m
It follows from Corollary 9.22 that for strictly separable measures, we may
take γ = 0. In this case, the penalized term in (9.6) vanishes and the soft margin
classifier becomes a hard margin one.
182 9 Support vector machines for classification

9.7 Weakly separable measures


We continue our discussion on separable measures. By abandoning the positive
margin  > 0 assumption, we consider weakly separable measures.

Definition 9.23 We say that ρ is weakly separable by HK if there is some


function fsp ∈ HK satisfying fsp K = 1 and y fsp (x) > 0 almost surely. It has
separation triple (θ , , C0 ) ∈ (0, ∞] × (0, ∞)2 if, for all t > 0,

ρX {x ∈ X : | fsp (x)| < t} ≤ C0 t θ . (9.16)

The largest θ for which there are positive constants , C0 such that (θ , , C0 )
is a separation triple is called the separation exponent of ρ (w.r.t. HK and fsp ).

Remark 9.24 Note that when θ = ∞, condition (9.16) is the same as ρX {x ∈


X : | fsp (x)| < t} = 0 for all 0 < t < 1. That is, | fsp (x)| ≥  almost surely. But
y fsp (x) > 0 almost surely. So a weakly separable measure with θ = ∞ is exactly
a strictly separable measure with margin .

Lemma 9.25 Assume ρ is weakly separable by HK with separation triple


(θ , , C0 ). Take

1/(2+θ)
C0
fγ = −θ/(2+θ) fsp . (9.17)
γ

Then  γ θ/(2+θ)
E φh ( fγ ) + γ fγ
2/(2+θ)
2
K ≤ 2C0 .
2
Proof. Write fγ = fsp /t, with t > 0 to be determined. Since y fsp (x) > 0
almost surely, the same holds for y fγ (x) > 0. Hence, φh (y fγ (x)) < 1 and
φh ( y fγ (x)) > 0 only if y fγ (x) < 1, that is, if | fγ (x)| =  fsp (x)/t  < 1.
Therefore,
 
E φh ( fγ ) = φh ( y fγ (x)) dρ = φh (| fγ (x)|) dρX
Z X

= (1 − | fγ (x)|) dρX ≤ ρX {x ∈ X : | fγ (x)|
| fγ (x)|<1

< 1} ≤ ρX {x ∈ X : | fsp (x)| < t} ≤ C0 t θ .

$ %1/(2+θ)
But γ fγ 2
K = γ 1/(t)2 . Setting t = γ /C0 2 proves our statement.

9.7 Weakly separable measures 183

Theorem 9.26 If ρ is weakly separable by HK with a separation triple


(θ , , C0 ) then, for any 0 < δ < 1, with confidence 1 − δ, we have
 γ θ/(2+θ) 3 log(2/δ)
φ
R(sgn( fz,γh )) ≤ 2ε∗ (m, δ, γ ) + 8C0
2/(2+θ)
+ ,
2 m
where ε ∗ (m, δ, γ ) is the smallest positive solution of the inequality
 ε  3mε δ
log N B1 , − ≤ log
4R 128(1 + CK R) 2
7
1/(2+θ) −θ/(2+θ) −1/(2+θ)
with R = 2C0  γ + 2 log(2/δ)
mγ .
In addition,
(i) If log N (B1 , η) ≤ C0 (1/η)p for some p, C0 > 0 and all η > 0, then, taking
γ = m−β with 0 < β < max{θ,1+2p}
2+θ
, we have, with confidence 1 − δ,
 
r 2
φ 1 2
R(sgn( fz,γh )) ≤C log
m δ

−β 2+θ−β−2pβ θβ
with r = min 2+θ 2+θ , (2+θ)(1+p) , 2+θ and C a constant independent of
m and δ.
(ii) If log N (B1 , η) ≤ C0 (log(1/η))p for some p, C0 > 0 and all 0 < η < 1,
then, for m ≥ max{4/, 3}, taking γ = m−(2+θ)/(1+θ) , we have, with
confidence 1 − δ,
 
φh 1 θ/(1+θ) 2 2
R(sgn( fz,γ )) ≤ C (log m) log
p
.
m δ

φ
Proof. By Theorem 9.21, it is sufficient to bound E φh ( fz,γh ) as stated, since
y = sgn( fsp (x)) = fc (x) almost surely and therefore E φh ( fc ) = 0 and R( fc ) =
0.
φ φ
Choose fγ by (9.17). Decompose E φh ( fz,γh ) + γ fz,γh 2K as
  
φ φ φ φ φ φ φ
E φh ( fz,γh ) − Ez h ( fz,γh ) + Ez h ( fz,γh ) + γ fz,γh 2
K − Ez h ( fγ ) + γ fγ 2
K
φ
+ Ez h ( fγ ) + γ fγ 2
K.

Since the middle term is at most zero by (9.6), we have


 
φ φ φ φ φ φ
E φh ( fz,γh ) + γ fz,γh 2K ≤ E φh ( fz,γh ) − Ez h ( fz,γh ) + Ez h ( fγ ) + γ fγ 2
K .
184 9 Support vector machines for classification

To bound the last term, consider the random variable ξ = φh ( y fγ (x)). Since
y fγ (x) > 0 almost surely, we have 0 ≤ ξ ≤ 1. Also, σ 2 (ξ ) ≤ E(ξ ) = E φh ( fγ ).
Apply the one-side Bernstein inequality to ξ and deduce, for each ε > 0,
 
 mε2
φ
Prob Ez h ( fγ ) − E φh ( fγ ) ≤ ε ≥ 1 − exp − .
z∈Zm
2(E φh ( fγ ) + 13 ε)

By solving the quadratic equation

mε2 δ
− = log ,
2(E φh ( fγ ) + 13 ε) 2

we conclude that, with confidence 1 − 2δ ,


? 2
1
3 log 2
δ + 1
3 log 2δ + 2mE φh ( fγ ) log 2δ
φ
Ez h ( fγ ) − E φh ( fγ ) ≤
m
2
7 log
≤ δ
+ E φh ( fγ ).
6m

δ
Thus, there exists a subset U1 of Z m with ρ(U1 ) ≥ 1 − 2 such that

φ 7 log 2δ
Ez h ( fγ ) ≤ 2E φh ( fγ ) + , ∀z ∈ U1 .
6m

Then, by Lemma 9.25, for all z ∈ U1 ,

φ φ φ φ
Ez h ( fz,γh ) + γ fz,γh 2
K ≤ Ez h ( fγ ) + γ fγ 2
K
 γ θ/(2+θ) 7 log 2
2/(2+θ) δ
≤ 4C0 + . (9.18)
2 6m
7
−θ/(2+θ) γ −1/(2+θ) +
1/(2+θ) 2 log(2/δ)
In particular, taking R = 2C0 mγ , we have,
for all z ∈ U1 ,

φ
fz,γh ∈ F = { f ∈ HK : f K ≤ R} , ∀z ∈ U1 .

Now apply Lemma 9.19 to the set F with α = 14 . We have N (F, 4ε ) =


ε
N (B1 , 4R ) and f C (X ) ≤ CK R for all f ∈ F. By Lemma 9.19, we can find
9.8 References and additional remarks 185

ε
a subset U2 of Z m with ρ(U2 ) ≥ 1 − N (B1 , 4R ) exp{−3mε/(128(1 + CK R))}
such that, for all f ∈ F,

φ
E φh ( f ) − Ez h (f ) √
≤ ε.
E φh ( f ) + ε

φ
In particular, when z ∈ U1 ∩ U2 , we have fz,γh ∈ F and, hence,
7
φ φ φ √ φ φ
E φh ( fz,γh ) − Ez h ( fz,γh ) ≤ ε E φh ( fz,γh ) + ε ≤ 21 E φh ( fz,γh ) + ε.

Take ε = ε∗ (m, δ, γ ) to be the smallest positive solution of


 ε  3mε δ
log N B1 , − ≤ log .
4R 128(1 + CK R) 2
δ
Then ρ(U2 ) ≥ 1 − 2 and, for z ∈ U1 ∩ U2 , (9.18) implies

1 φh φh  γ θ/(2+θ) 7 log 2
φ
E φh ( fz,γh ) ≤ E ( fz,γ ) + ε ∗ (m, δ, γ ) + 4C0 δ
2/(2+θ)
+ .
2 2 6m
It follows that
 γ θ/(2+θ) 7 log 2
φ
E φh ( fz,γh ) ≤ 2ε∗ (m, δ, γ ) + 8C0 δ
2/(2+θ)
+ .
2 3m
Since ρ(U1 ∩ U2 ) ≥ 1 − δ, our first statement holds. The rest of the result,
statements (i) and (ii), follows from Theorem 9.20 after replacing  by R1
and δ by 2δ . 

9.8 References and additional remarks


The support vector machine was introduced by Vapnik and his collaborators.
It appeared in [20] with polynomial kernels K(x, y) = (1 + x · y)d and in [35]
with general Mercer kernels. More details about the algorithm for solving
optimization problem (9.11) can be found in [37, 107, 134, 152].
Proposition 9.3 and some other properties of the Bayes rule can be found
in [44]. Proposition 9.7 (a representer theorem) can be found in [137]. The
material in Sections 9.3–9.5 is taken from [134].
Theorem 9.17 was proved in [138] and Theorem 9.21 in [154]. The idea of
comparing excess errors also appeared in [80].
186 9 Support vector machines for classification

The error analysis for support vector machines and strictly separable
distributions was already well understood in the early works on support vector
machines (see [134, 37]). The concept of weakly separable distribution was
introduced, and the error analysis for such a distribution was performed, in [31].
When the support vector machine soft margin classifier contains an offset
term b as in (9.11), the algorithm is more flexible and more general data
can be separated. But the error analysis is more complex than for scheme
φ
(9.6), which has no offset. The bound for fz,γh K becomes larger than those
shown in Theorem 9.16 and (9.18). But the approach we have used for scheme
(9.6) can be applied as well and a similar error analysis can be performed.
For details, see [31].
10
General regularized classifiers

In Chapter 9 we saw that solving classification problems amounts to


approximating the Bayes rule fc (w.r.t. the misclassification error) and we
described a learning algorithm, the support vector machine, producing such
aproximations from a sample z, a Mercer kernel K, and a regularization
parameter γ > 0. The main result in Chapter 9 estimated the quality of
the approximations obtained under a separability hypothesis on ρ. The
classifier produced by the support vector machine is the regularized classifier
associated with z, K, γ , and a particular loss function, the hinge loss φh .
Recall that for a loss function φ, this regularized classifier is given by
φ
sgn( fz,γ ), with
 
1
m
φ
fz,γ := argmin φ(yi f (xi )) + γ f 2
K . (10.1)
f ∈HK m
i=1

In this chapter we extend this development in two ways. First, we remove the
separability assumption. Second, we replace the hinge loss φh by arbitrary loss
functions within a certain class. Note that it would not be of interest to consider
completely arbitrary loss functions, since many such functions would lead to
optimization problems (10.1) for which no efficient algorithm is known. The
following definition yields an intermediate class of loss functions.

Definition 10.1 We say that φ : R → R+ is a classifying loss (function) if


it is convex and differentiable at 0 with φ  (0) < 0, and if the smallest zero of
φ is 1.

Examples of classifying loss functions are the least squares loss φls (t), the
hinge loss φh , and, for 1 ≤ q < ∞, the q-norm (support vector machine) loss
defined by φq (t) := (φh (t))q .

187
188 10 General regularized classifiers

h= 1

0 1

Figure 10.1

Note that Proposition 9.11 implies that optimization problem (10.1) for a
classifying loss function is a convex programming problem. One special feature
shared by φls , φh , and φ2 = (φh )2 is that their associated convex programming
problems are quadratic programming problems. This allows for many efficient
algorithms to be applied when computing a solution of (10.1). Note that φls
differs from φ2 by the addition of a symmetric part on the right of 1.
Figure 10.1 shows the shape of some of these classifying loss functions
(together with that of φ0 ).
φ
Our goal, as in previous chapters, is to understand how close sgn( fz,γ ) is to
fc (w.r.t. the misclassification error). In other words, we want to estimate the
φ
excess misclassification error R(sgn( fz,γ )) − R( fc ). Note that in Chapter 9
we had R( fc ) = 0 because of the separability assumption. This is no longer
the case. The main result in this chapter, Theorem 10.24, this goal achieves for
various kernels K and classifying loss functions.
The following two theorems, easily derived from Theorem 10.24, become
specific for C ∞ kernels and the hinge loss φh and the least squares loss φls ,
respectively.

Theorem 10.2 Assume that X ⊆ Rn , K is C ∞ in X × X , and, for


some β > 0,

inf { f − fc Lρ1 + γ f K}
2
= O(γ β ). (10.2)
f ∈HK X
10.1 Bounding the misclassification error 189

Choose γ = m−2/(1+β) . Then, for any 0 < ε < 1


2 and 0 < δ < 1, with
confidence 1 − δ,
θ
φ 2 1
R(sgn( fz,γh )) − R( fc ) ≤ C log
δ m
 
2β 1
holds, where θ = min , − ε and C is a constant depending on ε but
1+β 2
not on m or δ.
Condition (10.2) measures how quickly fc is approximated by functions from
HK in the metric Lρ1X . When HK is dense in Lρ1X (X ), the quantity on the left-
hand side of (10.2) tends to zero as γ → 0. What (10.2) requires is a certain
decay for this convergence. This can be stated as some interpolation space
condition for the function fc .
Theorem 10.3 Assume that K is C ∞ in X × X and that for some β > 0,

inf { f − fρ 2L 2 + γ f K}
2
= O(γ β ).
f ∈HK ρX

Choose γ = 1
m. Then, for any 0 < ε < 1
2 and 0 < δ < 1, with confidence
1 − δ,
1/2 (1/2) min{β,1−ε}
φ 2 1
R(sgn( fz,γls )) − R( fc ) ≤ C log
δ m

holds, where C is a constant depending on ε but not on m or δ.


Again, the exponents β in Theorems 10.2 and 10.3 depend on the measure
ρ. We note, however, that in the latter this exponent occurs only in the bounds;
that is, the regularization parameter γ can be chosen without knowing β and,
actually, without any knowledge about ρ.

10.1 Bounding the misclassification error in terms


of the generalization error
The classification algorithm induced by (10.1) is a regularization scheme. Thus,
we expect that our knowledge from Chapter 8 can be used in its analysis. Note,
however, that the minimized errors – the generalization error in Chapter 8 and
the error with respect to the loss φ here – are different and, naturally enough,
so are their minimizers.
190 10 General regularized classifiers

φ
Definition 10.4 Denote by fρ : X → R any measurable function minimizing
the generalization error with respect to φ for example, for almost all
x ∈ X,

fρφ (x) := argmin φ(yt) d ρ(y | x) = argmin φ(t)ηx + φ(−t)(1 − ηx ).
t∈R Y t∈R

Our goal in this chapter is to show that under some mild conditions, for any
φ φ
classifying loss φ satisfying φ  (0) > 0, we have E φ ( fz,γ ) − E φ ( fρ ) → 0 with
high confidence as m → ∞ and γ = γ (m) → 0. We saw in Chapter 9 that this
is the case for φ = φh and weakly separable measures. We begin in this section
by extending Theorem 9.21.

Theorem 10.5 Let φ be a classifying loss such that φ  (0) exists and is positive.
Then there is a constant cφ > 0 such that for all measurable functions
f : X → R,
7
φ
R(sgn( f )) − R( fc ) ≤ cφ E φ ( f ) − E φ ( fρ ).

If R( fc ) = 0, then the bound can be improved to


& '
R(sgn( f )) − R( fc ) ≤ cφ E φ ( f ) − E φ ( fρφ ) .

φ
To prove Theorem 10.5, we want to understand the behavior of fρ . To this
end, we introduce an auxiliary function. In what follows, fix a classifying
loss φ.

Definition 10.6 Define the localizing function  = x : (R ∪ {±∞}) → R+


to be the function associated with φ, ρ, and x given by

(t) = ηx φ(t) + (1 − ηx )φ(−t). (10.3)

The following property of classifying loss functions follows immediately


from their convexity. Denote by φ−  (respectively, φ  ) the left derivative
+
(respectively, right derivative) of φ.
Figure 10.2 shows the localizing functions corresponding to φ0 , φ1 , and φ2
for η(x) = 0.75.

Lemma 10.7 A classifying loss φ is strictly decreasing on (−∞, 1]


 (t) < 0 for t ∈ (−∞, 1) and
and nondecreasing on (1, +∞). It satisfies φ+

φ− (t) ≥ 0 for t ∈ (1, +∞). 
10.1 Bounding the misclassification error 191

[ 2]

[ 1]

[ 0]

0 1

Figure 10.2

 if φ is a classifying loss, then, for all x ∈ X , x is convex and


Note that
E φ ( f ) = X x ( f (x)) dρ X . Denote

fρ− (x) := sup{t ∈ R | − (t) = ηx φ−


 
(t) − (1 − ηx )φ+ (−t) < 0}

and

fρ+ (x) := inf {t ∈ R | + (t) = ηx φ+


 
(t) − (1 − ηx )φ− (−t) > 0}.

The convexity of  implies that fρ− (x) ≤ fρ+ (x).


Theorem 10.8 Let φ be a classifying loss function and x ∈ X .
(i) The convex function x is strictly decreasing on (−∞, fρ− (x)], strictly
increasing on [fρ+ (x), +∞), and constant on [fρ− (x), fρ+ (x)].
φ
(ii) fρ (x) is a minimizer of x and can be taken to be any value in
[fρ− (x), fρ+ (x)].
⎧ φ

⎪ 0 ≤ fρ− (x) ≤ fρ (x) if fρ (x) > 0


(iii) The following holds: fρφ (x) ≤ fρ+ (x) ≤ 0 if fρ (x) < 0



⎩ −
fρ (x) ≤ 0 ≤ fρ+ (x) if fρ (x) = 0.
(iv) fρ− (x) ≤ 1 and fρ+ (x) ≥ −1.
192 10 General regularized classifiers

Proof.
(i) Since  = x is convex, its one-side derivatives are both well defined and
nondecreasing, and − (t)$ ≤ + (t) for # every t ∈ R. Then  is strictly
decreasing on the interval −∞, fρ− (x) , since − (t) < 0 on this interval.
 +
"In +the same% way, + (t) > 0 for t > fρ (x), so  is strictly increasing on
fρ (x), ∞ .
For t ∈ [ fρ− (x), fρ+ (x)], we have 0 ≤ − (t) ≤ + (t) ≤ 0. Hence  is
constant on [fρ− (x), fρ+ (x)] and its value on this interval is its minimum.
(ii) Let x ∈ X . If we denote

φ
E ( f | x) := φ(yf (x)) d ρ(y | x) = ηx φ( f (x)) + (1 − ηx )φ(−f (x)),
Y
(10.4)
φ
then E φ ( f | x) = ( f (x)). It follows that fρ (x), which minimizes
E φ (· | x), is also a minimizer of .
(iii) Observe that
 
fρ (x) = ηx − (1 − ηx ) = 2 ηx − 12 . (10.5)

Since φ is differentiable at 0, so is , and  (0) = (2ηx − 1)


φ  (0) = fρ (x)φ  (0). We now reason by cases and use that φ  (0) < 0. When
fρ (x) > 0, we have − (0) =  (0) < 0 and fρ− (x) ≥ 0. When fρ (x) < 0,
+ (0) =  (0) > 0 and fρ+ (x) ≤ 0 hold. Finally, when fρ (x) = 0, we
have − (0) = + (0) = 0, which implies fρ− (x) ≤ 0 and fρ+ (x) ≥ 0.
(iv) When t > 1, Lemma 10.7 tells us that φ−  (t) ≥ 0 and φ  (−t) < 0. Hence
+
− (t) ≥ 0 and fρ (x) ≤ 1. In the same way, fρ+ (x) ≥ −1 follows from
 −

φ+ (t) < 0 and φ  (−t) ≥ 0 for t < −1.




φ
Assumption 10.9 It follows from Theorem 10.8 that fρ can be chosen to
satisfy

| fρφ (x)| ≤ 1 and fρφ (x) = 0 if fρ (x) = 0. (10.6)


φ
In the remainder of this chapter we assume, without loss of generality, that fρ
satisfies (10.6).
Lemma 10.10 Let φ be a classifying loss such that φ  (0) exists and is positive.
Then there is a constant C = Cφ > 0 such that for all x ∈ X ,
 2
(0) − ( fρφ (x)) ≥ C ηx − 12 .
10.1 Bounding the misclassification error 193

Proof. By the definition of φ  (0), there exists some 12 ≥ c0 > 0 such that for
all t ∈ [−c0 , c0 ],
  
 φ (t) − φ  (0)  φ  (0)
 − φ 
(0)≤ .
 t  2
This implies that

φ  (0)
φ  (0) + φ  (0)t − |t| ≤ φ  (t) ≤ φ  (0) + φ  (0)t
2
φ  (0)
+ |t|, ∀t ∈ [−c0 , c0 ]. (10.7)
2
Let x ∈ X . Consider the case ηx > 1
2 first.
 (0)
Denote  = min{ −φ
φ  (0) (ηx − 12 ), c0 }. For 0 ≤ t ≤ c0 ,

φ  (0)
 (t) = ηx φ  (t) − (1 − ηx )φ  (−t) ≤ (2ηx − 1)φ  (0) + φ  (0)t + t.
2
−φ  (0)
Thus, for 0 ≤ t ≤  ≤ φ  (0) (ηx − 12 ), we have
 
3 −φ  (0) 1
 (t) ≤ (2ηx − 1)φ  (0) + φ  (0) η x −
2 φ  (0) 2
φ  (0) 1
≤ ηx − < 0.
2 2
φ
Therefore  is strictly decreasing on the interval [0, ]. But fρ (x) is its
minimal point, so

φ  (0) 1
(0) − ( fρφ (x)) ≥ (0) − () ≥ − ηx − .
2 2

   (0)
 
When −φ (0)
ηx − 12 ≤ c0 , we have  = −φ
φ  (0) φ  (0) η x − 1
2 . When
 
−φ  (0)

φ (0) η x − 2 > c0 , we have  = c0 ≥ 2c0 (ηx − 2 ). In both cases, we have
1 1

 
−φ  (0) 1 2
−φ  (0)
(0) − ( fρφ (x)) ≥ ηx − min , 2c0 .
2 2 φ  (0)
That is, the desired inequality holds with
 $  %2 
−φ (0)
C = min −φ  (0)c0 , .
2φ  (0)
194 10 General regularized classifiers

The proof for ηx < 1


2 is similar: one estimates the upper bound of (t) for
t < 0. 
Proof of Theorem 10.5 Denote Xc = {x ∈ X | sgn( f )(x)  = fc (x)}. Recall
(9.14). Applying the Cauchy–Schwarz inequality and the fact that ρX is a
probability measure on X , we get
 1/2  1/2
 
R(sgn( f )) − R( fc ) ≤  fρ (x)2 dρ X 1 d ρX
Xc Xc
 1/2
 
≤  fρ (x)2 dρ X .
Xc

We then use Lemma 10.10 and (10.5) to find that


  1/2
1 & '
R(sgn( f )) − R( fc ) ≤ 2 (0) − ( fρφ (x)) dρ X .
C Xc

Let x ∈ Xc . If fρ (x) > 0, then fc (x) = 1 and f (x) < 0. By Theorem 10.8,
fρ− (x) ≥ 0 and  is strictly decreasing on (−∞, 0]. So ( f (x)) > (0) in
this case. In the same way, if fρ (x) < 0, then f (x) ≥ 0. By Theorem 10.8,
fρ+ (x) ≤ 0 and  is strictly increasing on [0, +∞). So ( f (x)) ≥ (0).
φ φ
Finally, if fρ (x) = 0, by (10.6), fρ (x) = 0 and then (0) − ( fρ (x)) = 0.
φ φ
In all three cases we have (0)−( fρ (x)) ≤ ( f (x))−( fρ (x)). Hence,
 
& ' & '
(0) − ( fρφ (x)) dρ X ≤ ( f (x)) − ( fρφ (x)) dρX
Xc Xc

& '
≤ ( f (x)) − ( fρφ (x)) d ρX = E φ ( f ) − E φ ( fρφ ).
X

This proves the first desired bound with cφ = 2/ C.
If R( fc ) = 0, then y = fc (x) almost surely and ηx = 1 or 0 almost everywhere.
This means that | fρ (x)| = 1 and | fρ (x)| = | fρ (x)|2 almost everywhere.
Using this with
 respect
2 to relation (9.14), we see that R(sgn( f )) −
R( fc ) = Xc  fρ (x) d ρX . Then the above procedure yields the second bound
with cφ = C4 . 

10.2 Projection and error decomposition


Since regularized classifiers are obtained by composing the sgn function with
a real-valued function f : X → R, we may improve the error estimates by
10.2 Projection and error decomposition 195

replacing image values of f by their projection onto [−1, 1]. This section
develops this idea.

Definition 10.11 The projection operator π on the space of measurable


functions f : X → R is defined by

⎨ 1 if f (x) > 1
π( f )(x) = −1 if f (x) < −1

f (x) if −1 ≤ f (x) ≤ 1.

Trivially, sgn(π( f )) = sgn( f ). Lemma 10.7 tells us that φ(y(π( f ))(x)) ≤


φ(yf (x)). Then

φ φ
E φ (π( f )) ≤ E φ ( f ) and Ez (π( f )) ≤ Ez ( f ). (10.8)

Together with Theorem 10.5, this implies that if φ  (0) > 0,


7
φ φ φ
R(sgn( fz,γ )) − R( fc ) ≤ cφ E φ (π( fz,γ )) − E φ ( fρ ).

φ
Thus the analysis for the excess misclassification error of fz,γ is reduced into
φ φ
that for the excess generalization error E φ (π( fz,γ )) − E φ ( fρ ). We carry out
the latter analysis in the next two sections.
The following result is similar to Theorem 8.3.
φ
Theorem 10.12 Let φ be a classifying loss, fz,γ be defined by (10.1), and
φ φ
fγ ∈ HK . Then E φ (π( fz,γ )) − E φ ( fρ ) is bounded by

φ φ
E φ (π( fz,γ )) − E φ ( fρφ ) + γ fz,γ 2K ≤ E φ ( fγ ) − E φ ( fρφ ) + γ fγ 2K
( ) " #
φ φ
+ Ez ( fγ ) − Ez ( fρφ ) − E φ ( fγ ) − E φ ( fρφ ) (10.9)
( ) ( )
φ φ φ φ
+ E φ (π( fz,γ )) − E φ ( fρφ ) − Ez (π( fz,γ )) − Ez ( fρφ ) .

Proof. The proof follows from (10.8) using the procedure in the proof of
Theorem 8.3. 

The function fγ ∈ HK in Theorem 10.12 is called a regularizing function.


It is arbitrarily chosen and depends on γ . One standard choice is the function

fγφ := argmin E φ ( f ) + γ f 2
K . (10.10)
f ∈HK
196 10 General regularized classifiers

The first term on the right-hand side of (10.9) is estimated in the next section.
It is the regularized error (w.r.t. φ) of fγ ,

D(γ , φ) := E φ ( fγ ) − E φ ( fρφ ) + γ fγ 2
K. (10.11)

The second and third terms on the right-hand side of (10.9) decompose
φ
the sample error E φ (π( fz,γ )) − E φ ( fγ ). The second term is about a single
random variable involving only one function fγ and is easy to handle; we bound
it in Section 10.4. The third term is more complex. In the form presented,
φ φ
the function π( fz,γ ) is projected from fz,γ . This projection maintains the
φ φ
misclassification error: R(sgn(π( fz,γ ))) = R(sgn( fz,γ )). However, it causes
φ
the random variable φ(yπ( fz,γ )(x)) to be bounded by φ(−1), a bound that
φ
is often smaller than that for φ(yfz,γ (x)). This allows for improved bounds
for the sample error for classification algorithms. We bound this third term in
Section 10.5.

10.3 Bounds for the regularized error D(γ , φ) of fγ


In this section we estimate the regularized error D(γ , φ) of fγ . This estimate
φ
follows from estimates for E φ ( fγ )−E φ ( fρ ). Define the function  : R+ → R+
by

   
(t) = max{|φ− (t)|, |φ+ (t)|, |φ− (−t)|, |φ+ (−t)|}.

Theorem 10.13 Let φ be a classifying loss. For any measurable function f ,



 
E φ ( f ) − E φ ( fρφ ) ≤ (| f (x)|)  f (x) − fρφ (x) dρ X .
X

If, in addition, φ ∈ C 2 (R), then we have



&
E φ ( f ) − E φ ( fρφ ) ≤ 1
2 φ  L ∞ [−1,1]
X
' 2
+ φ  L ∞ [−| f (x)|,| f (x)|]  f (x) − fρφ (x) dρ X .

Proof. It follows from (10.3) and (10.4) that



E φ ( f ) − E φ ( fρφ ) = ( f (x)) − ( fρφ (x)) dρ X .
X
10.3 Bounds for the regularized error D(γ , φ) of fγ 197

By Theorem 10.8,  is constant on [fρ− (x), fρ+ (x)]. So we need only bound for
those points x for which the value f (x) is outside this interval.
If f (x) > fρ+ (x), then, by Theorem 10.8 and since fρ+ (x) ≥ −1,  is strictly
increasing on [fρ+ (x), f (x)]. Moreover, the convexity of  implies that

$ %
( f (x)) − ( fρφ (x)) ≤ − ( f (x)) f (x) − fρφ (x)
&  ' 
≤ max φ− 
( f (x)), |φ+ (−f (x))|  f (x) − fρφ (x)
 
≤ (| f (x)|)  f (x) − fρφ (x) .

Similarly, if f (x) < fρ− (x), then, by Theorem 10.8 again and since fρ− (x) ≤ 1,
 is strictly decreasing on [f (x), fρ− (x)], and

$ %  
( f (x)) − ( fρφ (x)) ≤ + ( f (x)) f (x) − fρφ (x) ≤ (| f (x)|)  f (x) − fρφ (x) .

Thus, we have
 
( f (x)) − ( fρφ (x)) ≤ (| f (x)|)  f (x) − fρφ (x) .

This gives the first bound.


φ φ
If φ ∈ C 2 (R), so is . Then  ( fρ (x)) = 0 since fρ (x) is a minimum of .
+
When f (x) > fρ (x), using Taylor’s expansion,

 f (x)
$ %
( f (x)) − ( fρφ (x)) = 
( fρφ (x)) f (x) − fρφ (x) + ( f (x) − t)  (t) dt
φ
fρ (x)
 2
≤  L ∞ [f φ (x), f (x)] 21  f (x) − fρφ (x) .
ρ

φ
Now, since fρ (x) ∈ [−1, 1], use that

 L ∞ [f φ (x),f (x)] ≤ max{ φ  L ∞ [−1,1] , φ  L ∞ [−| f (x)|,| f (x)|] }


ρ

to get the desired result.


The case f (x) < fρ− (x) is dealt with in the same way. 

Corollary 10.14 Let φ be a classifying loss with φ  ∞ < ∞. For any


measurable function f ,

E φ ( f ) − E φ ( fρφ ) ≤ φ  ∞ f − fρφ Lρ1 .


X
198 10 General regularized classifiers

If φ ∈ C 2 (R) and φ  ∞ < ∞, then we have

E φ ( f ) − E φ ( fρφ ) ≤ φ  ∞ f − fρφ 2L 2 . 
ρX

10.4 Bounds for the sample error term involving fγ


In this (section, we bound) the (sample error term) in (10.9) involving fγ ,
φ φ φ φ
that is, Ez ( fγ ) − Ez ( fρ ) − E φ ( fγ ) − E φ ( fρ ) . This can be written as
 m
i=1 ξ(zi ) − E(ξ ) with ξ the random variable on (Z, ρ) given by ξ(z) =
1
m
φ
φ(yfγ (x)) − φ(yfρ (x)). To bound this quantity using the Bernstein inequalities,
we need to control the variance. We do so by means of the following constant
determined by φ and ρ.

Definition 10.15 The variancing power τ = τφ,ρ of the pair (φ, ρ) is defined
to be the maximal number τ in [0, 1] such that for some constant C1 > 0 and
any measurable function f : X → [−1, 1],
$ %2 $ %τ
E φ(yf (x)) − φ(yfρφ (x)) ≤ C1 E φ ( f ) − E φ ( fρφ ) . (10.12)

Since (10.12) always holds with τ = 0 and C1 = (φ(−1))2 , the variancing


power τφ,ρ is well defined.

Example 10.16 For φls (t) = (1 − t)2 we have τφ,ρ = 1 for any probability
measure ρ.

Proof. For φls (t) = (1 − t)2 we know that φls (yf (x)) = (y − f (x))2
φ
and fρ = fρ . Hence (10.12) is valid with τ = 1 and C1 = sup(x,y)∈Z
$ %2
y − f (x) + y − fρ (x) ≤ 16. 

In general, τ depends on the convexity of φ and on how much noise ρ


contains.
In particular, for the q-norm loss φq , τφq ,ρ = 1 when 1 < q ≤ 2 and
0 < τφq ,ρ < 1 when q > 2.

Lemma 10.17 Let q, q∗ > 1 be such that 1


q + 1
q∗ = 1. Then

1 q 1 ∗
a·b≤ a + ∗ bq , ∀a, b > 0.
q q
10.4 Bounds for the sample error term involving fγ 199


Proof. Let b > 0. Define a function f : R+ → R by f (a) = a·b− q1 aq − q1∗ bq .
This satisfies

f  (a) = b − aq−1 , f  (a) = −(q − 1)aq−2 < 0, ∀a > 0.

Hence f is a concave function on R+ and takes its maximum value at the unique
point a∗ = b1/(q−1) where f  (a∗ ) = 0. But q∗ = q−1
q
and

1 1 ∗ 1 1
f (a∗ ) = a∗ · b − (a∗ )q − ∗ bq = bq/(q−1) − bq/(q−1) − ∗ bq/(q−1) = 0.
q q q q

Therefore, f (a) ≤ f (a∗ ) = 0 for all a ∈ R+ . This is true for any b > 0. So the
inequality holds. 

Proposition 10.18 Let τ = τφ,ρ and Bγ := max{φ( ( fγ ∞ ), φ(− fγ ∞))}.


φ φ φ
For any 0 < δ < 1, with confidence 1 − δ, the quantity Ez ( fγ ) − Ez ( fρ ) −
( )
φ
E φ ( fγ ) − E φ ( fρ ) is bounded by

1/(2−τ )
5Bγ + 2φ(−1) 2 2C1 log (2/δ)
log + + E φ ( fγ ) − E φ ( fρφ ).
3m δ m

φ
Proof. Write the random variable ξ(z) = φ(yfγ (x)) − φ(yfρ (x)) on (Z, ρ) as
ξ = ξ1 + ξ2 , where

ξ1 := φ(yfγ (x)) − φ(yπ( fγ )(x)), ξ2 := φ(yπ( fγ )(x)) − φ(yfρφ (x)).


(10.13)

The first part ξ1 is a random variable satisfying 0 ≤ ξ1 ≤ Bγ . Applying the


one-side Bernstein inequality to ξ1 , we obtain, for any ε > 0,

  ⎧ ⎫
1 m ⎨ mε 2 ⎬
Prob ξ 1 (z i ) − E(ξ 1 ) > ε ≤ exp −   .
z∈Z m m ⎩ 2 σ 2 (ξ ) + 1 B ε ⎭
i=1 1 3 γ
200 10 General regularized classifiers

Solving the quadratic equation for ε given by

mε 2 2
  = log ,
σ 2 (ξ δ
1) + 3 Bγ ε
1
2

we see that for any 0 < δ < 1, there exists a subset U1 of Z m with measure at
least 1 − 2δ such that for every z ∈ U1 ,

1
m
ξ1 (zi ) − E(ξ1 )
m
i=1
? 2
1
3 Bγ log(2/δ) + 1
3 Bγ log(2/δ) + 2mσ 2 (ξ1 ) log(2/δ)
≤ .
m

But σ 2 (ξ1 ) ≤ E(ξ12 ) ≤ Bγ E(ξ1 ). Therefore,

1
m
5Bγ log(2/δ)
ξ1 (zi ) − E(ξ1 ) ≤ + E(ξ1 ), ∀z ∈ U1 .
m 3m
i=1

Next we consider ξ2 . This is a random variable bounded by φ(−1). Applying


the one-side Bernstein inequality as above, we obtain another subset U2 of Z m
with measure at least 1 − 2δ such that for every z ∈ U2 ,
!
1
m
2φ(−1) log(2/δ) 2 log(2/δ)σ 2 (ξ2 )
ξ2 (zi ) − E(ξ2 ) ≤ + .
m 3m m
i=1

The definition of the variancing power τ gives σ 2 (ξ2 ) ≤ E(ξ22 ) ≤ C1 {E(ξ2 )}τ .
Applying Lemma 10.17 to q = 2−τ 2
, q∗ = τ2 , a = 2 log(2/δ)C1 /m, and
√ τ
b = {E(ξ2 )} , we obtain
!
2 log(2/δ)σ 2 (ξ2 )  τ  2 log(2/δ)C1 1/(2−τ )
τ
≤ 1− + E(ξ2 ).
m 2 m 2

Hence, for all z ∈ U2 ,

1 1/(2−τ )
m
2φ(−1) log(2/δ) 2 log(2/δ)C1
ξ2 (zi ) − E(ξ2 ) ≤ + + E(ξ2 ).
m 3m m
i=1
φ
10.5 Bounds for the sample error term involving fz,γ 201

Combining these inequalities for ξ1 and ξ2 with the fact that E(ξ1 ) + E(ξ2 ) =
φ
E(ξ ) = E φ ( fγ ) − E φ ( fρ ), we conclude that for all z ∈ U1 ∩ U2 ,

( ) " #
φ φ
Ez ( fγ ) − Ez ( fρφ ) − E φ ( fγ ) − E φ ( fρφ )
(1/2−τ )
5Bγ log(2/δ) + 2φ(−1) log(2/δ) 2 log(2/δ)C1
≤ +
3m m
+ E φ ( fγ ) − E φ ( fρφ ).

Since the measure of U1 ∩ U2 is at least 1 − δ, this bound holds with the


confidence claimed. 

φ
10.5 Bounds for the sample error term involving fz,γ
( )
φ φ
The other term of the sample error in (10.9), E φ (π( fz,γ )) − E φ ( fρ ) −
( )
φ φ φ φ φ
Ez (π( fz,γ )) − Ez ( fρ ) , involves the function fz,γ and thus runs over a set
of functions. To bound it, we use – as we have already done in similar cases –
a probability inequality for a function set in terms of the covering numbers of
the set.
The following probability inequality can be proved using the one-side
Bernstein inequality as in Lemma 3.18.

Lemma 10.19 Let ξ be a random variable on Z with mean µ and variance σ 2 .


Assume that µ ≥ 0, |ξ − µ| ≤ B almost everywhere, and σ 2 ≤ cµτ for some
0 ≤ τ ≤ 2 and c, B ≥ 0. Then, for every ε > 0,

    
µ − m1 m i=1 ξ(zi ) 1− τ2 mε2−τ
Prob √ τ >ε ≤ exp −
z∈Z m µ + ετ 2(c + 13 Bε 1−τ )

holds. 
Also, the following inequality for a function set can be proved in the same way
as Lemma 3.19.

Lemma 10.20 Let 0 ≤ τ ≤ 1, c, B ≥ 0, and G be a set of functions on Z such


that for every g ∈ G, E(g) ≥ 0, g − E(g) Lρ∞ ≤ B, and E(g 2 ) ≤ c(E(g))τ .
202 10 General regularized classifiers

Then, for all ε > 0,


 m 
E(g) − 1
i=1 g(zi )
Prob sup m
> 4ε 1−τ/2
z∈Zm
g∈G (E(g)) + ε τ
τ
 
mε2−τ
≤ N (G, ε) exp − . 
2(c + 13 Bε 1−τ )

We can now derive the sample error bounds along the same lines we followed
in the previous chapter for the regression problem.

Lemma 10.21 Let τ = τφ,ρ . For any R > 0 and any ε > 0.
⎧   ⎫

⎪ φ φ φ φ ⎪

⎨ E φ (π( f )) − E φ ( fρ ) − Ez (π( f )) − Ez ( fρ ) ⎬
Prob sup ? τ ≤ 4ε1−τ/2
z∈Zm ⎪
⎪ ⎪

⎩f ∈BR E φ (π( f )) − E φ ( fρ )
φ
+ ετ ⎭
 
ε mε2−τ
≥ 1 − N B1 , exp −
R|φ  (−1)| 2C1 + 43 φ(−1)ε 1−τ

holds.

Proof. Apply Lemma 10.20 to the function set

φ & '
FR = φ(y(πf )(x)) − φ(yfρφ (x)) : f ∈ BR . (10.14)

φ
Each function g ∈ FR satisfies E(g 2 ) ≤ c (E(g))τ for c = C1 and
g − E(g) Lρ∞ ≤ B := 2φ(−1). Therefore, to draw our conclusion from
φ
Lemma 10.20, we need only bound the covering number N (FR , ε). To do so,
we note that for f1 , f2 ∈ BR and (x, y) ∈ Z, we have
& ' & '
 φ(y(π f1 )(x)) − φ(yf φ (x)) − φ(y(π f2 )(x)) − φ(yf φ (x)) 
ρ ρ

= |φ(y(π f1 )(x)) − φ(y(πf2 )(x))| ≤ |φ (−1)| f1 − f2 ∞.

Therefore
  ε
φ
N FR , ε ≤ N BR , ,
|φ  (−1)|

proving the statement. 


φ
10.5 Bounds for the sample error term involving fz,γ 203

Define ε∗ (m, R, δ) to be the smallest positive number ε satisfying

ε mε 2−τ
log N B1 , 
− ≤ log δ. (10.15)
R|φ (−1)| 2C1 + 43 φ(−1)ε 1−τ

Then the confidence for the error ε = ε∗ (m, R, δ) in Lemma 10.21 is at least
1 − δ.
For R > 0, denote

φ
W(R) = z ∈ Z m : fz,γ K ≤ R .

Proposition 10.22 For all 0 < δ < 1 and R > 0, there is a subset VR of
Z m with measure at most δ such that for all z ∈ W(R) \ VR , the quantity
φ φ φ
E φ (π( fz,γ )) − E φ ( fρ ) + γ fz,γ 2K is bounded by

10Bγ + 4φ(−1)
4D(γ , φ) + 24ε ∗ (m, R, δ/2) + log (4/δ)
3m
1/(2−τ )
2C1 log (4/δ)
+2 .
m

Proof. Lemma 10.17 implies that for 0 < τ < 1,


? τ
φ τ $ φ %
E φ (π( f )) − E φ ( fρ ) + ε τ · 4ε 1−τ/2 ≤ E (π( f )) − E φ ( fρφ )
2
+ (1 − τ/2)41/(1−τ/2) ε + 4ε
1$ φ %
≤ E (π( f )) − E φ ( fρφ ) + 12ε.
2
Putting this into Lemma 10.21 with ε = ε ∗ (m, R, δ/2), we deduce that for
z ∈ W(R), with confidence 1 − 2δ ,
( ) ( )
φ φ φ φ
E φ (π( fz,γ )) − E φ ( fρφ ) − Ez (π( fz,γ )) − Ez ( fρφ )
 
φ
≤ 12 E φ (π( fz,γ )) − E φ ( fρφ ) + 12ε ∗ (m, R, δ/2).

Proposition 10.18 with δ replaced by 2δ guarantees that with confidence 1− 2δ ,


( ) ( )
φ φ φ φ
the quantity Ez ( fγ ) − Ez ( fρ ) − E φ ( fγ ) − E φ ( fρ ) is bounded by

1/(2−τ )
5Bγ + 2φ(−1) 2C1 log (4/δ)
log (4/δ) + + E φ ( fγ ) − E φ ( fρφ ).
3m m
204 10 General regularized classifiers

Combining these two bounds with (10.9), we see that for z ∈ W(R), with
confidence 1 − δ,

φ φ 1 φ φ

E φ (π( fz,γ )) − E φ ( fρφ ) + γ fz,γ 2
K ≤ D(γ , φ) + E (π( fz,γ )) − E φ ( fρφ )
2
1/(2−τ )
5Bγ + 2φ(−1) 2C1 log (4/δ)
+ 12ε∗ (m, R, δ/2) + log (4/δ) +
3m m
+ E φ ( fγ ) − E φ ( fρφ ).

This gives the desired bound. 


Lemma 10.23 For all γ > 0 and z ∈ Z m ,
φ
fz,γ K ≤ φ(0)/γ .

φ φ
Proof. Since fz,γ minimizes Ez ( f ) + γ f 2
K in HK , choosing f = 0 implies
that

1
m
φ φ φ φ φ
γ fz,γ 2
K ≤ Ez ( fz,γ ) + γ fz,γ 2
K ≤ Ez (0) + 0 = φ(0) = φ(0).
m
i=1

φ √
Therefore, fz,γ K ≤
φ(0)/γ for all z ∈ Z m . 
√ √
By Lemma 10.23, W( φ(0)/γ ) = Z m . Taking R := φ(0)/γ , we can
derive a weak error bound, as we did in Section 8.3. But we can do better.
φ
A bound for the norm fz,γ K improving that of Lemma 10.23 can be shown to
hold with high probability. To show this is the target of the next section. Note that
we could now wrap the results in this and the two preceding sections into a single
φ
statement bounding the excess misclassification error R(sgn( fz,γ )) − R( fc ).
We actually do that, in Corollary 10.25, once we have obtained a better bound
φ
for the norm fz,γ K .

10.6 Stronger error bounds


φ φ
In this section we derive bounds for E φ (π( fz,γ ))−E φ ( fρ ), improving those that
would follow from the preceding sections, at the cost of a few mild assumptions.
Theorem 10.24 Assume the following with positive constants p, C0 , Cφ , A,
q ≥ 1, and β ≤ 1.
(i) K satisfies log N (B1 , η) ≤ C0 (1/η)p .
10.6 Stronger error bounds 205

(ii) φ(t) ≤ Cφ |t|q for all t  ∈ (−1, 1).


(iii) D(γ , φ) ≤ Aγ β for each γ > 0.
Choose γ = m−ζ with ζ = β+q(1−β)/2
1
. Then, for all 0 < η ≤ 1
2 and all
0 < δ < 1, with confidence 1 − δ,

φ 2
E φ (π( fz,γ )) − E φ ( fρφ ) ≤ Cη log m−θ ,
δ

where
 
β 1 − pr p
θ := min , , s := ,
β + q(1 − β)/2 2 − τ + p 2(1 + p)
 
ζ − 1/(2 − τ + p) 1−β 1 q 1
r := max + η, ζ,ζ + (1 − β) −
2(1 − s) 2 2 4 2

and Cη is a constant depending on η and the constants in conditions (i)–(iii),


but not on m or δ.
The following corollary follows from Theorems 10.5 and 10.24.
Corollary 10.25 Under the hypothesis and with the notations of
Theorem 10.24, if φ  (0) > 0, then, with confidence at least 1 − δ, we have
?
φ 2
R(sgn( fz,γ )) − R( fc ) ≤ cφ Cη log m−θ . 
δ

When the kernel is C ∞ on X ⊂ Rn , we know (cf. Theorem 5.1(i)) that p in


Theorem 10.24(i) can be arbitrarily small. We thus get the following result.
Corollary 10.26 Assume that K is C ∞ on X × X and φ(t) ≤ Cφ |t|q for all
t  ∈ (−1, 1) and some q ≥ 1. If D(γ , φ) ≤ Aγ β for all γ > 0 and some
0 < β ≤ 1, choose γ = m−ζ with ζ = β+q(1−β)/2
1
. Then for any 0 < η ≤ 12
and 0 < δ < 1, with confidence 1 − δ,

φ 2
E φ (π( fz,γ )) − E φ ( fρφ ) ≤ Cη log m−θ ,
δ

where
 
β 1
θ := min , −η
β + q(1 − β)/2 2 − τ

and Cη is a constant depending on η, but not on m or δ. 


206 10 General regularized classifiers

Theorem 10.2 follows from Corollary 10.26, Corollary 10.14, and


φ
Theorem 9.21 by taking fγ = fγ defined in (10.10).
Theorem 10.3 is a consequence of Corollary 10.26 and Theorem 10.5. In this
case, in addition, we can take q = 2, which implies ζ = 1.
The proof of Theorem 10.24 will follow from several lemmas. The idea is to
find a radius R such that W(R) is close to Z m with high probability.
First we establish a bound for the number ε∗ (m, R, δ).
Lemma 10.27 Assume K satisfies log N (B1 , η) ≤ C0 (1/η)p for some p > 0.
Then for R ≥ 1 and 0 < δ ≤ 12 , the quantity ε ∗ (m, R, δ) defined by (10.15) can
be bounded by

∗ log(1/δ) 1/(2−τ )
ε (m, R, δ) ≤ C2
m
 
Rp 1/(1+p) Rp 1/(2−τ +p)
+ max , ,
m m

where C2 := (6φ(−1) + 8C1 + 1)(C0 + 1)(|φ  (−1)| + 1).


Proof. Using the covering number assumption, we see from (10.15) that
ε∗ (m, R, δ) ≤ , where  is the unique positive number  satisfying

R|φ  (−1)| p
m 2−τ
C0 − = log δ.
 2C1 + 43 φ(−1) 1−τ

We can rewrite this equation as

4φ(−1) log(1/δ) 1−τ +p 2C1 log(1/δ) p


 2−τ +p −  − 
3m m
4φ(−1)C0 $ %p 2C1 C0 $ %p
− R|φ  (−1)|  1−τ − R|φ  (−1)| = 0.
3m m
Applying Lemma 7.2 with d = 4 to this equation, we find that the solution 
satisfies

16φ(−1) log(1/δ) 8C1 log(1/δ) 1/(2−τ )
 ≤ max , ,
3m m

16φ(−1)C0 $ 
%p 1/(1+p) 8C1 C0 $ 
%p 1/(2−τ +p)
R|φ (−1)| , R|φ (−1)| .
3m m

Therefore the desired bound for ε∗ (m, R, δ) follows. 


10.6 Stronger error bounds 207

The following lemma is a consequence of Lemma 10.27 and


Proposition 10.22.

Lemma 10.28 Under the assumptions of Theorem 10.24, choose γ = m−ζ for
some ζ > 0. Then, for any 0 < δ < 1 and R ≥ 1, there is a set VR ⊆ Z m with
measure at most δ such that for m ≥ (C2K A)−1/(ζ (1−β)) ,

W(R) ⊆ W(am Rs + bm ) ∪ VR ,

√  1/2
p
where s := 2(1+p) , am := 5 C2 m(ζ −1/(2−τ +p))/2 , and bm := C3 log 4δ mr .
Here the constants are
 
ζ − 1/(2 − τ ) 1 − β 1 q 1
r := max , ζ,ζ + (1 − β) −
2 2 2 4 2

and

√ 7
1/(4−2τ )
φ(−1) + 2 A + 2 Cφ CK Aq/4 .
q/2
C3 = 5 C2 + 2C1 + √2
3

Proof. By Proposition 10.22, there is a set VR with measure at most δ such


that for all z ∈ W(R) \ VR ,

φ 10Bγ + 4φ(−1)
γ fz,γ 2
K ≤ 4Aγ β + 24ε ∗ (m, R, δ/2) + log (4/δ)
3m
1/(2−τ )
2C1 log (4/δ)
+2 .
m

Since φ(t) ≤ Cφ |t|q for each t  ∈ (−1, 1), we see that Bγ =
$ %q
max{φ( fγ ∞ ), φ(− fγ ∞ )} is bounded by Cφ max{ fγ ∞ , 1} . But the
assumption D(γ , φ) ≤ Aγ β implies that


fγ ∞ ≤ CK fγ K ≤ CK D(γ , φ)/γ ≤ CK Aγ (β−1)/2 .

Hence,


Bγ ≤ Cφ CK Aq/2 γ q(β−1)/2 , when CK Aγ (β−1)/2 ≥ 1.
q
(10.16)
208 10 General regularized classifiers

Under this restriction it follows from Lemma 10.27 that z ∈ W(R) for any R
satisfying

1/(2−τ )
1 1/(2−τ ) 4 log(4/δ)
R≥ √ 4Aγ β + 24C2 + 4C1 + φ(−1)
γ 3 m
1/2
−1/(2−τ +p) p/(1+p) 10 log(4/δ)
+ Cφ CK Aq/2 γ q(β−1)/2
q
+24C2 m R .
3 m

Taking γ = m−ζ , we see that we can choose

1/2
4
R = 5 C2 m(ζ −1/(2−τ +p))/2 r p/(2(1+p)) + C3 log mr .
δ

This proves the lemma. 

Lemma 10.29 Under the assumptions of Theorem 10.24, take γ = m−ζ for
some ζ > 0 and let m ≥ (C2K A)−1/(ζ (1−β)) . Then, for any η > 0 and 0 < δ < 1,

the set W(R∗ ) has measure at least 1 − Jη δ, where R∗ = C4 mr ,
 
ζ 1 1
Jη := log2 max , + log2 + 1,
2 (2 − τ + p) η

and
 
ζ − 1/(2 − τ + p)
r ∗ := max r, +η .
2(1 − s)

The constant C4 is given by

 2  2 4 1/2
C4 = 5 C2 (φ(0) + 1) + Jη 5 C2 C3 log .
δ

Proof. Let J be a positive integer that will be determined


$ later.
%s Define a

sequence {R(j) }Jj=0 by R(0) = φ(0)/γ and R(j) = am R(j−1) + bm for
1 ≤ j ≤ J . Then we have

2 +···+sJ −1
 sJ J −1
  s j
R(J ) = (am )1+s+s R(0)
2 +···+sj−1
+ (am )1+s+s bm .
j=0
(10.17)
10.6 Stronger error bounds 209

The first term on the right-hand side of (10.17) equals


  1−sJ ζ −1/(2−τ +p) 1−sJ ζ J
· 1−s
(φ(0))s /2 m 2 s ,
1−s J
5 C2 m 2

which, since 0 < s < 12 , is bounded by


 2  
ζ −1/(2−τ +p) ζ
− ζ −1/(2−τ +p)
·sJ
5 C2 (φ(0) + 1) m 2(1−s) · m 2 2(1−s)
.

ζ
When 2J ≥ max{ 2η , (2−τ1+p)η }, this upper bound is controlled by
 2 ζ −1/(2−τ +p)

5 C2 (φ(0) + 1) m 2(1−s) .

The second term on the right-hand side of (10.17) equals


J −1 
  1−s j
 sj
ζ −1/(2−τ +p) 1−s
5 C2 m 2 C3 (log(4/δ))1/2 mr ,
j=0

which is bounded by
J −1  2  
ζ −1/(2−τ +p)  r− ζ −1/(2−τ +p)
·sj
m 2(1−s) 5 C2 C3 (log(4/δ)) 1/2
m 2(1−s)
.
j=0

ζ −1/(2−τ +p)
If r ≥ 2(1−s) , this last expression is bounded by

ζ −1/(2−τ +p)  2 ζ −1/(2−τ +p)


m 2(1−s) J 5 C2 C3 (log(4/δ))1/2 mr− 2(1−s)
 2
= J 5 C2 C3 (log(4/δ))1/2 mr .

ζ −1/(2−τ +p)
If r < 2(1−s) , an upper bound is easier:

ζ −1/(2−τ +p)  2
m 2(1−s) J 5 C2 C3 (log(4/δ))1/2 .

$ √ %2
Thus, in either case, the second term has the upper bound J 5 C2 C3

(log(4/δ))1/2 mr .
Combining the bounds for the two terms, we have
 2  2 

R(J ) ≤ 5 C2 (φ(0) + 1) + J 5 C2 C3 (log(4/δ))1/2 mr .
210 10 General regularized classifiers

ζ
Taking J to be Jη , we have 2J > max{ 2η , (2−τ1+p)η } and we finish the
proof. 
The proof of Theorem 10.24 follows from Lemmas 10.27 and 10.29 and
Proposition 10.22. The constant Cη can be explicitly obtained.

10.7 Improving learning rates by imposing noise conditions


There is a difference in the learning rates given by Theorem 10.2 (where the
best rate is 12 − ε) and Theorem 9.26 (where the rate can be arbitrarily close to
1). This motivates the idea of improving the learning rates stated in this chapter
by imposing some conditions on the measures. In this section we introduce one
possible such condition.
Definition 10.30 Let 0 ≤ q ≤ ∞. We say that ρ has Tsybakov noise exponent
q if there exists a constant cq > 0 such that for all t > 0,
$ %
ρX {x ∈ X : | fρ (x)| ≤ cq t} ≤ t q . (10.18)

All distributions have at least noise exponent 0 since t 0 = 1. Deterministic


distributions (which satisfy | fρ (x)| ≡ 1) have noise exponent q = ∞ with
c∞ = 1.
The Tsybakov noise condition improves the variancing power τφ,ρ . Let us
show this for the hinge loss.
Lemma 10.31 Let 0 ≤ q ≤ ∞. If ρ has Tsybakov noise exponent q with (10.18)
valid, then, for every function f : X → [−1, 1],
 1 q/(q+1) $ %q/(q+1)
E (φh (yf (x)) − φh (yfc (x)))2 ≤ 8 E φh ( f ) − E φh ( fc )
2cq

holds.
Proof. Since f (x) ∈ [−1, 1], we have φh (yf (x)) − φh (yfc (x)) = y( fc (x) −
f (x)). It follows that
 
E φh ( f ) − E φh ( fc ) = ( fc (x) − f (x))fρ (x) d ρX = | fc (x) − f (x)| | fρ (x)| d ρX
X X

and
 
E (φh (yf (x)) − φh (yfc (x)))2 = | fc (x) − f (x)|2 d ρX .
X
10.8 References and additional remarks 211

Let t > 0 and separate the domain X into two sets: Xt+ := {x ∈ X : | fρ (x)| >
cq t} and Xt− := {x ∈ X : | fρ (x)| ≤ cq t}. On Xt+ we have | fc (x) − f (x)|2 ≤
| f (x)|
2| fc (x) − f (x)| cρq t . On Xt− we have | fc (x) − f (x)|2 ≤ 4. It follows from
(10.18) that
 $ %
2 E φh ( f ) − E φh ( fc )
| fc (x) − f (x)| d ρX ≤
2
+ 4ρX (Xt− )
X cq t
$ %
2 E φh ( f ) − E φh ( fc )
≤ + 4t q .
cq t
& '1/(q+1)
Choosing t = (E φh ( f ) − E φh ( fc ))/(2cq ) yields the desired bound.

Lemma 10.31 tells us that the variancing power τφh ,ρ of the hinge loss equals
q
q+1 when the measure ρ has Tsybakov noise exponent q. Combining this
with Corollary 10.26 gives the following result on improved learning rates
for measures satisfying the Tsybakov noise condition.
Theorem 10.32 Under the assumption of Theorem 10.2, if ρ has Tsybakov
noise exponent q with 0 ≤ q ≤ ∞, then, for any 0 < ε < 12 and 0 < δ < 1,
with confidence 1 − δ, we have
θ
φ 2 1
R(sgn( fz,γh )) − R( fc ) ≤ C log
δ m

2β q+1
where θ = min 1+β , q+2 − ε and C is a constant independent of m and δ.

In Theorem 10.32, the learning rate can be arbitrarily close to 1 when q is
sufficiently large.

10.8 References and additional remarks


General expositions of convex loss functions for classification can be found
in [14, 31]. Theorem 10.5, the use of the projection operator, and some estimates
for the regularized error were provided in [31]. The error decomposition for
regularization schemes was introduced in [145].
The convergence of the support vector machine (SVM) 1-norm soft margin
classifier for general probability distributions (without separability conditions)
was established in [121] when HK is dense in C (X ) (such a kernel K is called
212 10 General regularized classifiers

universal). Convergence rates in this situation were derived in [154]. For further
results and references on convergence rates, see the thesis [140].
The error analysis in this chapter is taken from [142], where more technical
and better error bounds are provided by means of the local Rademacher process,
empirical covering numbers, and the entropy integral [84, 132]. The Tsybakov
noise condition of Section 10.7 was introduced in [131].
The iteration technique used in the proof of Lemma 10.29 was given in [122]
(see also [144]).
SVMs have many modifications for various purposes in different fields [134].
These include q-norm soft margin classifiers [31, 77], multiclass SVMs
[4, 32, 75, 139], ν-SVMs [108], linear programming SVMs [26, 96, 98, 146],
maximum entropy discrimination [65], and one-class SVMs [107, 128].
We conclude with some brief comments on current trends.
Learning theory is a rapidly growing field. Many people are working on both
its foundations and its applications, from different points of view. This work
develops the theory but also leaves many open questions. Here we mention
some involving regularization schemes [48].

(i) Feature selection. One purpose is to understand structures of high-


dimensional data. Topics include manifold learning or semisupervised
learning [15, 23, 27, 34, 45, 97] and dimensionality reduction (see the
introduction [55] of a special issue and references therein). Another
purpose is to determine important features (variables) of functions defined
on huge-dimensional spaces. Two approaches are the filter method and
the wrapper method [69]. Regularization schemes for this purpose include
those in [56, 58] and a least squares–type algorithm in [93] that learns
gradients as vector-valued functions [89].
(ii) Multikernel regularization schemes. Let K = {Kσ : σ ∈ } be a set of
Mercer kernels on X such as Gaussian kernels with variances σ 2 running
over (0, ∞). The multikernel regularization scheme associated with K is
defined as
 
1
m
fz,γ , = arginf inf V (yi , f (xi )) + γ f 2
Kσ .
σ ∈ f ∈HKσ m
i=1

Here V : R2 → R+ is a general loss function. In [30] SVMs with multiple


parameters are investigated. In [76, 104] mixture-density estimation is
considered and Gaussian kernels with variance σ 2 varying on an interval
[σ12 , σ22 ] with 0 < σ1 < σ2 < +∞ are used to derive bounds. Multitask
learning algorithms involve kernels from a convex hull of several Mercer
10.8 References and additional remarks 213

kernels and spaces with changing norms (e.g. [49, 62]). The learning of
kernel functions is studied in [72, 88, 90].
Another related class of multikernel regularization schemes consists that
of schemes generated by polynomial kernels {Kd (x, y) = (1 + x · y)d }
with d ∈ N. In [158] convergence rates in the univariate case (n = 1) for
multikernel regularized classifiers generated by polynomial kernels are
derived.
(iii) Online learning algorithms. These algorithms improve the efficiency of
learning methods when the sample size m is very large. Their convergence
is investigated in [28, 51, 52, 68, 134], and their error with respect to
the step size has been analyzed for the least squares regression in [112]
and for regularized classification with a general classifying loss in [151].
Error analysis for online schemes with varying regularization parameters
is performed in [127] and [149].
References

[1] R.A. Adams. Sobolev Spaces. Academic Press, 1975.


[2] C.A. Aliprantis and O. Burkinshaw. Principles of Real Analysis. Academic Press,
3rd edition, 1998.
[3] F. Alizadeh and D. Goldfarb. Second-order cone programming. Math. Program.,
95:3–51, 2003.
[4] E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: a
unifying approach for margin classifiers. J. Mach. Learn. Res., 1:113–141, 2000.
[5] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive
dimensions, uniform convergence and learnability. J. ACM, 44:615–631, 1997.
[6] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations.
Cambridge University Press, 1999.
[7] M. Anthony and N. Biggs. Computational Learning Theory. Cambridge
University Press, 1992.
[8] M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete
Appl. Math., 47:207–217, 1993.
[9] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337–
404, 1950.
[10] A.R. Barron. Complexity regularization with applications to artificial neural
networks. In G. Roussas, editor, Nonparametric Functional Estimation,
pages 561–576. Kluwer Academic Publishers 1990.
[11] R.G. Bartle. The Elements of Real Analysis. John Wiley & Sons, 2nd edition,
1976.
[12] P.L. Bartlett. The sample complexity of pattern classification with neural
networks: the size of the weights is more important than the size of the network.
IEEE Trans. Inform. Theory, 44:525–536, 1998.
[13] P.L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities.
Ann. Stat., 33:1497–1537, 2005.
[14] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk
bounds. J. Amer. Stat. Ass., 101:138–156, 2006.
[15] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds.
Mach. Learn., 56:209–239, 2004.

214
References 215

[16] J. Bergh and J. Löfström. Interpolation Spaces: An Introduction. Springer-Verlag,


1976.
[17] P. Binev, A. Cohen, W. Dahmen, R. DeVore, and V. Temlyakov. Universal
algorithms for learning theory. Part I: piecewise constant functions. J. Mach.
Learn. Res., 6:1297–1321, 2005.
[18] C.M. Bishop. Neural Networks for Pattern Recognition. Cambridge University
Press, 1995.
[19] L. Blum, F. Cucker, M. Shub, and S. Smale. Complexity and Real Computation.
Springer-Verlag, 1998.
[20] B.E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin
classifiers. In Proceedings of the Fifth Annual Workshop of Computational
Learning Theory, pages 144–152. Association for Computing Machinery,
New York, 1992.
[21] S. Boucheron, O. Bousquet, and G. Lugosi. Concentration inequalities. In
O. Bousquet, U. von Luxburg, and G. Rátsch, editors, Advanced Lectures in Mac-
hine Learning, pages 208–240. Springer-Verlag, 2004.
[22] S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with
applications in random combinatorics and learning. Random Struct. Algorithms,
16:277–292, 2000.
[23] O. Bousquet, O. Chapelle, and M. Hein. Measure based regularizations. In
S. Thrun, L.K. Saul, and B. Schölkopf, editors, Advances in Neural Information
Processing Systems, volume 16, pages 1221–1228. MIT Press, 2004.
[24] O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res.,
2:499–526, 2002.
[25] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,
2004.
[26] P.S. Bradley and O.L. Mangasarian. Massive data discrimination via linear
support vector machines. Optimi. Methods and Softw., 13:1–10, 2000.
[27] A. Caponnetto and S. Smale. Risk bounds for random regression graphs. To appear
at Found Comput. Math.
[28] N. Cesa-Bianchi, P.M. Long, and M.K. Warmuth. Worst-case quadratic loss
bounds for prediction using linear functions and gradient descent. IEEE Trans.
Neural Networks, 7:604–619, 1996.
[29] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge
University Press, 2006.
[30] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple
parameters for support vector machines. Mach. Learn., 46:131–159, 2002.
[31] D.R. Chen, Q. Wu, Y. Ying, and D.X. Zhou. Support vector machine soft margin
classifiers: error analysis. J. Mach. Learn. Res., 5:1143–1175, 2004.
[32] D.R. Chen and D.H. Xiang. The consistency of multicategory support vector
machines. Adv. Comput. Math., 24:155–169, 2006.
[33] D.A. Cohn, Z. Ghahramani, and M.I. Jordan. Active learning with statistical
models. J. Artif. Intell. Res., 4:129–145, 1996.
[34] R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and
S.W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure
definition of data: diffusion maps. Proc. Natl. Acad. Sci., 102:7426–7431, 2005.
216 References

[35] C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20:273–297,


1995.
[36] D.D. Cox. Approximation of least squares regression on nested subspaces. Ann.
Stat., 16:713–732, 1988.
[37] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines.
Cambridge University Press, 2000.
[38] F. Cucker and S. Smale. Best choices for regularization parameters in learning
theory. Found. Comput. Math., 2:413–428, 2002.
[39] F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer.
Math. Soc., 39:1–49, 2002.
[40] L. Debnath and P. Mikusiński. Introduction to Hilbert Spaces with Applications.
Academic Press, 2nd edition, 1999.
[41] C. de Boor, K. Höllig, and S. Riemenschneider. Box Splines. Springer-Verlag,
1993.
[42] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized
least-squares algorithm in learning theory. Found. Comput. Math., 5:59–85,
2005.
[43] E. De Vito, L. Rosasco, A. Caponnetto, U. de Giovannini, and F. Odone. Learning
from examples as an inverse problem. J. Mach. Learn. Res., 6:883–904, 2005.
[44] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern
Recognition. Springer-Verlag, 1996.
[45] D.L. Donoho and C. Grimes. Hessian eigenmaps: locally linear embedding
techniques for high-dimensional data. Proc. Natl. Acad. Sci., 100:5591–5596,
2003.
[46] R.M. Dudley, E. Giné, and J. Zinn. Uniform and universal Glivenko–Cantelli
classes. J. Theor. Prob., 4:485–510, 1991.
[47] D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential
Operators. Cambridge University Press, 1996.
[48] H.W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems,
volume 375 of Mathematics and Its Applications. Kluwer, 1996.
[49] T. Evgeniou and M. Pontil. Regularized multi-task learning. In C.E. Brodley,
editor, Proc. 17th SIGKDD Conf. Knowledge Discovery and Data Mining,
Association for Computing Machinery, New York, 2004.
[50] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector
machines. Adv. Comput. Math., 13:1–50, 2000.
[51] J. Forster and M.K. Warmuth. Relative expected instantaneous loss bounds.
J. Comput. Syst. Sci., 64:76–102, 2002.
[52] Y. Freund and R.E. Shapire.Adecision-theoretic generalization of on-line learning
and an application to boosting. J. Comput. Syst. Sci., 55:119–139, 1997.
[53] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks
architectures. Neural Comp., 7:219–269, 1995.
[54] G. Golub, M. Heat, and G. Wahba. Generalized cross-validation as a method for
choosing a good ridge parameter. Technometrics, 21:215–223, 1979.
[55] I. Guyon and A. Ellisseeff. An introduction to variable and feature selection.
J. Mach. Learn. Res., 3:1157–1182, 2003.
[56] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer
classification using support vector machines. Mach. Learn., 46:389–422, 2002.
References 217

[57] L. Györfi, M. Kohler, A. Krzyżak, and H. Walk. A Distribution-Free Theory of


Nonparametric Regression. Springer-Verlag, 2002.
[58] D. Hardin, I. Tsamardinos, and C.F. Aliferis. A theoretical characterization of
linear SVM-based feature selection. In Proc. 21st Int. Conf. Machine Learning,
2004.
[59] T. Hastie, R.J. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning.
Springer-Verlag, 2001.
[60] D. Haussler. Decision theoretic generalizations of the PAC model for neural net
and other learning applications. Inform. and Comput., 100:78–150, 1992.
[61] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press,
2002.
[62] M. Herbster. Relative loss bounds and polynomial-time predictions for the k-lms-
net algorithm. In S. Ben-David, J. Case, and A. Maruoka, editors, Proc. 15th Int.
Conf. Algorithmic Learning Theory, Springer 2004.
[63] H. Hochstadt. Integral Equations. John Wiley & Sons, 1973.
[64] V.V. Ivanov. The Theory of Approximate Methods and Their Application to the
Numerical Solution of Singular Integral Equations. Nordhoff International, 1976.
[65] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In S.A.
Solla, T.K. Leen, and K.-R. Müller, editors, Advances in Neural Information
Processing Systems, volume 12, pages 470–476. MIT Press, 2000.
[66] K. Jetter, J. Stöckler, and J.D. Ward. Error estimates for scattered data interpolation
on spheres. Math. Comp., 68:733–747, 1999.
[67] M.J. Kearns and U.V. Vazirani. An Introduction to Computational Learning
Theory. MIT Press, 1994.
[68] J. Kivinen, A.J. Smola, and R.C. Williamson. Online learning with kernels. IEEE
Trans. Signal Processing, 52:2165–2176, 2004.
[69] R. Kohavi and G. John. Wrappers for feature subset selection. Artif. Intell.,
97:273–324, 1997.
[70] A.N. Kolmogorov and S.V. Fomin. Introductory Real Analysis. Dover
Publications, 1975.
[71] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding
the generalization error of combined classifiers. Ann. Stat., 30:1–50, 2002.
[72] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan.
Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res.,
5:27–72, 2004.
[73] P. Lax. Functional Analysis. John Wiley & Sons, 2002.
[74] W.-S. Lee, P. Bartlett, and R. Williamson. The importance of convexity in learning
with squared loss. IEEE Trans. Inform. Theory, 44:1974–1980, 1998.
[75] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory,
and application to the classification of microarray data and satellite radiance data.
J. Amer. Stat. Ass., 99:67–81, 2004.
[76] J. Li and A. Barron. Mixture density estimation. In S.A. Solla, T.K. Leen,
and K.R. Müller, editors, Advances in Neural Information Processing Systems,
volume 12, pages 279–285. Morgan Kaufmann Publishers, 1999.
[77] Y. Lin. Support vector machines and the Bayes rule in classification. Data Min.
Knowl. Discov., 6:259–275, 2002.
[78] G.G. Lorentz. Approximation of Functions. Holt, Rinehart and Winston, 1966.
218 References

[79] F. Lu and H. Sun. Positive definite dot product kernels in learning theory. Adv.
Comput. Math., 22:181–198, 2005.
[80] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting
methods. Ann. Stat., 32:30–55, 2004.
[81] D.J.C. Mackay. Information-based objective functions for active data selection.
Neural Comp., 4:590–604, 1992.
[82] W.R. Madych and S.A. Nelson. Bounds on multivariate polynomials and
exponential error estimates for multiquadric interpolation. J. Approx. Theory,
70:94–114, 1992.
[83] C. McDiarmid. Concentration. In M. Habib et al., editors, Probabilistic Methods
for Algorithmic Discrete Mathematics, pages 195–248. Springer-Verlag, 1998.
[84] S. Mendelson. Improving the sample complexity using global data. IEEE Trans.
Inform. Theory, 48:1977–1991, 2002.
[85] J. Mercer. Functions of positive and negative type and their connection with the
theory of integral equations. Philos. Trans. Roy. Soc. London Ser. A, 209:415–446,
1909.
[86] C.A. Micchelli. Interpolation of scattered data: distance matrices and
conditionally positive definite functions. Constr. Approx., 2:11–22, 1986.
[87] C.A. Micchelli andA. Pinkus. Variational problems arising from balancing several
error criteria. Rend. Math. Appl., 14:37–86, 1994.
[88] C.A. Micchelli and M. Pontil. Learning the kernel function via regularization.
J. Mach. Learn. Res., 6:1099–1125, 2005.
[89] C.A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Comp.,
17:177–204, 2005.
[90] C.A. Micchelli, M. Pontil, Q. Wu, and D.X. Zhou. Error bounds for learning the
kernel. Preprint, 2006.
[91] M. Mignotte. Mathematics for Computer Algebra. Springer-Verlag, 1992.
[92] T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[93] S. Mukherjee and D.X. Zhou. Learning coordinate covariances via gradients.
J. Mach. Learn. Res., 7:519–549, 2006.
[94] F.J. Narcowich, J.D. Ward, and H. Wendland. Refined error estimates for radial
basis function interpolation. Constr. Approx., 19:541–564, 2003.
[95] P. Niyogi. The Informational Complexity of Learning. Kluwer Academic
Publishers, 1998.
[96] P. Niyogi and F. Girosi. On the relationship between generalization error,
hypothesis complexity and sample complexity for radial basis functions. Neural
Comput., 8:819–842, 1996.
[97] P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds
with high confidence from random samples. Preprint, 2004.
[98] J.P. Pedroso and N. Murata. Support vector machines with different norms:
motivation, formulations and results. Pattern Recognit. Lett., 22:1263–1272,
2001.
[99] I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces.
Ann. Probab., 22:1679–1706, 1994.
[100] A. Pinkus. N-widths in Approximation Theory. Springer-Verlag, 1996.
[101] A. Pinkus. Strictly positive definite kernels on a real inner product space. Adv.
Comput. Math., 20:263–271, 2004.
References 219

[102] T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory.
Nature, 317:314–319, 1985.
[103] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984.
[104] A. Rakhlin, D. Panchenko, and S. Mukherjee. Risk bounds for mixture density
estimation. ESAIM: Prob. Stat., 9:220–229, 2005.
[105] R. Schaback. Reconstruction of multivariate functions from scattered data.
Manuscript, 1997.
[106] I.J. Schoenberg. Metric spaces and completely monotone functions. Ann. Math.,
39:811–841, 1938.
[107] B. Schölkopf and A.J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
[108] B. Schölkopf, A.J. Smola, R.C. Williamson, and P.L. Bartlett. New support vector
algorithms. Neural Comp., 12:1207–1245, 2000.
[109] I.R. Shafarevich. Basic Algebraic Geometry. 1: Varieties in Projective Space.
Springer-Verlag, 2nd edition, 1994.
[110] J. Shawe-Taylor, P.L. Bartlet, R.C. Williamson, and M. Anthony. Structural risk
minimization over data dependent hierarchies. IEEE Trans. Inform. Theory,
44:1926–1940, 1998.
[111] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis.
Cambridge University Press, 2004.
[112] S. Smale and Y. Yao. Online learning algorithms. Found. Comput. Math., 6:145–
170, 2006.
[113] S. Smale and D.X. Zhou. Estimating the approximation error in learning theory.
Anal. Appl., 1:17–41, 2003.
[114] S. Smale and D.X. Zhou. Shannon sampling and function reconstruction from
point values. Bull. Amer. Math. Soc., 41:279–305, 2004.
[115] S. Smale and D.X. Zhou. Shannon sampling II: Connections to learning theory.
Appl. Comput. Harmonic Anal., 19:285–302, 2005.
[116] S. Smale and D.X. Zhou. Learning theory estimates via integral operators and
their approximations. To appear in Constr. Approx.
[117] A. Smola, B. Schölkopf, and R. Herbricht. A generalized representer theorem.
Comput. Learn. Theory, 14:416–426, 2001.
[118] A. Smola, B. Schölkopf, and K.R. Müller. The connection between regularization
operators and support vector kernels. Neural Networks, 11:637–649, 1998.
[119] M. Sousa Lobo, L. Vandenberghe, S. Boyd, and H. Lebret.Applications of second-
order cone programming. Linear Algebra Appl., 284:193–228, 1998.
[120] E.M. Stein. Singular Integrals and Differentiability Properties of Functions.
Princeton University Press, 1970.
[121] I. Steinwart. Support vector machines are universally consistent. J. Complexity,
18:768–791, 2002.
[122] I. Steinwart and C. Scovel. Fast rates for support vector machines. In P. Auer and
R. Meir, editors, Proc. 18th Ann. Conf. Learn. Theory, pages 279–294, Springer
2005.
[123] H.W. Sun. Mercer theorem for RKHS on noncompact sets. J. Complexity, 21:337–
349, 2005.
[124] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,
1998.
220 References

[125] J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle.


Least Squares Support Vector Machines. World Scientific, 2002.
[126] M. Talagrand. New concentration inequalities in product spaces. Invent. Math.,
126:505–563, 1996.
[127] P. Tarrès and Y. Yao. Online learning as stochastic approximations of
regularization paths. Preprint, 2005.
[128] D.M.J. Tax and R.P.W. Duin. Support vector domain description. Pattern
Recognit. Lett., 20:1191–1199, 1999.
[129] M.E. Taylor. Partial Differential Equations I: Basic Theory, volume 115 of
Applied Mathematical Sciences. Springer-Verlag, 1996.
[130] A.N. Tikhonov and V.Y. Arsenin. Solutions of Ill-Posed Problems. W.H. Winston,
1977.
[131] A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann.
Stat., 32:135–166, 2004.
[132] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes.
Springer-Verlag, 1996.
[133] V.N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-
Verlag, 1982.
[134] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
[135] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory Prob. Appl., 16:264–280, 1971.
[136] M. Vidyasagar. Learning and Generalization. Springer-Verlag, 2003.
[137] G. Wahba. Spline Models for Observational Data. SIAM, 1990.
[138] G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the
randomized GACV. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances
in Kernel Methods – Support Vector Learning, pages 69–88. MIT Press, 1999.
[139] J. Weston and C. Watkins. Multi-class support vector machines. Technical Report
CSD-TR-98-04, Department of Computer Science, Royal Holloway, University
of London, 1998.
[140] Q. Wu. Classification and regularization in learning theory. PhD thesis, City
University of Hong Kong, 2005.
[141] Q. Wu, Y. Ying, and D.X. Zhou. Learning theory: from regression to classification.
In K. Jetter, M. Buhmann, W. Haussmann, R. Schaback, and J. Stoeckler, editors,
Topics in Multivariate Approximation and Interpolation, volume 12 of Studies in
Computational Mathematics, pages 257–290. Elsevier, 2006.
[142] Q. Wu, Y. Ying, and D.X. Zhou. Multi-kernel regularized classifiers. To appear
in J. Complexity.
[143] Q. Wu, Y. Ying, and D.X. Zhou. Learning rates of least-square regularized
regression. Found. Comput. Math., 6:171–192, 2006.
[144] Q. Wu and D.X. Zhou. SVM soft margin classifiers: linear programming versus
quadratic programming. Neural Comp., 17:1160–1187, 2005.
[145] Q. Wu and D.X. Zhou. Analysis of support vector machine classification.
J. Comput. Anal. Appl., 8:99–119, 2006.
[146] Q. Wu and D.X. Zhou. Learning with sample dependent hypothesis spaces.
Preprint, 2006.
[147] Z. Wu and R. Schaback. Local error estimates for radial basis function
interpolation of scattered data. IMA J. Numer. Anal., 13:13–27, 1993.
References 221

[148] Y. Yang and A. R. Barron. Information-theoretic determination of minimax rates


of convergence. Ann. Stat., 27:1564–1599, 1999.
[149] G.B. Ye and D.X. Zhou. Fully online classification by regularization. To appear
at Appl. Comput. Harmonic Anal.
[150] Y. Ying and D.X. Zhou. Learnability of Gaussians with flexible variances. To
appear at J. Mach. Learn. Res.
[151] Y. Ying and D.X. Zhou. Online regularized classification algorithms. To appear
in IEEE Trans. Inform. Theory, 52:4775–4788, 2006.
[152] T. Zhang. On the dual formulation of regularized linear systems with convex
risks. Machine Learning, 46:91–129, 2002.
[153] T. Zhang. Leave-one-out bounds for kernel methods. Neural Comp., 15:1397–
1437, 2003.
[154] T. Zhang. Statistical behavior and consistency of classification methods based on
convex risk minimization. Ann. Stat., 32:56–85, 2004.
[155] D.X. Zhou. The covering number in learning theory. J. Complexity, 18:739–767,
2002.
[156] D.X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Trans.
Inform. Theory, 49:1743–1752, 2003.
[157] D.X. Zhou. Density problem and approximation error in learning theory. Preprint,
2006.
[158] D.X. Zhou and K. Jetter. Approximation with polynomial kernels and SVM
classifiers. Adv. Comput. Math., 25:323–344, 2006.
Index

, 7 fz , 9
C s (X ) , 18 fz0 , 161
Lip(s) , 73 fz,γ , 134
φ
Lip∗(s,C (X )) , 74 fz,γ , 164
Lip∗(s,L p (X )) , 74 HK , 23
s , 21 HK,z , 34, 163
H s (Rn ) , 75 H s (Rn ), 75
A(fρ , R), 54 H s (X ), 21
A(H), 12 Kρ , 159
BR , 76 κρ , 159
CK , 22 Kx , 22
C (X ), 9 K[x], 22
C s (X ), 18 2 (XN ), 90
C ∞ (X ), 18 ∞ (XN ), 90
D(γ ), 136
Lip(s), 73
D(γ , φ), 196
Lip∗(s, C (X )), 74
Diam(X ), 72
Lip∗(s, L p (X )), 74
Dµρ , 110
LK , 56
rt , 74 p
Lν (X ), 18
E, 6
Lν∞ (X ), 19
Eγ , 134
Lz , 10
EH , 11
M(S, η), 101
E φ , 162
φ N (S, η), 37
Eγ , 162
O(n), 30
φ
Ez,γ , 162 φ0 , 162
φ φh , 165
Ez , 162
Ez , 8 φls , 162
Ez,γ , 134 s (R), 84
K (x), 112 R(f ) 5
ηx , 174 ρ, 5
fc , 159 ρX , 6
fγ , 134 ρ(y|x), 6
fH , 9 sgn, 159
fρ , 3, 6 σρ2 , 6
fY , 8 ϒk , 87

222
Index 223

X, 5 regularized (associated with φ), 162


Xr,t , 74 regularized empirical, 134
Y, 5 regularized empirical (associated with
Z, 5 φ), 162
Z m, 8 expected value, 5

Bayes rule, 160 feasible points, 33


Bennett’s inequality, 38 feasible set, 33
Bernstein’s inequality, 40, 42 feature map, 70
best fit, 2 Fourier coefficients, 55
bias–variance problem, 13, 127 Fourier transform, 19
bounded linear map, 21 nonnegative, 26
box spline, 28 positive, 26
full measure, 10
Chebyshev’s inequality, 38 function
classification algorithms, 160 completely monotonic, 21
classifier even, 26
binary, 157 measurable, 19
regularized, 164
compact linear map, 21 generalized Bennett’s inequality, 40, 42
confidence, 42 Gramian, 22
constraints, 33
convex function, 33
Hoeffding’s inequality, 40, 42
convex programming, 33
homogeneous polynomials, 17, 29
convex set, 33
hypothesis space, 9
convolution, 19
convex, 46
covering number, 37

interpolation space, 63
defect, 10
distortion, 110
divided difference, 74 K-functional, 63
domination of measures, 110 kernel, 56
box spline, 28
efficient algorithm, 33 dot product, 24
ε-net, 78 Mercer, 22
ERM, 50 spline, 27
error translation invariant, 26
approximation, 12 universal, 212
approximation (associated with ψ), 70
empirical, 8 Lagrange interpolation polynomials, 84
empirical (associated with φ), 162 Lagrange multiplier, 151
empirical (associated with ψ), 51 least squares, 1, 2
excess generalization, 12 left derivative, 34
excess misclassification, 188 linear programming, 33
generalization, 5 localizing function, 190
generalization (associated with φ), 162 loss
generalization (associated with ψ), 51 classifying function, 187
in H, 11 -insensitive, 50
local, 6, 161 function, 161
misclassification, 157 function, regression, 50
regularized, 134 hinge, 165
224 Index

least squares, 50, 162 general nonlinear, 33


misclassification, 162 second-order cone, 34
q-norm, 187 projection operator, 195

margin, 167 radial basis functions, 28


hard, 171 regression function, 3, 6
maximal, 168 regularization
of the sample, 168 parameter, 135, 164
soft, 171 scheme, 135, 164
Markov’s inequality, 38 regularizing function, 195
measure reproducing kernel Hilbert space, 24
finite, 19 reproducing property, 24
marginal, 6 right derivative, 34
nondegenerate, 19 RKHS, see reproducing kernel Hilbert space
strictly separable by HK , 173
support of, 19 sample, 8
weakly separable by HK , 158, 182 separable, 166
Mercer kernel, see kernel separable by a hyperplane, 166
metric entropy, 52 sample error, regularized, 136
model selection, 127 separating hyperplane, 167
multinomial coefficients, 24 separation
exponent, 182
net, 78 triple, 182
nodal functions, 102 Sobolev embedding theorem, 21
Sobolev space, 21
fractional, 35
objective function, 33 spherical coordinates, 98
offset, 172 support vector machine, 165
operator support vectors, 169
positive, 55 SVM, see support vector machine
self-adjoint, 55
strictly positive, 55
target function, 9, 51, 134
optimal hyperplane, 168
empirical, 9, 51, 134
orthogonal group, 30
orthogonal invariance, 30
orthonormal basis, 55 uniform Glivenko–Cantelli, 52
orthonormal complete system, 55
orthonormal system, 55 variance, 5
variancing power, 198
packing number, 101 Veronese embedding, 31
power function, 125 Veronese variety, 31, 36
problems
classification, 15 weak compactness, 20
regression, 15 weak convergence, 20
programming Weyl inner product, 30
convex, 34
convex quadratic, 34 Zygmund class, 75

You might also like