(2007) - Cucker-Learning Theory - An Approximation Theory Viewpoint
(2007) - Cucker-Learning Theory - An Approximation Theory Viewpoint
CAMBRIDGE MONOGRAPHS ON
APPLIED AND COMPUTATIONAL
MATHEMATICS
Series Editors
M. J. ABLOWITZ, S. H. DAVIS, E. J. HINCH, A. ISERLES,
J. OCKENDEN, P. J. OLVER
Learning Theory:
An Approximation Theory Viewpoint
The Cambridge Monographs on Applied and Computational Mathematics reflect the
crucial role of mathematical and computational techniques in contemporary science.
The series publishes expositions on all aspects of applicable and numerical mathematics,
with an emphasis on new developments in this fast-moving area of research.
State-of-the-art methods and algorithms as well as modern mathematical descriptions
of physical and mechanical ideas are presented in a manner suited to graduate research
students and professionals alike. Sound pedagogical presentation is a prerequisite. It is
intended that books in the series will serve to inform a new generation of researchers.
Within the series will be published titles in the Library of Computational
Mathematics, published under the auspices of the Foundations of Computational
Mathematics organisation. Learning Theory: An Approximation Theory View Point
is the first title within this new subseries.
The Library of Computational Mathematics is edited by the following editorial board:
Felipe Cucker (Managing Editor) Ron Devore, Nick Higham, Arieh Iserles, David
Mumford, Allan Pinkus, Jim Renegar, Mike Shub.
FELIPE CUCKER
City University of Hong Kong
DING-XUAN ZHOU
City University of Hong Kong
CAMBRIDGE UNIVERSITY PRESS
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo
Cambridge University Press has no responsibility for the persistence or accuracy of urls
for external or third-party internet websites referred to in this publication, and does not
guarantee that any content on such websites is, or will remain, accurate or appropriate.
Contents
Foreword ix
Preface xi
v
vi Contents
References 214
Index 222
Foreword
This book by Felipe Cucker and Ding-Xuan Zhou provides solid mathematical
foundations and new insights into the subject called learning theory.
Some years ago, Felipe and I were trying to find something about brain
science and artificial intelligence starting from literature on neural nets. It was
in this setting that we encountered the beautiful ideas and fast algorithms of
learning theory. Eventually we were motivated to write on the mathematical
foundations of this new area of science.
I have found this arena to with its new challenges and growing number of
application, be exciting. For example, the unification of dynamical systems and
learning theory is a major problem.Another problem is to develop a comparative
study of the useful algorithms currently available and to give unity to these
algorithms. How can one talk about the “best algorithm” or find the most
appropriate algorithm for a particular task when there are so many desirable
features, with their associated trade-offs? How can one see the working of
aspects of the human brain and machine vision in the same framework?
I know both authors well. I visited Felipe in Barcelona more than 13 years
ago for several months, and when I took a position in Hong Kong in 1995, I
asked him to join me. There Lenore Blum, Mike Shub, Felipe, and I finished
a book on real computation and complexity. I returned to the USA in 2001,
but Felipe continues his job at the City University of Hong Kong. Despite the
distance we have continued to write papers together. I came to know Ding-Xuan
as a colleague in the math department at City University. We have written a
number of papers together on various aspects of learning theory. It gives me
great pleasure to continue to work with both mathematicians. I am proud of our
joint accomplishments.
I leave to the authors the task of describing the contents of their book. I will
give some personal perspective on and motivation for what they are doing.
ix
x Foreword
xi
xii Preface
material in this book comes from or evolved from joint papers we wrote with
him. Qiang Wu, Yiming Ying, Fangyan Lu, Hongwei Sun, Di-Rong Chen,
Song Li, Luoqing Li, Bingzheng Li, Lizhong Peng, and Tiangang Lei regularly
attended our weekly seminars on learning theory at City University of Hong
Kong, where we exposed early drafts of the contents of this book. They, and
José Luis Balcázar, read preliminary versions and were very generous in their
feedback. We are indebted also to David Tranah and the staff of Cambridge
University Press for their patience and willingness to help. We have also been
supported by the University Grants Council of Hong Kong through the grants
CityU 1087/02P, 103303, and 103704.
1
The framework of learning
1.1 Introduction
We begin by describing some cases of learning, simplified to the extreme, to
convey an intuition of what learning is.
Case 1.1 Among the most used instances of learning (although not necessarily
with this name) is linear regression. This amounts to finding a straight line that
best approximates a functional relationship presumed to be implicit in a set of
data points in R2 , {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )} (Figure 1.1). The yardstick
used to measure how good an approximation a given line Y = aX + b is, is
called least squares. The best line is the one that minimizes
m
Q(a, b) = (yi − axi − b)2 .
i=1
Figure 1.1
1
2 1 The framework of learning
Case 1.2 Case 1.1 readily extends to a classical situation in science, namely,
that of learning a physical law by curve fitting to data. Assume that the law at
hand, an unknown function f : R → R, has a specific form and that the space
of all functions with this form can be parameterized by N real numbers. For
instance, if f is assumed to be a polynomial of degree d , then N = d +1 and the
parameters are the unknown coefficients w0 , . . . , wd of f . In this case, finding
the best fit by the least squares method estimates the unknown f from a set
of pairs {(x1 , y1 ), . . . , (xm , ym )}. If the measurements generating this set were
exact, then yi would be equal to f (xi ). However, in general one expects the
values yi to be affected by noise. That is, yi = f (xi ) + ε, where ε is a random
variable (which may depend on xi ) with mean zero. One then computes the
vector of coefficients w such that the value
m
d
(fw (xi ) − yi )2 , with fw (x) = wj x j
i=1 j=0
N
fw (x) = wi φi (x),
i=1
where the φi are the elements of a basis of a specific function space, not
necessarily of polynomials.
Case 1.3 The training of neural networks is an extension of Case 1.2. Roughly
speaking, a neural network is a directed graph containing some input nodes,
some output nodes, and some intermediate nodes where certain functions are
1.1 Introduction 3
computed. If X denotes the input space (whose elements are fed to the input
nodes) and Y the output space (of possible elements returned by the output
nodes), a neural network computes a function from X to Y . The literature on
neural networks shows a variety of choices for X and Y , which can be continuous
or discrete, as well as for the functions computed at the intermediate nodes. A
common feature of all neural nets, though, is the dependence of these functions
on a set of parameters, usually called weights, w = {wj }j∈J . This set determines
the function fw : X → Y computed by the network.
Neural networks are trained to learn functions. As in Case 1.2, there is a
target function f : X → Y , and the network is given a set of randomly chosen
pairs (x1 , y1 ), . . . , (xm , ym ) in X × Y . Then, training algorithms select a set of
weights w attempting to minimize some distance from fw to the target function
f :X → Y.
That is, fρ (x) is the average of the y values of {x} × Y (we are more precise
about ρ and the regression function in Section 1.2).
Case 1.5 A standard approach for approximating characteristic (or indicator)
functions of sets is known as PAC learning (from “probably approximately
correct”). Let T (the target concept) be a subset of Rn and ρX be a probability
measure on Rn that we assume is not known in advance. Intuitively, a set
S ⊂ Rn approximates T when the symmetric difference ST = (S \ T ) ∪
(T \ S) is small, that is, has a small measure. Note that if fS and fT denote the
characteristic
functions of S and T , respectively, this measure, called the error
of S, is Rn |fS − fT | d ρX . Note that since the functions take values in {0, 1},
only this integral coincides with Rn (fS − fT )2 d ρX .
Let C be a class of subsets of Rn and assume that T ∈ C. One strategy for
constructing an approximation of T in C is the following. First, draw points
x1 , . . . , xm ∈ Rn according to ρX and label each of them with 1 or 0 according
to whether they belong to T . Second, compute any function fS : Rn → {0, 1},
fS ∈ C, that coincides with this labeling over {x1 , . . . , xm }. Such a function will
provide a good approximation S of T (small error with respect to ρX ) as long
as m is large enough and C is not too wild. Thus the measure ρX is used in both
capacities, governing the sample drawing and measuring the error set ST .
A major goal in PAC learning is to estimate how large m needs to be to obtain
an ε approximation of T with probability at least 1 − δ as a function of ε and δ.
The situation described above is noise free since each randomly drawn point
xi ∈ Rn is correctly labeled. Extensions of PAC learning allowing for labeling
mistakes with small probability exist.
Case 1.6 (Monte Carlo integration) An early instance of randomization in
algorithmics appeared in numerical integration. Let f : [0, 1]n → R. One way
of approximating the integral x∈[0,1]n f (x) dx consists of randomly drawing
points x1 , . . . , xm ∈ [0, 1]n and computing
1
m
Im (f ) = f (xi ).
m
i=1
Under mild conditions on the regularity of f , Im (f ) → f with probability 1;
that is, for all ε > 0,
lim Prob Im (f ) − f (x) dx > ε → 0.
m→∞ x1 ,...,xm x∈[0,1]n
Again we find the theme of learning an object (here a single real number,
although defined in a nontrivial way through f ) from a sample. In this case
1.2 A formal setting 5
the measure governing the sample is known (the measure in [0, 1]n inherited
from the standard Lebesgue measure on Rn ), but the same idea can be used
for an unknown measure. If ρX is a probability measure on X ⊂ Rn , a
domain or manifold, Im (f ) will approximate x∈X f (x) d ρX for large m with
high probability as long as the points x1 , . . . , xm are drawn from X according to
the measure ρX . Note that no noise is involved here. An extension of this idea
to include noise is, however, possible.
A common characteristic of Cases 1.2–1.5 is the existence of both an
“unknown” function f : X → Y and a probability measure allowing one
to randomly draw points in X × Y . That measure can be on X (Case 1.5), on Y
varying with x ∈ X (Cases 1.2 and 1.3), or on the product X ×Y (Case 1.4). The
only requirement it satisfies is that, if for x ∈ X a point y ∈ Y can be randomly
drawn, then the expected value of y is f (x). That is, the noise is centered at zero.
Case 1.6 does not follow this pattern. However, we have included it since it is
a well-known algorithm and shares the flavor of learning an unknown object
from random data.
The development in this book, for reasons of unity and generality, is based on
a single measure on X × Y . However, one should keep in mind the distinction
between “inputs” x ∈ X and “outputs” y ∈ Y .
A central concept in the next few chapters is the generalization error (or
least squares error or, if there is no risk of ambiguity, simply error) of f , for
6 1 The framework of learning
f : X → Y , defined by
E(f ) = Eρ (f ) = (f (x) − y)2 d ρ.
Z
For each input x ∈ X and output y ∈ Y , (f (x) − y)2 is the error incurred through
the use of f as a model for the process producing y from x. This is a local error.
By integrating over X × Y (w.r.t. ρ, of course) we average out this local error
over all pairs (x, y). Hence the word “error” for E(f ).
The problem posed is: What is the f that minimizes the error E(f )? To answer
this question we note that the error E(f ) naturally decomposes as a sum. For
every x ∈ X , let ρ(y|x) be the conditional (w.r.t. x) probability measure on Y .
Let also ρX be the marginal probability measure of ρ on X , that is, the measure
on X defined by ρX (S) = ρ(π −1 (S)), where π : X × Y → X is the projection.
For every integrable function ϕ : X × Y → R a version of Fubini’s theorem
relates ρ, ρ(y|x), and ρX as follows:
ϕ(x, y) d ρ = ϕ(x, y) d ρ(y|x) d ρX .
X ×Y X Y
The number σρ2 is a measure of how well conditioned ρ is, analogous to the
notion of condition number in numerical linear algebra.
Remark 1.7
(i) It is important to note that whereas ρ and fρ are generally “unknown,” ρX
is known in some situations and can even be the Lebesgue measure on X
inherited from Euclidean space (as in Cases 1.2 and 1.6).
(ii) In the remainder of this book, if formulas do not make sense or ∞ appears,
then the assertions where these formulas occur should be considered
vacuous.
1
The first term on the right-hand side of Proposition 1.8 provides an average
(over X ) of the error suffered from the use of f as a model for fρ . In addition,
since σρ2 is independent of f , Proposition 1.8 implies that fρ has the smallest
possible error among all functions f : X → Y . Thus σρ2 represents a lower
bound on the error E and it is due solely to our primary object, the measure ρ.
Thus, Proposition 1.8 supports the following statement:
The goal is to “learn” (i.e., to find a good approximation of) fρ from random
samples on Z.
1 Throughout this book, the square denotes the end of a proof or the fact that no proof is given.
8 1 The framework of learning
z ∈ Z m, z = ((x1 , y1 ), . . . , (xm , ym ))
1
m
Ez (f ) = (f (xi ) − yi )2 .
m
i=1
1
m
Ez (ξ ) = ξ(zi ).
m
i=1
fY : X × Y → Y
(x, y) → f (x) − y.
With these notations we may write E(f ) = E(fY2 ) and Ez (f ) = Ez (fY2 ). We have
already remarked that the expected value of (fρ )Y is 0; we now remark that its
variance is σρ2 .
Remark 1.9 Consider the PAC learning setting discussed in Case 1.5 where
X = Rn and T is a subset of Rn .2 The measure ρX described there can be
extended to a measure ρ on Z by defining, for A ⊂ Z,
2 Note, in this case, that X is not compact. In fact, most of the results in this book do not require
compactness of X but only completeness and separability.
1.3 Hypothesis spaces and target functions 9
f ∞ = sup |f (x)|.
x∈X
Notice that since E(f ) = X (f − fρ )2 + σρ2 , fH is also an optimizer of
min (f − fρ )2 d ρX .
f ∈H X
1
m
min (f (xi ) − yi )2 . (1.1)
f ∈H m
i=1
Notethatalthoughfz isnotproducedbyanalgorithm,itisclosetoalgorithmic.The
statement of the minimization problem (1.1) depends on ρ only through its
dependenceonz,butoncez isgiven,sois(1.1),anditssolutionfz canbelookedfor
without further involvement of ρ. In contrast to fH , fz is “empirical” from its
dependence on the sample z. Note finally that E(fz ) and Ez (f ) are different objects.
We next prove that fH and fz exist under a mild condition on H.
10 1 The framework of learning
Lz (f ) = Lρ,z (f ) = E(f ) − Ez (f ).
Notice that the theoretical error E(f ) cannot be measured directly, whereas
Ez (f ) can. A bound on Lz (f ) becomes useful since it allows one to bound
the actual error from an observed quantity. Such bounds are the object of
Theorems 3.8 and 3.10.
Let f1 , f2 ∈ C (X ). Toward the proof of the existence of fH and fz , we first
estimate the quantity
(f1 (x) − y)2 − (f2 (x) − y)2 = (f1 (x) + f2 (x) − 2y)(f1 (x) − f2 (x)),
we have
|E(f1 ) − E(f2 )| = (f1 (x) + f2 (x) − 2y)(f1 (x) − f2 (x)) d ρ
Z
≤ |(f1 (x) − y) + (f2 (x) − y)| f1 − f2 ∞ dρ
Z
≤ 2M f1 − f2 ∞.
1
m
≤ |(f1 (xi ) − y) + (f2 (xi ) − yi )| f1 − f2 ∞
m
i=1
≤ 2M f1 − f2 ∞.
1.4 Sample, approximation, and generalization errors 11
Thus
Remark 1.12 Notice that for bounding |Ez (f1 ) − Ez (f2 )| in this proof – in
contrast to the bound for |E(f1 ) − E(f2 )| – the use of the ∞ norm is crucial.
Nothing less will do.
Corollary 1.13 Let H ⊆ C (X ) and ρ be such that, for all f ∈ H, |f (x) − y| ≤
M almost everywhere. Then E, Ez : H → R are continuous.
Proof. The proof follows from the bounds |E(f1 )−E(f2 )| ≤ 2M f1 −f2 ∞ and
|Ez (f1 ) − Ez (f2 )| ≤ 2M f1 − f2 ∞ shown in the proof of Proposition 1.11.
Proof. The proof follows from the compactness of H and the continuity of
E, Ez : C (X ) → R.
Remark 1.15
(i) The functions fH and fz are not necessarily unique. However, we see a
uniqueness result for fH in Section 3.4 when H is convex.
(ii) Note that the requirement of H to be compact is what allows Corollary 1.14
to be proved and therefore guarantees the existence of fH and fz .
Other consequences (e.g., the finiteness of covering numbers) follow in
subsequent chapters.
EH (f ) = E(f ) − E(fH ).
Note that EH (f ) ≥ 0 for all f ∈ H and that EH (fH ) = 0. Also note that E(fH )
and EH (f ) are different objects.
12 1 The framework of learning
The quantities in (1.2) are the main characters in this book. We have already
noted that σρ2 is a lower bound on the error E that is solely due to the measure ρ.
The generalization error E(fz ) of fz depends on ρ, H, the sample z, and the
scheme (1.1) defining fz . The squared distance X (fz − fρ )2 dρX is the excess
generalization error of fz . A goal of this book is to show that under some
hypotheses on ρ and H, this excess generalization error becomes arbitrarily
small with high probability as the sample size m tends to infinity.
Now consider the sum EH (fz ) + E(fH ). The second term in this sum
depends on the choice of H but is independent of sampling. We will call it
the approximation error. Note that this approximation error is the sum
A(H) + σρ2 ,
where A(H) = X (fH − fρ )2 dρX . Therefore, σρ2 is a lower bound for the
approximation error.
The first term, EH (fz ), is called the sample error or estimation error.
Equation (1.2) thus reduces our goal above – to estimate X (fz − fρ )2 or,
equivalently, E(fz ) – into two different problems corresponding to finding
estimates for the sample and approximation errors. The way these problems
depend on the measure ρ calls for different methods and assumptions in their
analysis.
The second problem (to estimate A(H)) is independent of the sample z.
But it depends heavily on the regression function fρ . The worse behaved fρ is
(e.g., the more it oscillates), the more difficult it will be to approximate fρ well
with functions in H. Consequently, all bounds for A(H) will depend on some
parameter measuring the behavior of fρ .
The first problem (to estimate the sample error EH (fz )) is posed on the space
H, and its dependence on ρ is through the sample z. In contrast with the
approximation error, it is essentially independent of fρ . Consequently, bounds
for EH (fz ) will not depend on properties of fρ . However, due to their dependence
on the random sample z, they will hold with only a certain confidence. That is,
the bound will depend on a parameter δ and will hold with a confidence of at
least 1 − δ.
This discussion extends to some algorithmic issues. Although dependence
on the behavior of fρ seems unavoidable in the estimates of the approximation
1.5 The bias–variance problem 13
Thus, a too small space H will yield a large bias, whereas one that is too large
will yield a large variance. Several parameters (radius of balls, dimension, etc.)
determine the “size” of H, and different instances of the bias–variance problem
are obtained by fixing all of them except one and minimizing the error over this
nonfixed parameter.
Failing to find a good compromise between bias and variance leads to what
is called underfitting (large bias) or overfitting (large variance). As an example,
consider Case 1.2 and the curve C in Figure 1.2(a) with the set of sample
points and assume we want to approximate that curve with a polynomial of
degree d (the parameter d determines in our case the dimension of H). If d is
too small, say d = 2, we obtain a curve as in Figure 1.2(b) which necessarily
“underfits” the data points. If d is too large, we can tightly fit the data points
3 [18] p. 332.
14 1 The framework of learning
Figure 1.2
but this “overfitting” yields a curve as in Figure 1.2(c). In terms of the error
decomposition (1.2) this overfitting corresponds to a small approximation error
but large sample error.
As another example of overfitting, consider the PAC learning situation in
Case 1.5 with C consisting of all subsets of Rn . Consider also a sample
{(x1 , 1), . . . , (xk , 1), (xk+1 , 0), . . . , (xm , 0)}. The characteristic function of the
set S = {x1 , . . . , xk } has zero sample error, but its approximation error is the
measure (w.r.t. ρX ) of the set T {x1 , . . . , xk }, which equals the measure of T as
long as ρX has no points with positive probability mass.
we give bounds for these covering numbers for most of the spaces H introduced
in Chapter 2. These bounds are in terms of explicit geometric parameters of H
(e.g., dimension, diameter, smoothness, etc.).
In Chapter 6 we continue along the lines of Chapter 4. We first show some
conditions under which the approximation error can decay as O(R−θ ) only if
fρ is C ∞ . Then we show a polylogarithmic decay in the approximation error
of hypothesis spaces defined via RKHSs for some common instances of these
spaces.
Chapter 7 gives a solution to the bias–variance problem for a particular family
of hypothesis spaces (and under some assumptions on fρ ).
Chapter 8 describes a new setting, regularization, in which the hypothesis
space is no longer required to be compact and argues some equivalence
with the setting described above. In this new setting the computation of the
empirical target function is algorithmically very simple. The notion of excess
generalization error has a natural version, and a bound for it is exhibited.
A special case of learning is that in which Y is finite and, most particularly,
when it has two elements (cf. Case 1.5). Learning problems of this kind are
called classification problems as opposed to the ones with Y = R, which are
called regression problems. For classification problems it is possible to take
advantage of the special structure of Y to devise learning schemes that perform
better than simply specializing the schemes used for regression problems.
One such scheme, known as the support vector machine, is described, and
its error analyzed, in Chapter 9. Chapter 10 gives a detailed analysis for natural
extensions of the support vector machine.
We have begun Chapters 3–10 with brief introductions. Our intention is that
maybe after reading Chapter 2, a reader can form an accurate idea of the contents
of this book simply by reading these introductions.
In this book we will not go deeper into the details of PAC learning. A standard
reference for this is [67].
Other (but not all) books dealing with diverse mathematical aspects of
learning theory are [7, 29, 37, 57, 59, 61, 92, 95, 107, 111, 124, 125, 132,
133, 136, 137]. In addition, a number of scientific journals publish papers on
learning theory. Two devoted wholly to the theory as developed in this book
are Journal of Machine Learning Research and Machine Learning.
Finally, we want to mention that the exposition and structure of this chapter
largely follow [39].
2
Basic hypothesis spaces
17
18 2 Basic hypothesis spaces
2.2 Reminders I
(I) We first recall some commonly used spaces of functions.
We have already defined C (X ). Recall that this is the Banach space of
bounded continuous functions on X with the norm
f C (X ) = f ∞ = sup | f (x)|.
x∈X
f C s (X ) = max Dα f .
|α|≤s
p
exists. The space Lν (X ) is defined to be the quotient of L under the equivalence
relation ≡ given by
f ≡ g ⇐⇒ | f (x) − g(x)|p d ν = 0.
X
p
Note that elements in Lν (X ) are classes of functions. In general, however,
one abuses language and refers to them as functions on X . For instance, we say
p
that f ∈ Lν (X ) is continuous when there exists a continuous function in the
class of f .
The support of a measure ν on X is the smallest closed subset Xν of X such
that ν(X \ Xν ) = 0.
A function f : X → R is measurable when, for all α ∈ R, the set {x ∈ X |
f (x) ≤ α} is a Borel subset of X .
The space Lν∞ (X ) is defined to be the set of all measurable functions on X
such that
F( f ∗ g) = F( f )F(g),
The fact that every closed ball in Rn is compact is not true in Hilbert space.
However, we will use the fact that closed balls in a Hilbert space H are weakly
compact. That is, every sequence {fn }n∈N in a closed ball B in H has a weakly
convergent subsequence {fnk }k∈N , or, in other words, there is some f ∈ B
such that
lim fnk , g = f , g, ∀g ∈ H .
k→∞
2.3 Hypothesis spaces associated with Sobolev spaces 21
J = sup J (x) .
x =1
Js : H s (X ) → C r (X )
22 2 Basic hypothesis spaces
is well defined and bounded. In particular, for all s > n/2, the inclusion
Js : H s (X ) → C (X )
Then
since, by the positive semidefiniteness of the matrix K[{x, t}], for all x, t ∈ X ,
Kx : X → R
t → K(x, t).
Theorem 2.9 There exists a unique Hilbert space (HK , , HK ) of functions
on X satisfying the following conditions:
(i) for all x ∈ X , Kx ∈ HK ,
(ii) the span of the set {Kx | x ∈ X } is dense in HK , and
(iii) for all f ∈ HK and x ∈ X , f (x) = Kx , f HK .
Moreover, HK consists of continuous functions and the inclusion IK : HK →
C (X ) is bounded with IK ≤ CK .
Proof. Let H0 be the span of the set {Kx | x ∈ X }. We define an inner product
in H0 as
s
r
f , g = αi βj K(xi , tj ), for f = αi Kxi , g = βj Ktj .
1≤i≤s i=1 j=1
1≤j≤r
The conditions for the inner product can be easily checked. For example, if
f , f = 0, then for each t ∈ X the positive semidefiniteness of the Gramian of
K at the subset {xi }si=1 ∪ {t} tells us that for each ∈ R
s
s
αi K(xi , xj )αj + 2 αi K(xi , t) + 2 K(t, t) ≥ 0.
i,j=1 i=1
However, si,j=1 αi K(xi , xj )αj = f , f = 0. By letting be arbitrarily small,
we see that f (t) = si=1 αi K(xi , t) = 0. This is true for each t ∈ X ; hence f is
the zero function.
Let HK be the completion of H0 with the associated norm. It is easy to check
that HK satisfies the three conditions in the statement. We need only prove that
it is unique. So, assume H is another Hilbert space of functions on X satisfying
the conditions noted. We want to show that
Ka − Kb 2
K = Ka − Kb , Ka − Kb K = K(a, a) − 2K(a, b) + K(b, b)
= 1 + a2 − 2(1 + ab) + 1 + b2 = (a − b)2 .
1 + ax, 1 + axK = 1 2
K + 2a1, xK + a2 x 2
K = 1 2
K + a2 .
Kej − K0 2
K = K(ej , ej ) − 2K(ej , 0) + K(0, 0) = 1.
But (Kej −K0 )(x) = (1+xj )−1 = xj , and therefore 1 = Kej −K0 2K = xj 2K .
One can prove, similarly, that 1, xj K = 0 and 1, 1K = 1. Now consider i = j:
Proof. We need only show the positive semidefiniteness. To do so, for any
x1 , . . . , xm ∈ Rn and c1 , . . ., cm ∈ R, we apply the inverse Fourier transform
k(x) = (2π)−n k(ξ )eix·ξ d ξ
Rn
2.5 Some Mercer kernels 27
to get
m
m
cj c K(xj , x ) = cj c (2π)−n k(ξ )eixj ·ξ e−ix ·ξ d ξ
j,=1 j,=1 Rn
⎛ ⎞
m
m
= (2π)−n k(ξ ) ⎝ cj e j ⎠
ix ·ξ
c eix ·ξ d ξ
Rn j=1 =1
2
m
= (2π)−n k(ξ ) cj eixj ·ξ d ξ ≥ 0,
Rn j=1
and therefore CK = 1.
Multivariate splines can also be used to construct translation-invariant
kernels. Take B = [b1 b2 . . . bq ] to be an n × q matrix (called the direction
set) such that q ≥ n and the n × n submatrix B0 = [b1 b2 . . . bn ] is invertible.
Define
1
MB0 = χpar(B0 ) ,
| det B0 |
1
2
M[b1 b2 ... bn+j ] (x) = M[b1 b2 ... bn+j−1 ] (x − tbn+j ) dt
− 12
q
sin(ξ · bj /2)
MB (ξ ) = .
ξ · bj /2
j=1
q
sin(ξ · bj /2) 2
k(ξ ) =
ξ · bj /2
j=1
Now note that for each σ ∈ [0, ∞), the Fourier transform of e−σ x 2
equals
√
( π/σ )n e− ξ /4σ . Hence,
2
π n ξ 2
−σ x −n
e−
2 2
e = (2π) 4σ eix·ξ d ξ .
Rn σ
Corollary 2.19 Let c > 0. The following functions are Mercer kernels on any
subset X ⊂ Rn :
Proof. Clearly, both kernels are continuous and symmetric. In (i) K is positive
semidefinite by Proposition 2.18 with f (r) = e−r/c . The same is true for (ii)
2
Remark 2.20 The kernels of (i) and (ii) in Corollary 2.19 satisfy CK = 1 and
CK = c−α , respectively.
n+d
N= .
n
σ ( f ), σ (g)W = f , gW .
| f (x)| ≤ f W x d
,
where x is the standard norm of x ∈ Rn+1 . This follows from taking the
action of σ ∈ O(n + 1) such that σ (x) = ( x , 0, . . ., 0).
Let X = S(Rn+1 ) and
K :X × X → R
(x, t) → (x · t)d .
Let also
: X → RN
x → xα (Cαd )1/2 .
|α|=d
the matrix whose jth column is (tj ), we have that K[t] = M T M , from which
the positivity of K[t] follows. Since K is clearly continuous and symmetric, we
conclude that K is a Mercer kernel.
The next proposition shows the RKHS associated with K.
Proposition 2.21 Hd = HK as function spaces and inner product spaces.
Proof. We know from the proof of Theorem 2.9 that HK is the completion of
H0 , the span of {Kx | x ∈ X }. Since H0 ⊆ Hd and Hd has finite dimension, the
same holds for H0 . But then H0 is complete and we deduce that
HK = H0 ⊆ Hd .
On the other hand, since Kx (w) = |α|=d Cαd xα wα , we know that the Weyl
inner product of Kx and Kt satisfies
Kx , Kt W = (Cαd )−1 Cαd xα Cαd t α = Cαd xα t α = Kx , Kt K .
|α|=d |α|=d
We conclude that since the polynomials Kx span all of H0 , the inner product in
HK = H0 is the Weyl inner product.
But limk→∞ fnk (x) = f (x), so we have f (x) = f (x) for every point x ∈ X .
Hence, as continuous functions on X , f = f . Therefore, f ∈ BR . This shows
that BR is closed as a subset of C (X ).
Proposition 2.23 Let K be a Mercer kernel on a compact metric space X , and
HK be its RKHS. For all R > 0, the set IK (BR ) is compact.
Proof. By the Arzelá–Ascoli theorem (Theorem 2.4) it suffices to prove that
BR is equicontinuous.
Since X is compact, so is X × X . Therefore, since K is continuous on X × X ,
K must be uniformly continuous on X × X . It follows that for any ε > 0, there
exists δ > 0 such that for all x, y, y ∈ X with d (y, y ) ≤ δ,
|K(x, y) − K(x, y )| ≤ ε.
Example 2.24 (Hypothesis spaces associated with an RKHS) Let X be
compact and K : X × X → R be a Mercer kernel. By Proposition 2.23, for
all R > 0 we may consider IK (BR ) to be a hypothesis space. Here and in what
follows BR denotes the closed ball of radius R centered on the origin.
2.7 Reminders II 33
2.7 Reminders II
The general nonlinear programming problem is the problem of finding x ∈ Rn
to solve the following minimization problem:
min f (x)
s.t. gi (x) ≤ 0, i = 1, . . . , m, (2.2)
hj (x) = 0, j = 1, . . . , p,
and
f (x0 + λ(x − x0 )) − f (x0 ) f (x) − f (x0 )
≤ .
λ(x − x0 ) x − x0
34 2 Basic hypothesis spaces
This means that the function t → ( f (t) − f (x0 ))/(t − x0 ) is increasing in the
interval [x0 , x]. Hence, the right derivative
f (t) − f (x0 )
f+ (x0 ) := lim
t→(x0 )+ t − x0
f (t) − f (x0 )
f− (x0 ) := lim
t→(x0 )− t − x0
exists. These two derivatives, in addition, satisfy f− (x0 ) ≤ f+ (x0 ) whenever x0
is a point in the interior of S. Hence, both f− (x0 ) and f+ (x0 ) are nondecreasing
in S.
In addition to those listed above, convex functions satisfy other properties.
We highlight the fact that the addition of convex functions is convex and that if a
function f is convex and C 2 then its Hessian D2 f (x) at x is positive semidefinite
for all x in its domain.
The convex programming problem is the problem of finding x ∈ Rn
to solve (2.2) with f and gi convex functions and hj linear. As we have
remarked, efficient algorithms for the convex programming problem exist.
In particular, when f and the gi are quadratic functions, the corresponding
programming problem, called the convex quadratic programming problem,
can be solved by even more efficient algorithms. In fact, convex quadratic
programs are a particular case of second-order cone programs. And second-
order cone programming today provides an example of the success of interior
point methods: very large amounts of input data can be efficiently dealt with,
and commercial code is available. (For references see Section 2.9).
s.t. cT K[x]c ≤ R2 .
Note that this is a convex quadratic programming problem and, therefore, can
be efficiently solved.
References for Sobolev space are [1, 129], and [47] for embedding theorems.
A substantial amount of the theory of RKHSs was surveyed by
N. Aronszajn [9]. On page 344 of this reference, Theorem 2.9, in essence,
is attributed to E. H. Moore.
The special dot product kernel K(x, y) = (c + x · y)d for some c ≥ 0 and
d ∈ N was introduced into the field of statistical learning theory by Vapnik (see,
e.g., [134]). General dot product kernels are described in [118]; see also [101]
and [79]. Spline kernels are discussed extensively in [137].
Chapter 14 of [19] is a reference for the unitary and orthogonal invariance of
, W . A reference for the nondegeneracy of the Veronese variety mentioned
in Proposition 2.21 is section 4.4 of [109].
A comprehensive introduction to convex optimization is the book [25]. For
second-order cone programming see the articles [3, 119].
For more families of Mercer kernels in learning theory see [107]. More
examples of box splines can be found in [41]. Reducing the computation of fz
from HK to HK,z is ensured by representer theorems [137]. For a general form
of these theorems see [117].
3
Estimating the sample error
The main result in this chapter provides bounds for the sample error of a compact
and convex hypothesis space. We have already noted that with m fixed, the
sample error increases with the size of H. The bounds we deduce in this chapter
show this behavior with respect to a particular measure for the size of H: its
capacity as measured by covering numbers.
Definition 3.1 Let S be a metric space and η > 0. We define the covering
number N (S, η) to be the minimal ∈ N such that there exist disks in S with
radius η covering S. When S is compact this number is finite.
sup | f (x) − y| ≤ M
f ∈H
EH ( fz ) = E( fz ) − Ez ( fz ) + Ez ( fz ) − Ez ( fH ) + Ez ( fH ) − E( fH ).
37
38 3 Estimating the sample error
E(ξ )
Prob{ξ ≥ t} ≤ .
t
σ 2 (ξ )
Prob{|ξ − E(ξ )| ≥ t} = Prob{(ξ − E(ξ ))2 ≥ t 2 } ≤ .
t2
One particular use of Chebyshev’s inequality is for sums of independent random
variables. If ξ is a random variable on a probability space Z with mean E(ξ ) = µ
and variance σ 2 (ξ ) = σ 2 , then, for all ε > 0,
1 m σ2
Prob ξ(z i ) − µ ≥ ε ≤ .
z∈Z m m mε 2
i=1
This inequality provides a simple form of the weak law of large numbers since
it shows that when m → ∞, m1 m i=1 ξ(zi ) → µ with probability 1.
For any 0 < δ < 1 and by taking ε = σ 2 /(mδ) in the inequality above it
follows that with confidence 1 − δ,
!
1
m σ2
ξ(zi ) − µ ≤ . (3.1)
m mδ
i=1
The goal of this section is to extend inequality (3.1) to show a faster rate of decay.
1
Typical bounds with confidence 1 − δ will be of the form c(log(2/δ)/m) 2 +θ
with 0 < θ < 12 depending on the variance of ξ . The improvement in the error is
seen both in its dependence on δ – from 2/δ to log(2/δ) – and in its dependence
1 1
on m – from m− 2 to m−( 2 +θ) . Note that {ξi = ξ(zi )}mi=1 are independent random
variables with the same mean and variance.
3.1 Exponential inequalities in probability 39
If for each i |ξi − µi | ≤ M holds almost everywhere, then for every ε > 0
we have
m
" # ε 2 Mε
Prob ξi − µi > ε ≤ exp − 1+ log 1 + 2 −1 .
M Mε
i=1
i=1 i=1
Since |ξi | ≤ M almost everywhere and E(ξi ) = 0, the Taylor expansion for ex
yields
+∞ $ %
+∞ −2 2
$ % c E ξi c M σi
E e cξi
=1+ ≤1+ .
! !
=2 =2
and therefore
ecM − 1 − cM 2
I ≤ exp −cε + .
M2
Now choose the constant c to be the minimizer of the bound on the right-hand
side above:
1 Mε
c= log 1 + 2 .
M
40 3 Estimating the sample error
g(λ) := (1 + λ) log(1 + λ) − λ.
(Bernstein)
m ⎧ ⎫
" # ⎨ ε 2 ⎬
Prob ξi − µi > ε ≤ exp − .
⎩ 2 2 + 1 M ε ⎭
i=1 3
(Hoeffding)
m
" # ε2
Prob ξi − µi > ε ≤ exp − .
2mM 2
i=1
Proof. The first inequality follows from (3.3) and the inequality
λ
g(λ) ≥ log(1 + λ), ∀λ ≥ 0. (3.4)
2
3.1 Exponential inequalities in probability 41
We can see that f (0) = 0, f (0) = 0, and f (λ) = λ(1 + λ)−2 ≥ 0 for
λ ≥ 0. Hence f (λ) ≥ 0 and
It follows that
λ
g(λ) = λ log(1 + λ) + log(1 + λ) − λ ≥ log(1 + λ), ∀λ > 0.
2
This verifies (3.4) and then the generalized Bennett’s inequality.
Since g(λ) ≥ 0, we find that the function h defined on [0, ∞) by h(λ) =
(6 + 2λ)g(λ) − 3λ2 satisfies similar conditions: h(0) = h (0) = 0, and h (λ) =
(4/(1 + λ))g(λ) ≥ 0. Hence h(λ) ≥ 0 for λ ≥ 0 and
3λ2
g(λ) ≥ .
6 + 2λ
Applying this to (3.3), we get the proof of Bernstein’s inequality.
To prove Hoeffding’s inequality, we follow the proof of Proposition 3.4 and
use (3.2). As the exponential function is convex and −M ≤ ξi ≤ M almost
surely,
∞ $ %j ∞ $ %j
(cM )2 /2 j
1 (cM )2 /2 (cM )2
= ≤ = exp .
j! 2 − 1 j! 2
j=0 =1 j=0
& '
This, together with (3.2), implies
& that I ≤ exp
' −cε + m(cM )2 /2 . Choose
⎧ ⎫
1
m ⎨ mε2 ⎬
(Bernstein) Prob ξ(z i ) − µ ≥ ε ≤ exp − .
z∈Z m m ⎩ 2 σ2 + 1M ε ⎭
i=1 3
1
m
mε 2
(Hoeffding) Prob ξ(zi ) − µ ≥ ε ≤ exp − .
z∈Zm m 2M 2
i=1
Proof. Apply Proposition 3.5 to the random variables {ξi = ξ(zi )/m} that
satisfy |ξi − E(ξi )| ≤ M /m, σ 2 (ξi ) = σ 2 /m2 , and σi2 = σ 2 /m.
Theorem 3.8 Let M > 0 and f : X → Y be M-bounded. Then, for all > 0,
mε2
Prob {Lz ( f ) ≥ −ε} ≥ 1 − exp − .
z∈Z m 2M 4
Remark 3.9
(i) Note that the confidence (i.e., the right-hand side in the inequality above)
is positive and approaches 1 exponentially quickly with m.
3.2 Uniform estimates on the defect 43
and the fact that the probability of a union of events is bounded by the sum of
the probabilities of those events.
$ ε
%
Proof of Theorem 3.10 Let = N H, 4M and consider f1 , . . . , f such that
ε
the disks Dj centered at fj and with radius 4M cover H. Let U be a full measure
44 3 Estimating the sample error
sup Lz ( f ) ≥ 2ε ⇒ Lz ( fj ) ≥ ε.
f ∈Dj
where the last inequality follows from Hoeffding’s bound in Corollary 3.6 for
ξ = −( f (x) − y)2 on Z. The statement now follows from Lemma 3.11 by
replacing ε by 2ε .
Remark 3.12 Hoeffding’s inequality can be seen as a quantitative instance of
the law of large numbers. An “abstract” uniform version of this law can be
extracted from the proof of Theorem 3.10.
Proposition 3.13 Let F be a family of functions from a probability space Z to
R and d a metric on F. Let U ⊂ Z be of full measure and B, L > 0 such that
(i) |ξ(z)| ≤ B for all ξ ∈ F and all z ∈ U , and
(ii) |Lz (ξ1 ) − Lz (ξ2 )| ≤ L d (ξ1 , ξ2 ) for all ξ1 , ξ2 ∈ F and all z ∈ U m , where
1
m
Lz (ξ ) = ξ(z) − ξ(zi ).
Z m
i=1
(
ε ) mε 2
Prob {EH ( fz ) ≤ ε} ≥ 1 − N H, + 1 exp − .
z∈Zm 16M 32M 4
ε
sup Lz ( f ) = sup {E( f ) − Ez ( f )} ≤
f ∈H f ∈H 2
Remark 3.15 Theorem 3.14 helps us deal with the question posed in
Section 1.3. Given ε, δ > 0, to ensure that
Prob
m
{EH ( fz ) ≤ ε} ≥ 1 − δ,
z∈Z
& $ % '
ε mε2
To prove this, take δ = N H, 16M + 1 exp − 32M 4 and solve for m. Note,
furthermore, that (3.5) gives a relation between the three basic variables ε, δ,
and m.
46 3 Estimating the sample error
fH − fρ 2
ρ ≤ tf + (1 − t)fH − fρ 2ρ
, - t
= fH − fρ ρ + 2t f − fH , fH − fρ ρ + f − fH ρ .
2 2
2
, -
By taking t to be small enough, we see that f − fH , fH − fρ ρ ≥ 0. That is, the
angle f
ρ fH f is obtuse, which implies (note that the squares are crucial)
fH − f 2
ρ ≤ f − fρ 2
ρ − fH − fρ 2
ρ;
3.4 Convex hypothesis spaces 47
that is,
( fH − f )2 ≤ E( f ) − E( fH ) = EH ( f ).
X
holds.
Proof. Since ξ satisfies |ξ − µ| ≤ B, the one-side Bernstein inequality in
Corollary 3.6 implies that
⎧ ⎫
m ⎨ ⎬
µ− 1
ξ(z i ) √ α m(µ + ε)ε
2
Prob √ i=1
m
> α ε ≤ exp − .
z∈Zm µ+ε ⎩ 2 σ 2 (ξ ) + 1 Bα √µ + ε √ε ⎭
3
1 √ √ 1 B
σ 2 (ξ ) + Bα µ + ε ε ≤ cµ + B(µ + ε) ≤ c + (µ + ε).
3 3 3
Proof. Let {gj }Jj=1 ⊂ G with J = N (G, αε) be such that G is covered by balls
in C (Z) centered on gj with radius αε.
Applying Lemma 3.18 to ξ = gj for each j, we have
E(gj ) − Ez (gj ) √ α 2 mε
Prob ≥ α ε ≤ exp − .
z∈Z m E(gj ) + ε 2c + 23 B
For each g ∈ G, there is some j such that g − gj C (Z) ≤ αε. Then |Ez (g) −
Ez (gj )| and |E(g) − E(gj )| are both bounded by αε. Hence
EH ( fz ) ≤ ε/2 EH ( fz ) + ε.
The probability inequalities given in Section 3.1 are standard in the literature
on the law of large numbers or central limit theorems (e.g., [21, 103, 132]).
There is a vast literature on further extensions of the inequalities in Section 3.2
stated in terms of empirical covering numbers and other capacity measures [13,
71] (called concentration inequalities) that is outside the scope of this book.
We mention the McDiarmid inequality [83], the Talagrand inequality [126, 22],
and probability inequalities in Banach spaces [99].
The inequalities in Section 3.4 are improvements of those in the Vapnik–
Chervonenkis theory [135]. In particular, Lemmas 3.18 and 3.19 are a covering
number version of an inequality given by Anthony and Shawe-Taylor [8]. The
convexity of the hypothesis space plays a central role in improving the sample
error bounds, as in Theorem 3.3. This can be seen in [10, 12, 74].
A natural question about the sample error is whether upper bounds such as
those in Theorem 3.3 are tight. In this regard, lower bounds called minimax
rates of convergence can be obtained (see, e.g., [148]).
The ideas described in this chapter can be developed for a more general
class of learning algorithms known as empirical risk minimization (ERM) or
structural risk minimization algorithms [17, 60, 110]. We devote the remainder
of this section to briefly describe some aspects of this development.
The greater generality of ERM comes from the fact that algorithms in this
class minimize empirical errors with respect to a loss function ψ : R → R+ .
The loss function measures how the sample value y approximates the function
value f (x) by evaluating ψ(y − f (x)).
For (x, y) ∈ Z, the value ψ(y−f (x)) is the local error suffered from the use of
f as a model for the process producing y at x. The condition ψ(0) = 0 ensures
a zero error when y = f (x). Examples of regression loss functions include the
least squares loss and Vapnik’s -insensitive norm.
Example 3.22 The least squares loss corresponds to the loss function ψ(t) =
t 2 . For > 0, the -insensitive norm is the loss function defined by
|t| − if |t| >
ψ(t) = ψ (t) =
0 otherwise.
3.5 References and additional remarks 51
1
m
ψ
Ez ( f ) = ψ(yi − f (xi )).
m
i=1
ψ ψ ψ ψ ψ ψ ψ ψ
E ψ ( fz ) − E ψ ( fH ) ≤ E ψ ( fz ) − Ez ( fz ) + Ez ( fH ) − E ψ ( fH ) .
(3.6)
The second term on the right-hand side of (3.6) converges to zero, with
high probability when m → ∞, and its convergence rate can be estimated by
standard probability inequalities.
The first term on the right-hand side of (3.6) is more
involved. If one writes
ψ ψ ψ ψ ψ
ξz (z) = ψ(y −fz (x)), then E ( fz )−Ez ( fz ) = Z ξz (z)d ρ − m1 m i=1 ξz (zi ).
But ξz is not a single random variable; it depends on the sample z. Therefore,
the usual law of large numbers does not guarantee the convergence of this first
term. One major goal of classical statistical learning theory [134] is to estimate
ψ ψ ψ
this error term (i.e., E ψ ( fz )−Ez ( fz )). The collection of ideas and techniques
used to get such estimates, known as the theory of uniform convergence, plays
the role of a uniform law of large numbers. To see why, consider the quantity
ψ
sup |Ez ( f ) − E ψ ( f )|, (3.7)
f ∈H
which bounds the first term on the right-hand side of (3.6), hence providing
(together with bounds for the second term) an estimate for the sample error
ψ ψ
E ψ ( fz )−E ψ ( fH ). The theory of uniform convergence studies the convergence
of this quantity. It characterizes those function sets H such that the quantity (3.7)
tends to zero in probability as m → ∞.
52 3 Estimating the sample error
where the supremum is taken with respect to all Borel probability distributions
µ on X , and Prob denotes the probability with respect to the samples x1 , x2 , . . .
independently drawn according to such a distribution µ.
The UGC property can be characterized by the Vγ dimensions of H, as has
been done in [5].
Definition 3.24 Let H be a set of functions from X to [0, 1] and γ > 0. We say
that A ⊂ X is Vγ shattered by H if there is a number α ∈ R with the following
property: for every subset E of A there exists some function fE ∈ H such that
fE (x) ≤ α − γ for every x ∈ A \ E, and fE (x) ≥ α + γ for every x ∈ E. The Vγ
dimension of H, Vγ (H), is the maximal cardinality of a set A ⊂ X that is Vγ
shattered by H.
The concept of Vγ dimension is related to many other quantities involving
capacity of function sets studied in approximation theory or functional analysis:
covering numbers, entropy numbers, VC dimensions, packing numbers, metric
entropy, and others.
The following characterization of the UGC property is given in [5].
Theorem 3.25 Let H be a set of functions from X to [0, 1]. Then H is UGC if
and only if the Vγ dimension of H is finite for every γ > 0.
Theorem 3.25 may be used to verify the convergence of ERM schemes when
the hypothesis space H is a noncompact UGC set such as the union of unit balls
of reproducing kernel Hilbert spaces associated with a set of Mercer kernels. In
particular, for the Gaussian kernels with flexible variances, the UGC property
holds [150].
Many fundamental problems about the UGC property remain to be solved.
As an example, consider the empirical covering numbers.
Definition 3.26 For x = (xi )m ∞
i=1 ∈ X and H ⊂ C (X ), the -empirical
m
covering number N∞ (H, x, η) is the covering number of H|x := {( f (xi ))m i=1 :
f ∈ H} as a subset of Rm with the following metric. For f , g ∈ C (X ) we take
dx ( f , g) = maxi≤m |f (xi ) − g(xi )|. The metric entropy of H is defined as
It is known [46] that a set H of functions from X to [0, 1] is UGC if and only if,
for every η > 0, limm→∞ Hm (H, η)/m = 0. In this case, one has Hm (H, η) =
O(log2 m) for every η > 0. It is conjectured in [5] that Hm (H, η) = O(log m)
is true for every η > 0. A weak form is, Is it true that for some α ∈ [1, 2), every
UGC set H satisfies
(which coincides with the approximation error modulo σρ2 ) decreases. The main
result in this chapter characterizes the measures ρ and kernels K for which this
decay is polynomial, that is, A(fρ , R) = O(R−θ ) with θ > 0.
Theorem 4.1 Suppose ρ is a Borel probability measure on Z. Let K be a Mercer
kernel on X and LK : Lρ2X → Lρ2X be the operator given by
LK f (x) = K(x, t)f (t)d ρX (t), x ∈ X.
X
θ/(4+2θ) θ/(4+2θ)
Let θ > 0. If fρ ∈ Range(LK ), that is, fρ = LK (g) for some g ∈
Lρ2X , then A(fρ , R) ≤ 22+θ g 2+θ R −θ . Conversely, if ρ is nondegenerate
LρX
2 X
and A(fρ , R) ≤ CR−θ for some constants C and θ , then fρ lies in the range of
θ/(4+2θ)−
LK for all > 0.
Although Theorem 4.1 may be applied to spline kernels (see Section 4.6),
we show in Theorem 6.2 that for C ∞ kernels (e.g., the Gaussian kernel) and
under some conditions on ρX , the approximation error decay cannot reach the
order A(fρ , R) = O(R−θ ) unless fρ is C ∞ itself. Instead, also in Chapter 6,
we derive logarithmic orders like A(fρ , R) = O((log R)−θ ) for analytic kernels
and Sobolev smooth regression functions.
54
4.1 Reminders III 55
A sequence satisfying (i) and (ii) only is said to be an orthonormal system. The
numbers f , φn are the Fourier coefficients of f in the basis {φn }n≥1 . It is easy
to see that these coefficients are unique since, if f = an φn , an = f , φn for
all n ≥ 1.
Lθ ck φk = ck λθk φk .
56 4 Polynomial decay of the approximation error
LK : Lν2 (X ) → C (X )
√
The assertion LK ≤ ν(X )C2K follows from the inequality
Proposition 4.6
(i) If K is symmetric, then LK : Lν2 (X ) → Lν2 (X ) is self-adjoint.
(ii) If, in addition, K is positive semidefinite, then LK is positive.
Proof. Part (i) follows easily from Fubini’s theorem and the symmetry of K.
For Part (ii), just note that
(ν(X ))2
k
K(x, t)f (x)f (t) d ν(x) d ν(t) = lim K(xi , xj )f (xi )f (xj )
X X k→∞ k2
i,j=1
(ν(X ))2 T
= lim fx K[x]fx ,
k→∞ k2
To prove this last fact, use the fact that φk = (1/λk )/LK (φk ) in Lν2 (X ). Then
we can choose the eigenfunction to be (1/λk )LK (φk ), which is a continuous
function.
In what follows we fix a Mercer kernel K and let {φk ∈ Lν2 (X )} be an
orthonormal basis of Lν2 (X ) consisting of eigenfunctions of LK . We call the
φk orthonormal eigenfunctions. Denote by λk , k ≥ 1, the eigenvalue of LK
corresponding to φk . If λk > 0, the function φk is continuous. In addition, it lies
in the RKHS HK . This is so since
1 1
φk (x) = LK (φk )(x) = K(x, t)φk (t) d ν(t),
λk λk
Remark 4.9 When ν is nondegenerate, one can easily see from the definition
of the integral operator that LK has no eigenvalue 0 if and only if HK is dense
in Lν2 (X ).
In fact, the orthonormal system above forms an orthonormal basis of HK
when ρX is nondegenerate. This will be proved in Section 4.4. Toward this end,
we next prove Mercer’s theorem.
λk φk , Kx K = λk φk (x),
k≥1 λk |φk (x)| converges. This is true for each point x ∈ X .
Hence the series 2
60 4 Polynomial decay of the approximation error
Now we fix a point x ∈ X . When the basis {φk }k≥1 has infinitely many
functions, the estimate above, together with the Cauchy–Schwarz inequality,
tells us that for each t ∈ X ,
m+ m+ 1/2 m+ 1/2
λk φk (x)φk (t) ≤ λk |φk (t)| 2
λk |φk (x)| 2
k=m k=m k=m
m+ 1/2
≤ CK λk |φk (x)|2 ,
k=m
which tends to zero uniformly (for t ∈ X ). Hence the series k≥1 λk φk (x)φk (t)
(as a function of t) converges absolutely and uniformly on X to a continuous
function gx . On the other hand, as a function in Lν2 (X ), Kx can be expanded by
means of an orthonormal basis consisting of {φk } and an orthonormal basis
of the nullspace of LK . For f ∈ ,
Kx , f Lν2 (X ) = K(x, y)f (y) d ν = 0.
X
However, since {φ1 , φ2 , . . .} is an orthonormal basis, φk2 = 1 for all k ≥ 1
and the first statement follows. The second statement holds true because the
assumption λk ≥ λj for j > k tells us that kλk ≤ kj=1 λj ≤ ν(X )C2K .
k=1
62 4 Polynomial decay of the approximation error
1/2
LK : Lν2 (X ) → HK
ak φk → ak λk φk
: X → 2
x → λk φk (x)
k≥1
∞
K(x, t) = łk φk (x)φk (t) = (x), (t)2 .
k=1
4.5 Characterizing the approximation error in RKHSs 63
(x) − (t) 2
2
= (x), (x)2 + (t), (t)2 − 2(x), (t)2
= K(x, x) + K(t, t) − 2K(x, t),
Definition 4.15 Let (B, · ) and (H, · H ) be Banach spaces and assume H
is a subspace of B. The K-functional K : B × (0, ∞) → R of the pair (B, H) is
defined, for a ∈ B and t > 0, by
& '
K(a, t) := inf a−b +t b H . (4.2)
b∈H
It can easily be seen that for fixed a ∈ B, the function K(a, t) of t is continuous,
nondecreasing, and bounded by a (take b = 0 in (4.2)). When H is dense in
B, K(a, t) tends to zero as t → 0. The interpolation spaces for the pair (B, H)
are defined in terms of the convergence rate of this function.
For 0 < r < 1, the interpolation space (B, H)r consists of all the elements
a ∈ B such that the norm
& '
a r := sup K(a, t)/t r
t>0
is finite.
Theorem 4.16 Let (B, ) be a Banach space, and (H, H ) a subspace, such
that b ≤ C0 b H for all b ∈ H and a constant C0 > 0. Let 0 < r < 1. If
a ∈ (B, H)r , then, for all R > 0,
& ' 2/(1−r) −2r/(1−r)
A(a, R) := inf a−b 2
≤ a r R .
b H ≤R
Conversely, if A(a, R) ≤ CR−2r/(1−r) for all R > 0, then a ∈ (B, H)r and
a r ≤ 2C (1−r)/2 .
64 4 Polynomial decay of the approximation error
Proof. Consider the function f (t) := K(a, t)/t. It is continuous on (0, +∞).
Since K(a, t) ≤ a , inf t>0 {f (t)} = 0.
Fix R > 0. If supt>0 {f (t)} ≥ R, then, for any 0 < < 1, there exists some
tR, ∈ (0, +∞) such that
K(a, tR, )
f (tR, ) = = (1 − )R.
tR,
It follows that
K(a, tR, )
b H ≤ =R
(1 − )tR,
and
K(a, tR, )
a − b ≤ .
1−
But the definition of the norm a r implies that
K(a, tR, )
r ≤ a r.
tR,
Therefore
* +−r/(1−r) 2 31/(1−r)
K(a, tR, ) K(a, tR, )
a − b ≤
(1 − )tR, (1 − )tR,
r
1/(1−r)
1
≤ R−r/(1−r) ( a r )1/(1−r) .
1−
Thus,
2/(1−r) −2r/(1−r)
A(a, R) ≤ inf a − b 2
≤ a r R ;
0<<1
K(a, t) 1
bt, H ≤ ≤ sup{f (u)} < R
(1 − )t 1 − u>0
and
K(a, t)
a − bt, ≤ .
1−
Hence
2
A(a, R) ≤ inf a − bt, 2
≤ inf {K(a, t)} /(1 − )
t>0 t>0
& ' 2
≤ inf a r t r /(1 − ) = 0.
t>0
This again proves the desired error estimate. Hence the first statement of the
theorem holds.
suppose that A(a, R) ≤ CR−2r/(1−r) for all R > 0. Let t > 0.
Conversely, √
Choose Rt = ( C/t)1−r . Then, for any > 0, we can find bt, ∈ H such that
−2r/(1−r)
bt, H ≤ Rt and a − bt, 2
≤ CRt (1 + )2 .
It follows that
√ −r/(1−r)
K(a, t) ≤ a − bt, + t bt, H ≤ CRt (1 + )
+ tRt ≤ 2(1 + )C (1−r)/2 t r .
K(a, t) ≤ 2C (1−r)/2 t r .
Proof. Take B = Lρ2X and H = HK+ with the norm inherited from HK . Then
√
b ≤ λ1 b H for all b ∈ H. The statement now follows from Theorem 4.16
taking r = θ/(2 + θ ).
Proof of Theorem 4.1 Take HK+ as in Corollary 4.17 and r = θ/(2 + θ).
θ/(4+2θ) θ/(4+2θ)
If fρ ∈ Range(LK ), then fρ = LK g for some g ∈ Lρ2X . Without
loss of generality, we may take g = λk >0 ak φk . Then g 2 = λk >0 ak2 < ∞
θ/(4+2θ)
and fρ = λk >0 ak λk φk .
√
We show that fρ ∈ (LρX , HK+ )r . Indeed, for every t ≤ λ1 , there exists
2
λN +1 < t 2 ≤ λN .
N θ/(4+2θ)
Choose f = k=1 ak λk φk ∈ HK+ . We can see from Theorem 4.8 that
N 2
N
(θ/(4+2θ))−1/2 −2/(2+θ) −2/(2+θ)
f 2
= ak λ k λk φ k = ak2 λk ≤ λN g 2.
K
k=1 K k=1
In addition,
2
θ/(4+2θ) θ/(2+θ) θ/(2+θ)
fρ − f L 2 2
= ak λ k φk = ak2 λk ≤ λN +1 g 2.
ρX
k>N Lρ2X k>N
θ/(4+2θ) −1/(2+θ)
K(fρ , t) ≤ fρ − f Lρ2 + t f K ≤ λN +1 g + tλN g .
X
4.5 Characterizing the approximation error in RKHSs 67
K(fρ , t) ≤ g 2t θ/(2+θ) = 2 g t r .
r/2 √
Since K(fρ , t) ≤ fρ Lρ2 ≤ λ1 g , we can also see that for t > λ1 ,
X
K(fρ , t)/t r ≤ g holds. Therefore, fρ ∈ (Lρ2X , HK+ )r and fρ r ≤ 2 g . It
follows from Theorem 4.16 that
$ %2/(1−r) −2r/(1−r)
A(fρ , R) ≤ inf f − fρ 2L 2 ≤ 2 g R
+ ρX
f ∈HK , f K ≤R
2+θ −θ
= 22+θ g R .
Then
(m)
Write fρ = k c k φk and fm = k bk φk . Then, for all 0 < < r,
(m) 2
ck2 ck − b k
≤2
λr−2 λr−2
2−2m ≤λk <2−2(m−1) k 2−2m ≤λk <2−2(m−1) k
(m) 2
bk
+2 ,
λr−2
2−2m ≤λk <2−2(m−1) k
21+2m(r−2) fρ − fm 2L 2 + 21+2(1−m)(1−r+2) fm 2
K ≤ C 2/(2+θ) 25−4m
ρX
+ C 2/(2+θ) 25+2(1−r)+4(1−m) .
68 4 Polynomial decay of the approximation error
Therefore,
ck2 160
≤ C 2/(2+θ) < ∞.
λr−2 16 − 1
λk <1 k
θ/(4+2θ)−
This means that fρ ∈ Range(LK ).
4.6 An example
In this section we describe a simple example for the approximation error in
RKHSs.
Example 4.19 Let X = [−1, 1], and let K be the spline kernel given in
Example 2.15, that is, K(x, y) = max{1 − |x − y|/2, 0}. We claim that HK
is the Sobolev space H 1 (X ) with the following equivalent inner product:
Assume now that ρX is the Lebesgue measure. For θ > 0 and a function
fρ ∈ L 2 [−1, 1], we also claim that A(fρ , R) = O(R−θ ) if and only if fρ (x +
t) − fρ (x) L 2 [−1,1−t] = O(t θ/(2+θ) ).
To prove the first claim, note that we know from Example 2.15 that K is
a Mercer kernel. Also, Kx ∈ H 1 (X ) for any x ∈ X . To show that (4.3) is
the inner product in HK , it is sufficient to prove that f , Kx K = f (x) for any
f ∈ H 1 (X ) and x ∈ X . To see this, note that Kx = 12 χ[−1,x) − 12 χ(x,1] and
Kx (−1) + Kx (1) = 1. Then,
x 1
f , Kx K = 1
2 f (y) dy − 1
2 f (y) dy + 12 (f (−1) + f (1)) = f (x).
−1 x
and
1
ft H 1 (X ) =
t (f (x + t) − f (x))
2 ≤ Ct .
r−1
L
Hence
≤ 2 f − g L2 + t g H 1 (X ) .
This proves the statement and, with it, the second claim.
Theorems 4.10 and 4.12 are for general nondegenerate measures ν on a compact
space X . For an extension to a noncompact space X see [123].
The map in Theorem 4.14 is called the feature map in the literature on
learning theory [37, 107, 134]. More general characterizations for the decay of
the approximation error being of type O(ϕ(R)) with ϕ decreasing on (0, +∞)
can be derived from the literature on approximation theory e.g., ([87, 94]) by
means of K-functionals and moduli of smoothness. For interpolation spaces
see [16].
RKHSs generated by general spline kernels are described in [137]. In the
proof of Example 4.19 we have used a standard technique in approximation
theory (see [78]). Here the function needs to be extended outside [−1, 1] for
defining ft , or the norm L 2 should be taken on [−1, 1 − t]. For simplicity, we
have omitted this discussion.
The characterization of the approximation error described in Section 4.5 is
taken from [113].
Consider the approximation for the ERM scheme with a general loss function
ψ
ψ in Section 3.5. The target function fH minimizes the generalization error E ψ
over H. If we minimize instead over the set of all measurable functions we
obtain a version (w.r.t. ψ) of the regression function.
Definition 4.20 Given the regression loss function ψ the ψ-regression function
is given by
fρψ (x) = argmin ψ(y − t) d ρ(y|x), x ∈ X .
t∈R Y
|ψ(t) − ψ(t )|
sup = C < ∞, (4.4)
t,t ∈[−M ,M ] |t − t |s
4.7 References and additional remarks 71
then
E ψ (f ) − E ψ (fρψ ) ≤ C |f (x) − fρψ (x)|s d ρX
X
≤ C f − fρψ sL 1 ≤ C f − fρψ L 2 .
s/2
ρX ρX
E ψ (f ) − E ψ (fρψ ) ≥ c
2 f − fρψ 2L 2 .
ρX
5
Estimating covering numbers
The bounds for the sample error described in Chapter 3 are in terms of, among
other quantities, some covering numbers. In this chapter, we provide estimates
for these covering numbers when we take a ball in an RKHS as a hypothesis
space. Our estimates are given in terms of the regularity of the kernel. As a
particular case, we obtain the following.
$ % n/s R 2n/s
ln N IK (BR ), η ≤ C(Diam(X ))n K C s (X ×X ) , ∀0 < η ≤ R/2.
η
(ii) If K(x, y) = exp − x − y 2 /σ 2 for some σ > 0, then, for all 0 < η ≤
R/2,
n+1
$ % 640n(Diam(X ))2 R n+1
ln N IK (BR ), η ≤ n 32 + ln .
σ2 η
$ % R n/2
ln N IK (BR ), η ≥ C1 ln .
η
72
5.1 Reminders IV 73
Part (i) of Theorem 5.1 follows from Theorem 5.5 and Lemma 5.6. It shows
how the covering number decreases as the index s of the Sobolev smooth kernel
increases. A case where the hypothesis of Part (ii) applies is that of the box spline
kernels described in Example 2.17. We show this is so in Proposition 5.25.
When the kernel is analytic, better than
$ Sobolev% smoothness for any index
s > 0, one can see from Part (i) that ln N IK (BR ), η decays at a rate$faster than%s
(R/η) for any > 0. Hence one would expect a decay rate such as ln(R/η)
for some s. This is exactly what Part (ii) of Theorem 5.1 shows for Gaussian
kernels. The lower bound stated in Part (ii) also tells us that the upper bound
is almost sharp. The proof for Part (ii) is given in Corollaries 5.14 and 5.24
together with Proposition 5.13, where an explicit formula for the constant C1
can be found.
5.1 Reminders IV
To prove the main results of this chapter, we use some basic knowledge from
function spaces and approximation theory.
Approximation theory studies the approximation of functions by functions in
some “good” family – for example, polynomials, splines, wavelets, radial basis
functions, ridge functions. The quality of the approximation usually depends
on, in addition to the size of the approximating family, the regularity of the
approximated function. In this section, we describe some common measures of
regularity for functions.
(I) Consider functions on an arbitrary metric space (X, d ). Let 0 < s ≤ 1. We
say that a continuous function f on X is Lipschitz-s when there exists a constant
C > 0 such that for all x, y ∈ X ,
We denote by Lip(s) the space of all Lipschitz-s functions with the norm
f Lip(s) := | f |Lip(s) + f C (X ) ,
| f (x) − f (y)|
| f |Lip(s) := sup .
x=y∈X (d (x, y))s
r
r
rt f (x) := (−1)r−j f (x + jt).
j
j=0
Xr,t = {x ∈ X | x, x + t, . . . , x + rt ∈ X }.
1/p
−s
| f |Lip∗(s,L p (X )) := sup t |rt f (x)|p dx
t∈Rn Xr,t
f ∗
Lip (s,L p (X ))
:= | f |Lip∗(s,L p (X )) + f L p (X ) .
−s
| f |Lip∗(s,C (X )) := sup t sup |rt f (x)|
t∈Rn x∈Xr,t
and
f ∗
Lip (s,C (X ))
:= | f |Lip∗(s,C (X )) + f C (X ) .
f ∗
Lip (s,L p (Rn ))
≤ CX ,s,p f ∗
Lip (s,L p (X ))
.
When s ∈ N, H s (Rn ) coincides with the Sobolev space defined in Section 2.3.
Note that for a function f ∈ H s (Rn ), its regularity s is tied to the decay of
f . The larger s is, the faster the decay of f is. These subspaces of L 2 (Rn ) and
those described in (II) are related as follows. For any > 0,
f C d (Rn ) ≤ CX ,s,d f ∗
Lip (s,L 2 (X ))
. (5.1)
76 5 Estimating covering numbers
BR (E) = {x ∈ E : x ≤ R}.
N
|x| :=
x
j j ,
e x = (x1 , . . . , xN ) ∈ RN .
j=1
Let η > 0. Suppose that N (BR , η) > ((2R/η)+1)N . Then BR cannot be covered
by ((2R/η)+1)N balls with radius η. Hence we can find elements f (1) , . . . , f ()
in BR such that
2R N 4
j−1
> +1 and f ( j) ∈ B( f (i) , η), ∀j ∈ {2, . . . , }.
η
i=1
|x(i) − x( j) | > η.
5.2 Covering numbers for Sobolev smooth kernels 77
Also, |x( j) | ≤ R.
Denote by Br the ball of radius r > 0 centered on the origin in (RN , | |). Then
4 η
x(j) + B1 ⊆ B
R+ η2 ,
2
j=1
and the sets in this union are disjoint. Therefore, if µ denotes the Lebesgue
measure on RN ,
⎛ ⎞
η
4
η
µ⎝ x(j) + B1 ⎠ = µ x(j) + B1 ≤ µ B R+ η2 .
2 2
j=1 j=1
It follows that
η N $ % η N $ %
µ B1 ≤ R + µ B1 ,
2 2
and thereby
N N
R + (η/2) 2R
≤ = +1 .
η/2 η
R n/s $ % R n/s
Cs ≤ ln N BR (Lip∗(s, C ([0, 1]n ))), η ≤ Cs , (5.2)
η η
where the positive constants Cs and Cs depend only on s and n (i.e., they are
independent of R and η).
We will not prove the bound (5.2) here in all its generality (references for a
proof can be found in Section 5.6). However, to give an idea of the methods
78 5 Estimating covering numbers
involved in such a proof, we deal next with the special case 0 < s < 1 and
n = 1. Recall that in this case, Lip∗(s, C ([0, 1])) = Lip(s). Since B1 (Lip(s)) ⊂
B1 (C ([0, 1])), we have N (B1 (Lip(s)), η) = 1 for all η ≥ 1.
Proposition 5.4 Let 0 < s < 1 and X = [0, 1]. Then, for all 0 < η ≤ 14 ,
1/s 1/s
1 1 4
≤ ln N (B1 (Lip(s)), η) ≤ 4 .
8 2η η
The restriction η ≤ 1
4 is required only for the lower bound.
Proof. We first deal with the upper bound. Set ε = (η/4)1/s . Define x =
{xi = iε}di=1 , where d = 1ε denotes the integer part of 1/ε. Then x is an ε-net
of X = [0, 1] (i.e., for all x ∈ X , the distance from x to x is at most ε). If
f ∈ B1 (Lip(s)), then f C (X ) ≤ 1 and −1 ≤ f (xi ) ≤ 1 for all i = 1, . . . , d .
Hence, (νi − 1) η2 ≤ f (xi ) ≤ νi η2 for some νi ∈ J := {−m + 1, . . . , m}, where
m is the smallest integer greater than η2 . For ν = (ν1 , . . . , νd ) ∈ J d define
η η
Vν := f ∈ B1 (Lip(s)) | (νi − 1) ≤ f (xi ) ≤ νi for i = 1, . . . , d .
2 2
5
Then B1 (Lip(s)) ⊆ ν∈J d Vν . If f , g ∈ Vν , then, for each i ∈ {1, . . . , d },
η η η
− ≤ f (xi+1 ) − f (xi ) − (νi+1 − νi ) ≤ .
2 2 2
η η
|νi+1 − νi | − ≤ | f (xi+1 ) − f (xi )| ≤ |xi+1 − xi |s ≤ εs .
2 2
5.2 Covering numbers for Sobolev smooth kernels 79
This yields |νi+1 −νi | ≤ η2 (ε s + η2 ) = 32 and then νi+1 ∈ {νi −1, νi , νi +1}. Since
ν1 has 2m possible values, the number of nonempty Vν is at most 2m · 3d −1 .
Therefore,
1/s
2 2 1
ln +1 ≤ ≤2
η η η
and hence
1/s 1/s 1/s
4 1 4
ln N (B1 (Lip(s)), η) ≤ ln 2 + ln 3 + 2 ≤4 .
η η η
We now prove the lower bound. Set ε = (2η)1/s and x as above. For
i = 1, . . . , d − 1, define fi to be the hat function of height η2 on the interval
[xi − ε, xi + ε]; that is,
⎧ η η
⎪
⎨ 2 − 2ε t if 0 ≤ t ≤ ε
η η
fi (xi + t) = + 2ε t if −ε ≤ t < 0
⎪
⎩
2
0 if t ∈ [−ε, ε].
Observe that fI is piecewise linear on each [xi , xi+1 ] and its values on xi are
either η2 or 0. Hence fI C (X ) ≤ η2 ≤ 12 . To evaluate the Lipschitz-s seminorm
of fI we take x, x + t ∈ X with t > 0. If t ≥ ε, then
| fI (x + t) − fI (x)| 2 fI C (X ) η 1
≤ ≤ s = .
t s ε s ε 2
| fI (x + t) − fI (x)| (η/2ε)t η η 1
s
≤ s
= t 1−s < ε 1−s = .
t t 2ε 2ε 4
and hence
| fI (x + t) − fI (x)| η η 1
≤ t 1−s ≤ ε −s = .
ts 2ε 2 4
Thus, in all three cases, | fI |Lip(s) ≤ 12 , and therefore fI Lip(s) ≤ 1. This shows
fI ∈ B1 (Lip(s)).
Finally, since η ≤ 41 , we have ε ≤ 12 and d ≥ 2. It follows that
which implies
1/s 1/s
1 1 1 1
ln N (B1 (Lip(s)), η) ≥ ln 2 − ln 4 ≥ .
2 2η 8 2η
Now we can give some upper bounds for the covering number of balls in
RKHSs. The bounds depend on the regularity of the Mercer kernel. When the
kernel K has Sobolev or generalized Lipschitz regularity, we can show that
the RKHS HK can be embedded into a generalized Lipschitz space. Then an
estimate for the covering number follows.
5.2 Covering numbers for Sobolev smooth kernels 81
Here (0, t) denotes the vector in R2n where the first n components are zero.
By hypothesis, K ∈ Lip∗(s, C (X × X )). Hence
r
K(x + jt, x) ≤ |K| ∗ t s .
(0,t) Lip (s)
This yields
⎛ ⎞1/2
r
r 7
|rt f (x)| ≤ f K ⎝ |K|Lip∗(s) t s⎠
≤ 2r |K|Lip∗(s) f K t s/2
.
j
j=0
Therefore,
7
| f |Lip∗( s ) ≤ 2r |K|Lip∗(s) f K.
2
Combining this inequality with the fact (cf. Theorem 2.9) that
f ∞ ≤ K ∞ f K,
82 5 Estimating covering numbers
x − x∗
f ∗ (x) = f .
D
Then f ∗ ∈ Lip∗(s, C (x∗ + D[0, 1]n )) if and only if f ∈ Lip∗(s, C ([0, 1]n )).
Moreover,
D−s f ∗
Lip (s,C ([0,1]n ))
≤ f∗ ∗
Lip (s,C (x∗ +D[0,1]n ))
≤ f ∗
Lip (s,C ([0,1]n ))
.
Let {f1 , . . . , fN } be an η2 -net of BCX ,s Ds R (Lip∗(s, C ([0, 1]n ))) and N its cover-
ing number. For each j = 1, . . . , N take a function gj∗ |X ∈ BR (Lip∗(s, C (X )))
with gj ∈ BCX ,s Ds R (Lip∗(s, C ([0, 1]n ))) and gj − fj Lip∗(s,C ([0,1]n )) ≤ η2 if it
exists. Then {gj∗ |X | j = 1, . . . , N } provides an η-net of BR (Lip∗(s, C (X ))):
each f ∈ BR (Lip∗(s, C (X ))) can be written as the restriction g ∗ |X to X of
some function g ∈ BCX ,s Ds R (Lip∗(s, C ([0, 1]n ))), so there is some j such that
g − fj Lip∗(s,C ([0,1]n )) ≤ η2 . This implies that
f − gj∗ |X ∗
Lip (s,C (X ))
≤ g ∗ − gj∗ ∗
Lip (s,C (x∗ +D[0,1]n ))
≤ g − gj ∗
Lip (s,C ([0,1]n ))
≤ η.
Theorem 5.5 and the upper bound in (5.2) yield upper-bound estimates for the
covering numbers of RKHSs when the Mercer kernel has Sobolev regularity.
2n/s
R
ln N (IK (BR ), η) ≤ C ,
η
t − tj
wl,s (t) =
tl − t j
j∈{0,1,...,s}\{l}
s
st(st − 1) · · · (st − j + 1) j
wl,s (t) := (−1) j−l . (5.3)
j! l
j=l
Since
j
j
(−1) j−l z l = (z − 1) j ,
l
l=0
s
s
st(st − 1) · · · (st − j + 1)
wl,s (t)z l = (z − 1) j . (5.4)
j!
l=0 j=0
s
In particular, l=0 wl,s (t) ≡ 1. In addition, it can be easily checked that
m
wl,s = δl,m , l, m ∈ {0, 1, . . . , s}. (5.5)
s
This means that the wl,s are the Lagrange interpolation polynomials, and
hence
t − j/s st − j
wl,s (t) = = .
l/s − j/s l−j
j∈{0,1,...,s}\{l} j∈{0,1,...,s}\{l}
5.3 Covering numbers for analytic kernels 85
The norm of these polynomials (as elements in C ([0, 1])) can be estimated as
follows:
Lemma 5.9 Let s ∈ N, l ∈ {0, 1, . . . , s}. Then, for all t ∈ [0, 1],
s
|wl,s (t)| ≤ s .
l
When l ∈ {m + 1, . . . , s},
: :l−1 :s
m
j=0 (st − j) j=m+1 (st − j) j=l+1 (st − j)
|wl,s (t)| =
l!(s − l)!
(m + 1)!(s − m)! s
≤ ≤s .
(l − m)l!(s − l)! l
n
wα,N (x) = wαj ,N (xj ), x = (x1 , . . . , xn ), α = (α1 , . . . , αn ). (5.6)
j=1
and, for θ ∈ [− 12 , 21 ]n ,
−iθ·Nx 1 n−1 N
e − (x)e −iθ·α
w α,N ≤ n 1 + 2N max |θj |
1≤ j≤n
(5.8)
α∈XN
holds.
∞
Nt(Nt − 1) · · · (Nt − j + 1)
z Nt = (1 + (z − 1))Nt = (z − 1) j .
j!
j=0
−iη·Nt N
Nt(Nt − 1) · · · (Nt − j + 1) −iη
e − (e j
− 1)
j!
j=0
∞
Nt(Nt − 1) · · · (Nt − j + 1) j
≤ |η| ≤ |η|N .
j!
j=N +1
N N Nt(Nt − 1) · · · (Nt − j + 1) −iη
wl,N (t)e −iη·l j
= j!
(e − 1)
l=0 j=0
1
≤ 1 + |η|N ≤ 1 + (5.9)
2N
and
−iη·Nt N
e − wl,N (t)e −iη·l
≤ |η| .
N
(5.10)
l=0
N −iθm ·αm
αm =0 wαm ,N (xm )e for m = 1, 2, . . . , n. We have
2 3
−iθ ·Nx n m−1
e − wα,N (x)e −iθ·α = e −iθs ·Nxs
α∈XN m=1 s=1
⎡ ⎤ ⎡ ⎤
N n N
× ⎣e−iθm ·Nxm − wαm ,N (xm )e−iθm ·αm ⎦ ⎣ wαs ,N (xs )e−iθs ·αs ⎦ .
αm =0 s=m+1 αs =0
Applying (5.9) to the last term and (5.10) to the middle term, we see that this
expression can be bounded by
n N
1 n−m
1 n−1 N
max |θj | 1+ ≤n 1+ max |θj | .
1≤ j≤n 2N 2N 1≤ j≤n
m=1
The domain of this function is split into two$ parts. %In the first part, ξ ∈
N
[−N /2, N /2]n , and therefore, for j = 1, . . . , n, |ξj |/N ≤ 2−N ; hence this
first part decays exponentially quickly as N becomes large. In the second part,
ξ ∈ [−N /2, N /2]n , and therefore ξ is large when N is large. The decay of k
(which is equivalent to the regularity of k; see Part (III) in Section 5.1) yields
the fast decay of ϒk on this second part. For more details and examples of
bounding ϒk (N ) by means of the decay of k (or, equivalently, the regularity
of k), see Corollaries 5.12 and 5.16.
Theorem 5.11 Assume that k is an even function in L 2 (Rn ) and k(ξ ) > 0
almost everywhere on Rn . Let K(x, t) = k(x − t) for x, t ∈ [0, 1]n . Suppose
limN →∞ ϒk (N ) = 0. Then, for 0 < η < R2 ,
R
ln N (IK (BR ), η) ≤ (N + 1)n ln 8 k(0)(N + 1)n/2 (N 2N )n (5.11)
η
88 5 Estimating covering numbers
η 2
ϒk (N ) ≤ . (5.12)
2R
≤ f K {QN (x)}
1/2
,
where {QN (x)}1/2 is the HK -norm of the function Kx − α∈XN wα,N (x)Kα/N .
It is explicitly given by
α
QN (x) := k(0) − 2 wα,N (x)k x −
N
α∈XN
α−β
+ wα,N (x)k wβ,N (x). (5.13)
N
α,β∈XN
we obtain
2
−n iξ ·(x− Nα )
QN (x) = (2π) k(ξ )1 − wα,N (x)e dξ
Rn α∈XN
2
−i ξ ·Nx
−i Nξ ·α
= (2π)−n
k(ξ )e − wα,N (x)e
N
dξ.
Rn α∈XN
Now we separate this integral into two parts, one with ξ ∈ [− N2 , N2 ]n and the
other with ξ ∈ [− N2 , N2 ]n . For the first region, (5.8) in Lemma 5.10 with θ = Nξ
5.3 Covering numbers for analytic kernels 89
tells us that
2
−i ξ ·Nx
−i Nξ ·α
k(ξ )e N − wα,N (x)e dξ
ξ ∈[−N /2,N /2]n α∈XN
1 2n−2 |ξj | 2N
≤ n2 1 + k(ξ ) dξ.
2N N
1≤ j≤n ξ ∈[−N /2,N /2]
n
For the second region, we apply (5.7) in Lemma 5.10 and obtain
2
ξ ξ
k(ξ )e−i N ·Nx − wα,N (x)e−i N ·α d ξ
ξ ∈[−N /2,N /2]n α∈XN
≤ (1 + (N 2N )n )2 k(ξ ) d ξ .
ξ ∈[−N /2,N /2]n
Hence
⎧
⎨ α
sup k(0) − 2 wα,N (x)k x −
x∈[0,1]n ⎩ α∈XN
N
⎫
α−β ⎬
+ wα,N (x)k wβ,N (x) ≤ ϒk (N ). (5.14)
N ⎭
α,β∈XN
d − cl 2 (XN ) ≤ .
We have covered IK (BR ) by balls with centers α∈XN cαl wα,N (x) and radius
η. Therefore,
#XN
2r
N (IK (BR ), η) ≤ +1 .
5.3 Covering numbers for analytic kernels 91
That is,
2r
ln N (IK (BR ), η) ≤ (N + 1)n ln +1
R
≤ (N + 1)n ln 8 k(0)(N + 1)n/2 (N 2N )n .
η
Corollary 5.12 Let σ > 0, X = [0, 1]n , and K(x, y) = k(x − y) with
x 2
k(x) = exp − 2 , x ∈ Rn .
σ
n
R 54n R 90n2
ln N (IK (BR ), η) ≤ 3 ln + 2 +6 (6n + 1) ln + 2 + 11n + 3
η σ η σ
(5.15)
n+1
R
ln N (IK (BR ), η) ≤ 4n (6n + 2) ln . (5.16)
η
√
k(ξ ) = (σ π)n e−σ ξ
2 2 /4
. (5.17)
when = j. Hence
√ |ξj | 2N
(2π)−n (σ π)n e−σ ξ
2 2 /4
dξ
ξ ∈[−N /2,N /2]n N
√
σ π N /2 −σ 2 |ξj |2 /4 |ξj | 2N
≤ e d ξj
2π −N /2 N
2N
1 2 1
≤√ N+ .
π σN 2
π σ
8n −(σ 2 /16)N 2
= √ e .
σ π
and
(1 + (N 2N )n )2 ≤ 21−2n+4Nn .
5.3 Covering numbers for analytic kernels 93
It follows that
N
2 n42−n −(σ 2 /16)N 2 +4nN ln 2
ϒk (N ) ≤ n3 e + √ e .
σ eN
2 σ π
80n ln 2 R
N≥ + 3 ln + 5.
σ2 η
Then, by checking the cases σ ≥ 1 and σ < 1, we see that (5.12) is valid for
any 0 < η < R2 . By Theorem 5.11,
n
R 80n ln 2 5 R
ln N (IK (BR ), η) ≤ 3 ln + +6 ln 2 nN + ln + ln 8
η σ2 2 η
n
R 54n R 90n2
≤ 3 ln + 2 +6 (6n + 1) ln + 2 + 11n + 3 .
η σ η σ
n+1
R
ln N (IK (BR ), η) ≤ 4n (6n + 2) ln .
η
Then
(i) If X ⊆ x∗ + [−/2, /2]n for some x∗ ∈ X , then, for all η, R > 0,
Proof.
m
(i) Denote t0 = ( 12 , 12 , . . . , 12 ) ∈ [0, 1]n . Let g = i=1 ci Kxi ∈ IK (BR ). Then
m
m
xi − x∗ xj − x ∗
g 2
K = ci cj k(xi − xj ) = ci c j k + t0 − − t0
i,j=1 i,j=1
m 2
m
xi − x ∗ xj − x∗
= ci cj K + t0 , + t0 = ci K xi −x∗ ,
+t0
i,j=1 i=1 K
m
ci K
xi −x∗ ∈ IK (BR ).
+t0
i=1
This shows that if we define fj∗ (x) := fj (((x − x∗ )/) + t0 ), the set
{f1∗ , . . . , fN∗ } is an η-net of the function set { mi=1 ci Kxi ∈ IK (BR )} in C (X ).
Since this function set is dense in IK (BR ), we have
m
m
m
f (t) = ci k((t − ti )) = ci k(x − xi ) = ci Kxi (x),
i=1 i=1 i=1
m
m
f 2
K
= ci cj K (ti , tj ) = ci cj k((ti − tj ))
i,j=1 i,j=1
m
= ci cj K(xi , xj ) = g 2
K ≤ R,
i,j=1
96 5 Estimating covering numbers
where g = m i=1 ci Kxi . So, g ∈ IK (BR ) and we have g − gj C (X ) =
supx∈X |g(x) − gj (x)| ≤ η for some j ∈ {1, . . . , N }. But for x = x(t) ∈ X ,
m
m
g(x) = ci k(x − xi ) = ci k((t − ti )) = f (t).
i=1 i=1
It follows that
sup f (t) − gj (x∗ + (t − t0 )) ≤ sup g(x) − gj (x) ≤ η.
t∈[0,1]n x∈X
This shows that if we define gj∗ (t) := gj (x∗ +(t−t0 )), the set {g1∗ , . . . , gN∗ }
is an η-net of IK (BR ) in C ([0, 1]n ).
We now note that the upper bound in Part (ii) of Theorem 5.1 follows from
Corollary 5.14. We next apply Theorem 5.11 to kernels with exponentially
decaying Fourier transforms.
Theorem 5.15 Let k be as in Theorem 5.11, and assume that for some constants
C0 > 0 and λ > n(6 + 2 ln 4),
k(ξ ) ≤ C0 e−λ ξ
, ∀ξ ∈ Rn .
√
Denote := max{1/eλ, 4n /eλ/2 }. Then for 0 < η ≤ 2R C0 (2n−1)/4 ,
n
4 R 4 R
ln N (IK (BR ), η) ≤ ln + 1 + C1 + 1 ln + C2
ln (1/) η ln (1/) η
(5.19)
5.3 Covering numbers for analytic kernels 97
holds, where
?
2 ln(32C0 ) C0 n/2 −3/8 n C1
C1 := 1 + , C2 := ln 8 2 ( 2 ) .
ln (1/) λ
Proof. Let N ∈ N and 1 ≤ j ≤ n. Since |ξj |/N < 1 for ξ ∈ [−N /2, N /2]n ,
we have
|ξj | 2N |ξj | 2N
k(ξ ) d ξ ≤ C0 e−λ ξ dξ
ξ ∈[−N /2,N /2]n N ξ ∈[−N /2,N /2]n N
C0
≤ N N n−1 |ξj |N e−λ|ξj | d ξj
N ξj ∈[−N /2,N /2]
2C0
≤ N n−1 N !
λ +1 N N
N
√ N n−1/2
≤ 2C0 2π2(1/12)+1 ,
(eλ)N +1
the last inequality by Stirling’s formula. Hence the first term of ϒk (N ) is at
most
1 2n−2 √ N n−(1/2)
n3 1 + (2π)−n 2C0 2π2(1/12)+1
2N (eλ)N +1
N +1
1
≤ 4C0 N n−(1/2) .
eλ
ξ1 = r cos θ1
ξ2 = r sin θ1 cos θ2
ξ3 = r sin θ1 sin θ2 cos θ3
..
.
ξn−1 = r sin θ1 sin θ2 . . . sin θn−2 cos θn−1
ξn = r sin θ1 sin θ2 . . . sin θn−2 sin θn−1 ,
where r ∈ (0, ∞), θ1 , . . . , θn−2 ∈ [0, φ), and θn−1 ∈ [0, 2π ). For a radial
function f ( ξ ) we have
r2
f ( ξ ) d ξ = wn−1 f (r)r n−1 dr,
r1 ≤ ξ ≤r2 r1
where
2π π π
wn−1 = ... sinn−2 θ1 sinn−3 θ2 . . . sin θn−2 d θ1 d θ2 . . . d θn−2 d θn−1
0 0 0
π
n−2
= 2π sinn−j−1 θj d θj
j=1 0
2π n/2
= .
(n/2)
n
1 (n − 1)! N n−j
= ... = e−λN /2 .
λj (n − j)! 2
j=1
5.3 Covering numbers for analytic kernels 99
2π n/2 1 (n − 1)!
n n−j
N
≤ C0 e−λN /2
(n/2) (2n) j (n − j)! 2
j=1
2C0 π n/2
≤ 2N n−1 e−λN /2 .
(n/2)
It follows that the second term in ϒk (N ) is bounded by
n 2 2C0 π n/2 n−1 −λN /2 4n N
1 + N 2N (2π)−n 2N e ≤ 4C0 N 3n .
(n/2) eλ/2
Combining the two bounds above, we have
N +1 N
1 4n
ϒk (N ) ≤ 4C0 N n−1/2 + 4C0 N 3n ≤ 8C0 N 3n N .
eλ eλ/2
Since λ > n(6 + 2 ln 4), the definition of yields
1 4n
≤ max , < e−3n .
e(2n ln 4 + n) en ln 4+3n
ϒk (N ) ≤ 8C0 N /2 . (5.21)
√
Thus, for 0 < η ≤ 2R C0 (2n−1)/4 , we may take N ∈ N such that N ≥
4n/ ln(1/) and N > 2 to obtain
η 2
8C0 N /2 ≤ ≤ 8C0 (N −1)/2 . (5.22)
2R
100 5 Estimating covering numbers
R
ln N (IK (BR ), η) ≤ (N + 1)n ln 8 k(0)(N + 1)n/2 (N 2N )n .
η
Now, by (5.22),
2 ln(32C0 ) 4 R
N ≤1+ + ln .
ln(1/) ln(1/) η
Also, since
we have
n
4 R 2 ln(32C0 )
ln N (IK (BR ), η) ≤ ln + 2 + ln 8 k(0)2n/2
ln(1/) η ln (1/)
Proof. For any > 0, we know that there are positive constants C0 ≥ 1,
depending only on α, and C0∗ , depending only on α and , such that
C0∗ e−(c+) ξ
≤ k(ξ ) ≤ C0 e−c ξ
∀ξ ∈ Rn . (5.23)
Then we can apply Theorem 5.15 with λ = c, and the desired estimate
follows.
Definition 5.17 Let S be a compact set in a metric space and η > 0. The
packing number M(S, η) is the largest integer m ∈ N such that there exist m
points x1 , . . . , xm ∈ S being η-separated; that is, the distance between xi and xj
is greater than η if i = j.
The lower bounds for the packing numbers are presented in terms of the
Gramian matrix
$ %m
K[x] = K(xi , xj ) i,j=1 , (5.24)
m
ui (x) = (K[x]−1 )i,j Kxj (x), i = 1, . . . , m. (5.25)
j=1
m
m
Q(w) = wi K(xi , xj )wj − 2 wi K(x, xi ) + K(x, x), w ∈ Rm .
i,j=1 i=1
Proof.
(i) ⇒ (ii). The nodal function property implies that the nodal functions
{ui } are linearly independent. Hence (i) implies (ii), since the
m-dimensional space span{ui }m i=1 is contained in span{Kxi }i=1 .
m
K[x]d = 0
satisfies
2 ⎧ ⎫
m ⎨ ⎬
m m
d K = d K(x , x )d = 0.
j xj i
⎩ i j j
⎭
j=1 i=1 j=1
K
5.4 Lower bounds for covering numbers 103
Then the linear independence of {Kxj }m j=1 implies that the linear
system has only the zero solution; that is, K[x] is invertible.
(iii) ⇒ (iv). When K[x] is invertible, the functions { fi }m i=1 given by fi =
m −1 ) K satisfy
j=1 (K[x] i,j xj
m
fi (xj ) = (K[x]−1 )i, K(x , xj ) = (K[x]−1 K[x])i,j = δij .
=1
(ui (x))m
i=1 .
When the RKHS has finite dimension , then, for any m ≤ , we can find
nodal functions {uj }m
j=1 associated with some subset x = {x1 , . . . , xm } ⊆ X ,
whereas for m > no such nodal functions exist. When dim HK = ∞, then,
for any m ∈ N, we can find a subset x = {x1 , . . . , xm } ⊆ X that possesses a set
of nodal functions.
M(IK (BR ), η) ≥ 2m − 1
2
1 R
K[x]−1 2 < .
m η
104 5 Estimating covering numbers
m
ui (x) = (K[x]−1 )ij Kxj (x), i = 1, . . . , m.
j=1
For each nonempty subset J of {1, . . . , m}, we define $the function % uJ (x) :=
u (x), where η > η satisfies K[x]−1 2 . These 2m − 1
j∈J η j 2 < 1
m R/η
functions are η-separated in C (X ).
For J1 = J2 , there exists some j0 ∈ {1, . . . , m} lying in one of the sets J1 , J2 ,
but not in the other. Hence
What is left is to show that the functions uJ lie in BR . To see this, take
∅ = J ⊆ {1, . . . , m}. Then
8 m
9
m
−1 −1
uJ K = η
2
K[x] Kx , η K[x]
K xs
j js
j∈J =1 j ∈J s=1 K
m
m
= η2 K[x]−1 K[x]−1 (K[x])s
j j s
j,j ∈J =1 s=1
m
= η2 K[x]−1 = η2 (K[x])−1 e
jj i
j,j ∈J i=1
√
≤ η2 (K[x])−1 e 1 (J ) ≤ η2 m (K[x])−1 e 2 (J )
√
≤ η2 m e (K[x])−1 2 = η2 m (K[x])−1 2 ,
k(ξ ) > 0, ∀ξ ∈ Rn .
5.4 Lower bounds for covering numbers 105
−1
K[x]−1 2 ≤ N −n inf k(ξ ) .
ξ ∈[−N π,N π] n
Bounding from below the integral over the subset [−N π , N π ]n , we see that
2
c K[x]c ≥ (2π )
T −n
N n
inf k(η) iα·ξ
cα e d ξ
η∈[−N π,N π]n
[−π,π]n
α
= c 2
2 (XN )
Nn inf k(ξ ) .
ξ ∈[−N π,N π]n
Nn inf k(ξ ) ,
ξ ∈[−N π,N π]n
from which the estimate for the norm of the inverse matrix follows.
Combining Theorem 5.21 and Proposition 5.22, we obtain the following
result.
Theorem 5.23 Suppose K(x, y) = k(x − y) is a Mercer kernel on X = [0, 1]n
and the Fourier transform of k is positive. Then, for N ∈ N,
η
ln N IK (BR ), ≥ ln M(IK (BR ), η) ≥ ln 2{N n − 1},
2
106 5 Estimating covering numbers
provided N satisfies
η 2
inf k(ξ ) ≥ .
ξ ∈[−N π ,N π]n R
n
2 1 R √
ln N (IK (BR ), η) ≥ ln 2 ln + ln(σ π )
σπ n η
n/2
2
− + 1 ln 2 − ln 2.
n
sin t 2 2 2
4
t ≤ 1 + |t|
≤
1 + t2
.
n
sin((ξ · bj )/2) 2
4n
≤: 2 .
(ξ · bj )/2 n
1 + (ξ · bj )/2
j=1 j=1
n
ξ · bj 2 n ηj2 1 2
n
1
1 + = 1 + ≥ 1 + ηj = 1 + η 2 .
2 4 4 4
j=1 j=1 j=1
2s
4n
(1 + ξ ) |k(ξ )|2 d ξ ≤
2 p
Rn min{1, |λ0 |2 /4}
2s−p
1
dξ < ∞
Rn 1+ ξ 2
(ii) (Inverse multiquadrics) For K(x, t) = (c2 + |x − t|2 )−α with α > 0 we have
109
110 6 Logarithmic decay of the approximation error
Id
L 2 (X ) → Lρ2X (X ).
We call Dµρ the distortion of ρ (with respect to µ). It measures how much
ρX distorts the ambient measure µ. It is often reasonable to suppose that the
distortion Dµρ is finite.
Since ρ is not known, neither, in general is Dµρ . In some cases, however,
the context may provide some information about Dµρ . An important case is the
one in which, despite ρ not being known, we do know ρX . In this case Dµρ
may be derived.
In Theorem 6.1 we assume Sobolev regularity only for the approximated
function fρ . To have better approximation orders, more information about ρ
should be used: for instance, analyticity of fρ or degeneracy of the marginal
distribution ρX .
then fρ is C ∞ on X .
r
r
rt fρ (x) = rt ( fρ −g)(x)+rt g(x) = (−1)r−j ( fρ −g)(x + jt)+rt g(x).
j
j=0
Let = 2s(2 + θ )/θ. Using the triangle inequality and the definition of
∗
Lip (/2,C (X ))
, it follows that
1/2
r
r
|rt fρ (x)|2 dx ≤ fρ − g Lµ2 (X ) + rt g Lµ2 (X )
Xr,t j
j=0
/2
≤ 2r fρ − g Lµ2 (X ) + µ(X ) g ∗
Lip (/2,C (X ))
t .
7
g ∗
Lip (/2,C (X ))
≤ 2r+1 K ∗
Lip ()
g K.
√
Also, d ρX (x) ≥ C0 dx implies that fρ − g Lµ2 (X ) ≤ (1/ C0 ) fρ − g Lρ2 (X )
X
and µ(X ) ≤ 1/C0 . By taking the infimum over g ∈ HK , we see that
1/2
1
|rt fρ (x)|2 dx ≤√ inf 2r fρ − g Lρ2 (X )
Xr,t C0 g∈HK X
7
+ 2r+1 K Lip∗() g K t /2
112 6 Logarithmic decay of the approximation error
1 r 7 r+1
≤√ 2 + 2 K Lip∗()
C0
inf fρ − g Lρ2 (X ) + t /2 g K .
g∈HK X
≤ C0 t θ/2(2+θ)
= C0 t s ,
where C0 may be taken as the norm of fρ in the interpolation space. It follows
that
1/2
−s
| fρ |Lip∗(s,L 2 (X )) = sup t |rt fρ (x)|2 dx
µ
t∈Rn Xr,t
1 r 7 r+1
≤√ 2 + 2 K ∗
Lip ()
C0 < ∞.
C0
m
K(x, x)−2 wi K(x, xi )+ wi K(xi , xj )wj = K(x, x)−2K(x, x )+K(x , x ).
i=1 i,j = 1
Hence
K (x) ≤ 2Cdxs .
In particular, if X = [0, 1] and x = { j/N }N j=0 , then dx ≤ 2N , and therefore
1
−s
K (x) ≤ 2 CN . We obtain a polynomial decay with exponent s.
1−s
s−1
wl,s−1 (t)p(l/(s − 1)) = p(t).
l=0
s−1 (j)
f (t)
f (y) = (y − t)j + Rs ( f )(y, t),
j!
j=0
114 6 Logarithmic decay of the approximation error
implies that
s−1 s−1
wl,s−1 (t)( f (l/(s − 1)) − f (t)) = wl,s−1 (t)Rs ( f )(l/(s − 1), t) .
l=0 l =0
n
wα,s−1 (x) = wαj ,s−1 (xj ),
j=1
defined by (5.6).
Now we can estimate K (x) for C s kernels as follows.
Theorem 6.3 Let X = [0, 1]n , s ∈ N, and K be a Mercer kernel on X such that
for each α ∈ Nn with |α| ≤ s,
∂α ∂ |α|
K(x, y) = K(x, y) ∈ C ([0, 1]2n ).
∂yα ∂y1α1 · · · ∂ynαn
wγ ,s−1 (t) if α = β + γ , γ ∈ {0, . . . , s − 1}n
wα =
0 otherwise.
But γ ∈{0,...,s−1}n wγ ,s−1 (t) ≡ 1. So the above expression equals
β +γ
wγ ,s−1 (t) K(x, x) − K x, + wγ ,s−1 (t)
N
γ ∈{0,...,s−1} n γ ∈{0,...,s−1}n
⎧ ⎫
⎨ β +γ β +η β +γ ⎬
wη,s−1 (t) K , −K ,x .
⎩ N N N ⎭
η∈{0,...,s−1}n
$
Using Equation (6.2) for %the univariate function g(z) = f γ1 /(s − 1), . . . ,
(γi−1 )/(s − 1), z, ti+1 , . . . , tn with z ∈ [0, 1] and all the other variables fixed,
we get
s−1
γi (s − 1)2s−1 ∂ s f
w (t ) g − g(t ) ≤ .
γi ,s−1 i
s−1
i ∂t s
γi =0 s! i C ([0,1]n )
Using Lemma 5.9 for γj , j = i, we conclude that for a function f on [0, 1]n and
for i = 1, . . . , n,
γ1 γi−1 γi
wγ ,s−1 (t) f ,..., , , ti+1 , . . . , tn
s−1 s−1 s−1
γ ∈{0,...,s−1}n
γ1 γi−1
−f ,..., , ti , ti+1 , . . . , tn ≤ ((s − 1)2s−1 )n−1
s−1 s−1
s
(s − 1)2 s−1 ∂ f
.
s! ∂t s
i C ([0,1]n )
116 6 Logarithmic decay of the approximation error
β+γ
Applying this estimate to the functions f (t) = K x, β+(s−1)t
N and K N ,
β+(s−1)t
N , we find that the expression for K can be bounded by
n s
((s − 1)2s−1 )n s s ∂ K
1 + ((s − 1)2 )
s−1 n .
s! N ∂ys
i C (X ×X )
i=1
This bound is valid for each x ∈ X . Therefore, we obtain the required estimate
for K (x) by taking the supremum for x ∈ X .
Theorem 6.4 Let X = [0, 1]n and K(x, y) = k(x − y) be a Mercer kernel on X
with
k(ξ ) ≤ C0 e−λ|ξ | , ∀ξ ∈ Rn
for some constants C0 > 0 and λ > 4 + 2n ln 4. Then, for x = { Nα }α∈{0,1,...,N −1}n
with N ≥ 4n/ ln min{eλ, 4−n eλ/2 }, we have
N /2
1 4n
K (x) ≤ 4C0 max , .
eλ eλ/2
K (x) ≤ ϒk (N ).
But the assumption of the kernel here verifies the condition in Theorem 5.15.
Thus, we can apply the estimate (5.21) for ϒk (N ) to draw our conclusion
here.
6.3 Estimating the approximation error in RKHSs 117
m
Ix ( f )(x) = f (xi )ui (x), x ∈ X , f ∈ C (X ). (6.3)
i=1
Proof. For x ∈ X
m
m
Ix ( f )(x) − f (x) = f (xi )ui (x) − f (x) = ui (x)Kxi , f K − Kx , f K
i=1 i=1
.
m /
= ui (x)Kxi − Kx , f ,
i=1 K
m
m
Q(w) = K(x, x) − 2 wi K(x, xi ) + wi K(xi , xj )wj
i=1 i,j=1
118 6 Logarithmic decay of the approximation error
m
ui (x)Kxi − Kx
≤ K (x).
i=1 K
It follows that
|Ix ( f )(x) − f (x)| ≤ K (x) f K.
−1/2
k (r) := inf k(ξ ) ,
[−rπ,rπ] n
n
where g|x is the vector (g(xi ))i∈XN ∈ RN . It follows that
Ix (g) 2
K ≤ K[x]−1 2 g|x 2
2 (XN )
= K[x]−1 2 |g(xi )|2 .
i∈XN
2
| fM (xj )| =2 (2π)−n fM (ξ )e i(j/N )·ξ
d ξ
j∈XN j∈XN ξ ∈[−M π,M π]n
2
≤ (2π)−n fM (N ξ )eij·ξ N n d ξ
j∈XN ξ ∈[−π,π]n
≤ (2π)−n fM (N ξ )N n 2 d ξ ≤ N n f 2 2 .
L
ξ ∈[−π ,π] n
Then
Ix ( fM ) 2
K ≤ K[x]−1 2 N n f 2L 2 .
120 6 Logarithmic decay of the approximation error
$ %2
But, by Proposition 5.22, K[x]−1 2≤N
−n k (N ) . Therefore,
Ix ( fM ) K ≤ f L 2 k (N ).
1/2
k(0) − 2 uj (x)k(x − xj ) + ui (x)k(xi − xj )uj (x) ,
j∈XN i,j∈XN
fM − Ix ( fM ) C (X ) ≤ f L 2 k (M )ϒk (N ).
+ f L 2 k (M )ϒk (N )
6.3 Estimating the approximation error in RKHSs 121
with
Ix ( fM ) K ≤ f L 2 k (N ).
Choose N = N (M ) ≥ M such that k (M )ϒk (N ) → 0 as M → +∞. We
then have f − Ix ( fM ) L 2 (X ) → 0. Also, the RKHS norm of Ix ( fM ) is
asymptotically controlled by k (N ).
We can now state the main estimates for the approximation error for balls
in the RKHS HK on X = [0, 1]n . Denote by −1 k the inverse function of the
nondecreasing function k ,
−1
k (R) := max{r > 0 : k (r) ≤ R}, for R > 1/k(0).
Theorem 6.7 Let X = [0, 1]n , s > 0, and f ∈ H s (Rn ). Then, for R > f 2,
& −s
'
inf f − g L 2 (X ) ≤ inf k (M ) f 2 ϒk (NR ) + f s (π M ) ,
g K ≤R 0 < M ≤ NR
where NR = −1
k (R/ f 2 ), the integer part of −1
k (R/ f 2 ). If s > n2 , then
f s
inf f −g C (X ) ≤ inf k (M ) f 2 ϒk (NR ) + √ M (n/2)−s .
g K ≤R 0 < M ≤ NR s − n/2
Proof. Take N to be NR .
Let M ∈ (0, N ]. Set the function fM as in Lemma 6.6. Then, by Lemma 6.6,
Ix ( fM ) K ≤ f 2 k (N ) ≤ R
and
f − Ix ( fM ) L 2 (X ) ≤ fM − Ix ( fM ) C (X ) + f − fM L 2 (X )
−s
≤ k (M ) f 2 ϒk (N ) + f s (π M ) .
If s > n2 , then
f s n
f − fM C (X ) ≤ (2π)−n |f (ξ )|d ξ ≤ √ M 2 −s .
ξ ∈[−M π,M π ]n s − n/2
Corollary 6.8 Let X = [0, 1]n , s > 0, and f ∈ H s (Rn ). If for some
α1 , α2 , C1 , C2 > 0, one has
k(ξ ) ≥ C1 (1 + |ξ |)−α1 , ∀ξ ∈ Rn
122 6 Logarithmic decay of the approximation error
and
ϒk (N ) ≤ C2 N −α2 , ∀N ∈ N,
R −γ
−2s/α1
inf f − g L 2 (X ) ≤ C3 f 2 + C1 f s ,
g K ≤R f 2
4α2 s
α1 (α1 +2s) if α1 + 2s ≥ 2α2
γ= 2s
α1 , if α1 + 2s < 2α2 .
where
α2 (4s−2n)
α1 (α1 +2s−n) if α1 + 2s − n ≥ 2α2 ,
γ = 2s−n
α1 , if α1 + 2s − n < 2α2 .
1 √
k (r) ≤ √ (1 + nπ r)α1 /2 .
C1
−1
2/α1
k (R/ f 2) ≥ C1 (R/ f 2)
2/α1
and
2/α1
NR ≥ 12 C1 (R/ f 2)
2/α1 .
Also,
−α2
ϒk (N ) ≤ C2 [−1
k (R/ f 2 )] ≤ 2α2 C2 (C1 R/ f 2)
−2α2 /α1
.
6.3 Estimating the approximation error in RKHSs 123
Take M = 12 C1
2/α1
(R/ f γ /s
2) with γ as in the statement. Then, M ≤ NR , and
we can see that
R −γ
−2s/α1
inf f − g L 2 (X ) ≤ C3 f 2 + C1 f s .
g K ≤R f 2
This proves the first statement of the corollary. The second statement can be
proved in the same way.
Corollary 6.9 Let X = [0, 1]n , s > 0, and f ∈ H s (Rn ). If for some
α1 , α2 , δ1 , δ2 , C1 , C2 > 0, one has
& '
k(ξ ) ≥ C1 exp −δ1 |ξ |α1 , ∀ξ ∈ Rn
and
& '
ϒk (N ) ≤ C2 exp −δ2 N α2 , ∀N ∈ N,
then, for R > (1 + A/C1 ) f 2,
−γ s
C2 B s/α1 s s/2 C1
inf f −g L 2 (X ) ≤ √ f 2 + δ1 2n f s ln R + ln ,
g K ≤R C1 f 2
γ ( n2 −s)
C2 B C1
inf f − g C (X ) ≤ B √ f 2 + f s ln R + ln ,
g K ≤R C1 f 2
√ √
Then, by Theorem 6.7, for R > (1 + (1 + exp{(2/δ1 )(2 nπ )α1 }/ C1 )) f 2,
C2 f 2 δ1 √ α1 α2
inf f − g L 2 (X ) ≤ inf √ exp ( nπ M ) − δ2 R
g K ≤R 0<M ≤R C1 2
'
+ f s (πM )−s .
Take √
−1/α1 1/α1 −1 γ
δ1 2 R C1
M= √ ,
ln
nπ f 2
√
where γ is given in our statement.
For R > f 2 / C1 , M ≤ R ≤ −1 k (R/ f 2 )
−α −α2 /α1 √ −α α1 /(γ α12 −α2 )
√
holds. Therefore, if R > exp (δ2 2 δ12 ( nπ ) )2 f 2 / C1 ,
we have
C2 f 2 −α /α √
inf f − g L 2 (X ) ≤ √ exp −δ2 2−α2 −1 δ1 2 1 ( nπ )−α2
g K ≤R C1
√ α /α
√ −γ s
R C1 2 1 s/α R C1
ln + δ1 1 2s ns/2 f s ln
f 2 f 2
√ −γ s
C2 f 2 s/α C1
≤ √ C3 + δ1 1 2s ns/2 f s ln R + ln ,
C1 f 2
6.5 References and additional remarks 125
−α /α √
where C3 := supx≥1 x−γ s exp{−δxα2 /α1 } and δ = δ22−α2 −1 δ1 2 1 ( nπ )−α2 .
This proves the first statement of the corollary. The second statement follows
using the same argument.
C1 > 0, δ1 = c + ε, α1 = 1
and
1 ec/2
C2 > 0, α2 = 1, δ2 = ln min ec, n .
2 4
Then α1 = α2 and the bounds of Corollary 6.9 hold with γ = 12 . This yields
the second statement of Theorem 6.1.
Nelson [82], and extensively used by Wu and Schaback [147], and it plays
an important role in error estimates for scattered data interpolation using radial
basis functions. In that literature (e.g., [147, 66]), the interpolation scheme (6.3)
is essential. What is different in learning theory is the presence of an RKHS
HK , not necessarily a Sobolev space.
Theorem 6.1 was proved in [113], and the approach in Section 6.3 was
presented in [157].
7
On the bias–variance problem
Em,δ = E : R+ → R
with confidence 1 − δ.
(ii) There is a unique minimizer R∗ of E(R).
(iii) When m → ∞, we have R∗ → ∞ and E(R∗ ) → 0.
The proof of Theorem 7.1 relies on the main results of Chapters 3, 4, and 5.
We show$ in Section 7.3 that% R∗ and E(R∗ ) $have the asymptotic% expressions
−θ/((2+θ)(1+2n/s))
R∗ = O m 1/((2+θ)(1+2n/s)) and E(R∗ ) = O m .
127
128 7 On the bias–variance problem
It follows from the proof of Theorem 7.1 that R∗ may be easily computed
from m, δ, IK , Mρ , fρ ∞ , g Lρ2 , and θ . Here g ∈ Lρ2X (X ) is such that
X
θ/(4+2θ)
LK (g) = fρ . Note that this requires substantial information about ρ and,
in particular, about fρ . The next chapter provides an alternative approach to
the one considered thus far whose corresponding bias–variance problem can be
solved without information on ρ.
Lemma 7.2 Let c1 , c2 , . . . , c > 0 and s > q1 > q2 > . . . > q−1 > 0. Then
the equation
x x*
Figure 7.1
& '
To prove the second statement, letx > max (ci )1/(s−qi ) | i = 1, . . . , ,
where we set q = 0. Then, for i = 1, . . . , , ci < 1 xs−qi . It follows that
1
ci xqi < xs−qi xqi = xs ;
i=1 i=1
M = M (R) = IK R + Mρ + fρ ∞. (7.1)
130 7 On the bias–variance problem
n/s
(with C = C(Diam(X ))n K C s (X ×X ) and C depending on X and s but
independent of R, ε, and M ) or
2n/s
mε 1 12M 2
− ln −C ≤ 0,
300M 2 δ IK ε
c0 v d +1 − c1 v d − c2 ≤ 0, (7.2)
where d = 2ns , c0 = 300 , c1 = ln δ , and c2 = C (12/ IK ) .
m 1 d
θ/(4+2θ)
where g ∈ Lρ2X (X ) is such that LK (g) = fρ and
A( fρ , R) = inf E( f ) − E( fρ ) = inf f − fρ L
2
2 .
f ∈HK,R f K ≤R ρX
We can therefore take E(R) = A(R) + ε(R) and Part (i) is proved.
7.2 Proof of Theorem 7.1 131
from which it follows that R∗ → ∞ when m → ∞. Note that this implies that
2+θ −θ
lim A(R∗ ) ≤ lim 22+θ g R∗ = 0.
m→∞ m→∞
v∗ (m, δ)
× = 0,
Q∗2
132 7 On the bias–variance problem
2 2 ∗
ε(R) = ( IK + Mρ + fρ ∞) R v (m, δ)
as an upper bound for the sample error with confidence 1 − δ. Hence, under
conditions (i) and (ii), we may choose
2 2 ∗ 2+θ −θ
E(R) = ( IK + Mρ + fρ ∞) R v (m, δ) + 22+θ g R .
Example 7.4 Let K be the spline kernel on X = [−1, 1] given in Example 4.19.
$ %
If ρX is the Lebesgue measure and fρ (x + t) − fρ (x) L 2 ([−1,1−t]) = O t θ
for some θ > 0, then A( fρ , R) = O(R−θ ). Take s = 21 and n = 1. Then we
have
E(R∗ ) = O m−θ/(5(2+θ)) .
7.4 References and additional remarks 133
$ %
When θ is sufficiently large, fz − fρ 2L 2 = O m−(1/5)+ε for an arbitrarily
ρX
small ε.
We now abandon the setting of a compact hypothesis space adopted thus far
and change the perspective slightly. We will consider as a hypothesis space an
RKHS HK but we will add a penalization term in the error to avoid overfitting,
as in the setting of compact hypothesis spaces.
In what follows, we consider as a hypothesis space H = HK – that is, H is
a whole linear space – and the regularized error Eγ defined by
Eγ ( f ) = (f (x) − y)2 d ρ + γ f 2
K
Z
for a fixed γ > 0. For a sample z, the regularized empirical error Ez,γ is
defined by
1
m
Ez,γ ( f ) = (yi − f (xi ))2 + γ f 2K .
m
i=1
One can consider a target function fγ minimizing Eγ ( f ) over HK and an
empirical target fz,γ minimizing Ez,γ over HK . We prove in Section 8.2 the
existence and uniqueness of these target and empirical target functions. One
advantage of this new approach, which becomes apparent from the results in
this section, is that the empirical target function can be given an explicit form,
readily computable, in terms of the sample z, the parameter γ , and the kernel K.
Our discussion of Sections 1.4 and 1.5 remains valid in this context and the
following questions concerning fz,γ require an answer: Given γ > 0, how large
is the excess generalization error E(fz,γ ) − E(fρ )? Which value of γ minimizes
the excess generalization error? The main result of this chapter provides some
answer to these questions.
∗
Theorem 8.1 Assume that K satisfies log N (B1 , η) ≤ C0 (1/η)s for some
θ/2
s∗ > 0, and ρ satisfies fρ ∈ Range(LK ) for some 0 < θ ≤ 1. Take γ∗ = m−ζ
134
8.1 Bounds for the regularized error 135
with ζ < 1/(1 + s∗ ). Then, for every 0 < δ < 1 and m ≥ mδ , with confidence
1 − δ,
$ %2 $ %
fz,γ∗ (x) − fρ (x) d ρX ≤ C0 log 2/δ m−θζ
X
holds.
−θ/2
Here C0 is a constant depending only on s∗ , ζ , CK , M, C0 , and LK fρ ,
and mδ depends also on δ. We may take
$ %1/s∗ $ %1+1/s∗ $ %2/(ζ −1/(1+s∗ ))
mδ := max 108/C0 log(2/δ) , 1/(2c) ,
$ %1/(1+s∗ )
where c = (2CK + 5) 108C0 .
At the end of this chapter, in Section 8.6, we show that the regularization
approach just introduced and the minimization in compact hypothesis spaces
considered thus far are closely related.
The parameter γ is said to be the regularization parameter. The whole
approach outlined above is called a regularization scheme.
Note that γ∗ can be computed from knowledge of m and s∗ only. No
information on fρ is required. The next example shows a simple situation where
Theorem 8.1 applies and yields bounds on the generalization error from a simple
assumption on fρ .
Example 8.2 Let K be the spline kernel on X = [−1, 1] given in Example 4.19.
If ρX is the Lebesgue measure and fρ (x + t) − fρ (x) L 2 ([−1,1−t]) = O(t θ ) for
some 0< θ ≤ 1,then, by the conclusion of Example 4.19 and Theorem 4.1, fρ ∈
(θ −ε)/2
Range LK for any ε > 0. Theorem 5.8 also tells us that log N (B1 , η) ≤
C0 (1/η) . So, we may take s∗ = 2. Choose γ∗ = m−ζ with ζ =
2 1−2ε
3 < 31 .
Then Theorem 8.1 yields
2
E(fz,γ∗ ) − E(fρ ) = fz,γ∗ − fρ 2L 2 = O log m−(θ/3)+ε
ρX δ
with confidence 1 − δ.
The definition of fz implies that the second term is at most zero. Hence E(fz ) −
E(fρ ) + γ fz 2K is bounded by (8.1).
D(γ ) = fγ − fρ 2
ρ + γ fγ 2
K ≥ fγ − fρ 2
ρ.
We call the second term in (8.1) the sample error (this use of the expression
differs slightly from the one in Section 1.4).
In this section we give bounds for the regularized error. The bounds
(Proposition 8.5 below) easily follow from the next general result.
Proof. First note that replacing A by As and θ/(2s) by θ we can reduce the
" #−θ/(2s)
problem to the case s = 1 where A−θ a = (As )2 a.
(i) Consider
ϕ(b) = b − a 2
+ γ A−1 b 2 .
If a point b minimizes ϕ, then it must be a zero of the derivative Dϕ whose
value at b ∈ Range(A) satisfies ϕ(b + εf ) − ϕ(b) = Dϕ(b), εf + o(ε)
for f ∈ Range(A). But ϕ(b + εf ) − ϕ(b) = 2b − a, εf + 2γ A−2 b, εf +
ε2 f 2 + ε2 γ A−1 f 2 . So b satisfies (Id + γ A−2 )b = a, which implies
b = (Id+γ A−2 )−1 a = (A2 +γ Id)−1 A2 a. Note that the operator Id+γ A−2
is invertible since it is the sum of the identity and a positive (but maybe
unbounded) operator.
We use the method from Chapter 4 to prove the remaining statements. If
λ1 ≥ λ2 ≥ . . . denote the eigenvalues of A2 corresponding to normalized
eigenvectors {φk }, then
λk
b= ak φk ,
λk + γ
k≥1
where a = k≥1 ak φk . It follows that
−γ
b−a = ak φk .
λk + γ
k≥1
& 2 θ '1/2
Assume A−θ a = k ak /λk < ∞.
(ii) For 0 < θ ≤ 2, we have
−γ 2 γ 2−θ
λk θ
ak2
b−a 2
= ak2 = γ θ
k≥1
λk + γ
k≥1
λk + γ λk + γ λθk
≤ γ θ A−θ a 2 .
138 8 Least squares regularization
$√ %
(iii) For 0 < θ ≤ 1, A−1 b = k≥1 ( λk )/(λk + γ ) ak φk . Hence
γ 2 λk
b−a 2
+ γ A−1 b 2
= ak2 + γ a2
λk + γ (λk + γ )2 k
k≥1 k≥1
γ
= a2 ,
λk + γ k
k≥1
which is bounded by
γ 1−θ
λk θ
ak2
γθ ≤ γ θ A−θ a 2 .
k≥1
λk + γ λk + γ λθk
It follows that
γ2
A−1 (b − a) 2
= a2
(λk + γ )2 λk k
k≥1
γ 3−θ
λk θ−1 2
ak
= γ θ−1 ≤ γ θ−1 A−θ a 2 .
k≥1
λk + γ λk + γ λθk
Bounds for the regularized error D(γ ) follow from Theorem 8.4.
−θ/2
D(γ ) = E(fγ ) − E( fρ ) + γ fγ 2
K ≤ γ θ LK fρ 2
ρ.
−θ/2
fγ − fρ K ≤ γ (θ−1)/2 LK fρ ρ.
8.2 On the existence of target functions 139
1/2
Proof. Apply Theorem 8.4 with H = Ł2, s = 1, A = LK , and a = fρ , and
−1/2
use that LK f = A−1 f = f K . We know that fγ is the minimizer of
min ( f − fρ 2
+γ f K)
2
= min ( f − fρ 2
+γ f K)
2
f ∈Ł2 f ∈HK
b−a 2
+ γ A−s b 2
= b − fρ 2Lρ (X ) + γ b 2
K = Eγ (b) − σρ2 .
X
m
f (x) = ck φk (x) = λk ai φk (xi )φk (x)
λk >0 λk >0 i=1
m
m
= ai λk φk (xi )φk (x) = ai K(xi , x),
i=1 λk >0 i=1
where we have applied Theorem 4.10 in the last equality. Replacing f (xi ) in
the definition of ai above we obtain
m
yi − j=1 aj K(xj , xi )
ai = .
γm
Multiplying both sides by γ m and writing the result in matrix form we obtain
(γ m Id + K[x])a = y, and this system is well posed since K[x] is positive
semidefinite and the result of adding a positive semidefinite matrix and the
identity is positive definite.
& '
M = inf M̄ ≥ 0 | {(x, y) ∈ Z | |y| ≥ M̄ } has measure zero
8.3 A first estimate for the excess generalization error 141
lim v ∗ (m, δ) = 0.
m→∞
∗
More quantitatively, when K is C s on X ⊂ Rn , log N (B1 , η) ≤ C0 (1/η)s with
s∗ = 2n
s (cf. Theorem 5.1(i)). In this case the following decay holds.
∗
Lemma 8.9 If the Mercer kernel K satisfies log N (B1 , η) ≤ C0 (1/η)s for
some s∗ > 0, then
1/(1+s∗ )
108 log(1/δ) 108C0
v ∗ (m, δ) ≤ max , .
m m
∗
Proof. Observe that g(η) ≤ h(η) := C0 (1/η)s − mη 54 . Since h is also strictly
decreasing and continuous on (0, +∞), we can take to be the unique positive
solution of the equation h(t) = log δ. We know that v∗ (m, δ) ≤ . The equation
h(t) = log δ can be expressed as
∗ 54 log(1/δ) s∗ 54C0
t 1+s − t − = 0.
m m
&
Then Lemma 7.2 with d = 2 yields ≤ max 108 log(1/δ)/m,
$ %1/(1+s∗ ) '
108C0 /m . This verifies the bound for v ∗ (m, δ).
142 8 Least squares regularization
Theorem 8.10 For all γ ∈ (0, 1] and 0 < δ < 1, with confidence 1 − δ,
$ %
2(CK + 3)2 M2 v ∗ (m, δ/2) 8C2K log 2/δ
E(fz ) − E(fρ ) ≤ + + 6M + 4
γ mγ
$ % $ %
48M2 + 6M log 2/δ
× D(γ ) + .
m
holds.
Theorem 8.10 will follow from some lemmas and propositions given in the
remainder of this section. Before proceeding with these results, however, we
note that from Theorem 8.10, a convergence property for the regularized scheme
follows.
Corollary 8.11 Let 0 < δ < 1 be arbitrary. Take γ = γ (m) to satisfy γ (m) →
0, limm→∞ mγ (m) ≥ 1, and γ (m)/(v ∗ (m, δ/2)) → +∞. If D(γ ) → 0, then,
for any ε > 0, there is some Mδ,ε ∈ N such that with confidence 1 − δ,
E(fz ) − E( fρ ) ≤ ε, ∀m ≥ Mδ,ε
holds.
As an example, for Cs kernels on X ⊂ Rn ,
the decay of v ∗ (m, δ) shown in
Lemma 8.9 with s∗ = 2n
s yields the following convergence rate.
∗
Corollary 8.12 Assume that K satisfies log N (B1 , η) ≤ C0 (1/η)s for some
θ/2
s∗ > 0, and ρ satisfies fρ ∈ Range(LK ) for some 0 < θ ≤ 1. Then, for all
γ ∈ (0, 1] and all 0 < δ < 1, with confidence 1 − δ,
$2%
$ %2 log δ 1
fz (x) − fρ (x) d ρX ≤ C1 + 1/(1+s∗ ) + γ θ +
X mγ m γ
2 γθ 1
log +
δ mγ m
.
−θ/2
holds, where C1 is a constant depending only on s, CK , M, C0 , and LK fρ .
∗ )) $ % 2
If γ = m$ −1/((1+θ)(1+s
% −θ/((1+θ)(1+s , then the convergence rate is X fz (x)−fρ (x) dρX ≤
∗ ))
6C1 log 2/δ m .
Proof. The proof is an easy consequence of Theorem 8.10, Lemma 8.9, and
Proposition 8.5.
For C ∞ kernels, s∗ can be arbitrarily small. Then the decay rate exhibited in
Corollary 8.12 is m−(1/2)+ε for any ε > 0, achieved with θ = 1. We improve
8.3 A first estimate for the excess generalization error 143
Theorem 8.10 in the next section, where more satisfactory bounds (with decay
rate m−1+ε ) are presented. The basic ideas of the proof are included in this
section.
To move toward the proof of Theorem 8.10, we write the sample error as
1
m
E(fz ) − Ez (fz ) + Ez (fγ ) − E(fγ ) = E(ξ1 ) − ξ1 (zi )
m
i=1
m
1
+ ξ2 (zi ) − E(ξ2 ) , (8.4)
m
i=1
where
$ %2 $ %2 $ %2 $ %2
ξ1 := fz (x) − y − fρ (x) − y and ξ2 := fγ (x) − y − fρ (x) − y .
The second term on the right-hand side of (8.4) is about the random variable
ξ2 on Z. Since its mean E(ξ2 ) = E(fγ )−E(fρ ) is nonnegative, we may apply the
Bernstein inequality to estimate this term. To do so, however, we need bounds
for fγ ∞ .
Lemma 8.13 For all γ > 0,
γ fγ 2
K ≤ E( fγ ) − E( fρ ) + γ fγ 2
K = D(γ ).
which implies that σ 2 (ξ2 ) ≤ E(ξ22 ) ≤ cD(γ ). Now we apply the one-side
Bernstein inequality in Corollary 3.6 to ξ2 . It asserts that for any t > 0,
1
m
ξ2 (zi ) − E(ξ2 ) ≤ t
m
i=1
mt 2
− $ % = log δ.
2c D(γ ) + 23 t
1
m
ξ2 (zi ) − E(ξ2 ) ≤ t ∗
m
i=1
holds. But
⎛ ! ⎞
@
2c $ % 2c 2 $ %
t ∗ = ⎝ log 1/δ + log(1/δ) + 2cm log 1/δ D(γ )⎠ m
3 3
$ % 7
4c log 1/δ $ %
≤ + 2c log 1/δ D(γ )/m.
3m
By Lemma 8.13, c ≤ 2C2K D(γ )/γ + 18M2 . It follows that
7 $ % 7 $ %
2CK D(γ )
2c log 1/δ D(γ )/m ≤ log 1/δ √ + 6M D(γ )/m
mγ
8.3 A first estimate for the excess generalization error 145
Here we have used the expressions for c, B = 2c, and the restriction R ≥ M.
What is left is to bound the covering number N (FR , ε/4). To do so, we note
that
$ % $ % $ % $ %
f1 (x) − y 2 − f2 (x) − y 2 ≤ f1 − f2 ∞
f1 (x) − y + f2 (x) − y .
η
N (FR , η) ≤ N B1 , , ∀η > 0. (8.6)
2(MR + CK R2 )
W(R) := {z ∈ Z m : fz K ≤ R}.
Proposition 8.16 For all 0 < δ < 1 and R ≥ M, there is a set VR ⊂ Z m with
ρ(VR ) ≤ δ such that for all z ∈ W(R) \ VR , the regularized error Eγ ( fz ) =
E( fz ) − E(fρ ) + γ fz 2K is bounded by
$ %
2 2 ∗ 8C2K log 2/δ
2(CK + 3) R v (m, δ/2) + + 6M + 4 D(γ )
mγ
$ % $ %
48M2 + 6M log 2/δ
+ .
m
8.3 A first estimate for the excess generalization error 147
√ $ %
Proof. Note that E( f ) − E( fρ ) + ε ε ≤ 12 E( f ) − E( fρ ) + ε. Using the
quantity v∗ (m, δ), Proposition 8.15 with ε = (CK + 3)2 R2 v ∗ (m, 2δ ) tells us that
there is a set VR ⊂ Z m of measure at most 2δ such that
$ % $ %
E(f ) − E( fρ ) − Ez ( f ) − Ez ( fρ ) ≤ 12 E( f ) − E(fρ ) + (CK + 3)2
R2 v ∗ (m, δ/2), ∀f ∈ BR , z ∈ Z m \ VR .
1
m
$ %
E(ξ1 ) − ξ1 (zi ) = E( fz ) − E( fρ ) − Ez ( fz ) − Ez ( fρ )
m
i=1
≤ 12 E( fz ) − E( fρ ) + (CK + 3)2 R2 v ∗ (m, δ/2).
Now apply Proposition 8.14 with δ replaced by 2δ . We can find another set
VR⊂ Z m of measure at most 2δ such that for all z ∈ Z m \ VR ,
$ %
1
m
4C2K log 2/δ
ξ2 (zi ) − E(ξ2 ) ≤ + 3M + 1 D(γ )
m mγ
i=1
$ % $ %
24M2 + 3M log 2/δ
+ .
m
Combining these two bounds with (8.4), we see that for all z ∈ W(R)\(VR ∪VR ),
1$ %
E(fz ) − Ez (fz ) + Ez ( fγ ) − E( fγ )≤ E(fz ) − E( fρ ) + (CK + 3)2 R2 v ∗ (m, δ/2)
2
$ %
4C2K log 2/δ
+ + 3M + 1 D(γ )
mγ
$ % $ %
24M2 + 3M log 2/δ
+ .
m
This inequality, together with Theorem 8.3, tells us that for all z ∈ W(R) \
(VR ∪ VR ),
1$ %
E(fz ) − E(fρ ) + γ fz 2
K ≤ D(γ ) + E(fz ) − E( fρ ) + (CK + 3)2
2
$ %
2 ∗ 4C2K log 2/δ
R v (m, δ/2) + + 3M + 1 D(γ )
mγ
$ % $ %
24M2 + 3M log 2/δ
+ .
m
148 8 Least squares regularization
M
fz K ≤√ .
γ
1
m
γ fz 2
K ≤ Ez,γ ( fz ) ≤ Ez,γ (0) = (yi − 0)2 ≤ M2 ,
m
i=1
√
the last almost surely. Therefore, fz K ≤ M/ γ for almost all z ∈ Z m .
√
Lemma 8.17 says that W(M/ γ ) = Z m up to a set of measure zero (we
√
ignore this null set later). Take R := M/ γ ≥ M. Theorem 8.10 follows from
Proposition 8.16.
Lemma 8.18 For all 0 < δ < 1 and R ≥ M, there is a set VR ⊂ Z m with
ρ(VR ) ≤ δ such that
W(R) ⊆ W(am R + bm ) ∪ VR ,
√
where am := (2CK + 5) v ∗ (m, δ/2)/γ and
7 $ % ! 7 $ %
2CK 2 log 2/δ √ D(γ ) (7M + 1) 2 log 2/δ
bm := √ + 6M + 4 + √ .
mγ γ mγ
8.4 Proof of Theorem 8.1 149
fz K ≤ am R + bm , ∀z ∈ W(R) \ VR ,
R(j) = am R(j−1) + bm .
Then Lemma 8.17 proves W(R(0) ) = Z m , and Lemma 8.18 asserts that for
each j ≥ 1, W(R( j−1) ) ⊆ W(R( j) ) ∪ VR( j−1) with ρ(VR( j−1) ) ≤ δ. Apply this
150 8 Least squares regularization
$ %
∗) − ζ ≤ J ≤
inclusion
$ for j = %1, 2, . . . , J , with J satisfying 2/ 1/(1 + s
3/ 1/(1 + s∗ ) − ζ . We see that
−1
J4
Z m = W(R(0) ) ⊆ W(R(1) ) ∪ VR(0) ⊆ · · · ⊆ W(R(J ) ) ∪ VR( j) .
j=0
J −1
$ %
ζ /2−1/(2+2s∗ ) +ζ /2
R(J ) = am
J (0) j
R + bm am ≤ McJ mJ + bm ≤ McJ + bm .
j=0
Here we have used (8.8) and am ≤ 12 in the first inequality, and then the
restriction J ≥ 2/(1/(1 + s∗ ) − ζ ) > ζ /(1/(1 + s∗ ) −$ ζ ) in the% second
$ %3/ 1/(1+s∗ )−ζ
inequality. Note that cJ ≤ (2CK + 5)(108C0 + 1) . Since
γ = m−ζ , bm can be bounded as
7 $ % $ √ %
bm ≤ 2 log 2/δ 2CK + 6M + 4 D(γ )/γ + 7M + 1 .
7 $ %$√ %
Thus, R(J ) ≤ C2 log 2/δ D(γ )/γ + 1 with C2 depending only on
s∗ , ζ , CK , C0 , and M.
7 $ %
Proof of Theorem 8.1 Applying Proposition 8.16 with R := C2 log 2/δ
$√ % −θ/2
D(γ )/γ + 1 , and using that γ∗ = m−ζ and D(γ∗ ) ≤ γ∗θ LK fρ 2 , we
deduce from (8.7) that for m ≥ mδ and all z ∈ W(R) \ VR ,
$ % ∗ ∗
E(fz ) − E(fρ ) ≤ 2(CK + 3)2 C22 log 2/δ 2mζ (1−θ) (108C0 )1/(1+s ) m−1/(1+s )
$ %
+ C2 log 2/δ m−θζ
$ %
≤ C3 log 2/δ m−θζ
8.5 Reminders V
We use the following result on Lagrange multipliers.
min F( f )
s.t. H ( f ) ≤ 0.
Then, there exist real numbers µ, λ, not both zero, such that
1
m
min (f (xi ) − yi )2 + γ f 2
K
m
i=1
s.t. f ∈ HK
152 8 Least squares regularization
and
1
m
min ( f (xi ) − yi )2
m
i=1
satisfying
(i) for all γ > 0, fz,γ is the minimizer of Ez (z (γ )), and
(ii) for all R ∈ (0, R0 ), fz,R is the minimizer of Ez (−1
z (R)).
To prove Theorem 8.21, we use Proposition 8.20 for the problems Ez (R) with
U = HK , F(f ) = m1 m i=1 ( f (xi ) − yi ) , and H ( f ) = f K − R . Note that
2 2 2
HK → R
f → f (x) = f , Kx K
DH (f ) = 2f . Define
Also, for each R ∈ (0, R0 ), choose one minimizer fz,R of Ez (R) and let
1 1
m m
( f (xi ) − yi )2 < ( fz,γ (xi ) − yi )2
m m
i=1 i=1
1 1
m m
(f (xi ) − yi )2 + γ f 2
K < ( fz,γ (xi ) − yi )2 + γ z (γ )2
m m
i=1 i=1
1
m
= ( fz,γ (xi ) − yi )2 + γ fz,γ 2
K,
m
i=1
that is, the derivative of the objective function of Ez (λ) vanishes at fz,R . Since
this function is convex and Ez (λ) is an unconstrained problem, we conclude
that fz,R is the minimizer of Ez (λ) = Ez (z (R)).
Proposition 8.24 z is a decreasing global homeomorphism with inverse z .
Proof. Since K is a Mercer kernel, the matrix K[x] is positive definite by
the invertibility assumption. So there exist an orthogonal matrix P and a
154 8 Least squares regularization
diagonal matrix D such that K[x] = PDP −1 . Moreover, the main diagonal
entries d1 , . . . , dm of D are positive. Let y = P −1 y. Then, by Proposition 8.7,
−1 −1
fz,γ = m i=1 ai Kxi with a satisfying (γ mId + D)P a = P y = y . It follows
that
yi m
P −1 a =
γ m + di i=1
and, using P T = P −1 ,
A
B m
B yi 2
z (γ ) = fz,γ K = a K[x]a = (P a) DP a = C
T −1 T −1
di ,
γ m + di
i=1
1
m
y 2i
z (γ ) = − di .
fz,γ K (γ m + di )3
i=1
This expression is negative for all γ ∈ [0, +∞). This shows that z is strictly
decreasing in its domain. The first statement now follows since z is continuous,
z (0) = R0 , and z (γ ) → 0 when γ → ∞.
To prove the second statement, consider γ > 0. Then, by Proposition 8.23(i)
and (ii),
fz,γ = fz,z (γ ) = fz,z (z (γ )) .
To prove that γ = z (z (γ )), it is thus enough to prove that for γ , γ ∈
(0, +∞), if fz,γ = fz,γ , then γ = γ . To do so, let i be such that yi = 0 (such
an i exists since y = $0). Since the coefficient
%m vectors for fz,γ and fz,γ are the
same a with P −1 a = yi /(γ m + di ) i=1 , we have in particular
yi y
= i ,
γ m + di γ m + di
Corollary 8.25 For all R < R0 , the minimizer fz,R of Ez (R) is unique.
Proof. Let γ = z (R). Then fz,R = fz,γ by Proposition 8.23(ii). Now use that
fz,γ is unique.
Theorem 8.21 now follows from Propositions 8.23 and 8.24 and
Corollary 8.25.
8.7 References and additional remarks 155
s.t. f ∈ HK
and
min ( f (x) − y)2 dρ
In [42, 43] a functional analysis approach was employed to show that for any
0 < δ < 1, with confidence 1 − δ,
2 7 $ %
E(fz,γ ) − E( fγ ) ≤ M√CK 1 + √
CK
1+ 2 log 2/δ .
m γ
Parts (i) and (ii) of Proposition 8.5 were given in [39]. Part (iii) with 1 < θ ≤ 2
was proved in [115], and the extension to 2 < θ ≤ 3 was shown by Mihn in
the appendix to [116]. In [115], a modified McDiarmid inequality was used to
derive error bounds in the metric induced by K . If fρ is in the range of LK ,
then, for any 0 < δ < 1 with confidence 1 − δ,
$ $ %%2 1/3
$ $ %%2 1/3
log 4/δ log 4/δ
fz,γ − fρ 2K ≤C by taking γ =
m m
Our target concept (in the sense of Case 1.5) is the set T := {x ∈ X | Prob{y =
1 | x} ≥ 12 }, since the conditional distribution at x is a binary distribution.
One goal of this chapter is to describe an approach to producing classifiers
from samples (and an RKHS HK ) known as support vector machines.
157
158 9 Support vector machines for classification
Figure 9.1
Theorem 9.2 Assume ρ is weakly separable by HK . Let B1 denote the unit ball
in HK .
(i) If log N (B1 , η) ≤ C0 (1/η)p for some p, C0 > 0 and all η > 0, then, taking
γ = m−β (for some β > 0), we have, with confidence 1 − δ,
r 2
1 2
R(Fz,γ ) ≤ C log ,
m δ
(ii) If log N (B1 , η) ≤ C0 (log(1/η))p for some p, C0 > 0 and all 0 < η < 1,
then, for sufficiently large m and some β > 0, taking γ = m−β , we have,
with confidence 1 − δ,
1 r 2 2
R(Fz,γ ) ≤ C (log m)p log ,
m δ
fc := sgn( fρ ).
Remark 9.4 The role played by the quantity κρ is reminiscent of that played
by σρ2 in the regression setting. Note that κρ depends only on ρ. Therefore,
its occurrence in Proposition 9.3(i) – just as that of σρ2 in Proposition 1.8 – is
independent of f . In this sense, it yields a lower bound for the misclassification
error and is, again, a measure of how well conditioned ρ is.
1
m
fz,γ = argmin ( f (xi ) − yi )2 + γ f 2
K, (9.3)
f ∈HK m
i=1
and then taking the function sgn( fz,γ ) as an approximation of fc . Note that this
strategy minimizes a functional on a set of real-valued continuous functions
and then applies the sgn function to the computed minimizer to obtain a
classifier.
A different strategy consists of first taking signs to obtain the set {sgn( f ) |
f ∈ HK } of classifiers and then minimizing an empirical error over this set.
To see which empirical error we want to minimize, note that for a classifier
f : X → Y,
R(f ) = χ{ f (x)=y} d ρ = χ{yf (x)=−1} dρ.
Z Z
By discretizing the integral into a sum, given the sample z = {(xi , yi )}m
i=1 ∈ Z ,
m
1
m
argmin χ{yi f (xi )<0} ,
f ∈HK m
i=1
f (x)=0 a.e.
1
m
argmin χ{yi f (xi )<0} . (9.4)
f ∈HK m
i=1
Note that in practical terms, we are again minimizing over HK . But we are
now minimizing a different functional.
It is clear, however, that if f is any minimizer of (9.4), so is αf for all α > 0.
This shows that the regularized version of (9.4) (regularized by adding the term
γ f 2K to the functional to be minimized) has no solution. It also shows that
we can take as minimizer a function with norm 1. We conclude that we can
approximate the Bayes rule by sgn(fz0 ), where fz0 is given by
1
m
fz0 := argmin χ{ yi f (xi )<0} . (9.5)
f ∈HK m
i=1
f K =1
We show in the next section that although we can reduce the computation
of fz0 to a nonlinear programming problem, the problem is not a convex one.
Hence, we do not possess efficient algorithms to find fz0 (cf. Section 2.7). We
also introduce a third approach that lies somewhere in between those leading to
problems (9.3) and (9.5). This new approach then occupies us for the remainder
of this (and the next) chapter. We focus on its geometric background, error
analysis, and algorithmic features.
Definition 9.5 The generalization error associated with the loss φ is defined as
φ
E (f ) := φ(yf (x)) dρ.
Z
The empirical error associated with the loss φ and a sample z ∈ Z m is defined as
1
m
φ
Ez (f ) := φ(yi f (xi )).
m
i=1
1
m
φ
Ez,γ (f ) := φ(yi f (xi )) + γ f 2
K.
m
i=1
and the least-squares loss φls = (1 − t)2 . Note that for functions f : X → R
and points x ∈ X such that f (x) = 0, φ0 (yf (x)) = χ{y=sgn(f (x))} ; that is, the
local error is 1 if y and f (x) have different signs and 0 when the signs are
the same.
Proposition 9.6 Restricted to binary classifiers, the generalization error w.r.t.
φ0 is the misclassification error R; that is, for all classifiers f ,
R(f ) = E φ0 (f ).
For the second statement, note that the generalization error E(f ) of f
satisfies
E(f ) = (y − f (x)) dρ = (1 − yf (x))2 dρ = E φls (f ),
2
Z Z
Proposition 9.7 Let K be a Mercer kernel on X , and φ a loss function. Let also
φ
B ⊆ HK , γ > 0, and z ∈ Z m . If f ∈ HK is a minimizer of Ez in B, then P(f ) is
φ φ
a minimizer of Ez in P(B). If, in addition, P(B) ⊆ B and Ez can be minimized
in B, then such a minimizer can be chosen in P(B). Similar statements hold for
φ
Ez,γ .
1
m
fz0 := argmin χ{yi f (xi )<0} .
f ∈HK,z m
i=1
f K =1
φ
Proof. Let f∗ be a minimizer of Ez 0 (f ) = m1 m i=1 χ{yi f (xi )<0} in HK ∩ {f |
φ φ
f K = 1}. By Proposition 9.7, P(f∗ ) ∈ HK,z satisfies Ez 0 (f∗ ) = Ez 0 (P(f∗ )).
φ0 φ0
If P(f∗ ) = 0, we thus have Ez (f∗ ) = Ez (P(f∗ )/ P(f∗ ) K ), showing that a
minimizer exists in HK,z ∩ {f | f K = 1}.
φ φ
If P(f∗ ) = 0, then Ez 0 (f∗ ) = Ez 0 (0) = 1, the maximal possible error. This
φ0
means that for all f ∈ HK , Ez (f ) = 1, so we may take any function in
φ
HK,z ∩ {f | f K = 1} as a minimizer of Ez 0 .
164 9 Support vector machines for classification
where
1
m
cz = (cz,1 , . . . , cz,m ) = argmin χ m .
m j=1 cj yi K(xi ,xj )<0
c∈Rm i=1
cT K[x]c=1
We would like thus to replace the loss φ0 by a loss φ that, on one hand,
approximates Bayes rule – for which we will require that φ is close to
the misclassification loss φ0 – and, on the other hand, leads to a convex
programming problem. Although we could do so in the setting described in
Chapter 1 (we actually did it with fz0 above), we instead consider the regularized
setting of Chapter 8.
φ m
Proof. According to Proposition 9.7, fz,γ = j=1 cz,j Kxj , where
⎛ ⎞
1
m m
cz = (cz,1 , . . . , cz,m ) = argmin φ⎝ yi K(xi , xj )cj ⎠
c∈R m m
i=1 j=1
m
+γ ci K(xi , xj )cj .
i,j=1
m
For each i = 1, . . . , m, φ j=1 yi K(xi , xj )cj = φ(yT K[x]c) is a convex
function of c ∈ R . In addition, since K is a Mercer kernel, the Gramian matrix
m
The regularized classifier associated with the hinge loss, the support
vector machine, has been used extensively and appears to have a small
misclassification error in practice. One nice property of the hinge loss φh , not
possessed by the least squares loss φls , is the elimination of the local error
φ
when yf (x) > 1. This property often makes the solution fz,γh of (9.6) sparse
φ
in the representation fz,γh = m i=1 cz,i Kxi . That is, most coefficients cz,i in this
φ
representation vanish. Hence the computation of fz,γh can, in practice, be very
fast. We return to this issue at the end of Section 9.4.
Although the definition of the hinge loss may not suggest at a first glance
any particular reason for inducing good classifiers, it turns out that there
is some geometry to explain why it may do so. We next disgress on this
geometry.
166 9 Support vector machines for classification
Figure 9.2
9.3 Optimal hyperplanes: the separable case 167
Therefore, the two classes of points are separated by the hyperplane w · x = c(w)
and satisfy
⎧
⎪
⎪w · xi − c(w) ≥ mini∈I w · xi − c(w)
⎪
⎪
⎨ = 1 {min w · x − max
2 i∈I i i∈II w · xi } = (w) if i ∈ I
⎪
⎪w · xi − c(w) ≤ maxi∈II w · xi − c(w)
⎪
⎪
⎩
= 12 {maxi∈II w · xi − mini∈I w · xi } = −(w) if i ∈ II.
Figure 9.3
168 9 Support vector machines for classification
Figure 9.4
max (w)
w =1
= min w · xi + min w · xi ,
i∈I i∈II
9.4 Support vector machines 169
where 1
2 xi if yi = 1
xi =
− 12 xi if yi = −1,
is continuous. Therefore, achieves a maximum value over the compact set
{w ∈ Rn | w ≤ 1}. The maximum cannot be achieved in the interior of this
set; for w∗ with w∗ < 1, we have
w∗ w∗ w∗ 1
= min · x i + min · xi = (w∗ ) > (w∗ ).
w∗ i∈I w ∗ i∈II w ∗ w∗
which implies
1 ∗
2 w1 + 12 w2∗ · xi + 12 w1∗ + 12 w2∗ · xj ≥ (w1∗ ).
That is, 12 w1∗ + 21 w2∗ would be another maximizer, lying in the interior, which
is not possible.
Theorem 9.13 Assume (9.8) has a solution w∗ with (w∗ ) > 0. Then w∗ =
w/ w , where w is a solution of
min w 2
w∈Rn , b∈R (9.9)
s.t. yi (w · xi − b) ≥ 1, i = 1, . . . , m.
Then
w 1 w b
= min · xi −
w 2 yi =1 w w
w b 1
− max · xj − ≥ ,
yj =−1 w w w
and
$ %
w0 · xj − 1
2 minyi =1 w0 · xi + maxyj =−1 w0 · xj
w̄ · xj − b =
(w0 )
≤ −1 if yj = −1.
9.5 Optimal hyperplanes: the nonseparable case 171
m
min w 2 + 1
γm ξi
w∈Rn , b∈R, ξ ∈Rm i=1
(9.10)
s.t. yi (w · xi − b) ≥ 1 − ξi
ξi ≥ 0, i = 1, . . . , m.
We claimed at the end of Section 9.2 that the regularized classifier associated
with the hinge loss was related to our previous discussion of margins and
separating hyperplanes. To see why this is so we next show that the soft
margin classifier is a special example of (9.6). Recall that the hinge loss φh is
defined by
φh (t) = (1 − t)+ = max{1 − t, 0}.
1
m
min φh (yi (f (xi ) − b)) + γ f 2
K. (9.11)
f ∈HK ,b∈R m
i=1
The scheme (9.11) is the same as (9.6) with the linear kernel except for the
constant term b, called offset. 1
One motivation to consider scheme (9.6) with an arbitrary Mercer kernel is
the expectation of separating data by surfaces instead of hyperplanes only. Let
f be a function on Rn , and f (x) = 0 the corresponding surface. The two classes
I and II are separable by this surface if, for i = 1, . . . , m,
f (xi ) > 0 if i ∈ I
f (xi ) < 0 if i ∈ II;
1 We could have considered the scheme (9.11) with offset. We did not do so for simplicity of
exposition. References to work on the general case can be found in Section 9.8.
9.6 Error analysis for separable measures 173
Remark 9.15
(i) Even under the weaker condition that yfsp (x) > 0 almost surely (which
we consider in the next section), we have y = sgn(fsp (x)) almost surely.
Hence, the variance σρ2 vanishes (i.e., ρ is noise free) and so does κρ .
(ii) As a consequence of (i), fc = sgn(fsp ).
(iii) Since fsp is continuous and |fsp (x)| ≥ almost surely, it follows that if ρ
is strictly separable, then
$ %
ρX T ∩ X \ T = 0,
Figure 9.5
174 9 Support vector machines for classification
But y(fsp (x)/) ≥ 1 almost surely, that is, 1 − y(fsp (x)/) ≤ 0, so we have
$ % φ $ %
φh y(fsp (x)/) = 0 almost surely. It follows that Ez h fsp / = 0. Since
fsp /2 = 1/2 ,
K
φ φ φ γ
Ez h (fz,γh ) + γ fz,γh 2K ≤ 2
holds and the statement follows.
φ
The results in Chapter 8 lead us to expect the solution fz,γh of (9.6) to satisfy
φ φ φ
E φh (fz,γh ) → E φh (fρ h ), where fρ h is a minimizer of E φh . We next show that
φ
this is indeed the case. To this end, we first characterize fρ h . For x ∈ X , let
ηx := ProbY (y = 1 | x).
Theorem 9.17 For any measurable function f : X → R
E φh ( f ) ≥ E φh (fc )
holds.
φ
That is, the Bayes rule fc is a minimizer fρ h of E φh .
Proof. Write E φh ( f ) = X h,x (f (x)) dρX , where
h,x (t) = φh (yt) dρ(y | x) = φh (t)ηx + φh (−t)(1 − ηx ).
Y
When t = fc (x) ∈ {1, −1}, for y = fc (x) one finds that yt = 1 and φh (yt) = 0,
whereas for y = −fc (x) = fc (x), yt = −1 and φh (yt) = 2. So Y φh (yt) dρ(y |
x) = 2 Prob(y = fc (x) | x) and h,x ( fc (x)) = 2 Prob(y = fc (x) | x).
According to (9.2), Prob(y = fc (x) | x) ≤ Prob(y = s | x) for s = ±1.
Hence, h,x (fc (x)) ≤ 2 Prob(y = s | x) for any s ∈ {1, −1}.
If t ≥ 1, then φh (t) = 0 and h,x (t) = (1 + t)(1 − ηx ) ≥ 2(1 − ηx ) ≥
h,x ( fc (x)).
9.6 Error analysis for separable measures 175
If t ≤ −1, then φh (−t) = 0 and h,x (t) = (1 − t)ηx ≥ 2ηx ≥ h,x ( fc (x)).
If −1 < t < 1, then h,x (t) = (1 − t)ηx + (1 + t)(1 − ηx )≥(1 − t)
2 h,x ( fc (x)) + (1 + t) 2 h,x ( fc (x)) = h,x ( fc (x)).
1 1
When ρ is strictly separable by HK , we see that sgn( y fsp (x)) = 1 and hence
y = sgn( fsp (x)) almost surely. This means fc (x) = sgn( fsp (x)) and y = fc (x)
almost surely. In this case, we have E φh ( fc ) = Z (1−y fc (x))+ = 0. Therefore,
φ
we expect E φh ( fz,γh ) → 0. To get error bounds showing that this is the case we
write
φ φ φ φ φ φ φ φ φ γ
E φh ( fz,γh ) = E φh ( fz,γh ) − Ez h ( fz,γh ) + Ez h ( fz,γh ) ≤ E φh ( fz,γh ) − Ez h ( fz,γh ) + .
2
(9.12)
Here we have used the first inequality in Theorem 9.16. The second inequality
φ
of that theorem tells us that fz,γh lies in the set { f ∈ HK | f K ≤ 1/}. So
φ
it is sufficient to estimate E φh ( f ) − Ez h ( f ) for functions f in this set in some
uniform way. We can use the same idea we used in Lemmas 3.18 and 3.19.
Lemma 9.18 Suppose a random variable ξ satisfies 0 ≤ ξ ≤ M . Denote
µ = E(ξ ). For every ε > 0 and 0 < α ≤ 1,
m
µ− 1
i=1 ξ(zi ) √ 3α 2 mε
Prob m
√ ≥ α ε ≤ exp −
z∈Zm µ+ε 8M
holds.
Proof. The proof follows from Lemma 3.18, since the assumption 0 ≤ ξ ≤ M
implies |ξ − µ| ≤ M and E(ξ 2 ) ≤ M E(ξ ).
Lemma 9.19 Let F be a subset of C (X ) such that f C (X ) ≤ B for all f ∈ F.
Then, for every ε > 0 and 0 < α ≤ 1, we have
φ
E φh ( f ) − Ez h ( f ) √ 3α 2 mε
Prob sup ≥ 4α ε ≤ N (F, αε) exp − .
z∈Zm
f ∈F E φh ( f ) + ε 8(1 + B)
Also,
7 7
E φh ( fj ) + ε ≤ E φh ( f ) + ε + E φh ( fj ) − E φh ( f )
7 7
≤ E φh ( f ) + ε + |E φh ( fj ) − E φh ( f )|.
Therefore,
φ φ φ
E φh ( f ) − Ez h ( f ) E φh ( f ) − E φh ( fj ) Ez h ( fj ) − Ez h ( f )
= +
E φh ( f ) + ε E φh ( f ) + ε E φh ( f ) + ε
φ φ
E φh ( fj ) − Ez h ( fj ) √ E φh ( fj ) − Ez h ( fj )
+ ≤ 2α ε + .
E φh ( f ) + ε E φh ( f ) + ε
φ √
It follows that if (E φh ( f ) − Ez h ( f ))/ E φh ( f ) + ε) ≥ 4α ε for some f ∈ F,
then
φ
E φh ( fj ) − Ez h ( fj ) √
≥ 2α ε.
E φh ( f ) + ε
This, together with (9.13), tells us that
φ
E φh ( fj ) − Ez h ( fj ) √
7 ≥ α ε.
E φh ( fj ) + ε
Thus,
φ
E φh ( f ) − Ez h ( f ) √
Prob sup ≥ 4α ε
z∈Zm
f ∈F E φh ( f ) + ε
⎧ ⎫
⎪
⎨ ⎪
N φ φ
E h ( fj ) − Ez (fj )
h
√ ⎬
≤ Prob 7 ≥α ε .
z∈Z m ⎪
⎩ ⎪
⎭
j=1 E φh ( fj ) + ε
9.6 Error analysis for separable measures 177
The statement now follows from Lemma 9.18 applied to the random variable
ξ = φh ( y fj (x)), for j = 1, . . . , N , which satisfies 0 ≤ ξ ≤ 1 + fj C (X ) ≤
1 + B.
We can now derive error bounds for strictly separable measures. Recall that
B1 denotes the unit ball of HK as a subset of C (X ).
ε 3mε
log N B1 , − ≤ log δ.
4 128(1 + CK /)
In addition,
(i) If log N (B1 , η) ≤ C0 (1/η)p for some p, C0 > 0, and all η > 0, then
∗ log(1/δ) 1/(1+p)
ε (m, δ) ≤ 86(1 + CK /) max , C0
m
p/(1+p) 1/(1+p)
4 1
.
m
(ii) If log N (B1 , η) ≤ C0 (log(1/η))p for some p, C0 > 0 and all 0 < η < 1,
then, for m ≥ max{4/, 3},
φ
In particular, the function fz,γh , which belongs to F by Theorem 9.16, satisfies
φ φ φ
E φh ( fz,γh ) − Ez h ( fz,γh ) √
7 ≤ 4α ε.
φ
E φh ( fz,γh ) + ε
√ γ
t 2 − 4α εt − ε + 2 ≤ 0.
Solving
7 the associated quadratic equation and taking into account that
φ
t = E φh ( fz,γh ) + ε ≥ 0, we deduce that
?
√ √ γ
0 ≤ t ≤ 2α ε + (2α ε)2 + ε + 2 .
φ 2γ
E φh ( fz,γh ) ≤ 2ε∗ (m, δ) + .
2
A(log m)p 1 4
≥ and log + log m ≤ 2 log m.
m m
It follows that
p
A(log m)p 4 1
h ≤ C0 log + log m − (log m)p 2p C0 + log
m δ
1
≤ −(log m)p log ≤ log δ.
δ
A(log m)p
ε∗ (m, δ) ≤ ε∗ ≤ .
m
R(sgn( f )) − R( fc ) ≤ E φh ( f ) − E φh ( fc ).
180 9 Support vector machines for classification
By the definition of φh ,
0 if y = fc (x)
φh ( y fc (x)) = (1 − y fc (x))+ =
2 if y = fc (x).
Hence E φh ( fc ) = X Y φh ( y fc (x))
dρ( y|x) dρX = X 2 ProbY ( y = fc (x)|x)
dρX . Furthermore, E φh ( f ) = X Y φh ( y f (x)) dρ( y|x) dρX . Thus, it is
sufficient for us to prove that
φh ( y f (x)) dρ( y|x) − 2 Prob( y = fc (x)|x) ≥ | fρ (x)|, ∀x ∈ Xc . (9.15)
Y Y
If | f (x)| ≤ 1, then
φh ( y f (x)) dρ( y|x) − 2 Prob( y = fc (x)|x)
Y Y
− 2 Prob( y = fc (x)|x)
Y
Combining Theorems 9.20 and 9.21, we can derive bounds for the
misclassification error for the support vector machine soft margin classifier
for strictly separable measures satisfying R( fc ) = E φh ( fc ) = 0.
Corollary 9.22 Assume ρ is strictly separable by HK with margin .
(i) If log N (B1 , η) ≤ C0 (1/η)p for some p, C0 > 0 and all η > 0, then, with
confidence 1 − δ,
φ 2γ
R(sgn( fz,γh )) ≤ 2 + 172(1 + CK /)
log(1/δ) 1/(1+p) 4 p/(1+p) 1 1/(1+p)
max , C0 .
m m
(ii) If log N (B1 , η) ≤ C0 (log(1/η))p for some p, C0 > 0 and all 0 < η < 1,
then, for m ≥ max{4/, 3}, with confidence 1 − δ,
The largest θ for which there are positive constants , C0 such that (θ , , C0 )
is a separation triple is called the separation exponent of ρ (w.r.t. HK and fsp ).
1/(2+θ)
C0
fγ = −θ/(2+θ) fsp . (9.17)
γ
Then γ θ/(2+θ)
E φh ( fγ ) + γ fγ
2/(2+θ)
2
K ≤ 2C0 .
2
Proof. Write fγ = fsp /t, with t > 0 to be determined. Since y fsp (x) > 0
almost surely, the same holds for y fγ (x) > 0. Hence, φh (y fγ (x)) < 1 and
φh ( y fγ (x)) > 0 only if y fγ (x) < 1, that is, if | fγ (x)| = fsp (x)/t < 1.
Therefore,
E φh ( fγ ) = φh ( y fγ (x)) dρ = φh (| fγ (x)|) dρX
Z X
= (1 − | fγ (x)|) dρX ≤ ρX {x ∈ X : | fγ (x)|
| fγ (x)|<1
$ %1/(2+θ)
But γ fγ 2
K = γ 1/(t)2 . Setting t = γ /C0 2 proves our statement.
9.7 Weakly separable measures 183
φ
Proof. By Theorem 9.21, it is sufficient to bound E φh ( fz,γh ) as stated, since
y = sgn( fsp (x)) = fc (x) almost surely and therefore E φh ( fc ) = 0 and R( fc ) =
0.
φ φ
Choose fγ by (9.17). Decompose E φh ( fz,γh ) + γ fz,γh 2K as
φ φ φ φ φ φ φ
E φh ( fz,γh ) − Ez h ( fz,γh ) + Ez h ( fz,γh ) + γ fz,γh 2
K − Ez h ( fγ ) + γ fγ 2
K
φ
+ Ez h ( fγ ) + γ fγ 2
K.
To bound the last term, consider the random variable ξ = φh ( y fγ (x)). Since
y fγ (x) > 0 almost surely, we have 0 ≤ ξ ≤ 1. Also, σ 2 (ξ ) ≤ E(ξ ) = E φh ( fγ ).
Apply the one-side Bernstein inequality to ξ and deduce, for each ε > 0,
mε2
φ
Prob Ez h ( fγ ) − E φh ( fγ ) ≤ ε ≥ 1 − exp − .
z∈Zm
2(E φh ( fγ ) + 13 ε)
mε2 δ
− = log ,
2(E φh ( fγ ) + 13 ε) 2
δ
Thus, there exists a subset U1 of Z m with ρ(U1 ) ≥ 1 − 2 such that
φ 7 log 2δ
Ez h ( fγ ) ≤ 2E φh ( fγ ) + , ∀z ∈ U1 .
6m
φ φ φ φ
Ez h ( fz,γh ) + γ fz,γh 2
K ≤ Ez h ( fγ ) + γ fγ 2
K
γ θ/(2+θ) 7 log 2
2/(2+θ) δ
≤ 4C0 + . (9.18)
2 6m
7
−θ/(2+θ) γ −1/(2+θ) +
1/(2+θ) 2 log(2/δ)
In particular, taking R = 2C0 mγ , we have,
for all z ∈ U1 ,
φ
fz,γh ∈ F = { f ∈ HK : f K ≤ R} , ∀z ∈ U1 .
ε
a subset U2 of Z m with ρ(U2 ) ≥ 1 − N (B1 , 4R ) exp{−3mε/(128(1 + CK R))}
such that, for all f ∈ F,
φ
E φh ( f ) − Ez h (f ) √
≤ ε.
E φh ( f ) + ε
φ
In particular, when z ∈ U1 ∩ U2 , we have fz,γh ∈ F and, hence,
7
φ φ φ √ φ φ
E φh ( fz,γh ) − Ez h ( fz,γh ) ≤ ε E φh ( fz,γh ) + ε ≤ 21 E φh ( fz,γh ) + ε.
1 φh φh γ θ/(2+θ) 7 log 2
φ
E φh ( fz,γh ) ≤ E ( fz,γ ) + ε ∗ (m, δ, γ ) + 4C0 δ
2/(2+θ)
+ .
2 2 6m
It follows that
γ θ/(2+θ) 7 log 2
φ
E φh ( fz,γh ) ≤ 2ε∗ (m, δ, γ ) + 8C0 δ
2/(2+θ)
+ .
2 3m
Since ρ(U1 ∩ U2 ) ≥ 1 − δ, our first statement holds. The rest of the result,
statements (i) and (ii), follows from Theorem 9.20 after replacing by R1
and δ by 2δ .
The error analysis for support vector machines and strictly separable
distributions was already well understood in the early works on support vector
machines (see [134, 37]). The concept of weakly separable distribution was
introduced, and the error analysis for such a distribution was performed, in [31].
When the support vector machine soft margin classifier contains an offset
term b as in (9.11), the algorithm is more flexible and more general data
can be separated. But the error analysis is more complex than for scheme
φ
(9.6), which has no offset. The bound for fz,γh K becomes larger than those
shown in Theorem 9.16 and (9.18). But the approach we have used for scheme
(9.6) can be applied as well and a similar error analysis can be performed.
For details, see [31].
10
General regularized classifiers
In this chapter we extend this development in two ways. First, we remove the
separability assumption. Second, we replace the hinge loss φh by arbitrary loss
functions within a certain class. Note that it would not be of interest to consider
completely arbitrary loss functions, since many such functions would lead to
optimization problems (10.1) for which no efficient algorithm is known. The
following definition yields an intermediate class of loss functions.
Examples of classifying loss functions are the least squares loss φls (t), the
hinge loss φh , and, for 1 ≤ q < ∞, the q-norm (support vector machine) loss
defined by φq (t) := (φh (t))q .
187
188 10 General regularized classifiers
h= 1
0 1
Figure 10.1
Note that Proposition 9.11 implies that optimization problem (10.1) for a
classifying loss function is a convex programming problem. One special feature
shared by φls , φh , and φ2 = (φh )2 is that their associated convex programming
problems are quadratic programming problems. This allows for many efficient
algorithms to be applied when computing a solution of (10.1). Note that φls
differs from φ2 by the addition of a symmetric part on the right of 1.
Figure 10.1 shows the shape of some of these classifying loss functions
(together with that of φ0 ).
φ
Our goal, as in previous chapters, is to understand how close sgn( fz,γ ) is to
fc (w.r.t. the misclassification error). In other words, we want to estimate the
φ
excess misclassification error R(sgn( fz,γ )) − R( fc ). Note that in Chapter 9
we had R( fc ) = 0 because of the separability assumption. This is no longer
the case. The main result in this chapter, Theorem 10.24, this goal achieves for
various kernels K and classifying loss functions.
The following two theorems, easily derived from Theorem 10.24, become
specific for C ∞ kernels and the hinge loss φh and the least squares loss φls ,
respectively.
inf { f − fc Lρ1 + γ f K}
2
= O(γ β ). (10.2)
f ∈HK X
10.1 Bounding the misclassification error 189
inf { f − fρ 2L 2 + γ f K}
2
= O(γ β ).
f ∈HK ρX
Choose γ = 1
m. Then, for any 0 < ε < 1
2 and 0 < δ < 1, with confidence
1 − δ,
1/2 (1/2) min{β,1−ε}
φ 2 1
R(sgn( fz,γls )) − R( fc ) ≤ C log
δ m
φ
Definition 10.4 Denote by fρ : X → R any measurable function minimizing
the generalization error with respect to φ for example, for almost all
x ∈ X,
fρφ (x) := argmin φ(yt) d ρ(y | x) = argmin φ(t)ηx + φ(−t)(1 − ηx ).
t∈R Y t∈R
Our goal in this chapter is to show that under some mild conditions, for any
φ φ
classifying loss φ satisfying φ (0) > 0, we have E φ ( fz,γ ) − E φ ( fρ ) → 0 with
high confidence as m → ∞ and γ = γ (m) → 0. We saw in Chapter 9 that this
is the case for φ = φh and weakly separable measures. We begin in this section
by extending Theorem 9.21.
Theorem 10.5 Let φ be a classifying loss such that φ (0) exists and is positive.
Then there is a constant cφ > 0 such that for all measurable functions
f : X → R,
7
φ
R(sgn( f )) − R( fc ) ≤ cφ E φ ( f ) − E φ ( fρ ).
φ
To prove Theorem 10.5, we want to understand the behavior of fρ . To this
end, we introduce an auxiliary function. In what follows, fix a classifying
loss φ.
[ 2]
[ 1]
[ 0]
0 1
Figure 10.2
and
Proof.
(i) Since = x is convex, its one-side derivatives are both well defined and
nondecreasing, and − (t)$ ≤ + (t) for # every t ∈ R. Then is strictly
decreasing on the interval −∞, fρ− (x) , since − (t) < 0 on this interval.
+
"In +the same% way, + (t) > 0 for t > fρ (x), so is strictly increasing on
fρ (x), ∞ .
For t ∈ [ fρ− (x), fρ+ (x)], we have 0 ≤ − (t) ≤ + (t) ≤ 0. Hence is
constant on [fρ− (x), fρ+ (x)] and its value on this interval is its minimum.
(ii) Let x ∈ X . If we denote
φ
E ( f | x) := φ(yf (x)) d ρ(y | x) = ηx φ( f (x)) + (1 − ηx )φ(−f (x)),
Y
(10.4)
φ
then E φ ( f | x) = ( f (x)). It follows that fρ (x), which minimizes
E φ (· | x), is also a minimizer of .
(iii) Observe that
fρ (x) = ηx − (1 − ηx ) = 2 ηx − 12 . (10.5)
Proof. By the definition of φ (0), there exists some 12 ≥ c0 > 0 such that for
all t ∈ [−c0 , c0 ],
φ (t) − φ (0) φ (0)
− φ
(0)≤ .
t 2
This implies that
φ (0)
φ (0) + φ (0)t − |t| ≤ φ (t) ≤ φ (0) + φ (0)t
2
φ (0)
+ |t|, ∀t ∈ [−c0 , c0 ]. (10.7)
2
Let x ∈ X . Consider the case ηx > 1
2 first.
(0)
Denote = min{ −φ
φ (0) (ηx − 12 ), c0 }. For 0 ≤ t ≤ c0 ,
φ (0)
(t) = ηx φ (t) − (1 − ηx )φ (−t) ≤ (2ηx − 1)φ (0) + φ (0)t + t.
2
−φ (0)
Thus, for 0 ≤ t ≤ ≤ φ (0) (ηx − 12 ), we have
3 −φ (0) 1
(t) ≤ (2ηx − 1)φ (0) + φ (0) η x −
2 φ (0) 2
φ (0) 1
≤ ηx − < 0.
2 2
φ
Therefore is strictly decreasing on the interval [0, ]. But fρ (x) is its
minimal point, so
φ (0) 1
(0) − ( fρφ (x)) ≥ (0) − () ≥ − ηx − .
2 2
(0)
When −φ (0)
ηx − 12 ≤ c0 , we have = −φ
φ (0) φ (0) η x − 1
2 . When
−φ (0)
φ (0) η x − 2 > c0 , we have = c0 ≥ 2c0 (ηx − 2 ). In both cases, we have
1 1
−φ (0) 1 2
−φ (0)
(0) − ( fρφ (x)) ≥ ηx − min , 2c0 .
2 2 φ (0)
That is, the desired inequality holds with
$ %2
−φ (0)
C = min −φ (0)c0 , .
2φ (0)
194 10 General regularized classifiers
Let x ∈ Xc . If fρ (x) > 0, then fc (x) = 1 and f (x) < 0. By Theorem 10.8,
fρ− (x) ≥ 0 and is strictly decreasing on (−∞, 0]. So ( f (x)) > (0) in
this case. In the same way, if fρ (x) < 0, then f (x) ≥ 0. By Theorem 10.8,
fρ+ (x) ≤ 0 and is strictly increasing on [0, +∞). So ( f (x)) ≥ (0).
φ φ
Finally, if fρ (x) = 0, by (10.6), fρ (x) = 0 and then (0) − ( fρ (x)) = 0.
φ φ
In all three cases we have (0)−( fρ (x)) ≤ ( f (x))−( fρ (x)). Hence,
& ' & '
(0) − ( fρφ (x)) dρ X ≤ ( f (x)) − ( fρφ (x)) dρX
Xc Xc
& '
≤ ( f (x)) − ( fρφ (x)) d ρX = E φ ( f ) − E φ ( fρφ ).
X
√
This proves the first desired bound with cφ = 2/ C.
If R( fc ) = 0, then y = fc (x) almost surely and ηx = 1 or 0 almost everywhere.
This means that | fρ (x)| = 1 and | fρ (x)| = | fρ (x)|2 almost everywhere.
Using this with
respect
2 to relation (9.14), we see that R(sgn( f )) −
R( fc ) = Xc fρ (x) d ρX . Then the above procedure yields the second bound
with cφ = C4 .
replacing image values of f by their projection onto [−1, 1]. This section
develops this idea.
φ φ
E φ (π( f )) ≤ E φ ( f ) and Ez (π( f )) ≤ Ez ( f ). (10.8)
φ
Thus the analysis for the excess misclassification error of fz,γ is reduced into
φ φ
that for the excess generalization error E φ (π( fz,γ )) − E φ ( fρ ). We carry out
the latter analysis in the next two sections.
The following result is similar to Theorem 8.3.
φ
Theorem 10.12 Let φ be a classifying loss, fz,γ be defined by (10.1), and
φ φ
fγ ∈ HK . Then E φ (π( fz,γ )) − E φ ( fρ ) is bounded by
φ φ
E φ (π( fz,γ )) − E φ ( fρφ ) + γ fz,γ 2K ≤ E φ ( fγ ) − E φ ( fρφ ) + γ fγ 2K
( ) " #
φ φ
+ Ez ( fγ ) − Ez ( fρφ ) − E φ ( fγ ) − E φ ( fρφ ) (10.9)
( ) ( )
φ φ φ φ
+ E φ (π( fz,γ )) − E φ ( fρφ ) − Ez (π( fz,γ )) − Ez ( fρφ ) .
Proof. The proof follows from (10.8) using the procedure in the proof of
Theorem 8.3.
The first term on the right-hand side of (10.9) is estimated in the next section.
It is the regularized error (w.r.t. φ) of fγ ,
D(γ , φ) := E φ ( fγ ) − E φ ( fρφ ) + γ fγ 2
K. (10.11)
The second and third terms on the right-hand side of (10.9) decompose
φ
the sample error E φ (π( fz,γ )) − E φ ( fγ ). The second term is about a single
random variable involving only one function fγ and is easy to handle; we bound
it in Section 10.4. The third term is more complex. In the form presented,
φ φ
the function π( fz,γ ) is projected from fz,γ . This projection maintains the
φ φ
misclassification error: R(sgn(π( fz,γ ))) = R(sgn( fz,γ )). However, it causes
φ
the random variable φ(yπ( fz,γ )(x)) to be bounded by φ(−1), a bound that
φ
is often smaller than that for φ(yfz,γ (x)). This allows for improved bounds
for the sample error for classification algorithms. We bound this third term in
Section 10.5.
(t) = max{|φ− (t)|, |φ+ (t)|, |φ− (−t)|, |φ+ (−t)|}.
By Theorem 10.8, is constant on [fρ− (x), fρ+ (x)]. So we need only bound for
those points x for which the value f (x) is outside this interval.
If f (x) > fρ+ (x), then, by Theorem 10.8 and since fρ+ (x) ≥ −1, is strictly
increasing on [fρ+ (x), f (x)]. Moreover, the convexity of implies that
$ %
( f (x)) − ( fρφ (x)) ≤ − ( f (x)) f (x) − fρφ (x)
& '
≤ max φ−
( f (x)), |φ+ (−f (x))| f (x) − fρφ (x)
≤ (| f (x)|) f (x) − fρφ (x) .
Similarly, if f (x) < fρ− (x), then, by Theorem 10.8 again and since fρ− (x) ≤ 1,
is strictly decreasing on [f (x), fρ− (x)], and
$ %
( f (x)) − ( fρφ (x)) ≤ + ( f (x)) f (x) − fρφ (x) ≤ (| f (x)|) f (x) − fρφ (x) .
Thus, we have
( f (x)) − ( fρφ (x)) ≤ (| f (x)|) f (x) − fρφ (x) .
f (x)
$ %
( f (x)) − ( fρφ (x)) =
( fρφ (x)) f (x) − fρφ (x) + ( f (x) − t) (t) dt
φ
fρ (x)
2
≤ L ∞ [f φ (x), f (x)] 21 f (x) − fρφ (x) .
ρ
φ
Now, since fρ (x) ∈ [−1, 1], use that
E φ ( f ) − E φ ( fρφ ) ≤ φ ∞ f − fρφ 2L 2 .
ρX
Definition 10.15 The variancing power τ = τφ,ρ of the pair (φ, ρ) is defined
to be the maximal number τ in [0, 1] such that for some constant C1 > 0 and
any measurable function f : X → [−1, 1],
$ %2 $ %τ
E φ(yf (x)) − φ(yfρφ (x)) ≤ C1 E φ ( f ) − E φ ( fρφ ) . (10.12)
Example 10.16 For φls (t) = (1 − t)2 we have τφ,ρ = 1 for any probability
measure ρ.
Proof. For φls (t) = (1 − t)2 we know that φls (yf (x)) = (y − f (x))2
φ
and fρ = fρ . Hence (10.12) is valid with τ = 1 and C1 = sup(x,y)∈Z
$ %2
y − f (x) + y − fρ (x) ≤ 16.
1 q 1 ∗
a·b≤ a + ∗ bq , ∀a, b > 0.
q q
10.4 Bounds for the sample error term involving fγ 199
∗
Proof. Let b > 0. Define a function f : R+ → R by f (a) = a·b− q1 aq − q1∗ bq .
This satisfies
Hence f is a concave function on R+ and takes its maximum value at the unique
point a∗ = b1/(q−1) where f (a∗ ) = 0. But q∗ = q−1
q
and
1 1 ∗ 1 1
f (a∗ ) = a∗ · b − (a∗ )q − ∗ bq = bq/(q−1) − bq/(q−1) − ∗ bq/(q−1) = 0.
q q q q
Therefore, f (a) ≤ f (a∗ ) = 0 for all a ∈ R+ . This is true for any b > 0. So the
inequality holds.
1/(2−τ )
5Bγ + 2φ(−1) 2 2C1 log (2/δ)
log + + E φ ( fγ ) − E φ ( fρφ ).
3m δ m
φ
Proof. Write the random variable ξ(z) = φ(yfγ (x)) − φ(yfρ (x)) on (Z, ρ) as
ξ = ξ1 + ξ2 , where
⎧ ⎫
1 m ⎨ mε 2 ⎬
Prob ξ 1 (z i ) − E(ξ 1 ) > ε ≤ exp − .
z∈Z m m ⎩ 2 σ 2 (ξ ) + 1 B ε ⎭
i=1 1 3 γ
200 10 General regularized classifiers
mε 2 2
= log ,
σ 2 (ξ δ
1) + 3 Bγ ε
1
2
we see that for any 0 < δ < 1, there exists a subset U1 of Z m with measure at
least 1 − 2δ such that for every z ∈ U1 ,
1
m
ξ1 (zi ) − E(ξ1 )
m
i=1
? 2
1
3 Bγ log(2/δ) + 1
3 Bγ log(2/δ) + 2mσ 2 (ξ1 ) log(2/δ)
≤ .
m
1
m
5Bγ log(2/δ)
ξ1 (zi ) − E(ξ1 ) ≤ + E(ξ1 ), ∀z ∈ U1 .
m 3m
i=1
The definition of the variancing power τ gives σ 2 (ξ2 ) ≤ E(ξ22 ) ≤ C1 {E(ξ2 )}τ .
Applying Lemma 10.17 to q = 2−τ 2
, q∗ = τ2 , a = 2 log(2/δ)C1 /m, and
√ τ
b = {E(ξ2 )} , we obtain
!
2 log(2/δ)σ 2 (ξ2 ) τ 2 log(2/δ)C1 1/(2−τ )
τ
≤ 1− + E(ξ2 ).
m 2 m 2
1 1/(2−τ )
m
2φ(−1) log(2/δ) 2 log(2/δ)C1
ξ2 (zi ) − E(ξ2 ) ≤ + + E(ξ2 ).
m 3m m
i=1
φ
10.5 Bounds for the sample error term involving fz,γ 201
Combining these inequalities for ξ1 and ξ2 with the fact that E(ξ1 ) + E(ξ2 ) =
φ
E(ξ ) = E φ ( fγ ) − E φ ( fρ ), we conclude that for all z ∈ U1 ∩ U2 ,
( ) " #
φ φ
Ez ( fγ ) − Ez ( fρφ ) − E φ ( fγ ) − E φ ( fρφ )
(1/2−τ )
5Bγ log(2/δ) + 2φ(−1) log(2/δ) 2 log(2/δ)C1
≤ +
3m m
+ E φ ( fγ ) − E φ ( fρφ ).
φ
10.5 Bounds for the sample error term involving fz,γ
( )
φ φ
The other term of the sample error in (10.9), E φ (π( fz,γ )) − E φ ( fρ ) −
( )
φ φ φ φ φ
Ez (π( fz,γ )) − Ez ( fρ ) , involves the function fz,γ and thus runs over a set
of functions. To bound it, we use – as we have already done in similar cases –
a probability inequality for a function set in terms of the covering numbers of
the set.
The following probability inequality can be proved using the one-side
Bernstein inequality as in Lemma 3.18.
µ − m1 m i=1 ξ(zi ) 1− τ2 mε2−τ
Prob √ τ >ε ≤ exp −
z∈Z m µ + ετ 2(c + 13 Bε 1−τ )
holds.
Also, the following inequality for a function set can be proved in the same way
as Lemma 3.19.
We can now derive the sample error bounds along the same lines we followed
in the previous chapter for the regression problem.
Lemma 10.21 Let τ = τφ,ρ . For any R > 0 and any ε > 0.
⎧ ⎫
⎪
⎪ φ φ φ φ ⎪
⎪
⎨ E φ (π( f )) − E φ ( fρ ) − Ez (π( f )) − Ez ( fρ ) ⎬
Prob sup ? τ ≤ 4ε1−τ/2
z∈Zm ⎪
⎪ ⎪
⎪
⎩f ∈BR E φ (π( f )) − E φ ( fρ )
φ
+ ετ ⎭
ε mε2−τ
≥ 1 − N B1 , exp −
R|φ (−1)| 2C1 + 43 φ(−1)ε 1−τ
holds.
φ & '
FR = φ(y(πf )(x)) − φ(yfρφ (x)) : f ∈ BR . (10.14)
φ
Each function g ∈ FR satisfies E(g 2 ) ≤ c (E(g))τ for c = C1 and
g − E(g) Lρ∞ ≤ B := 2φ(−1). Therefore, to draw our conclusion from
φ
Lemma 10.20, we need only bound the covering number N (FR , ε). To do so,
we note that for f1 , f2 ∈ BR and (x, y) ∈ Z, we have
& ' & '
φ(y(π f1 )(x)) − φ(yf φ (x)) − φ(y(π f2 )(x)) − φ(yf φ (x))
ρ ρ
= |φ(y(π f1 )(x)) − φ(y(πf2 )(x))| ≤ |φ (−1)| f1 − f2 ∞.
Therefore
ε
φ
N FR , ε ≤ N BR , ,
|φ (−1)|
ε mε 2−τ
log N B1 ,
− ≤ log δ. (10.15)
R|φ (−1)| 2C1 + 43 φ(−1)ε 1−τ
Then the confidence for the error ε = ε∗ (m, R, δ) in Lemma 10.21 is at least
1 − δ.
For R > 0, denote
φ
W(R) = z ∈ Z m : fz,γ K ≤ R .
Proposition 10.22 For all 0 < δ < 1 and R > 0, there is a subset VR of
Z m with measure at most δ such that for all z ∈ W(R) \ VR , the quantity
φ φ φ
E φ (π( fz,γ )) − E φ ( fρ ) + γ fz,γ 2K is bounded by
10Bγ + 4φ(−1)
4D(γ , φ) + 24ε ∗ (m, R, δ/2) + log (4/δ)
3m
1/(2−τ )
2C1 log (4/δ)
+2 .
m
1/(2−τ )
5Bγ + 2φ(−1) 2C1 log (4/δ)
log (4/δ) + + E φ ( fγ ) − E φ ( fρφ ).
3m m
204 10 General regularized classifiers
Combining these two bounds with (10.9), we see that for z ∈ W(R), with
confidence 1 − δ,
φ φ 1 φ φ
E φ (π( fz,γ )) − E φ ( fρφ ) + γ fz,γ 2
K ≤ D(γ , φ) + E (π( fz,γ )) − E φ ( fρφ )
2
1/(2−τ )
5Bγ + 2φ(−1) 2C1 log (4/δ)
+ 12ε∗ (m, R, δ/2) + log (4/δ) +
3m m
+ E φ ( fγ ) − E φ ( fρφ ).
φ φ
Proof. Since fz,γ minimizes Ez ( f ) + γ f 2
K in HK , choosing f = 0 implies
that
1
m
φ φ φ φ φ
γ fz,γ 2
K ≤ Ez ( fz,γ ) + γ fz,γ 2
K ≤ Ez (0) + 0 = φ(0) = φ(0).
m
i=1
φ √
Therefore, fz,γ K ≤
φ(0)/γ for all z ∈ Z m .
√ √
By Lemma 10.23, W( φ(0)/γ ) = Z m . Taking R := φ(0)/γ , we can
derive a weak error bound, as we did in Section 8.3. But we can do better.
φ
A bound for the norm fz,γ K improving that of Lemma 10.23 can be shown to
hold with high probability. To show this is the target of the next section. Note that
we could now wrap the results in this and the two preceding sections into a single
φ
statement bounding the excess misclassification error R(sgn( fz,γ )) − R( fc ).
We actually do that, in Corollary 10.25, once we have obtained a better bound
φ
for the norm fz,γ K .
φ 2
E φ (π( fz,γ )) − E φ ( fρφ ) ≤ Cη log m−θ ,
δ
where
β 1 − pr p
θ := min , , s := ,
β + q(1 − β)/2 2 − τ + p 2(1 + p)
ζ − 1/(2 − τ + p) 1−β 1 q 1
r := max + η, ζ,ζ + (1 − β) −
2(1 − s) 2 2 4 2
φ 2
E φ (π( fz,γ )) − E φ ( fρφ ) ≤ Cη log m−θ ,
δ
where
β 1
θ := min , −η
β + q(1 − β)/2 2 − τ
R|φ (−1)| p
m 2−τ
C0 − = log δ.
2C1 + 43 φ(−1) 1−τ
Lemma 10.28 Under the assumptions of Theorem 10.24, choose γ = m−ζ for
some ζ > 0. Then, for any 0 < δ < 1 and R ≥ 1, there is a set VR ⊆ Z m with
measure at most δ such that for m ≥ (C2K A)−1/(ζ (1−β)) ,
W(R) ⊆ W(am Rs + bm ) ∪ VR ,
√ 1/2
p
where s := 2(1+p) , am := 5 C2 m(ζ −1/(2−τ +p))/2 , and bm := C3 log 4δ mr .
Here the constants are
ζ − 1/(2 − τ ) 1 − β 1 q 1
r := max , ζ,ζ + (1 − β) −
2 2 2 4 2
and
√ 7
1/(4−2τ )
φ(−1) + 2 A + 2 Cφ CK Aq/4 .
q/2
C3 = 5 C2 + 2C1 + √2
3
φ 10Bγ + 4φ(−1)
γ fz,γ 2
K ≤ 4Aγ β + 24ε ∗ (m, R, δ/2) + log (4/δ)
3m
1/(2−τ )
2C1 log (4/δ)
+2 .
m
Since φ(t) ≤ Cφ |t|q for each t ∈ (−1, 1), we see that Bγ =
$ %q
max{φ( fγ ∞ ), φ(− fγ ∞ )} is bounded by Cφ max{ fγ ∞ , 1} . But the
assumption D(γ , φ) ≤ Aγ β implies that
√
fγ ∞ ≤ CK fγ K ≤ CK D(γ , φ)/γ ≤ CK Aγ (β−1)/2 .
Hence,
√
Bγ ≤ Cφ CK Aq/2 γ q(β−1)/2 , when CK Aγ (β−1)/2 ≥ 1.
q
(10.16)
208 10 General regularized classifiers
Under this restriction it follows from Lemma 10.27 that z ∈ W(R) for any R
satisfying
1/(2−τ )
1 1/(2−τ ) 4 log(4/δ)
R≥ √ 4Aγ β + 24C2 + 4C1 + φ(−1)
γ 3 m
1/2
−1/(2−τ +p) p/(1+p) 10 log(4/δ)
+ Cφ CK Aq/2 γ q(β−1)/2
q
+24C2 m R .
3 m
1/2
4
R = 5 C2 m(ζ −1/(2−τ +p))/2 r p/(2(1+p)) + C3 log mr .
δ
Lemma 10.29 Under the assumptions of Theorem 10.24, take γ = m−ζ for
some ζ > 0 and let m ≥ (C2K A)−1/(ζ (1−β)) . Then, for any η > 0 and 0 < δ < 1,
∗
the set W(R∗ ) has measure at least 1 − Jη δ, where R∗ = C4 mr ,
ζ 1 1
Jη := log2 max , + log2 + 1,
2 (2 − τ + p) η
and
ζ − 1/(2 − τ + p)
r ∗ := max r, +η .
2(1 − s)
2 2 4 1/2
C4 = 5 C2 (φ(0) + 1) + Jη 5 C2 C3 log .
δ
2 +···+sJ −1
sJ J −1
s j
R(J ) = (am )1+s+s R(0)
2 +···+sj−1
+ (am )1+s+s bm .
j=0
(10.17)
10.6 Stronger error bounds 209
ζ
When 2J ≥ max{ 2η , (2−τ1+p)η }, this upper bound is controlled by
2 ζ −1/(2−τ +p)
+η
5 C2 (φ(0) + 1) m 2(1−s) .
which is bounded by
J −1 2
ζ −1/(2−τ +p) r− ζ −1/(2−τ +p)
·sj
m 2(1−s) 5 C2 C3 (log(4/δ)) 1/2
m 2(1−s)
.
j=0
ζ −1/(2−τ +p)
If r ≥ 2(1−s) , this last expression is bounded by
ζ −1/(2−τ +p)
If r < 2(1−s) , an upper bound is easier:
ζ −1/(2−τ +p) 2
m 2(1−s) J 5 C2 C3 (log(4/δ))1/2 .
$ √ %2
Thus, in either case, the second term has the upper bound J 5 C2 C3
∗
(log(4/δ))1/2 mr .
Combining the bounds for the two terms, we have
2 2
∗
R(J ) ≤ 5 C2 (φ(0) + 1) + J 5 C2 C3 (log(4/δ))1/2 mr .
210 10 General regularized classifiers
ζ
Taking J to be Jη , we have 2J > max{ 2η , (2−τ1+p)η } and we finish the
proof.
The proof of Theorem 10.24 follows from Lemmas 10.27 and 10.29 and
Proposition 10.22. The constant Cη can be explicitly obtained.
holds.
Proof. Since f (x) ∈ [−1, 1], we have φh (yf (x)) − φh (yfc (x)) = y( fc (x) −
f (x)). It follows that
E φh ( f ) − E φh ( fc ) = ( fc (x) − f (x))fρ (x) d ρX = | fc (x) − f (x)| | fρ (x)| d ρX
X X
and
E (φh (yf (x)) − φh (yfc (x)))2 = | fc (x) − f (x)|2 d ρX .
X
10.8 References and additional remarks 211
Let t > 0 and separate the domain X into two sets: Xt+ := {x ∈ X : | fρ (x)| >
cq t} and Xt− := {x ∈ X : | fρ (x)| ≤ cq t}. On Xt+ we have | fc (x) − f (x)|2 ≤
| f (x)|
2| fc (x) − f (x)| cρq t . On Xt− we have | fc (x) − f (x)|2 ≤ 4. It follows from
(10.18) that
$ %
2 E φh ( f ) − E φh ( fc )
| fc (x) − f (x)| d ρX ≤
2
+ 4ρX (Xt− )
X cq t
$ %
2 E φh ( f ) − E φh ( fc )
≤ + 4t q .
cq t
& '1/(q+1)
Choosing t = (E φh ( f ) − E φh ( fc ))/(2cq ) yields the desired bound.
Lemma 10.31 tells us that the variancing power τφh ,ρ of the hinge loss equals
q
q+1 when the measure ρ has Tsybakov noise exponent q. Combining this
with Corollary 10.26 gives the following result on improved learning rates
for measures satisfying the Tsybakov noise condition.
Theorem 10.32 Under the assumption of Theorem 10.2, if ρ has Tsybakov
noise exponent q with 0 ≤ q ≤ ∞, then, for any 0 < ε < 12 and 0 < δ < 1,
with confidence 1 − δ, we have
θ
φ 2 1
R(sgn( fz,γh )) − R( fc ) ≤ C log
δ m
2β q+1
where θ = min 1+β , q+2 − ε and C is a constant independent of m and δ.
In Theorem 10.32, the learning rate can be arbitrarily close to 1 when q is
sufficiently large.
universal). Convergence rates in this situation were derived in [154]. For further
results and references on convergence rates, see the thesis [140].
The error analysis in this chapter is taken from [142], where more technical
and better error bounds are provided by means of the local Rademacher process,
empirical covering numbers, and the entropy integral [84, 132]. The Tsybakov
noise condition of Section 10.7 was introduced in [131].
The iteration technique used in the proof of Lemma 10.29 was given in [122]
(see also [144]).
SVMs have many modifications for various purposes in different fields [134].
These include q-norm soft margin classifiers [31, 77], multiclass SVMs
[4, 32, 75, 139], ν-SVMs [108], linear programming SVMs [26, 96, 98, 146],
maximum entropy discrimination [65], and one-class SVMs [107, 128].
We conclude with some brief comments on current trends.
Learning theory is a rapidly growing field. Many people are working on both
its foundations and its applications, from different points of view. This work
develops the theory but also leaves many open questions. Here we mention
some involving regularization schemes [48].
kernels and spaces with changing norms (e.g. [49, 62]). The learning of
kernel functions is studied in [72, 88, 90].
Another related class of multikernel regularization schemes consists that
of schemes generated by polynomial kernels {Kd (x, y) = (1 + x · y)d }
with d ∈ N. In [158] convergence rates in the univariate case (n = 1) for
multikernel regularized classifiers generated by polynomial kernels are
derived.
(iii) Online learning algorithms. These algorithms improve the efficiency of
learning methods when the sample size m is very large. Their convergence
is investigated in [28, 51, 52, 68, 134], and their error with respect to
the step size has been analyzed for the least squares regression in [112]
and for regularized classification with a general classifying loss in [151].
Error analysis for online schemes with varying regularization parameters
is performed in [127] and [149].
References
214
References 215
[79] F. Lu and H. Sun. Positive definite dot product kernels in learning theory. Adv.
Comput. Math., 22:181–198, 2005.
[80] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting
methods. Ann. Stat., 32:30–55, 2004.
[81] D.J.C. Mackay. Information-based objective functions for active data selection.
Neural Comp., 4:590–604, 1992.
[82] W.R. Madych and S.A. Nelson. Bounds on multivariate polynomials and
exponential error estimates for multiquadric interpolation. J. Approx. Theory,
70:94–114, 1992.
[83] C. McDiarmid. Concentration. In M. Habib et al., editors, Probabilistic Methods
for Algorithmic Discrete Mathematics, pages 195–248. Springer-Verlag, 1998.
[84] S. Mendelson. Improving the sample complexity using global data. IEEE Trans.
Inform. Theory, 48:1977–1991, 2002.
[85] J. Mercer. Functions of positive and negative type and their connection with the
theory of integral equations. Philos. Trans. Roy. Soc. London Ser. A, 209:415–446,
1909.
[86] C.A. Micchelli. Interpolation of scattered data: distance matrices and
conditionally positive definite functions. Constr. Approx., 2:11–22, 1986.
[87] C.A. Micchelli andA. Pinkus. Variational problems arising from balancing several
error criteria. Rend. Math. Appl., 14:37–86, 1994.
[88] C.A. Micchelli and M. Pontil. Learning the kernel function via regularization.
J. Mach. Learn. Res., 6:1099–1125, 2005.
[89] C.A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Comp.,
17:177–204, 2005.
[90] C.A. Micchelli, M. Pontil, Q. Wu, and D.X. Zhou. Error bounds for learning the
kernel. Preprint, 2006.
[91] M. Mignotte. Mathematics for Computer Algebra. Springer-Verlag, 1992.
[92] T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[93] S. Mukherjee and D.X. Zhou. Learning coordinate covariances via gradients.
J. Mach. Learn. Res., 7:519–549, 2006.
[94] F.J. Narcowich, J.D. Ward, and H. Wendland. Refined error estimates for radial
basis function interpolation. Constr. Approx., 19:541–564, 2003.
[95] P. Niyogi. The Informational Complexity of Learning. Kluwer Academic
Publishers, 1998.
[96] P. Niyogi and F. Girosi. On the relationship between generalization error,
hypothesis complexity and sample complexity for radial basis functions. Neural
Comput., 8:819–842, 1996.
[97] P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds
with high confidence from random samples. Preprint, 2004.
[98] J.P. Pedroso and N. Murata. Support vector machines with different norms:
motivation, formulations and results. Pattern Recognit. Lett., 22:1263–1272,
2001.
[99] I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces.
Ann. Probab., 22:1679–1706, 1994.
[100] A. Pinkus. N-widths in Approximation Theory. Springer-Verlag, 1996.
[101] A. Pinkus. Strictly positive definite kernels on a real inner product space. Adv.
Comput. Math., 20:263–271, 2004.
References 219
[102] T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory.
Nature, 317:314–319, 1985.
[103] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984.
[104] A. Rakhlin, D. Panchenko, and S. Mukherjee. Risk bounds for mixture density
estimation. ESAIM: Prob. Stat., 9:220–229, 2005.
[105] R. Schaback. Reconstruction of multivariate functions from scattered data.
Manuscript, 1997.
[106] I.J. Schoenberg. Metric spaces and completely monotone functions. Ann. Math.,
39:811–841, 1938.
[107] B. Schölkopf and A.J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, 2002.
[108] B. Schölkopf, A.J. Smola, R.C. Williamson, and P.L. Bartlett. New support vector
algorithms. Neural Comp., 12:1207–1245, 2000.
[109] I.R. Shafarevich. Basic Algebraic Geometry. 1: Varieties in Projective Space.
Springer-Verlag, 2nd edition, 1994.
[110] J. Shawe-Taylor, P.L. Bartlet, R.C. Williamson, and M. Anthony. Structural risk
minimization over data dependent hierarchies. IEEE Trans. Inform. Theory,
44:1926–1940, 1998.
[111] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis.
Cambridge University Press, 2004.
[112] S. Smale and Y. Yao. Online learning algorithms. Found. Comput. Math., 6:145–
170, 2006.
[113] S. Smale and D.X. Zhou. Estimating the approximation error in learning theory.
Anal. Appl., 1:17–41, 2003.
[114] S. Smale and D.X. Zhou. Shannon sampling and function reconstruction from
point values. Bull. Amer. Math. Soc., 41:279–305, 2004.
[115] S. Smale and D.X. Zhou. Shannon sampling II: Connections to learning theory.
Appl. Comput. Harmonic Anal., 19:285–302, 2005.
[116] S. Smale and D.X. Zhou. Learning theory estimates via integral operators and
their approximations. To appear in Constr. Approx.
[117] A. Smola, B. Schölkopf, and R. Herbricht. A generalized representer theorem.
Comput. Learn. Theory, 14:416–426, 2001.
[118] A. Smola, B. Schölkopf, and K.R. Müller. The connection between regularization
operators and support vector kernels. Neural Networks, 11:637–649, 1998.
[119] M. Sousa Lobo, L. Vandenberghe, S. Boyd, and H. Lebret.Applications of second-
order cone programming. Linear Algebra Appl., 284:193–228, 1998.
[120] E.M. Stein. Singular Integrals and Differentiability Properties of Functions.
Princeton University Press, 1970.
[121] I. Steinwart. Support vector machines are universally consistent. J. Complexity,
18:768–791, 2002.
[122] I. Steinwart and C. Scovel. Fast rates for support vector machines. In P. Auer and
R. Meir, editors, Proc. 18th Ann. Conf. Learn. Theory, pages 279–294, Springer
2005.
[123] H.W. Sun. Mercer theorem for RKHS on noncompact sets. J. Complexity, 21:337–
349, 2005.
[124] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press,
1998.
220 References
, 7 fz , 9
C s (X ) , 18 fz0 , 161
Lip(s) , 73 fz,γ , 134
φ
Lip∗(s,C (X )) , 74 fz,γ , 164
Lip∗(s,L p (X )) , 74 HK , 23
s , 21 HK,z , 34, 163
H s (Rn ) , 75 H s (Rn ), 75
A(fρ , R), 54 H s (X ), 21
A(H), 12 Kρ , 159
BR , 76 κρ , 159
CK , 22 Kx , 22
C (X ), 9 K[x], 22
C s (X ), 18 2 (XN ), 90
C ∞ (X ), 18 ∞ (XN ), 90
D(γ ), 136
Lip(s), 73
D(γ , φ), 196
Lip∗(s, C (X )), 74
Diam(X ), 72
Lip∗(s, L p (X )), 74
Dµρ , 110
LK , 56
rt , 74 p
Lν (X ), 18
E, 6
Lν∞ (X ), 19
Eγ , 134
Lz , 10
EH , 11
M(S, η), 101
E φ , 162
φ N (S, η), 37
Eγ , 162
O(n), 30
φ
Ez,γ , 162 φ0 , 162
φ φh , 165
Ez , 162
Ez , 8 φls , 162
Ez,γ , 134 s (R), 84
K (x), 112 R(f ) 5
ηx , 174 ρ, 5
fc , 159 ρX , 6
fγ , 134 ρ(y|x), 6
fH , 9 sgn, 159
fρ , 3, 6 σρ2 , 6
fY , 8 ϒk , 87
222
Index 223
interpolation space, 63
defect, 10
distortion, 110
divided difference, 74 K-functional, 63
domination of measures, 110 kernel, 56
box spline, 28
efficient algorithm, 33 dot product, 24
ε-net, 78 Mercer, 22
ERM, 50 spline, 27
error translation invariant, 26
approximation, 12 universal, 212
approximation (associated with ψ), 70
empirical, 8 Lagrange interpolation polynomials, 84
empirical (associated with φ), 162 Lagrange multiplier, 151
empirical (associated with ψ), 51 least squares, 1, 2
excess generalization, 12 left derivative, 34
excess misclassification, 188 linear programming, 33
generalization, 5 localizing function, 190
generalization (associated with φ), 162 loss
generalization (associated with ψ), 51 classifying function, 187
in H, 11 -insensitive, 50
local, 6, 161 function, 161
misclassification, 157 function, regression, 50
regularized, 134 hinge, 165
224 Index