Get Fundamentals of Optimization Theory With Applications To Machine Learning Gallier J. Free All Chapters
Get Fundamentals of Optimization Theory With Applications To Machine Learning Gallier J. Free All Chapters
com
https://textbookfull.com/product/fundamentals-of-
optimization-theory-with-applications-to-machine-
learning-gallier-j-2/
https://textbookfull.com/product/fundamentals-of-optimization-theory-
with-applications-to-machine-learning-gallier-j-2/
textbookfull.com
https://textbookfull.com/product/fundamentals-of-linear-algebra-and-
optimization-gallier-j/
textbookfull.com
https://textbookfull.com/product/fundamentals-of-linear-algebra-and-
optimization-jean-gallier-and-jocelyn-quaintance/
textbookfull.com
https://textbookfull.com/product/putin-s-top-secret-plan-leaked-on-uk-
mass-media-outlets-1st-edition-fake-o-tics/
textbookfull.com
https://textbookfull.com/product/quran-of-the-oppressed-liberation-
theology-and-gender-justice-in-islam-1st-edition-shadaab-rahemtulla/
textbookfull.com
https://textbookfull.com/product/trends-in-insect-molecular-biology-
and-biotechnology-dhiraj-kumar/
textbookfull.com
https://textbookfull.com/product/popular-media-in-kenyan-history-
fiction-and-newspapers-as-political-actors-1st-edition-george-ogola-
auth/
textbookfull.com
Avatar The Last Airbender The Search Part 1 Yang Gene Luen
Koneitzko Bryan
https://textbookfull.com/product/avatar-the-last-airbender-the-search-
part-1-yang-gene-luen-koneitzko-bryan-3/
textbookfull.com
Processes in Human Evolution: The Journey from Early
Hominins to Neanderthals and Modern Humans Francisco J.
Ayala
https://textbookfull.com/product/processes-in-human-evolution-the-
journey-from-early-hominins-to-neanderthals-and-modern-humans-
francisco-j-ayala/
textbookfull.com
Fundamentals of Optimization Theory
With Applications to Machine Learning
c Jean Gallier
In recent years, computer vision, robotics, machine learning, and data science have been
some of the key areas that have contributed to major advances in technology. Anyone who
looks at papers or books in the above areas will be baffled by a strange jargon involving exotic
terms such as kernel PCA, ridge regression, lasso regression, support vector machines (SVM),
Lagrange multipliers, KKT conditions, etc. Do support vector machines chase cattle to catch
them with some kind of super lasso? No! But one will quickly discover that behind the jargon
which always comes with a new field (perhaps to keep the outsiders out of the club), lies a
lot of “classical” linear algebra and techniques from optimization theory. And there comes
the main challenge: in order to understand and use tools from machine learning, computer
vision, and so on, one needs to have a firm background in linear algebra and optimization
theory. To be honest, some probability theory and statistics should also be included, but we
already have enough to contend with.
Many books on machine learning struggle with the above problem. How can one under-
stand what are the dual variables of a ridge regression problem if one doesn’t know about the
Lagrangian duality framework? Similarly, how is it possible to discuss the dual formulation
of SVM without a firm understanding of the Lagrangian framework?
The easy way out is to sweep these difficulties under the rug. If one is just a consumer
of the techniques we mentioned above, the cookbook recipe approach is probably adequate.
But this approach doesn’t work for someone who really wants to do serious research and
make significant contributions. To do so, we believe that one must have a solid background
in linear algebra and optimization theory.
This is a problem because it means investing a great deal of time and energy studying
these fields, but we believe that perseverance will be amply rewarded.
This second volume covers some elements of optimization theory and applications, espe-
cially to machine learning. This volume is divided in five parts:
(1) Preliminaries of Optimization Theory.
3
4
Part I is devoted to some preliminaries of optimization theory. The goal of most optimiza-
tion problems is to minimize (or maximize) some objective function J subject to equality
or inequality constraints. Therefore it is important to understand when a function J has
a minimum or a maximum (an optimum). In most optimization problems, we need to find
necessary conditions for a function J : Ω → R to have a local extremum with respect to a
subset U of Ω (where Ω is open). This can be done in two cases:
(1) The set U is defined by a set of equations,
U = {x ∈ Ω | ϕi (x) = 0, 1 ≤ i ≤ m},
U = {x ∈ Ω | ϕi (x) ≤ 0, 1 ≤ i ≤ m},
separate two disjoint sets of points, {ui }pi=1 and {vj }qj=1 , using a hyperplane satisfying some
optimality property (to maximize the margin).
Section 14.7 contains the most important results of the chapter. The notion of Lagrangian
duality is presented and we discuss weak duality and strong duality.
In Chapter 15, we consider some deeper aspects of the the theory of convex functions
that are not necessarily differentiable at every point of their domain. Some substitute for the
gradient is needed. Fortunately, for convex functions, there is such a notion, namely subgra-
dients. A major motivation for developing this more sophisticated theory of differentiation
of convex functions is to extend the Lagrangian framework to convex functions that are not
necessarily differentiable.
Chapter 16 is devoted to the presentation of one of the best methods known at the
present for solving optimization problems involving equality constraints, called ADMM (al-
ternating direction method of multipliers). In fact, this method can also handle more general
constraints, namely, membership in a convex set. It can also be used to solve lasso mini-
mization.
In Section 16.4, we prove the convergence of ADMM under exactly the same assumptions
as in Boyd et al. [17]. It turns out that Assumption (2) in Boyd et al. [17] implies that the
matrices A> A and B > B are invertible (as we show after the proof of Theorem 16.1). This
allows us to prove a convergence result stronger than the convergence result proven in Boyd
et al. [17].
The next three chapters constitute Part IV, which covers some applications of optimiza-
tion theory (in particular Lagrangian duality) to machine learning.
In Chapter 17, we discuss linear regression, ridge regression and lasso regression.
Chapter 18 is an introduction to positive definite kernels and the use of kernel functions
in machine learning. called a kernel function.
We illustrate the kernel methods on two examples: (1) kernel PCA (see Section 18.3),
and (2) ν-SV Regression, which is a variant of linear regression in which certain points are
allowed to be “misclassified” (see Section 18.4).
In Chapter 19 we return to the problem of separating two disjoint sets of points, {ui }pi=1
and {vj }qj=1 , but this time we do not assume that these two sets are separable. To cope with
nonseparability, we allow points to invade the safety zone around the separating hyperplane,
and even points on the wrong side of the hyperplane. Such a mehod is called soft margin
support vector machine. We discuss variations of this method, including ν-SV classification.
In each case, we present a careful derivation of the dual.
Except for a few exceptions we provide complete proofs. We did so to make this book
self-contained, but also because we believe that no deep knowledge of this material can be
acquired without working out some proofs. However, our advice is to skip some of the proofs
upon first reading, especially if they are long and intricate.
6
The chapters or sections marked with the symbol ~ contain material that is typically
more specialized or more advanced, and they can be omitted upon first (or second) reading.
Acknowledgement: We would like to thank Christine Allen-Blanchette, Kostas Daniilidis,
Carlos Esteves, Spyridon Leonardos, Stephen Phillips, João Sedoc, and Marcelo Siqueira,
for reporting typos and for helpful comments.
Contents
1 Introduction 13
3 Differential Calculus 69
3.1 Directional Derivatives, Total Derivatives . . . . . . . . . . . . . . . . . . . 69
3.2 Jacobian Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.3 The Implicit and The Inverse Function Theorems . . . . . . . . . . . . . . . 90
3.4 Second-Order and Higher-Order Derivatives . . . . . . . . . . . . . . . . . . 95
3.5 Taylor’s Formula, Faà di Bruno’s Formula . . . . . . . . . . . . . . . . . . . 100
3.6 Futher Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7
8 CONTENTS
V Appendix 631
A Total Orthogonal Families in Hilbert Spaces 633
A.1 Total Orthogonal Families, Fourier Coefficients . . . . . . . . . . . . . . . . 633
A.2 The Hilbert Space `2 (K) and the Riesz-Fischer Theorem . . . . . . . . . . . 641
Bibliography 650
12 CONTENTS
Chapter 1
Introduction
This second volume covers some elements of optimization theory and applications, especially
to machine learning. This volume is divided in five parts:
(1) Preliminaries of Optimization Theory.
(2) Linear Optimization.
(3) Nonlinear Optimization.
(4) Applications to Machine Learning.
(5) An appendix on Hilbert Bases and the Riesz–Fischer Theorem.
Part I is devoted to some preliminaries of optimization theory. The goal of most optimiza-
tion problems is to minimize (or maximize) some objective function J subject to equality
or inequality constraints. Therefore it is important to understand when a function J has a
minimum or a maximum (an optimum). If the function J is sufficiently differentiable, then
a necessary condition for a function to have an optimum typically involves the derivative of
the function J, and if J is real-valued, its gradient ∇J.
Thus it is desirable to review some basic notions of topology and calculus, in particular,
to have a firm grasp of the notion of derivative of a function between normed vector spaces.
Partial derivatives ∂f /∂A of functions whose range and domain are spaces of matrices tend
to be used casually, even though in most cases a correct definition is never provided. It is
possible, and simple, to define rigorously derivatives, gradients, and directional derivatives
of functions defined on matrices and to avoid these nonsensical partial derivatives.
Chapter 2 contains a review of basic topological notions used in analysis. We pay par-
ticular attention to complete metric spaces and complete normed vector spaces. In fact, we
provide a detailed construction of the completion of a metric space (and of a normed vector
space) using equivalence classes of Cauchy sequences. Chapter 3 is devoted to some notions
of differential calculus, in particular, directional derivatives, total derivatives, gradients, Hes-
sians, and the inverse function theorem.
13
14 CHAPTER 1. INTRODUCTION
U = {x ∈ Ω | ϕi (x) = 0, 1 ≤ i ≤ m},
U = {x ∈ Ω | ϕi (x) ≤ 0, 1 ≤ i ≤ m},
In (1), the equations ϕi (x) = 0 are called equality constraints, and in (2), the inequalities
ϕi (x) ≤ 0 are called inequality constraints. The case of equality constraints is much easier
to deal with and is treated in Chapter 4.
If the functions ϕi are convex and Ω is convex, then U is convex. This is a very important
case that we will discuss later. In particular, if the functions ϕi are affine, then the equality
constraints can be written as Ax = b, and the inequality constraints as Ax ≤ b, for some
m × n matrix A and some vector b ∈ Rm . We will also discuss the case of affine constraints
later.
In the case of equality constraints, a necessary condition for a local extremum with respect
to U can be given in terms of Lagrange multipliers. In the case of inequality constraints, there
is also a necessary condition for a local extremum with respect to U in terms of generalized
Lagrange multipliers and the Karush–Kuhn–Tucker conditions. This will be discussed in
Chapter 14.
In Chapter 5 we discuss Newton’s method and some of its generalizations (the Newton–
Kantorovich theorem). These are methods to find the zeros of a function.
Chapter 6 covers the special case of determining when a quadratic function has a mini-
mum, subject to affine equality constraints. A complete answer is provided in terms of the
notion of symmetric positive semidefinite matrices.
The Schur complement is introduced in Chapter 7. We give a complete proof of a cri-
terion for a matrix to be positive definite (or positive semidefinite) stated in Boyd and
Vandenberghe [18] (Appendix B).
Part II deals with the special case where the objective function is a linear form and the
constraints are affine inequality and equality constraints. This subject is known as linear
programming, and the next four chapters give an introduction to the subject. Although
linear programming has been supplanted by convex programming and its variants, it is still
15
a great workhorse. It is also a great warm up for the general treatment of Lagrangian duality.
We pay particular attention to versions of Farkas’ lemma, which is at the heart of duality in
linear programming.
Part III is devoted to nonlinear optimization, which is the case where the objective
function J is not linear and the constaints are inequality constraints. Since it is practically
impossible to say anything interesting if the constraints are not convex, we quickly consider
the convex case.
In optimization theory one often deals with function spaces of infinite dimension. Typ-
ically, these spaces either are Hilbert spaces or can be completed as Hilbert spaces. Thus
it is important to have some minimum knowledge about Hilbert spaces, and we feel that
this minimum knowledge includes the projection lemma, the fact that a closed subset has
an orthogonal complement, the Riesz representation theorem, and a version of the Farkas–
Minkowski lemma. Chapter 12 covers these topics. A more detailed introduction to Hilbert
spaces is given in Appendix A.
Chapter 13 is devoted to some general results of optimization theory. A main theme is
to find sufficient conditions that ensure that an objective function has a minimum which
is achieved. We define the notion of a coercive function. The most general result is The-
orem 13.2, which applies to a coercive convex function on a convex subset of a separable
Hilbert space. In the special case of a coercive quadratic functional, we obtain the Lions–
Stampacchia theorem (Theorem 13.6), and the Lax–Milgram theorem (Theorem 13.7). We
define elliptic functionals, which generalize quadratic functions defined by symmetric posi-
tive definite matrices. We define gradient descent methods, and discuss their convergence.
A gradient descent method looks for a descent direction and a stepsize parameter, which is
obtained either using an exact line search or a backtracking line search. A popular technique
to find the search direction is steepest descent. In addition to steepest descent for the Eu-
clidean norm, we discuss steepest descent for an arbitrary norm. We also consider a special
case of steepest descent, Newton’s method. This method converges faster than the other
gradient descent methods, but it is quite expensive since it requires computing and storing
Hessians. We also present the method of conjugate gradients and prove its correctness. We
briefly discuss the method of gradient projection and the penalty method in the case of
constrained optima.
Chapter 14 contains the most important results of nonlinear optimization theory. We
begin by defining the cone of feasible directions and then state a necessary condition for a
function to have local minimum on a set U that is not necessarily convex in terms of the
cone of feasible directions. The cone of feasible directions is not always convex, but it is if
the constraints are inequality constraints. An inequality constraint ϕ(u) ≤ 0 is said to be
active is ϕ(u) = 0. One can also define the notion of qualified constraint. Theorem 14.5
gives necessary conditions for a function J to have a minimum on a subset U defined by
qualified inequality constraints in terms of the Karush–Kuhn–Tucker conditions (for short
KKT conditions), which involve nonnegative Lagrange multipliers. The proof relies on a
version of the Farkas–Minkowski lemma. Some of the KTT conditions assert that λi ϕi (u) =
16 CHAPTER 1. INTRODUCTION
with respect to v, with µ ∈ Rm + . The dual program (D) is then to maximize G(µ) with
respect to µ ∈ Rm
+ . It turns out that G is a concave function, and the dual program is an
17
d∗ ≤ p∗ ,
which is known as weak duality. Under certain conditions, d∗ = p∗ , that is, the duality gap
is zero, in which case we say that strong duality holds. Also, under certain conditions, a
solution of the dual yields a solution of the primal, and if the primal has an optimal solution,
then the dual has an optimal solution, but beware that the converse is generally false (see
Theorem 14.16). We also show how to deal with equality constraints, and discuss the use of
conjugate functions to find the dual function. Our coverage of Lagrangian duality is quite
thorough, but we do not discuss more general orderings such as the semidefinite ordering.
For these topics which belong to convex optimization, the reader is referred to Boyd and
Vandenberghe [18].
In Chapter 15, we consider some deeper aspects of the the theory of convex functions
that are not necessarily differentiable at every point of their domain. Some substitute for
the gradient is needed. Fortunately, for convex functions, there is such a notion, namely
subgradients. Geometrically, given a (proper) convex function f , the subgradients at x are
vectors normal to supporting hyperplanes to the epigraph of the function at (x, f (x)). The
subdifferential ∂f (x) to f at x is the set of all subgradients at x. A crucial property is that f is
differentiable at x iff ∂f (x) = {∇fx }, where ∇fx is the gradient of f at x. Another important
property is that a (proper) convex function f attains its minimum at x iff 0 ∈ ∂f (x). A
major motivation for developing this more sophisticated theory of “differentiation” of convex
functions is to extend the Lagrangian framework to convex functions that are not necessarily
differentiable.
Experience shows that the applicability of convex optimization is significantly increased
by considering extended real-valued functions, namely functions f : S → R ∪ {−∞, +∞},
where S is some subset of Rn (usually convex). This is reminiscent of what happens in
measure theory, where it is natural to consider functions that take the value +∞.
In Section 15.1, we introduce extended real-valued functions, which are functions that
may also take the values ±∞. In particular, we define proper convex functions, and the
closure of a convex function. Subgradients and subdifferentials are defined in Section 15.2.
We discuss some properties of subgradients in Section 15.3 and Section 15.4. In particular,
we relate subgradients to one-sided directional derivatives. In Section 15.5, we discuss the
problem of finding the minimum of a proper convex function and give some criteria in terms
of subdifferentials. In Section 15.6, we sketch the generalization of the results presented in
Chapter 14 about the Lagrangian framework to programs allowing an objective function and
inequality constraints which are convex but not necessarily differentiable.
18 CHAPTER 1. INTRODUCTION
This chapter relies heavily on Rockafellar [59]. We tried to distill the body of results
needed to generalize the Lagrangian framework to convex but not necessarily differentiable
functions. Some of the results in this chapter are also discussed in Bertsekas [9, 12, 10].
Chapter 16 is devoted to the presentation of one of the best methods known at the
present for solving optimization problems involving equality constraints, called ADMM (al-
ternating direction method of multipliers). In fact, this method can also handle more general
constraints, namely, membership in a convex set. It can also be used to solve lasso mini-
mization.
In this chapter, we consider the problem of minimizing a convex function J (not neces-
sarily differentiable) under the equality constraints Ax = b. In Section 16.1, we discuss the
dual ascent method. It is essentially gradient descent applied to the dual function G, but
since G is maximized, gradient descent becomes gradient ascent.
In order to make the minimization step of the dual ascent method more robust, one can
use the trick of adding the penalty term (ρ/2) kAu − bk22 to the Lagrangian. We obtain the
augmented Lagrangian
with λ ∈ Rm , and where ρ > 0 is called the penalty parameter . We obtain the minimization
Problem (Pρ ),
with λ ∈ Rp and for some ρ > 0. The major difference with the method of multipliers is that
instead of performing a minimization step jointly over x and z, ADMM first performs an
19
x-minimization step and then a z-minimization step. Thus x and z are updated in an alter-
nating or sequential fashion, which accounts for the term alternating direction. Because the
Lagrangian is augmented, some mild conditions on A and B imply that these minimization
steps are guaranteed to terminate. ADMM is presented in Section 16.3.
In Section 16.4, we prove the convergence of ADMM under exactly the same assumptions
as in Boyd et al. [17]. It turns out that Assumption (2) in Boyd et al. [17] implies that the
matrices A> A and B > B are invertible (as we show after the proof of Theorem 16.1). This
allows us to prove a convergence result stronger than the convergence result proven in Boyd
et al. [17]. In particular, we prove that all of the sequences (xk ), (z k ), and (λk ) converge to
optimal solutions (e x, ze), and λ.
e
In Section 16.5, we discuss stopping criteria. In Section 16.6, we present some applications
of ADMM, in particular, minimization of a proper closed convex function f over a closed
convex set C in Rn and quadratic programming. The second example provides one of the
best methods for solving quadratic problems, in particular, the SVM problems discussed in
Chapter 19. Section 16.7 gives applications of ADMM to `1 -norm problems, in particular,
lasso regularization which plays an important role in machine learning.
The next three chapters constitute Part IV, which covers some applications of optimiza-
tion theory (in particular Lagrangian duality) to machine learning.
In Chapter 17, we discuss linear regression. This problem can be cast as a learning
problem. We observe a sequence of pairs ((x1 , y1 ), . . . , (xm , ym )) called a set of training
data, where xi ∈ Rn and yi ∈ R, viewed as input-output pairs of some unknown function f
that we are trying to infer. The simplest kind of function is a linear function f (x) = x> w,
where w ∈ Rn is a vector of coefficients usually called a weight vector . Since the problem
is overdetermined and since our observations may be subject to errors, we can’t solve for w
exactly as the solution of the system Xw = y, so instead we solve the least-squares problem
of minimizing kXw − yk22 . In general, there are still infinitely many solutions so we add a
regularizing term. If we add the term K kwk22 to the objective function J(w) = kXw − yk22 ,
then we have ridge regression. This problem is discussed in Section 17.1.
We derive the dual program. The dual has a unique solution which yields a solution of the
primal. However, the solution of the dual is given in terms of the matrix XX > (whereas the
solution of the primal is given in terms of X > X), and since our data points xi are represented
by the rows of the matrix X, we see that this solution only involves inner products of the
xi . This observation is the core of the idea of kernel functions, which we introduce. We also
explain how to solve the problem of learning an affine function f (x) = x> w + b.
In general, the vectors w produced by ridge regression have few zero entries. In practice, it
is highly desirable to obtain sparse solutions, that is, vectors w with many components equal
to zero. This can be achieved by replacing the regularizing term K kwk22 by the regularizing
term K kwk1 ; that is, to use the `1 -norm instead of the `2 -norm; see Section 17.2. This
method has the exotic name of lasso regression. This time, there is no closed-form solution,
20 CHAPTER 1. INTRODUCTION
but this is a convex optimization problem and there are efficient iterative methods to solve
it, although we do not discuss such methods here.
Chapter 18 is an introduction to positive definite kernels and the use of kernel functions
in machine learning.
Let X be a nonempty set. If the set X represents a set of highly nonlinear data, it
may be advantageous to map X into a space F of much higher dimension called the feature
space, using a function ϕ : X → F called a feature map. This idea is that ϕ “unwinds” the
description of the objects in F in an attempt to make it linear. The space F is usually a
vector space equipped with an inner product h−, −i. If F is infinite dimensional, then we
assume that it is a Hilbert space.
Many algorithms that analyze or classify data make use of the inner products hϕ(x), ϕ(y)i,
where x, y ∈ X. These algorithms make use of the function κ : X × X → C given by
21
Chapter 2
Topology
Geometrically, Condition (D3) expresses the fact that in a triangle with vertices x, y, z,
the length of any side is bounded by the sum of the lengths of the other two sides. From
(D3), we immediately get
|d(x, y) − d(y, z)| ≤ d(x, z).
23
24 CHAPTER 2. TOPOLOGY
Let us give some examples of metric spaces. Recall that the absolute value |x| of a real
number x ∈ R is defined such
√ that |x| = x if x ≥ 0, |x| = −x if x < 0, and for a complex
number x = a + ib, by |x| = a2 + b2 .
Example 2.1.
1. Let E = R, and d(x, y) = |x − y|, the absolute value of x − y. This is the so-called
natural metric on R.
3. For every set E, we can define the discrete metric, defined such that d(x, y) = 1 iff
x 6= y, and d(x, x) = 0.
[a, b) = {x ∈ R | a ≤ x < b}, (interval closed on the left, open on the right)
(a, b] = {x ∈ R | a < x ≤ b}, (interval open on the left, closed on the right)
Let E = [a, b], and d(x, y) = |x − y|. Then ([a, b], d) is a metric space.
We will need to define the notion of proximity in order to define convergence of limits
and continuity of functions. For this we introduce some standard “small neighborhoods.”
Definition 2.2. Given a metric space E with metric d, for every a ∈ E, for every ρ ∈ R,
with ρ > 0, the set
B(a, ρ) = {x ∈ E | d(a, x) ≤ ρ}
is called the closed ball of center a and radius ρ, the set
is called the open ball of center a and radius ρ, and the set
S(a, ρ) = {x ∈ E | d(a, x) = ρ}
is called the sphere of center a and radius ρ. It should be noted that ρ is finite (i.e., not
+∞). A subset X of a metric space E is bounded if there is a closed ball B(a, ρ) such that
X ⊆ B(a, ρ).
2.1. METRIC SPACES AND NORMED VECTOR SPACES 25
Example 2.2.
1. In E = R with the distance |x − y|, an open ball of center a and radius ρ is the open
interval (a − ρ, a + ρ).
2. In E = R2 with the Euclidean metric, an open ball of center a and radius ρ is the set
of points inside the disk of center a and radius ρ, excluding the boundary points on
the circle.
3. In E = R3 with the Euclidean metric, an open ball of center a and radius ρ is the set
of points inside the sphere of center a and radius ρ, excluding the boundary points on
the sphere.
One should be aware that intuition can be misleading in forming a geometric image of a
closed (or open) ball. For example, if d is the discrete metric, a closed ball of center a and
radius ρ < 1 consists only of its center a, and a closed ball of center a and radius ρ ≥ 1
consists of the entire space!
If E = [a, b], and d(x, y) = |x − y|, as in Example 2.1, an open ball B0 (a, ρ), with
ρ < b − a, is in fact the interval [a, a + ρ), which is closed on the left.
We now consider a very important special case of metric spaces, normed vector spaces.
Normed vector spaces have already been defined in Chapter ?? (Vol. I) (Definition ?? (Vol.
I)), but for the reader’s convenience we repeat the definition.
Definition 2.3. Let E be a vector space over a field K, where K is either the field R of
reals, or the field C of complex numbers. A norm on E is a function k k : E → R+ , assigning
a nonnegative real number kuk to any vector u ∈ E, and satisfying the following conditions
for all x, y, z ∈ E:
k−xk = kxk ,