Machine Learning Lecture
Machine Learning Lecture
of Machine Learning
Seongjai Kim
This lecture note will grow up as time marches; various core algorithms,
useful techniques, and interesting examples would be soon incorporated.
Seongjai Kim
April 6, 2022
i
ii
Contents
Title 2
Prologue i
1 Introduction 1
1.1. Why and What in Machine Learning (ML)? . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1. Inference problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2. Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3. Machine learning examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2. Three Different Types of ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1. Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2. Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3. Python in 25 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
iii
iv Contents
Bibliography 311
Contents vii
Index 317
viii Contents
Chapter 1
Introduction
This is a hard question to which we can only give a somewhat fuzzy answer.
But at a high enough level of abstraction, there are two answers:
These answers are so abstract that they are probably completely unsatis-
fying. But let’s (start to) clear things up, by looking at some particular
examples of “inference" and “modeling" problems.
Contents of Chapter 1
1.1. Why and What in Machine Learning (ML)? . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Three Different Types of ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3. Python in 25 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
2 Chapter 1. Introduction
This Course
• This course will deal with some of actual ML algorithms.
• However, we will put an emphasis on basic mathematical concepts
on which these algorithms are built. In particular, to really under-
stand anything about ML, you need to have a very good grasp of
– calculus,
– linear algebra, and
– probabilistic inference (i.e. the mathematical theory of probabil-
ity and how to use it).
• We will cover the key parts of these branches of applied mathematics
in depth and in the context of ML.
• More specifically, the mathematical topics to be treated in this
course can be broken down into four basic subject areas:
– Representations for data and operators that map data to decisions,
estimates, or both. We will start with a thorough discussion of linear
representations; these are important/useful by themselves, and also
are used as building-blocks for nonlinear representations. Here is
where we will need a lot of linear algebra and its extension.
– Estimation. What does it mean to estimate a parameter for/from a
data set? We will try to put this question on as firm a mathematical
footing as we can, using the language of statistics.
– Modeling.
– Computing. Finally, we will get a look at how computations for
solving problems arising in ML are actually done. We will look at
some basic algorithms from optimization, and some algebraic tech-
niques from numerical linear algebra.
In particular, we will experiment examples in several chapters
in Python Machine Learning, 3rd Ed. [53].
Required Coding/Reading : Read Chapter 1, Python Machine Learning, 3rd Ed. Install
required packages, including scikit-learn.
1.1. Why and What in Machine Learning (ML)? 3
(c) If I give you a recording of somebody speaking, can you produce text of
what they are saying?
4 Chapter 1. Introduction
1.1.2. Modeling
A second type of problem associated with machine learning (ML) might
be roughly described as:
One example is regression analysis. Most models can be broken into two
categories:
• Question: How should we pick the hypothesis space, the set of possible
functions f ?
– Often, there exist infinitely many hypotheses
– Let’s select f from PM , polynomials of degree ≤ M .
– Which one is best?
1.2. Three Different Types of ML 9
• Error: misfit
Key Idea 1.6. Training and test performance. Assume that each
training and test example–label pair (x, y) is drawn independently at
random from the same (but unknown) population of examples and labels.
Represent this population as a probability distribution p(x, y), so that:
• As of the end of 2021, Python, R, and Java are most popular skills
when it comes to ML and data science jobs, with Python being the
leader.
• R was built as a statistical language and more functional, whereas
Python is more object-oriented. R is the choice of language for a quick
prototype, but for long term Python is the most preferred language.
• Many users with Matlab experience, in particular, will be capa-
ble of taking the step to using modules in scientific-computing-and-
graphics packages such as numpy, scipy, and matplotlib, because the
commands are sufficiently similar to Matlab.
Here we deal with one of popular optimization problems, a Python code for
solving it, followed by some essentials of Python programming.
5 x0 = np.array([-1., 2.])
6
7 def rosen(xn):
8 return (1.-xn[0])**2+100*(xn[1]-xn[0]**2)**2
9
10 # default: method='BFGS'
11 t0 = time.time()
12 res1 = opt.minimize(rosen, x0, tol=1e-7,
13 options={'disp':True})
14 print res1.x
15 print 'Method = %s; E-time = %.4f\n' %('BFGS',time.time()-t0)
16
17 t0 = time.time()
18 res2 = opt.minimize(opt.rosen, x0, method='Newton-CG', tol=1e-7,
19 jac=opt.rosen_der, hess=opt.rosen_hess)
20 print 'Method = %s; E-time = %.4f' %('Newton-CG',time.time()-t0)
21 print res2
Note:
• A Python code is implemented to find the minimizer of the Rosenbrock
function (1.5), saved in the file rosenbrock_opt.py, as shown in Fig-
ure 1.7 (top).
• Under the Linux OS, you can execute it by giving the following com-
mand on the Terminal:
python rosenbrock_opt.py Enter
• Its result is attached in Figure 1.7 (bottom).
• Here are some remarks for the Python code.
– Lines 1-3: Python requires to import modules and packages to
use.
– Line 5: It shows how to make an array. Try np.ndarray or
np.zeros for a Matlab-like multi-dimensional array creation.
– Lines 7-8: It shows how to define (def) functions.
– Indentation: Leading whitespace (spaces and tabs) at the begin-
ning of a logical line is used to compute the indentation level of
the line, which in turn is used to determine the grouping of state-
ments.
– Line 10 (Comment): After #, the rest of the line is ignored by
Python.
– Lines 12-13 and 18-19: They call the built-in function minimize, a
method of optimize.
– Lines 14-15 and 20-21: They show how to make print for object-
oriented modules and data.
1
Matplotlib is the most popular plotting (data visualization) routine package for Python. Many users
with Matlab experience will be capable of taking the step to using import matplotlib.pyplot as plt, as the
commands are sufficiently similar to Matlab.
16 Chapter 1. Introduction
In this chapter, we will make use of one of the first algorithmically described
machine learning algorithms for classification, the perceptron and adap-
tive linear neurons (adaline). We will start by implementing a perceptron
step by step in Python and training it to classify different flower species in
the Iris dataset.
Contents of Chapter 2
2.1. Binary Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2. The Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3. Adaline: ADAptive LInear NEuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
17
18 Chapter 2. Simple Machine Learning Algorithms for Classification
Remark 2.2. Neurons are interconnected nerve cells that are involved
in the processing and transmitting of chemical and electrical signals.
Such a nerve cell can be described as a simple logic gate with binary
outputs;
• multiple signals arrive at the dendrites,
• they are integrated into the cell body,
• and if the accumulated signal exceeds a certain threshold, an output
signal is generated that will be passed on by the axon.
Linear classifiers
As artificial neurons, they have the following characteristics:
• Inputs are feature values: x
• Each feature has a weight: w
• Weighted sum (integration) is the activation
X
activationw (x) = wj xj = w · x (2.1)
j
Unknowns, in ML:
Training : w
Prediction : activationw (x)
Examples:
• Perceptron
• Adaline (ADAptive LInear NEuron)
• Support Vector Machine (SVM) ⇒ nonlinear decision boundaries, too
20 Chapter 2. Simple Machine Learning Algorithms for Classification
where θ is a threshold.
For simplicity, we can bring the threshold θ in (2.2) to the left side of the
equation; define a weight-zero as w0 = −θ and x0 = 1 and reformulate as
1 if z ≥ 0,
φ(z) = z = w T x = w0 x0 + w1 x1 + · · · + wm xm . (2.3)
−1 otherwise,
In the ML literature, the negative threshold (w0 = −θ) is called the bias
unit, while x0 (that always has value 1) is the bias or intercept.
2.2. The Perceptron Algorithm 21
The update of the weight vector w can be more formally written as:
w := w + ∆w, ∆w = η (y (i) − b
y (i) ) x (i) , (2.4)
where η is the learning rate, 0 < η < 1, y (i) is the true class label of the i -th
training sample, and by (i) denotes the predicted class label. (See (2.15) below
for the update with the Gradient descent method.)
Theorem 2.8. Assume the data set D = {(x (i) , y (i) )} is linearly separable
with margin γ, i.e.,
T
∃ w,
b kw
b k = 1, y (i) w
b x (i) ≥ γ > 0, ∀ i. (2.6)
Suppose that kx (i) k ≤ R, ∀ i , for some R > 0. Then, the maximum num-
ber of mistakes made by the perceptron algorithm is bounded by R 2 /γ 2 .
Proof. Assume the perceptron algorithm makes yet a mistake for (x (`) , y (`) ).
Then
kw (`+1) k2 = kw (`) + η (y (`) − b
y (`) )x (`) k2
T
= kw (`) k2 + kη (y (`) − b
y (`) )x (`) k2 + 2η (y (`) − b
y (`) )w (`) x (`) (2.7)
≤ kw (`) k2 + kη (y (`) − b
y (`) )x (`) k2 ≤ kw (`) k2 + (2η R)2 ,
which implies
T
b w (`) ≥ ` · (2η γ )
w (2.10)
and therefore
kw (`) k2 ≥ `2 · (2η γ )2 . (2.11)
It follows from (2.9) and (2.11) that ` ≤ R 2 /γ 2 .
Averaged perceptron is an
algorithmic modification that
helps with both issues
– Average the weight vectors
across, either all or a last
part of iterations Figure 2.3
2.2. The Perceptron Algorithm 25
Optimal Separator?
Figure 2.4
Example 2.9. Support Vector Machine (Cortes & Vapnik, 1995) chooses
the linear separator with the largest margin.
Figure 2.5
• − vs {◦, +} ⇒ weights w −
• + vs {◦, −} ⇒ weights w +
• ◦ vs {+, −} ⇒ weights w ◦
φ(w T x) = w T x.
Definition 2.11. In the case of Adaline, we can define the cost function
J to learn the weights as the Sum of Squared Errors (SSE) between
the calculated outcomes and the true class labels:
1 X (i) 2
J (w) = y − φ(w T x (i) ) . (2.12)
2
i
Algorithm 2.12. The Gradient Descent Method uses −∇J (w) for
the search direction (update direction):
Compare the above Widrow-Hoff rule with the perceptron rule in (2.4) on
p. 21.
2.3. Adaline: ADAptive LInear NEuron 29
♦ Hyperparameters
Definition 2.13. In ML, a hyperparameter is a parameter whose
value is set before the learning process begins. Thus it is an algorithmic
parameter. Examples are
• The learning rate (η)
• The number of maximum epochs/iterations (n_iter)
• The SGD method updates the weights based on a single training ex-
ample, while the gradient descent method uses the sum of the accu-
mulated errors over all samples.
• The SGD method typically reaches convergence much faster be-
cause of the more frequent weight updates.
• Since each search direction is calculated based on a single training
example, the error surface is smoother (not noisier) than in the gradi-
ent descent method, which can also have the advantage that the SGD
method can escape shallow local minima more readily.
• To obtain accurate results via the SGD method, it is important to
present it with data in a random order, which is why we want
to shuffle the training set for every epoch to prevent cycles.
• In the SGD method, the learning rate η is often set adaptively, de-
creasing over iteration k . For example, ηk = c1 /(k + c2 ).
32 Chapter 2. Simple Machine Learning Algorithms for Classification
♦ Mini-batch learning
Definition 2.16. A compromise between the batch gradient descent and
the SGD is the so-called mini-batch learning. Mini-batch learning can
be understood as applying batch gradient descent to smaller subsets of
the training data – for example, 32 samples at a time.
Required Coding/Reading : Read Chapter 2 and experiment all examples in it, Python
Machine Learning, 3rd Ed.
2.3. Adaline: ADAptive LInear NEuron 33
To get the Iris dataset, you have to use some lines on as earlier pages from 29.
2.3. Perturb the dataset (X ) by a random Gaussian noise Gσ of an observable σ (so as for
Gσ (X ) not to be linearly separable) and do the examples in Exercise 2 again.
34 Chapter 2. Simple Machine Learning Algorithms for Classification
Chapter 3
Contents of Chapter 3
3.1. Gradient Descent Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2. Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3. Variants of Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4. The Levenberg–Marquardt Algorithm, for Nonlinear Least-Squares Problems . . . . . 61
Exercises for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Project.1. Gaussian Sailing to Overcome Local Minima Problems . . . . . . . . . . . . . . . 67
35
36 Chapter 3. Gradient-based Methods for Optimization
1
The Rosenbrock function in 3D is given as f (x, y, z) = [(1 − x)2 + 100 (y − x 2 )2 ] + [(1 − y)2 + 100 (z − y 2 )2 ],
which has exactly one minimum at (1, 1, 1). Similarly, one can define the Rosenbrock function in
general N -dimensional spaces, for N ≥ 4, by adding one more component for each enlarged dimen-
NX−1
(1 − xi )2 + 100(xi+1 − xi2 )2 , where x = [x1 , x2 , · · · , xN ] ∈ RN . See Wikipedia
sion. That is, f (x) =
i=1
(https://en.wikipedia.org/wiki/Rosenbrock_function) for details.
38 Chapter 3. Gradient-based Methods for Optimization
The goal of the gradient descent method is to address directly the process
of minimizing the function f , using the fact that −∇f (x) is the direction of
steepest descent of f at x . Given an initial point x 0 , we move it to the
direction of −∇f (x 0 ) so as to get a smaller function value. That is,
x 1 = x 0 − γ ∇f (x 0 ) ⇒ f (x 1 ) < f (x 0 ).
x n+1 = x n − γ ∇f (x n ), (3.3)
for some γ > 0. The parameter γ is called the step length or the learn-
ing rate.
Picking the step length γ : Assume that the step length was chosen to
be independent of n, although one can play with other choices as well. The
question is how to select γ in order to make the best gain of the method. To
turn the right-hand side of (3.5) into a more manageable form, we invoke
Taylor’s Theorem:2
ˆ x+t
0
f (x + t) = f (x) + t f (x) + (x + t − s) f 00 (s) ds. (3.6)
x
00
Assuming that |f (s)| ≤ L , we have
t2
f (x + t) ≤ f (x) + t f 0 (x) + L.
2
Now, letting x = xn and t = −γ f 0 (xn ) reads
f (xn+1 ) = f (xn − γ f 0 (xn ))
1
≤ f (xn ) − γ f 0 (xn ) f 0 (xn ) + L [γ f 0 (xn )]2 (3.7)
2
0 2
L 2
= f (xn ) − [f (xn )] γ − γ .
2
The gain (learning) from the method occurs when
L 2 2
γ− γ >0 ⇒ 0<γ< , (3.8)
2 L
and it will be best when γ − L2 γ 2 is maximal. This happens at the point
1
γ= (3.9)
L
2
Taylor’s Theorem with integral remainder: Suppose f ∈ C n+1 [a, b] and x0 ∈ [a, b]. Then, for every
n ˆ x
X f (k ) (x0 ) 1
x ∈ [a, b], f (x) = (x − x0 )k + Rn (x), Rn (x) = (x − s)n f (n+1) (s) ds.
k! n! x0
k =0
40 Chapter 3. Gradient-based Methods for Optimization
lim xn = b
x. (3.11)
n→∞
f 0 (b
x ) = lim f 0 (xn ) = 0, (3.13)
n→∞
The gradient descent method thus updates its iterates minimizing the fol-
lowing surrogate function:
1
Qn (x) = f (x n ) + ∇f (x n ) · (x − x n ) + kx − x n k2 . (3.18)
2γ
Differentiating the function and equating to zero reads
We can implement the full gradient descent algorithm as follows. The algo-
rithm has only one free parameter: γ.
3.1.3. Examples
Here we examine the convergence of gradient descent on three examples:
a well-conditioned quadratic, an badly-conditioned quadratic, and a non-
convex function, as shown by Dr. Fabian Pedregosa, UC Berkeley.
γ = 0.2
Figure 3.3: On a well-conditioned quadratic function, the gradient descent converges in a
few iterations to the optimum
γ = 0.02
Figure 3.4: On a badly-conditioned quadratic function, the gradient descent converges and
takes many more iterations to converge than on the above well-conditioned problem. This
is partially because gradient descent requires a much smaller step size on this problem
to converge.
γ = 0.02
Figure 3.5: Gradient descent also converges on a badly-conditioned non-convex problem.
Convergence is slow in this case.
44 Chapter 3. Gradient-based Methods for Optimization
The gradient descent algorithm with backtracking line search then becomes
Algorithm 3.11. (The Gradient Descent Algorithm, with Back-
tracking Line Search).
input: initial guess x 0 , step size γ > 0;
for n = 0, 1, 2, · · · do
initial step size estimate γn ;
while (TRUE) do
if f (x n − γn ∇f (x n )) ≤ f (x n ) − γ2n k∇f (x n )k2
break; (3.24)
else γn = γn /2;
end while
x n+1 = x n − γn ∇f (x n );
end for
return x n+1 ;
3.1. Gradient Descent Method 45
The following examples show the convergence of gradient descent with the
aforementioned backtracking line search strategy for the step size.
Figure 3.7: In this example we can clearly see the effect of the backtracking line search
strategy: once the algorithm in a region of low curvature, it can take larger step sizes. The
final result is a much improved convergence compared with the fixed step-size equivalent.
Figure 3.8: The backtracking line search also improves convergence on non-convex prob-
lems.
H n = ∇2 f (x n ), (3.26)
We can find the optimum of the function deriving and equating to zero. This
way we find (assuming the Hessian is invertible)
−1
x n+1 = arg min Qn (x) = x n − γ ∇2 f (x n )
∇f (x n ). (3.28)
x
Remark 3.15. The Newton’s method can be seen as to find the critical
b such that ∇f (x
points of f , i.e., x b) = 0. Let
x n+1 = x n + ∆x. (3.29)
Then
∇f (x n+1 ) = ∇f (x n + ∆x) = ∇f (x n ) + ∇2 f (x n ) ∆x + O(|∆x |2 ).
For the three example functions in Section 3.1.3, the Newton’s method per-
forms better as shown in the following.
γ=1
Figure 3.10: In this case the approximation is exact and it converges in a single iteration.
γ=1
Figure 3.11: Although badly-conditioned, the cost function is quadratic; it converges in a
single iteration.
γ=1
Figure 3.12: When the Hessian is close to singular, there might be some numerical insta-
bilities. However, it is better than the result of the gradient descent method in Figure 3.5.
50 Chapter 3. Gradient-based Methods for Optimization
Claim 3.17. The Hessian (or Hessian matrix) describes the local
curvature of a function. The eigenvalues and eigenvectors of the Hes-
sian have geometric meaning:
• The first principal eigenvector (corresponding to the largest eigen-
value in modulus) is the direction of greatest curvature.
• The last principal eigenvector (corresponding to the smallest eigen-
value in modulus) is the direction of least curvature.
• The corresponding eigenvalues are the respective amounts of these
curvatures.
The eigenvectors of the Hessian are called principal directions, which
are always orthogonal to each other. The eigenvalues of the Hessian
are called principal curvatures and are invariant under rotation and
always real-valued.
and therefore
d
−1
X 1
H v = ξj uj , (3.32)
λj
j=1
where a·b
angle(a, b) = arccos .
ka k kb k
This implies that by setting v = −∇f (x n ), the adjusted vector H −1 v is a
rotation (and scaling) of the steepest descent vector towards the least
curvature direction.
Claim 3.19. The net effect of H −1 is to rotate and scale the gradi-
ent vector to face towards the minimizer by a certain degree, which may
make the Newton’s method converge much faster than the gradient de-
scent method.
Example 3.20. One can easily check that at each point (x, y) on the ellip-
soid
(x − h)2 (y − k )2
z = f (x, y) = + , (3.34)
a2 b2
−1
the vector − ∇2 f (x, y) ∇f (x, y) is always facing towards the minimizer
(h, k ). See Exercise 2.
52 Chapter 3. Gradient-based Methods for Optimization
y n · s n > 0. (3.38)
• If the function f is not strongly convex, then the above curvature condition
has to be enforced explicitly. In order to maintain the symmetry and
positive definiteness of H n+1 , the update formula can be chosen as3
• Imposing the secant condition H n+1 s n = y n and with (3.39), we get the
update equation of H n+1 :
y n y Tn (H n s n )(H n s n )T
H n+1 = Hn + − . (3.40)
y n · sn sn · Hnsn
3
Rank-one matrices: Let A be an m × n matrix. Then rank(A ) = 1 if and only if there exist column vectors
v ∈ Rm and w ∈ Rn such that A = vw T .
54 Chapter 3. Gradient-based Methods for Optimization
• Let B n = H −1
n , the inverse of H n . Then, applying the Sherman-Morrison
formula, we can update B n+1 = H − 1
n+1 as follows.
s n y Tn y n s Tn s n s Tn
B n+1 = I− Bn I − + . (3.41)
y n · sn y n · sn y n · sn
See Exercise 4.
Figure 3.15: On the badly-conditioned quadratic problem, the BFGS algorithm quickly
builds a good estimator of the Hessian and is able to converge very fast towards the opti-
mum. Note that this, just like the Newton method (and unlike gradient descent), BFGS
does not seem to be affected (much) by a bad conditioning of the problem.
Figure 3.16: Even on the ill-conditioned nonconvex problem, the BFGS algorithm also
converges extremely fast, with a convergence that is more similar to Newton’s method
than to gradient descent.
56 Chapter 3. Gradient-based Methods for Optimization
∇f (x n ) ≈ ∇fi (x n )
3.3. Variants of Optimization Algorithms 57
We can write the full stochastic gradient algorithm as follows. The algo-
rithm has only one free parameter: γ.
Algorithm 3.23. (Stochastic Gradient Descent).
The SGD can be much more efficient than gradient descent in the case in
which the objective consists of a large sum, because at each iteration we
only need to evaluate a partial gradient and not the full gradient.
Example 3.24. A least-squares problem can be written in the form accept-
able by SGD since
m
1 2 1 X T
kA x − b k = (Ai x − b i )2 , (3.44)
m m
i=1
γn = γ
γ = 0.2
Figure 3.17: For the well-conditioned convex problem, stochastic gradient with constant
step size converges quickly to a neighborhood of the optimum, but then bounces around.
Figure 3.18: Stochastic Gradient with decreasing step sizes is quite robust to the choice of
step size. On one hand there is really no good way to set the step size (e.g., no equivalent
of line search for Gradient Descent) but on the other hand it converges for a wide range of
step sizes.
60 Chapter 3. Gradient-based Methods for Optimization
Quesiton. Why does the SGD converge, despite its update being a very
rough estimate of the gradient?
This is the crucial property that makes SGD work. For a full proof, see
e.g. [7].
3.4. The Levenberg–Marquardt Algorithm, for Nonlinear Least-Squares Problems 61
If the function b
y (x; p) is nonlinear in the model parameters p , then the
minimization of the χ-squared function f with respect to the parameters
must be carried out iteratively:
p := p + ∆p. (3.49)
The goal of each iteration is to find the parameter update ∆p that reduces
f . We will begin with the gradient descent method and the Gauss-Newton
method.
62 Chapter 3. Gradient-based Methods for Optimization
∆p gd = γ J T W (y − y
b(p)), (3.51)
[J T W J + λI] ∆p lm = J T W (y − y
b(p)), (3.56)
Acceptance of the step: There have been many variations of the Levenberg-
Marquardt method, particularly for acceptance criteria. For example, at the
k -th step, we first compute
f (p) − f (p + ∆p lm )
ρk (∆p lm ) =
(y − y b)T W (y − y b) − (y − y b − J∆p lm )T W (y − yb − J∆p lm )
(3.57)
f (p) − f (p + ∆p lm )
= , (using (3.56))
∆p Tlm (λk ∆p lm + J T W (y − yb(p))
and then the step is accepted when ρk (∆p lm ) > ε0 , for some threshold ε0 > 0
(e.g., ε0 = 0.1). An example implementation reads
Initialize p 0 and λ0 ; (e.g. λ0 = 0.01)
Get ∆p lm from (3.56);
Get ρk from (3.57);
If ρk > ε0 : (3.58)
p k +1 = p k + ∆p lm ; λk +1 = λk · max[1/3, 1 − (2ρk )3 )]; νk = 2;
otherwise: λk +1 = λk νk ; νk +1 = 2νk ;
3.4. The Levenberg–Marquardt Algorithm, for Nonlinear Least-Squares Problems 65
3.1. (Gradient descent method). Implement the gradient descent algorithm (3.22) and
the gradient descent algorithm with backtracking line search (3.24).
3.2. (Net effect of the inverse Hessian matrix). Verify the claim in Example 3.20.
3.3. (Newton’s method). Implement a line search version of the Newton’s method (3.31)
with the Rosenbrock function in 2D.
(a) Recall the results in Exercise 1. With the backtracking line search, is the New-
ton’s method better than the gradient descent method?
(b) Now, we will approximate the Hessian matrix by its diagonal. That is,
∂ 2f ∂ 2f ∂ 2f
0
∂x 2 def ∂ x
2 ∂x ∂y
Dn = (x n ) ≈ ∇2 f (x n ) == (x n ). (3.59)
∂ 2f ∂ 2f ∂ 2f
0
∂y 2 ∂y ∂x ∂y 2
How does the Newton’s method perform when the Hessian matrix is replaced by
Dn ?
3.4. (BFGS update). Consider H n+1 and B n+1 in (3.40) and (3.41), respectively:
y n y Tn (H n s n )(H n s n )T
H n+1 = H n + − ,
y n · sn sn · Hnsn
s n y Tn y n s Tn s n s Tn
B n+1 = I− Bn I − + .
y n · sn y n · sn y n · sn
1 2 3 4
xi 0.0 1.0 2.0 3.0
yi 1.1 2.6 7.2 21.1
(a) Implement the three algorithms introduced in Section 3.4: the gradient descent
method, the Gauss-Newton method, and the Levenberg-Marquardt method.
(b) Ignore the weight vector W , i.e., set W = I .
(c) For each method, set p 0 = [a0 , b0 ] = [1.0, 0.8].
(d) Discuss how to choose γ for the gradient descent and λ for the Levenberg-Marquardt.
Hint : The Jacobian for this example must be in R4×2 ; more precisely,
1 0
∂ eb a eb
J= b(x, p) =
y e 2b 2a e 2b ,
∂p
e 3b 3a e 3b
for cw ≥ 0. Since
1 cw 1 1
2
cw cw cw = cw [1 cw 1],
1 cw 1 1
the convolution smoothing can be implemented easily and conveniently as
1
1
S ∗A = cw ∗ [1 cw 1] ∗ A ,
(2 + cw )2
1
Tasks to do:
1. Go to http://www.sfu.ca/∼ssurjano/optimization.html (Virtual Library
of Simulation Experiments) and select three functions of your inter-
ests having multiple local minima. (e.g., two of them are the Ackley
function and Griewank function.)
2. Store each of the functions in a 2D-array A which has dimensions large
enough.
3. Compute
| ∗ S ∗{z· · · ∗ S} ∗A ,
Aσ = gaussian_filter(A , σ ), or At = S
t -times
You may work in a group of two people; however, you must report individu-
ally.
Chapter 4
In this chapter, we will take a tour through a selection of popular and pow-
erful machine learning algorithms that are commonly used in academia as
well as in the industry. While learning about the differences between sev-
eral supervised learning algorithms for classification, we will also develop
an intuitive appreciation of their individual strengths and weaknesses.
The topics that we will learn about throughout this chapter are as follows:
Contents of Chapter 4
4.1. Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2. Classification via Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3. Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4. Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5. k -Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
69
70 Chapter 4. Popular Machine Learning Classifiers
1. Selection of features.
2. Choosing a performance metric.
3. Choosing a classifier and optimization algorithm.
4. Evaluating the performance of the model.
5. Tuning the algorithm.
Required Coding/Reading : Read Chapter 3 and experiment all examples in it, Python
Machine Learning, 3rd Ed.
4.1. Logistic Function 71
0 e x (1 + e x ) − e x · e x ex
s (x) = = = s(x)(1 − s(x)). (4.4)
(1 + e x )2 (1 + e x )2
• Let p stand for the probability of the particular event (that has class label
y = 1). Then the odds ratio of the particular event is defined as
p
.
1−p
• We can then define the logit function, which is simply the logarithm of
the odds ratio (log-odds):
p
logit(p) = ln . (4.7)
1−p
• The logit function takes input values in the range 0 to 1 and transforms
them to values over the entire real line, which we can use to express a
linear relationship between feature values and the log-odds:
Solution.
1
Ans: logit−1 (z) = , the standard logistic sigmoid function.
1 + e −z
1
Note that the Adaline minimizes the sum-squared-error (SSE) cost function defined as J (w) =
1X 2
φ(z (i) ) − y (i) , where z (i) = w T x (i) , using the gradient descent method; see Section 2.3.1.
2
i
4.2. Classification via Logistic Regression 77
Assume that the individual samples in our dataset are independent of one
another. Then we can define the likelihood L as
L (w) = P(y |x; w) = Πni=1 P(y (i) |x (i) ; w)
y (i) 1−y (i) (4.13)
= Πni=1 (i)
φ(z ) (i)
1 − φ(z ) ,
Note:
• Firstly, applying the log function reduces the potential for numerical
underflow, which can occur if the likelihoods are very small.
• Secondly, we can convert the product of factors into a summation of
factors, which makes it easier to obtain the derivative of this func-
tion via the addition trick, as you may remember from calculus.
• We can adopt the negation of the log-likelihood as a cost function J
that can be minimized using gradient descent.
and therefore
n
X
y (i) − φ(z (i) ) x (i) .
∇J (w) = − (4.17)
i=1
Note: The above gradient descent rule for logistic regression is equal to
that of Adaline; see (2.15) on p. 28. Only the difference is the activation
function φ(z).
80 Chapter 4. Popular Machine Learning Classifiers
Figure 4.6
4.2. Classification via Logistic Regression 81
Regularization
• One way of finding a good bias-variance tradeoff.
• It is useful to prevent overfitting, also handling
– collinearity (high correlation among features)
– filter-out noise from data
– multiple minima problem
• The concept behind regularization is to introduce additional in-
formation (bias) to penalize extreme parameter (weight) values.
• The most common form of regularization is so-called L 2 regularization
(sometimes also called L 2 shrinkage or weight decay):
m
λ 2 λX
kw k = wj2 , (4.20)
2 2
j=1
The cost function for logistic regression can be regularized by adding a sim-
ple regularization term, which will shrink the weights during model train-
ing: for z (i) = w T x (i) ,
n
X (i) λ
−y ln φ(z (i) ) − (1 − y (i) ) ln 1 − φ(z (i) ) + kw k2 .
J (w) = (4.21)
2
i=1
Note:
• Regularization is another reason why feature scaling such as stan-
dardization is important. For regularization to work properly, we
need to ensure that all our features are on comparable scales. See
§ 6.3 for more details of feature scaling.
• Via the regularization parameter λ, we can then control how well we
fit the training data while keeping the weights small. By increasing
the value of λ, we increase the regularization strength.
82 Chapter 4. Popular Machine Learning Classifiers
To find an optimal hyperplane that maximizes the margin, let’s begin with
considering the positive and negative hyperplanes that are parallel to the
decision boundary:
w0 + w T x + = 1,
(4.22)
w0 + w T x − = −1.
where w = [w1 , w2 , · · · , wd ]T . If we subtract those two linear equations from
each other, then we have
w · (x + − x − ) = 2
and therefore
w 2
· (x + − x − ) = . (4.23)
kw k kw k
4.3. Support Vector Machine 83
(b) Evaluate f at all these points, to find the maximum and minimum.
Self-study 4.12. Use the method of Lagrange multipliers to find the ex-
treme values of f (x, y) = x 2 + 2y 2 on the circle x 2
+ y 2 = 1.
2x = 2x λ 1
2x 2x
Hint : ∇f = λ∇g =⇒ =λ . Therefore, 4y = 2y λ 2
4y 2y 2 2
x +y =1 3
From 1 , x = 0 or λ = 1.
Now, consider
minx maxα L(x, α) subj.to α ≥ 0.
Figure 4.9: minx x 2 subj.to x ≥ 1.
(4.29)
3 Let x < 1. ⇒ maxα≥0 {−α(x − 1)} = ∞. However, minx won’t make this
happen! (minx is fighting maxα ) That is, when x < 1, the objective L(x, α)
becomes huge as α grows; then, minx will push x % 1 or increase it to become
x ≥ 1. In other words, minx forces maxα to behave, so constraints will
be satisfied.
Now, the goal is to solve (4.29). In the following, we will define the dual
problem of (4.29), which is equivalent to the primal problem.
86 Chapter 4. Popular Machine Learning Classifiers
where
L(x, α) = x 2 − α (x − 1).
The term minx L(x, α) is called the Lagrange dual function and the
Lagrange multiplier α is also called the dual variable.
How to solve it. For the Lagrange dual function minx L(x, α), the minimum
occurs where the gradient is equal to zero.
d α
L(x, α) = 2x − α = 0 ⇒ x = .
dx 2
Plugging this to L(x, α), we have
α 2 α α2
L(x, α) = −α −1 =α− .
2 2 4
N
X N
X
(i) (i)
∇w L([w, w0 ], α) = w− αi y x =0 ⇒ w= αi y (i) x (i) ,
i=1 i=1
N N
∂ X X
L([w, w0 ], α) = − αi y (i) = 0 ⇒ αi y (i) = 0,
∂ w0 (4.36)
i=1 i=1
αi ≥ 0, (dual feasibility)
αi [y (i) (w0 + w T x (i) ) − 1] = 0, (complementary slackness)
y (i) (w0 + w T x (i) ) − 1 ≥ 0. (primal feasibility)
Again using the first KKT condition, we can rewrite the first term.
N N
1 2 1 X (i) (i)
X
(j) (j)
− kw k = − αi y x · αj y x
2 2
i=1 j=1
N N (4.38)
1 XX
= − αi αj y (i) y (j) x (i) · x (j) .
2
i=1 j=1
Plugging (4.38) into the (simplified) Lagrangian (4.37), we see that the La-
grangian now depends on α only.
4.3. Support Vector Machine 89
Support vectors
Assume momentarily that we have w0∗ . Consider the complementary slack-
ness KKT condition along with the primal and dual feasibility conditions:
Then
αi∗ > 0 ⇒ y (i) f ∗ (x (i) ) = scaled margin = 1,
(4.43)
y (i) f ∗ (x (i) ) > 1 ⇒ αi∗ = 0.
Definition 4.19. The examples in the first category, for which the
scaled margin is 1 and the constraints are active, are called support
vectors. They are the closest to the decision boundary.
Note: The motivation for introducing the slack variable ξ was that the
linear constraints need to be relaxed for nonlinearly separable data to
allow the convergence of the optimization in the presence of misclassi-
fications, under appropriate cost penalization. Such strategy of SVM is
called the soft-margin classification.
92 Chapter 4. Popular Machine Learning Classifiers
The constraints allow some slack of size ξi , but we pay a price for it in the
objective. That is,
if y (i) f (x (i) ) ≥ 1, then ξi = 0 and penalty is 0. Otherwise, y (i) f (x (i) ) = 1 − ξi
and we pay price ξi > 0
.
4.3. Support Vector Machine 93
So the only difference from the original problem’s Lagrangian (4.39) is that
0 ≤ αi was changed to 0 ≤ αi ≤ C . Neat!
See § 4.3.6, p. 100, for the solution of (4.49), using the SMO algorithm.
94 Chapter 4. Popular Machine Learning Classifiers
Note:
• G = ZZ T ∈ RN×N is called the Gram matrix. That is,
• Margins
– to reduce overfitting
– to enhance classification accuracy
• Feature expansion
– mapping to a higher-dimension
– to classify inseparable datasets
• Kernel trick
– to avoid writing out high-dimensional feature vectors
96 Chapter 4. Popular Machine Learning Classifiers
For example, for the inseparable data set in Figure 4.12, we define
Kernel trick
Recall: the dual problem to the soft-margin SVM given in (4.49):
N N N
hX 1 XX (i) (j) (i) (j)
i
max αi − αi αj y y x · x , subj.to
α 2
i=1 i=1 j=1 (4.52)
0 ≤ α ≤ C, ∀ i,
PN i (i)
i=1 αi y = 0.
One of the most widely used kernels is the Radial Basis Function
(RBF) kernel or simply called the Gaussian kernel:
Common kernels
• Sigmoid:
K(x (i) , x (j) ) = tanh(a x (i) · x (j) + b) (4.57)
• Gaussian RBF:
(i) (j) kx (i) − x (j) k2
K(x , x ) = exp − (4.58)
2σ 2
• And many others: Fisher kernel, graph kernel, string kernel, ...
very active area of research!
Quesiton. Start with (4.61). Let’s say you want to hold α2 , · · · , αN fixed
and take a coordinate step in the first direction. That is, change α1 to
maximize the objective in (4.61). Can we make any progress? Can we
get a better feasible solution by doing this?
PN
Turns out, no. Let’s see why. Look at the constraint in (4.61), i=1 αi y (i) = 0.
This means
XN N
X
(1) (i) (1)
α1 y = − αi y ⇒ α1 = − y αi y (i) .
i=2 i=2
Figure 4.13
• In practice , this can result in a very deep tree with many nodes,
which can easily lead to overfitting
(We typically set a limit for the maximal depth of the tree)
4.4. Decision Trees 103
where
f : the feature to perform the split
DP : the parent dataset
Dj : the dataset of the j -th child node
I : the impurity measure
NP : the total number of samples at the parent note
Nj : the number of samples in the j -th child node
The information gain is simply the difference between the impurity of the
parent node and the sum of the child node impurities — the lower the
impurity of the child nodes, the larger the information gain.
However, for simplicity and to reduce the combinatorial search space, most
libraries implement binary decision trees, where each parent node is
split into two child nodes, DL and DR :
NL NR
IG(DP , f ) = I(DP ) − I(DL ) − I(DR ). (4.65)
NP NP
104 Chapter 4. Popular Machine Learning Classifiers
Impurity measure?
Commonly used in binary decision trees:
• Entropy c
X
IH (t) = − p(i |t) log2 p(i |t) (4.66)
i=1
• Gini impurity
c
X c
X
p(i |t)2
IG (t) = p(i |t) 1 − p(i |t) = 1 − (4.67)
i=1 i=1
• Classification error
IE (t) = 1 − max{p(i |t)} (4.68)
i
where p(i |t) denotes the proportion of the samples that belong to class
i for a particular node t .
Mind simulation
• The entropy is maximal, if we have a uniform class distribution; it is
0, if all samples at the node t belong to the same class.
e.g., when c = 2:
IH (t) = 0, if p(i = 1|t) = 1 or p(i = 0|t) = 0
IH (t) = 1, if p(i = 1|t) = p(i = 0|t) = 0.5
⇒ We can say that the entropy criterion attempts to maximize the mutual
information in the tree.
• Intuitively, the Gini impurity can be understood as a criterion to min-
imize the probability of misclassification.
• The Gini impurity is maximal, if the classes are perfectly mixed.
e.g., when c = 2:
Pc
IG (t) = 1 − i=1 0.52 = 0.5
⇒ In practice, both Gini impurity and entropy yield very similar results.
• The classification error is less sensitive to changes in the class prob-
abilities of the nodes.
⇒ The classification error is a useful criterion for pruning, but not recom-
mended for growing a decision tree.
4.4. Decision Trees 105
Figure 4.15: A decision tree result with Gini impurity measure, for three classes with two
features (petal length, petal width). Page 97, Python Machine Learning, 3rd Ed..
2. Let
(i)
fq(p) = arg max IG(DP , fj ). (4.69)
i,j
The maximum in (4.69) often happens when one of the child impurities
is zero or very small.
4.4. Decision Trees 107
Typically, the larger the number of trees, the better the performance
of the random forest classifier at the expense of an increased computa-
tional cost.
4.5. k -Nearest Neighbors 109
Figure 4.16: Illustration for how a new data point (?) is assigned the triangle class label,
based on majority voting, when k = 5.
Based on the chosen distance metric, the k -NN algorithm finds the k sam-
ples in the training dataset that are closest (most similar) to the point that
we want to classify. The class label of the new data point is then determined
by a majority vote among its k nearest neighbors.
110 Chapter 4. Popular Machine Learning Classifiers
4.1. For this problem, you would modify the code used for Problem 2.2 in Chapter 2. For
the standardized data (XSD ),
(a) Apply the logistic regression gradient descent (Algorithm 4.8) for the noisy data
Gσ (XSD ).
(b) Modify the code for the logistic regression with regularization (4.21) and apply
the resulting algorithm for Gσ (XSD ).
(c) Compare their performances
4.3. Verify the formulation in (4.49), which is dual to the minimization of (4.48).
4.4. Experiment examples on pp. 84–91, Python Machine Learning, 3rd Ed., in order to
optimize the performance of kernel SVM by finding a best kernel and optimal hyper-
parameters (gamma and C ).
Choose one of Exercises 5 and 6 below to implement and experiment. The experiment
will guide you to understand how the LM software has been composed from scratch.
You may use the example codes thankfully shared by Dr. Jason Brownlee, who is
the founder of machinelearningmastery.com.
4.5. Implement a decision tree algorithm that incorporates the Gini impurity measure,
from scratch, to run for the data used on page 96, Python Machine Learning, 3rd Ed..
Compare your results with the figure on page 97 of the book. You may refer to
https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
4.6. Implement a k -NN algorithm, from scratch, to run for the data used on page 106,
Python Machine Learning, 3rd Ed.. Compare your results with the figure on page
103 of the book. You may refer to
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-
scratch/
112 Chapter 4. Popular Machine Learning Classifiers
Chapter 5
Quadratic Programming
Quadratic programming (QP) is the process of solving a constrained
quadratic optimization problem. That is, the objective f is quadratic and
the constraints are linear in several variables x ∈ Rn . Quadratic program-
ming is a particular type of nonlinear programming. Its general form is
1 T
minn f (x) := x A x − x T b, subj.to
x ∈R 2
Cx = c, (5.1)
Dx ≤ d,
where A ∈ Rn×n is symmetric, C ∈ Rm×n , D ∈ Rp ×n , b ∈ Rn , c ∈ Rm , and
d ∈ Rp .
Contents of Chapter 5
5.1. Equality Constrained Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . 114
5.2. Direct Solution for the KKT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3. Linear Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4. Iterative Solution of the KKT System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.5. Active Set Strategies for Convex QP Problems . . . . . . . . . . . . . . . . . . . . . . . 131
5.6. Interior-point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.7. Logarithmic Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
113
114 Chapter 5. Quadratic Programming
Let Z ∈ Rn×(n−m) be a matrix whose columns span the null space of C , N (C),
i.e.,
CZ = 0 or Span(Z) = N (C). (5.6)
Definition 5.1. The matrix K in (5.5) is called the KKT matrix and
the matrix Z T AZ is referred to as the reduced Hessian.
Proposition 5.6.
• If A , B 0, then A · B ≥ 0, and A · B = 0 implies AB = 0.
• A symmetric matrix A is semidefinite if A · B ≥ 0 for every B 0.
which implies
A x + C T z = 0 and Cx = 0. (5.10)
It follows from the above equations that
0 = x T A x + x T C T z = x T A x,
p T A x ∗ = p T (b − C T λ∗ ) = p T b − p T C T λ∗ = p T b,
For direct solutions of the KKT system (5.19), this section considers sym-
metric factorization, the range-space approach, and the null-space approach.
9 print('L0=\n',L0)
10 print('D=\n',D)
11 print('P=\n',P)
12 print('L0*D*L0^T=\n',L0.dot(D).dot(L0.T),'\n#-----------')
13
where w Y ∈ Rm , w Z ∈ Rn−m .
7. On the other hand, substituting (5.26) into the first equation of (5.19), we get
A x + C T λ = AY w Y + AZw Z + C T λ = b. (5.28)
The reduced KKT system (5.29) can be solved easily e.g. by a Cholesky factoriza-
tion of the reduced Hessian Z T AZ ∈ R(n−m)×(n−m) .
8. Once w Y and w Z have been computed as solutions of (5.27) and (5.29), respectively,
x ∗ is obtained from (5.26).
9. When Lagrange multipliers λ∗ is to be computed, we multiply (5.28) by Y T and solve
the resulting equation:
(CY )T λ∗ = Y T b − Y T A x ∗ . (5.30)
5.3. Linear Iterative Methods 123
A x = b, (5.31)
A = M − N, (5.32)
Mx = Nx + b. (5.33)
Mx k = Nx k −1 + b, (5.34)
or, equivalently,
x k = M −1 (Nx k −1 + b) = x k −1 + M −1 (b − A x k −1 ), (5.35)
• It follows from (5.33) and (5.34) that the error equation reads
M e k = N e k −1 (5.36)
or, equivalently,
e k = M −1 N e k −1 . (5.37)
• Since
ke k k ≤ kM −1 N k · ke k −1 k ≤ kM −1 N k2 · ke k −2 k
(5.38)
≤ · · · ≤ kM −1 N kk · ke 0 k,
a sufficient condition for the convergence is
kM −1 N k < 1. (5.39)
connecting from Pi to Pj .
|z − aii | ≤ Λi , 1 ≤ i ≤ n. (5.43)
|λ − 2| < 2
Positiveness
Definition 5.21. An n × n complex-valued matrix A = [aij ] is diagonally
dominant if
n
X
|aii | ≥ Λi := |aij |, (5.44)
j=1
j 6= i
Implementation of (5.51):
(1) T (2)
compute r k " k , r k ) := β −#K ψ k ;
= (r
(1)
rk
compute L r k = b −1 r (1) + r (2) ;
−C A k k
(5.52)
solve M1 ∆ψ k = L rk ;
set ψ k +1 = ψ k + ∆ψ k ;
5.5. Active Set Strategies for Convex QP Problems 131
(C i and D i are row vectors; we will deal with them like column vectors.)
The primal active set strategy is an iterative procedure:
• Given a feasible iterate x k , k ≥ 0, we determine its active set
x k +1 = x k − αk p k , (5.60)
Remark 5.31. (Update for Iac (x k +1 )). Let’s begin with defining the set
of blocking constraints:
def
n D Ti x k − di o
Ibl (p k ) == i 6∈ Iac (x k ) | D Ti p k < 0, ≤1 . (5.65)
D Ti p k
Then we specify Iac (x k +1 ) by adding the most restrictive blocking con-
straint to Iac (x k ):
def
n D Tj x k − dj o
Iac (x k +1 ) == Iac (x k ) ∪ j ∈ Ibl (p k ) | =α
bk . (5.66)
D Tj p k
For such a j (the index of the newly added constraint), we clearly have
D Tj x k +1 = D Tj x k − αk D Tj p k = D Tj x k − α
bk D Tj p k = dj , (5.67)
Newton’s method:
• Given a feasible iterate (x, µ, z) = (x k , µk , z k ), we introduce a duality
measure θ: p
1X zT µ
θ := zi µi = . (5.75)
p p
i=1
where A DT 0
def
∇F(x, µ, z) == D 0 I .
0 Z M
Algorithms based on barrier functions are iterative methods where the it-
erates are forced to stay within the interior of the feasible set:
def
F int == {x ∈ Rn | D Ti x − di < 0, 1 ≤ i ≤ p }. (5.79)
βk → 0, as k → ∞. (5.81)
Then there holds:
1. For any β > 0, the logarithmic barrier function Bβ (x) is convex in
F int and attains a minimizer x β ∈ F int .
2. There is a unique minimizer; any local minimizer x β is also a
global minimizer of Bβ (x).
Recall: The KKT conditions for (5.78) are given in (5.71), p. 134:
∇x Q(x) + D T µ = A x − b + D T µ = 0,
Dx − d ≤ 0,
(5.83)
µi (Dx − d)i = 0, i = 1, 2, · · · , p,
µi ≥ 0, i = 1, 2, · · · , p.
(b) Now, considering the inequality constraints, discuss why the problem is yet ad-
mitting a unique solution α∗ .
5.6. Consider the following equality-constrained QP problem
min Q(x) := (x1 − 1)2 + 2(x2 − 2)2 + 2(x3 − 2)2 − 2x1 x2 , subj.to
x ∈R3
x1 + x2 − 3x3 ≤ 0, (5.93)
4x1 − x2 + x3 ≤ 1.
Implement the interior-point method with the Newton’s iterative update, pre-
sented in Section 5.6, to solve the QP problem (5.93).
(a) As in Exercise 6, you should first formulate the KKT system for (5.93) by identi-
fying A ∈ R3×3 , D ∈ R2×3 , and vectors b ∈ R3 and d ∈ R2 .
(b) Choose a feasible initial value (x 0 , µ0 , z 0 ).
(c) Select the algorithm parameter σ ∈ [0, 1] appropriately for Newton’s method.
(d) Discuss how to determine α in (5.77) in order for the new iterate (x k +1 , µk +1 , z k +1 )
to stay feasible.
Chapter 6
Contents of Chapter 6
6.1. General Remarks on Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2. Dealing with Missing Data & Categorical Data . . . . . . . . . . . . . . . . . . . . . . . 144
6.3. Feature Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.4. Feature Selection: Selecting Meaningful Variables . . . . . . . . . . . . . . . . . . . . . 147
6.5. Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
141
142 Chapter 6. Data Preprocessing in Machine Learning
The more disciplined you are in your handling of data, the more
consistent and better results you are likely to achieve.
6.1. General Remarks on Data Preprocessing 143
Software: pandas.DataFrame
• Encoding nominal features: one-hot-encoding, e.g.,
blue 0 1 0 0
color : green ←→ 1 ←→ 0 1 0 .
red 2 0 0 1
There are two common approaches to bring different features onto the same
scale:
• min-max scaling (normalization):
(i)
(i)
xj − xj,min
xj,norm = ∈ [0, 1], (6.1)
xj,max − xj,min
where xj,min and xj,max are the minimum and maximum of the j -th feature
column (in the training dataset), respectively.
• standardization:
(i)
(i)
xj − µj
xj,std = , (6.2)
σj
where µj is the sample mean of the j -th feature column and σj is the
corresponding standard deviation.2
Pm Pm
Figure 6.1: L 2 -regularization (kw k22 = i=1 wi2 ) and L 1 -regularization (kw k1 = i=1 |wi |).
where
1
Rp (w) := kw kpp , p = 1, 2. (6.4)
p
• Regularization can be considered as adding a penalty term to the
cost function to encourage smaller weights; or in other words, we
penalize large weights.
• Thus, by increasing the regularization strength (λ ↑),
– we can shrink the weights towards zero, and
– decrease the dependence of our model on the training data.
• The minimizer w ∗ keeps a balance between Q(X , y; w) and λRp (w);
the weight coefficients cannot fall outside the shaded region, which
shrinks as λ increases.
• The minimizer w ∗ must be the point where the L p -ball intersects
with the minimum-valued contour of the unpenalized cost function.
150 Chapter 6. Data Preprocessing in Machine Learning
where
1, if wi > 0
sign(wi ) = −1, if wi < 0
0, if wi = 0.
There is no research addressing the question of training vs. test data; more
research and more experience are needed to gain a better understanding.
6.5. Feature Importance 155
(a) Perform the same experiment with the k -NN classifier replaced by the support
vector machine (soft-margin SVM classification).
(b) In particular, analyze accuracy of the soft-margin SVM and plot the result
as in the figure on p. 139.
6.2. On pp. 141-143, the permutation feature importance is assessed from the ran-
dom forest classifier, using the wine dataset.
(a) Discuss whether or not you can derive feature importance for a k -NN classifier.
(b) Assess feature importance with the logistic regression classifier, using the
same dataset.
(c) Based on the computed feature importance, analyze and plot accuracy of the
logistic regression classifier for k_features = 1, 2, · · · , 13.
156 Chapter 6. Data Preprocessing in Machine Learning
Chapter 7
Contents of Chapter 7
7.1. Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.2. Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.3. Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.4. Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
157
158 Chapter 7. Feature Extraction: Data Compression
Z = X W, (7.1)
wT XT Xw v1
w 1 = arg max = , (X T X )v 1 = λ1 v 1 , (7.3)
w 6=0 wT w kv 1 k
where λ1 is the largest eigenvalue of X T X ∈ Rd ×d .
and then 2 finding the weight vector which extracts the maximum variance
from this new data matrix
w k = arg max kX bk w k2 ,
kw k=1
(7.5)
W = [w 1 |w 2 | · · · |w d ], (X T X ) w j = λj w j , w Ti w j = δij , (7.6)
where λ1 ≥ λ2 ≥ · · · ≥ λd ≥ 0.
z = xW , (7.7)
zT = W T xT , (7.8)
where
U : n × d orthogonal (the left singular vectors of X .)
Σ : d × d diagonal (the singular values of X .)
V : d × d orthogonal (the right singular vectors of X .)
where σ1 ≥ σ2 ≥ · · · ≥ σd ≥ 0.
• In terms of this factorization, the matrix X T X reads
X T X = (U ΣV T )T U ΣV T = V ΣU T U ΣV T = VΣ2 V T . (7.14)
and therefore each column of Z is given by one of the left singular vectors
of X multiplied by the corresponding singular value. This form is also the
polar decomposition of Z . See (7.9) on p. 160.
• As with the eigen-decomposition, the SVD, the truncated score matrix
Zk ∈ RN ×k can be obtained by considering only the first k largest singular
values and their singular vectors:
Zk = XWk = U ΣV T Wk = U Σk , (7.16)
where
Σk := diag(σ1 , · · · , σk , 0, · · · , 0). (7.17)
kX − Xk k2 = kU ΣV T − U Σk V T k2
= kU(Σ − Σk )V T k2 (7.19)
= kΣ − Σk k2 = σk +1 ,
for a tolerance ε.
σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.
where
U : m × n orthogonal (the left singular vectors of A .)
Σ : n × n diagonal (the singular values of A .)
V : n × n orthogonal (the right singular vectors of A .)
Proof. (of Theorem 7.9) Use induction on m and n: we assume that the
SVD exists for (m − 1) × (n − 1) matrices, and prove it for m × n. We assume
6 0; otherwise we can take Σ = 0 and let U and V be arbitrary orthogonal
A =
matrices.
Av
• Let u = ||A v||2 , which is a unit vector. Choose Ũ , Ṽ such that
are orthogonal.
• Now, we write
" #
uT uT A v uT A Ṽ
T
U AV = · A · [v Ṽ ] =
Ũ T Ũ T A v Ũ T A Ṽ
Since
we have
T
T σ 0 1 0 σ 0 1 0
U AV = = ,
0 U1 Σ1 V1T 0 U1 0 Σ1 0 V1
or equivalently
T
1 0 σ 0 1 0
A= U V . (7.26)
0 U1 0 Σ1 0 V1
A = U Σ V T ⇐⇒ AV = UΣV T V = UΣ,
we have
AV = A [v1 v2 ··· vn ] = [A v1 A v2 ··· A vn ]
σ1
...
= [u1 ··· ur ··· un ]
σr
(7.27)
...
0
= [σ1 u 1 ··· σr ur 0 ··· 0].
Therefore,
A vj = σj uj , j = 1, 2, · · · , r
A = U Σ VT ⇔ (7.28)
A vj = 0, j = r + 1, · · · , n
• Equation (7.30) gives how to find the singular values {σj } and the
right singular vectors V , while (7.28) shows a way to compute the
left singular vectors U .
• (Dyadic decomposition) The matrix A ∈ Rm×n can be expressed as
n
X
A= σj uj v Tj . (7.32)
j=1
When rank(A ) = r ≤ n,
r
X
A= σj uj v Tj . (7.33)
j=1
This property has been utilized for various approximations and ap-
plications, e.g., by dropping singular vectors corresponding to small
singular values.
168 Chapter 7. Feature Extraction: Data Compression
B1 = {v1 , v2 , · · · , vn }
B2 = {σ1 u1 , σ2 u2 , · · · , σr ur }
for a subspace of Rm :
A
B1 = {v1 , v2 , · · · , vn } −
→ B2 = {σ1 u1 , σ2 u2 , · · · , σr ur } (7.34)
Then, ∀x ∈ S n−1 ,
x = x1 v1 + x2 v2 + · · · + xn vn
A x = σ1 x1 u1 + σ2 x2 u2 + · · · + σr xr ur (7.35)
= y1 u1 + y2 u2 + · · · + yr ur , (yj = σj xj )
So, we have
yj
yj = σj xj ⇐⇒ xj =
σj
n r
X yj2 (7.36)
xj2
P
= 1 (sphere) ⇐⇒ = α ≤ 1 (ellipsoid)
j=1 j=1
σj2
7.2. Singular Value Decomposition 169
Then, for x ∈ S 1 ,
A x = UΣV T x = U Σ(V T x)
In general,
and
||A ||2 = σ1 (See Exercise 2.)
||A ||2F = σ12 + · · · + σr2 (See Exercise 3.)
||A x||2
min = σn (m ≥ n) (7.38)
x6=0 ||x||2
σ1
κ2 (A ) = ||A ||2 · ||A −1 ||2 =
σn
( when m = n, & ∃A −1 )
σ1 ≥ σ2 ≥ · · · σn > 0.
Then
||(A T A )−1 ||2 = σn−2 ,
||(A T A )−1 A T ||2 = σn−1 ,
(7.39)
||A (A T A )−1 ||2 = σn−1 ,
||A (A T A )−1 A T ||2 = 1.
σ1 ≥ · · · ≥ σr > 0.
Define, for k = 1, · · · , r − 1,
k
X
Ak = σj uj v Tj (sum of rank-1 matrices).
j=1
= σk2+1 + · · · + σr2 .
Ak = U Σk V T , (7.42)
Full SVD
• For A ∈ Rm×n ,
A = UΣV T ⇐⇒ U T AV = Σ,
where U ∈ Rm×n and Σ, V ∈ Rn×n .
• Expand
U → Ũ = "[U U#2 ] ∈ Rm×m , (orthogonal)
Σ
Σ → Σ̃ = ∈ Rm×n ,
O
where O is an (m − n) × n zero matrix.
• Then, " #
Σ
Ũ Σ̃V T = [U U2 ] V T = UΣV T = A (7.43)
O
7.2. Singular Value Decomposition 173
A T A = V ΛV T ,
vTn
Lemma 7.18. Let A ∈ Rn×n be symmetric. Then (a) all the eigenvalues
of A are real and (b) eigenvectors corresponding to distinct eigenvalues
are orthogonal.
λ1 = 18 and λ2 = 5,
√ √ √ √ √
3. σ1 = λ1 = 18 = 3 2, σ2 = λ2 = 5. So
√
18 √0
Σ=
0 5
√7
" # 7
√3 234
1 4
4. u1 = √1 A 13 √1 √1 −4 = − 234
√
σ1 A v1 = =
18 √2 18 13
13
13 √13
234
" # 4 √4
−2
√ 65
1 √1 A 13 √1 √1 √7
u2 = σ2 A v2 = = 7= .
5 √3 5 13 65
13
0 0
√7 √4
234 65 √ " √3 √2
#
4 7
18 √0 13 13
5. A = UΣV T = − √234 √
65
2 3
0 5 − √13 √
√13 0 13
234
7.2. Singular Value Decomposition 175
A + = (A T A )−1 A T = VΣ−1 U T ,
1 2
when A = −2 1 .
3 2
Solution. From Example 7.19, we have
√7 √4 √ " #
234 65
18 √0 √3 √2
4 √7
A = UΣV T = − √234 13 13
65 0 5 − √213 √3
√13 0 13
234
Thus,
" #" #" #
√3 − √213 √1 0 √7 4
− √234 √13
A + = V Σ−1 U T = 13 18 234 234
√2 √3 0 √1 √4 √7 0
13 13 5 65 65
1 4 1
− 30 − 15 6
= 11 13 1
45 45 9
176 Chapter 7. Feature Extraction: Data Compression
Pn · · · P1 A Q1 · · · Qn−2 = B (7.44)
Numerical rank
In the absence of round-off errors and uncertainties in the data, the SVD
reveals the rank of the matrix. Unfortunately the presence of errors makes
rank determination problematic. For example, consider
1/3 1/3 2/3
2/3 2/3 4/3
A = 1 / 3 2/ 3 3 /3
(7.45)
2/5 2/5 4/5
3/5 1/5 4/5
• Obviously A is of rank 2, as its third column is the sum of the first two.
• Matlab “svd" (with IEEE double precision) produces
σ1 = 2.5987, σ2 = 0.3682, and σ3 = 8.6614 × 10−17 .
In Matlab
• Matlab has a “rank" command, which computes the numerical rank
of the matrix with a default threshold
T = 2 max{m, n} ||A ||2 (7.47)
• The approximation k
X
T
Ak = UΣk V = σi ui vTi (7.49)
i=1
Original (k = 270) k =1 k = 10
k = 20 k = 50 k = 100
LDA objective
• The LDA objective is to perform dimensionality reduction.
– So what? PCA does that, too!
• However, we want to preserve as much of the class discriminatory
information as possible.
– OK, this is new!
LDA
• Consider a pattern classification problem, where we have c classes.
• Suppose each class has Nk samples in Rd , where k = 1, 2, · · · , c .
• Let Xk = {x (1) , x (2) , · · · , x (Nk ) } be the set of d -dimensional samples for
class k .
• Let X ∈ Rd ×N be the data matrix, stacking all the samples from P all
classes, such that each column represents a sample, where N = k Nk .
• The LDA seeks to obtain a transformation of X to Z through projecting
the samples in X onto a hyperplane with dimension c − 1.
1
In PCA, the main idea is to re-express the available dataset to extract the relevant information by re-
ducing the redundancy and to minimize the noise. While (unsupervised) PCA attempts to find the
orthogonal component axes of maximum variance in a dataset, the goal in the (supervised) LDA is to find
the feature subspace that optimizes class separability.
7.3. Linear Discriminant Analysis 181
Fisher’s LDA
The solution proposed by Fisher is to maximize a function that repre-
sents the difference between the means, normalized by a measure of the
within-class variability, or the so-called scatter.
• For each class k , we define the scatter (an equivalent of the variance)
as 2
X
sk =
e ek )2 , z = w T x.
(z − µ (7.55)
x ∈Xk
• The quantity e sk2 measures the variability within class Xk after project-
ing it on the z -axis.
• Thus, e s22 measures the variability within the two classes at hand
s12 + e
after projection; it is called the within-class scatter of the projected
samples.
• Fisher’s linear discriminant is defined as the linear function w T x that
maximizes the objective function:
∗ e1 − µ
(µ e2 )2
w = arg max J (w), where J (w) = . (7.56)
w s12 + e
e s22
where 1 X X
µk = x, ek = w T µk ,
µ sk2 =
e ek )2 .
(z − µ
Nk
x ∈Xk x ∈Xk
s12 + e
e s22 = w T S1 w + w T S2 w = w T Sw w =: S
ew , (7.60)
where S
ew is the within-class scatter of projected samples z .
• Similarly, the difference between the projected means (in z -space) can be
expressed in terms of the means in the original feature space (x -space).
e2 )2 = (w T µ1 − w T µ2 )2 = w T (µ1 − µ2 )(µ1 − µ2 )T w
e1 − µ
(µ
(7.61)
| {z }
=:Sb
T
= w Sb w =: S
eb ,
184 Chapter 7. Feature Extraction: Data Compression
Sw−1 Sb w = λ w ⇐⇒ Sb w = λ Sw w; (7.64)
9 S1 = cov(X1,0)*n;
10 S2 = cov(X2,0)*n;
11 Sw = S1+S2; % Sw = [20,13; 13,31]
12
15 invSw_Sb = inv(Sw)*Sb;
16 [V,L] = eig(invSw_Sb); % V1 = [ 0.9503,0.3113]; L1 = 1.0476
17 % V2 = [-0.6905,0.7234]; L2 = 0.0000
zk = w Tk x =⇒ z = W T x ∈ Rc −1 . (7.66)
• If we have N sample (column) vectors, we can stack them into one ma-
trix as follows.
Z = WTX, (7.67)
where X ∈ Rd ×N , W ∈ Rd ×(c −1) , and Z ∈ R(c −1)×N .
Recall: For the two classes case, the within-class scatter was com-
puted as
Sw = S1 + S2 .
Xk , and Sw ∈ Rd ×d .
Recall: For the two classes case, the between-class scatter was com-
puted as
Sb = (µ1 − µ2 )(µ1 − µ2 )T .
where rank(Sb ) = c − 1.
7.3. Linear Discriminant Analysis 187
∗ W T Sb W
W = arg max J (W ) = arg max T . (7.70)
W W W Sw W
Recall: For two-classes case, when we set ∂J∂ w(w) = 0, the optimization
problem is reduced to the eigenvalue problem
Sw−1 Sb w ∗ = λ∗ w ∗ , where λ∗ = J (w ∗ ).
Sw−1 Sb W ∗ = λ W ∗ , λ = J (W ∗ ), (7.73)
where Sw−1 Sb ∈ Rd ×d .
188 Chapter 7. Feature Extraction: Data Compression
Illustration – 3 classes
• Let us generate a dataset for each class to illustrate the LDA transfor-
mation.
• For each class:
– Use the random number generator to generate a uniform stream of
500 samples that follows U (0, 1).
– Using the Box-Muller approach, convert the generated uniform stream
to N (0, 1).
– Then use the method of eigenvalues and eigenvectors to manip-
ulate the standard normal to have the required mean vector and
covariance matrix .
– Estimate the mean and covariance matrix of the resulted dataset.
4 %% uniform stream
5 U = rand(2,1000); u1 = U(:,1:2:end); u2 = U(:,2:2:end);
6
41 lda_c3_visualize;
denormalize.m
1 function Xnew = denormalize(X,Mu,Cov)
2 % it manipulates data samples in N(0,1) to something else.
3
6 Xnew = zeros(size(X));
7 for j=1:size(X,2)
8 Xnew(:,j)= VsD * X(:,j);
9 end
10
lda_c3_visualize.m
1 figure, hold on; axis([-10 20 -5 20]);
2 xlabel('x_1 - the first feature','fontsize',12);
3 ylabel('x_2 - the second feature','fontsize',12);
4 plot(X1(1,:)',X1(2,:)','ro','markersize',4,"linewidth",2)
5 plot(X2(1,:)',X2(2,:)','g+','markersize',4,"linewidth",2)
6 plot(X3(1,:)',X3(2,:)','bd','markersize',4,"linewidth",2)
7 hold off
8 print -dpng 'LDA_c3_Data.png'
9
17 plot(Mu1(1),Mu1(2),'c.','markersize',20)
18 plot(Mu2(1),Mu2(2),'m.','markersize',20)
19 plot(Mu3(1),Mu3(2),'r.','markersize',20)
20 plot(Mu(1),Mu(2),'k*','markersize',15,"linewidth",3)
21 text(Mu(1)+0.5,Mu(2)-0.5,'\mu','fontsize',18)
22
33
43 figure, plot(z1_wk,z1_wk_pdf,'ro',z2_wk,z2_wk_pdf,'g+',...
44 z3_wk,z3_wk_pdf,'bd')
45 xlabel('z','fontsize',12); ylabel('p(z|w1)','fontsize',12);
46 print -dpng 'LDA_c3_Xw1_pdf.png'
47
61 figure, plot(z1_wk,z1_wk_pdf,'ro',z2_wk,z2_wk_pdf,'g+',...
62 z3_wk,z3_wk_pdf,'bd')
63 xlabel('z','fontsize',12); ylabel('p(z|w2)','fontsize',12);
64 print -dpng 'LDA_c3_Xw2_pdf.png'
192 Chapter 7. Feature Extraction: Data Compression
λ1 = 3991.2, λ2 = 1727.7.
7.3. Linear Discriminant Analysis 193
Figure 7.10: Classes PDF, along the first projection vector w ∗1 ; λ1 = 3991.2.
Figure 7.11: Classes PDF, along the second projection vector w ∗2 ; λ2 = 1727.7.
Apparently, the projection vector that has the highest eigenvalue pro-
vides higher discrimination power between classes.
194 Chapter 7. Feature Extraction: Data Compression
The (supervised) LDA classifier must work better than the (unsuper-
vised) PCA, for datasets in Figures 7.9 and 7.12.
Recall: Fisher’s LDA was generalized under the assumption of equal class
covariances and normally distributed classes.
However, even if one or more of those assumptions are (slightly) violated,
the LDA for dimensionality reduction can still work reasonably well.
7.3. Linear Discriminant Analysis 195
Let X ∈ RN×d be the data matrix, in which each row represents a sample.
We summarize the main steps that are required to perform the LDA for
dimensionality reduction.
1. Standardize the d -dimensional dataset (d is the number of features).
2. For each class j , compute the d -dimensional mean vector µj .
3. Construct the within-class scatter matrix Sw (7.68) and the
between-class scatter matrix Sb (7.69).
4. Compute the eigenvectors and corresponding eigenvalues of the ma-
trix Sw−1 Sb (7.71).
5. Sort the eigenvalues by decreasing order to rank the corresponding
eigenvectors.
6. Choose the k eigenvectors that correspond to the k largest eigenvalues
to construct a transformation matrix
W = [w 1 |w 2 | · · · |w k ] ∈ Rd ×k ; (7.74)
Remark 7.24.
• rank(Sw−1 Sb ) ≤ c − 1; we must have k ≤ c − 1.
• The projected feature Zij is x (i) · w j in the projected coordinates and
(x (i) · w j ) w j in the original coordinates.
where λ1 ≥ λ2 ≥ · · · ≥ λk ≥ 0.
• (Remark 7.4, p. 160). The matrix Zk ∈ RN×k is scaled eigenvectors of
XX T :
p p p
Zk = [ λ1 u1 | λ2 u2 | · · · | λk uk ], (XX T ) uj = λj uj , uTi uj = δij . (7.77)
Then,
V∼
= W; σj2 = λj , j = 1, 2, · · · , d. (7.79)
198 Chapter 7. Feature Extraction: Data Compression
Let, for λ1 ≥ λ2 ≥ · · · ≥ λk ≥ · · · ≥ λp ≥ 0,
Vk = [v 1 |v 2 | · · · |v k ] ∈ Rp ×k , Cv j = λj v j , v Ti v j = δij . (7.84)
and therefore
N N
1 X (i) (i) T 1 X
v= φ(x )φ(x ) v = [φ(x (i) ) · v] φ(x (i) ), (7.87)
λN λN
i=1 i=1
Note:
• The above claim means that all eigenvectors v with λ 6= 0 lie in the
span of φ(x (1) ), · · · , φ(x (N) ).
• Thus, finding the eigenvectors in (7.83) is equivalent to finding the
coefficients α = (α1 , α2 , · · · , αN )T .
200 Chapter 7. Feature Extraction: Data Compression
How to find α
• By substituting this back into the equation and using (7.82), we get
1
Cv j = λj v j ⇒ φ(X )T φ(X )φ(X )T αj = λj φ(X )T αj . (7.89)
N
and therefore
1
φ(X )φ(X )T φ(X )φ(X )T αj = λj φ(X )φ(X )T αj . (7.90)
N
• Let K be the similarity (kernel) matrix:
def
K == φ(X )φ(X )T ∈ RN ×N , (7.91)
where
Kij := K(x (i) , x (j) ) = φ(x (i) )T φ(x (j) ), (7.92)
where K is called the kernel function.a
• Then, (7.90) can be rewritten as
K 2 αj = (N λj ) K αj . (7.93)
Referring (7.77) derived for the standard PCA, we may conclude that the
k principal components for the kernel PCA are
√ √ √
Ak = [ µ1 α1 | µ2 α2 | · · · | µk αk ] ∈ RN×k . (7.97)
Remark 7.27. It follows from (7.85), (7.88), and (7.95)-(7.96) that for a
new point x , its projection onto the principal components is:
N N
1 X
(`) 1 X
T
zj = φ(x) v j = √ φ(x) T
α`j φ(x ) = √ α`j φ(x)T φ(x (`) )
µj µj
`=1 `=1
(7.98)
N
1 X 1
= √ α`j K(x, x (`) ) = √ K(x, X )T αj .
µj µj
`=1
• With an appropriate choice of kernel function, the kernel PCA can give a
good re-encoding of the data that lies along a nonlinear manifold.
• The kernel matrix is in (N × N)-dimensions, so the kernel PCA will have
difficulties when we have lots of data points.
204 Chapter 7. Feature Extraction: Data Compression
7.1. Read pp. 145–158, Python Machine Learning, 3rd Ed., about the PCA.
(a) Find the optimal number of components k ∗ which produces the best classifica-
tion accuracy (for logistic regression), by experimenting the example code with
n_components = 1, 2, · · · , 13.
(b) What is the corresponding cumulative explained variance?
7.2. Let A ∈ Rm×n . Prove that ||A ||2 = σ1 , the largest singular value of A . Hint: Use the
following
||A v 1 ||2 σ1 ||u1 ||2
= = σ1 =⇒ ||A ||2 ≥ σ1
||v 1 ||2 ||v 1 ||2
and arguments around Equations (7.35) and (7.36) for the opposite directional in-
equality.
7.3. Recall that the Frobenius matrix norm is defined by
m X
X n 1/2
||A ||F = |aij |2 , A ∈ Rm×n .
i=1 j=1
Show that ||A ||F = (σ12 + · · · + σk2 )1/2 , where σj are nonzero singular values of A . Hint:
You may use the norm-preserving property of orthogonal matrices. That is, if U is
orthogonal, then ||UB ||2 = ||B ||2 and ||UB ||F = ||B ||F .
7.4. Prove Lemma 7.18. Hint: For (b), let A v i = λi v i , i = 1, 2, and λ1 6= λ2 . Then
(λ1 v 1 ) · v 2 = (A v 1 ) · v 2 = v 1 · (A v 2 ) = v 1 · (λ2 v 2 ).
| {z }
∵ A is symmetric
For (a), you may use a similar argument, but with the dot product being defined for
complex values, i.e.,
u · v = uT v,
where v is the complex conjugate of v .
7.5. Use Matlab to generate a random matrix A ∈ R8×6 with rank 4. For example,
A = randn(8,4);
A(:,5:6) = A(:,1:2)+A(:,3:4);
[Q,R] = qr(randn(6));
A = A*Q;
(a) Print out A on your computer screen. Can you tell by looking if it has (numerical)
rank 4?
(b) Use Matlab’s “svd" command to obtain the singular values of A . How many are
“large?" How many are “tiny?" (You may use the command “format short e" to get
a more accurate view of the singular values.)
(c) Use Matlab’s “rank" command to confirm that the numerical rank is 4.
7.4. Kernel Principal Component Analysis 205
(d) Use the “rank" command with a small enough threshold that it returns the value
6. (Type “help rank" for information about how to do this.)
7.6. Verify (7.63). Hint : Use the quotient rule for ∂J∂ w(w) and equate the numerator to zero.
7.7. Try to understand the kernel PCA more deeply by experimenting pp. 175–188, Python
Machine Learning, 3rd Ed.. Its implementation is slightly different from (but equiv-
alent to) Summary 7.28.
(a) Modify the code, following Summary 7.28, and test if it works as expected as in
Python Machine Learning, 3rd Ed..
(b) The datasets considered are transformed via the Gaussian radial basis function
(RBF) kernel only. What happens if you use the following kernels?
K1 (x (i) , x (j) ) = (a1 + b1 x (i) · x (j) )2 (polynomial of degree up to 2)
K2 (x (i) , x (j) ) = tanh(a2 + b2 x (i) · x (j) ) (sigmoid)
Cluster Analysis
Contents of Chapter 8
8.1. Basics for Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.2. K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
8.3. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.4. DBSCAN: Density-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.5. Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
8.6. Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
207
208 Chapter 8. Cluster Analysis
• Understanding
– group related documents or browsings
– group genes/proteins that have similar functionality, or
– group stocks with similar price fluctuations
• Summarization
– reduce the size of large data sets
Center-based Clusters
Contiguity-based Clusters
Density-based Clusters
Conceptual Clusters
Partitional Clustering
Divide data objects into non-overlapping subsets (clusters) such that each
data object is in exactly one subset
Examples are
Hierarchical Clustering
A set of nested clusters, organized as a hierarchical tree
iteration=1 iteration=2
iteration=3 iteration=15
Figure 8.13: Lloyd’s algorithm: The Voronoi diagrams and the centroids (•) of the current
points at iterations 1, 2, 3, and 15.
220 Chapter 8. Cluster Analysis
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split “loose” clusters, i.e., clusters with relatively high SSE
– Merge clusters that are “close” and that have relatively low SSE
• PAM is more robust than K-Means in the presence of noise and out-
liers, because a medoid is less influenced by outliers or other extreme
values than a mean.
• PAM works efficiently for small data sets but does not scale well for
large data sets.
• The run-time complexity of PAM is O(K (N − K )2 ) for each iteration,
where N is the number of data points and K is the number of clusters.
PAM finds the best K medoids among a given data, and CLARA finds the
best K medoids among the selected samples.
⊕ It is more efficient and scalable than both PAM and CLARA; handles
outliers.
⊕ Focusing techniques and spatial access structures may further im-
prove its performance; see (Ng & Han, 2002) [47] and (Schubert &
Rousseeuw, 2018) [60].
Complexity
Limitations
• Single linkage
• Complete linkage
• Group linkage
• Centroid linkage
• Ward’s minimum variance
AGNES
Dendrogram
Figure 8.24: Six observations and a dendrogram showing their hierarchical clustering.
Remark 8.2.
• The height of the dendrogram indicates the order in which the clus-
ters were joined; it reflects the distance between the clusters.
• The greater the difference in height, the more dissimilarity.
• Observations are allocated to clusters by drawing a horizontal line
through the dendrogram. Observations that are joined together below
the line are in the same clusters.
236 Chapter 8. Cluster Analysis
Single Link
Complete Link
Average Link
DBSCAN
• Given a set of points in some space, it groups together points that are
closely packed together (points with many nearby neighbors), marking
as outliers points that lie alone in low-density regions (whose nearest
neighbors are too far away).
• It is one of the most common clustering algorithms and also most
cited in scientific literature.
• In 2014, the algorithm was awarded the test of time awarda at the
leading data mining conference,
KDD 2014: https://www.kdd.org/kdd2014/.
a
The test of time award is an award given to algorithms which have received substantial attention in
theory and practice.
8.4. DBSCAN: Density-based Clustering 239
Illustration of DBSCAN
• Point A and 5 other red points are core points. They are all reachable
from one another, so they form a single cluster.
• Points B and C are not core points, but are reachable from A (via
other core points) and thus belong to the cluster as well.
• Point N is a noise point that is neither a core point nor directly-
reachable.
• Reachability is not a symmetric relation since, by definition, no
point may be reachable from a non-core point, regardless of distance
(so a non-core point may be reachable, but nothing can be reached from
it).
DBSCAN: Pseudocode
DBSCAN
1 DBSCAN(D, eps, MinPts)
2 C=0 # Cluster counter
3 for each unvisited point P in dataset D
4 mark P as visited
5 NP = regionQuery(P, eps) # Find neighbors of P
6 if size(NP) < MinPts
7 mark P as NOISE
8 else
9 C = C + 1
10 expandCluster(P, NP, C, eps, MinPts)
11
Figure 8.29: Original points (left) and point types of DBSCAN clustering with eps=10 and
MinPts=4 (right): core (green), border (blue), and noise (red).
242 Chapter 8. Cluster Analysis
DBSCAN Clustering
• Resistant to Noise
• Can handle clusters of different shapes and sizes
• Eps and MinPts depend on each other and can be hard to specify
• Varying densities
• High-dimensional data
• Center-based Clustering
– Fuzzy c-means
• Mixture Models
– Expectation-maximization (EM) algorithm
• Hierarchical
– CURE (Clustering Using Representatives): shrinks points toward
center
– BIRCH (balanced iterative reducing and clustering using hierar-
chies)
• Graph-based Clustering
– Graph partitioning on a sparsified proximity graph
– SNN graph (Shared nearest-neighbor)
• Spectral Clustering
– Reduce the dimensionality using the spectrum of the similarity,
and cluster in this space
• Subspace Clustering
• Data Stream Clustering
244 Chapter 8. Cluster Analysis
SOM Architecture
tions between the nodes do not form a cycle. It is the first and simplest type of artificial
neural network devised. In this network, the information moves in only one direction,
forward, from the input nodes, through the hidden nodes (if any) and to the output
nodes.
b
One particularly interesting class of unsupervised system is based on competitive
learning, in which the output neurons compete amongst themselves to be activated,
with the result that only one is activated at any one time. This activated neuron is
called a winner-takes-all neuron or simply the winning neuron. Such competi-
tion can be induced/implemented by having lateral inhibition connections (nega-
tive feedback paths) between the neurons. The result is that the neurons are forced to
organize themselves.
8.5. Self-Organizing Maps 247
Originally, the SOM algorithm was defined for data described by numer-
ical vectors which belong to a subset X of an Euclidean space (typically
Rd ). For some results, we need to assume that the subset is bounded and
convex. Two different settings have to be considered from theoretical
and practical points of view:
• Continuous setting: the input space X ⊂ Rd is modeled by a proba-
bility distribution with a density function f ,
• Discrete setting: the input space X comprises N data points
x 1 , x 2 , · · · , x N ∈ Rd .
Here the discrete setting means a finite subset of the input space.
8.5. Self-Organizing Maps 249
Neighborhood Structure
where σ 2 (t) can decrease over time to reduce the intensity and
the scope of the neighborhood relations. This choice is related to
the cooperation process.
8.5. Self-Organizing Maps 251
Theoretical Issues
• The algorithm is easy to define and to use, and a lot of practical stud-
ies confirm that it works. However, the theoretical study of its conver-
gence when t tends to ∞ remains without complete proof and provides
open problems. The main question is to know if the solution obtained
from a finite sample converges to the true solution that might be ob-
tained from the true data distribution.
• When t tends to ∞, the Rd -valued stochastic processes [mk (t)]k =1,2,··· ,K
can present oscillations, explosion to infinity, convergence in distribu-
tion to an equilibrium process, convergence in distribution or almost
sure to a finite set of points in Rd , etc.. Some of the open questions are:
– Is the algorithm convergent in distribution or almost surely, when
t tends to ∞?
– What happens when ε(t) is constant? when it decreases?
– If a limit state exists, is it stable?
– How to characterize the organization?
8.6. Cluster Validation 253
• Two matrices
– Proximity matrixa (P ∈ RN×N )
– Incidence matrix (I ∈ RN×N )
* One row and one column for each data point
* An entry is 1 if the associated pair of points belong to the same cluster
* An entry is 0 if the associated pair of points belongs to different clusters
• High correlation indicates that points that belong to the same clus-
ter are close to each other.
Example: For K-Means clusterings of two data sets, the correlation
coefficient are:
measure of the similarity (or distance) between the items to which row i and column j
correspond.
256 Chapter 8. Cluster Analysis
Order the similarity matrix with respect to cluster labels and inspect
visually.
10 clusters
Silhouette Coefficient
10 clusters
• Entropya
– For cluster j , let pij be the probability that a member of cluster j
belongs to class i , defined as
pij = nij /Nj , (8.15)
• Purity
– The purity of cluster j is given by
8.1. We will experiment the K-Means algorithm following the first section of Chapter 11,
Python Machine Learning, 3rd Ed., in a little bit different fashion.
(a) Make a dataset of 4 clusters (modifying the code on pp. 354–355).
(b) For K = 1, 2, · · · , 10, run the K-Means clustering algorithm with the initialization
init=’k-means++’.
(c) For each K , compute the within-cluster SSE (distortion) for an elbow analysis
to select an appropriate K . Note: Rather than using inertia_ attribute, imple-
ment a function for the computation of distortion.
(d) Produce silhouette plots for K = 3, 4, 5, 6.
8.2. Now, let’s experiment DBSCAN, following Python Machine Learning, 3rd Ed., pp. 376–
381.
(a) Produce a dataset having three half-moon-shaped structures each of which con-
sists of 100 samples.
(b) Compare performances of K-Means, AGNES, and DBSCAN.
(Set n_clusters=3 for K-Means and AGNES.)
(c) For K-Means and AGNES, what if you choose n_clusters much larger than 3 (for
example, 9, 12, 15)?
(d) Again, for K-Means and AGNES, perform an elbow analysis to select an appro-
priate K .
264 Chapter 8. Cluster Analysis
Chapter 9
Contents of Chapter 9
9.1. Basics for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.2. Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
9.3. Back-Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.4. Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Exercises for Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Project.2. Development of New Machine Learning Algorithms . . . . . . . . . . . . . . . . . 303
265
266 Chapter 9. Neural Networks and Deep Learning
Feature Learning
−→ −→
Representation Algorithm
Representation Learning
or, equivalently,
1 if z ≥ 0,
φ(z) = z = b + w · x, (9.2)
−1 otherwise,
• The leftmost layer is called the input layer, and the neurons within
the layer are called input neurons.
• The rightmost layer is the output layer.
• The middle layers are called hidden layers.
• The design of the input and output layers in a network is often straight-
forward. For example, for the classification of handwritten digits:
– If the images are in 28 × 28 grayscale pixels, then we’d have 784(=
28 × 28) input neurons.
– It is heuristic to set 10 neurons in the output layer. (rather than 4,
where 24 = 16 ≥ 10)
• There can be quite an art to the design of the hidden layers. In par-
ticular, it is not possible to sum up the design process for the hidden
layers with a few simple rules of thumb. Instead, neural networks re-
searchers have developed many design heuristics for the hidden layers,
which help people get the behavior they want out of their nets. For
example, such heuristics can be used to help determine how to trade
off the number of hidden layers against the time required to train the
network. We’ll meet several such design heuristics later in this chapter.
274 Chapter 9. Neural Networks and Deep Learning
=⇒
Figure 9.5: Segmentation.
MNIST data set : A modified subset of two data sets collected by NIST
(US National Institute of Standards and Technology):
• Its first part contains 60,000 images (for training)
• The second part is 10,000 images (for test), each of which is in 28 × 28
grayscale pixels
• Let’s concentrate on the first output neuron, the one that is trying
to decide whether or not the input digit is a 0.
• It does this by weighing up evidence from the hidden layer of neurons.
It can do this by heavily weighting input pixels which overlap with the
image, and only lightly weighting the other inputs.
• Similarly, let’s suppose that the second, third, and fourth neurons in
the hidden layer detect whether or not the following images are present
• As you may have guessed, these four images together make up the 0
image that we saw in the line of digits shown in Figure 9.5:
• So if all four of these hidden neurons are firing, then we can conclude
that the digit is a 0.
276 Chapter 9. Neural Networks and Deep Learning
where W denotes the collection of all weights in the network, B all the
biases, and a(x (i) ) is the vector of outputs from the network when x (i) is
input.
• Gradient descent method
∆W
W W
← + , (9.7)
B B ∆B
where
∆W
∇W C
= −η .
∆B ∇B C
• For classification of handwritten digits for the MNIST data set, you
may choose: batch_size = 10.
13 class Network(object):
14 def __init__(self, sizes):
15 """The list ``sizes`` contains the number of neurons in the
16 respective layers of the network. For example, if the list
17 was [2, 3, 1] then it would be a three-layer network, with the
18 first layer containing 2 neurons, the second layer 3 neurons,
19 and the third layer 1 neuron. """
20
21 self.num_layers = len(sizes)
22 self.sizes = sizes
23 self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
24 self.weights = [np.random.randn(y, x)
25 for x, y in zip(sizes[:-1], sizes[1:])]
26
94 z = zs[-l]
95 sp = sigmoid_prime(z)
96 delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
97 nabla_b[-l] = delta
98 nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
99 return (nabla_b, nabla_w)
100
4 import network
5 n_neurons = 20
6 net = network.Network([784 , n_neurons, 10])
7
Validation Accuracy
Validation Accuracy
1 Epoch 0: 9006 / 10000
2 Epoch 1: 9128 / 10000
3 Epoch 2: 9202 / 10000
4 Epoch 3: 9188 / 10000
5 Epoch 4: 9249 / 10000
6 ...
7 Epoch 25: 9356 / 10000
8 Epoch 26: 9388 / 10000
9 Epoch 27: 9407 / 10000
10 Epoch 28: 9410 / 10000
11 Epoch 29: 9428 / 10000
Accuracy Comparisons
3 Of course, this is just a rough heuristic, and it suffers from many deficien-
cies.
3 Still, the heuristic suggests that if we can solve the sub-problems using
neural networks, then we can build a neural network for face-detection, by
combining the networks for the sub-problems.
9.2. Neural Networks 283
3 Those questions too can be broken down, further and further, through
multiple layers.
Final Remarks
• Researchers in the 1980s and 1990s tried using stochastic gradient de-
scent and back-propagation to train deep networks.
• Unfortunately, except for a few special architectures, they didn’t have
much luck.
• The networks would learn, but very slowly, and in practice often too
slowly to be useful.
• Since 2006, a set of new techniques has been developed that enable
learning in deep neural networks.
9.3. Back-Propagation 285
9.3. Back-Propagation
• In the previous section, we saw an example of neural networks that
could learn their weights and biases using the stochastic gradient de-
scent algorithm.
• In this section, we will see how to compute the gradient, more pre-
cisely, the derivatives of the cost function with respect to weights and
biases in all layers.
• Back-propagation is a practical application of the chain rule for
the computation of derivatives.
• The back-propagation algorithm was originally introduced in the 1970s,
but its importance was not fully appreciated until a famous 1986 paper
by Rumelhart-Hinton-Williams [55], in Nature.
9.3.1. Notations
Let’s begin with notations which let us refer to weights, biases, and activa-
tions in the network in an unambiguous way.
wjk` : the weight for the connection from the k -th neuron in the (`− 1)-th
layer to the j -th neuron in the `-th layer
`
bj : the bias of the j -th neuron in the `-th layer
`
aj : the activation of the j -th neuron in the `-th layer
where the sum is over all neurons k in the (` − 1)-th layer. Denote the
weighted input by X
zj` := wjk` ak`−1 + bj` . (9.11)
k
Now, define
W ` = [wjk` ] : the weight matrix for layer `
b ` = [bj` ] : the bias vector for layer `
(9.12)
z ` = [zj` ] : the weighted input vector for layer `
a ` = [aj` ] : the activation vector for layer `
a ` = σ (z ` ) = σ (W ` a `−1 + b ` ). (9.13)
9.3. Back-Propagation 287
Remark 9.1. The reason we need the first assumption is because what
back-propagation actually lets us do is compute the partial deriva-
tives ∂ Cx /∂ wjk` and ∂ Cx /∂ bj` for a single training example. We then can
recover ∂ C /∂ wjk` and ∂ C /∂ bj` by averaging over training examples.
Theorem 9.3. Suppose that the cost function C satisfies the two as-
sumptions in Section 9.3.2 so that it represents the cost for a single train-
ing example. Assume the network contains L layers, of which the feed-
forward model is given as in (9.10):
X
` ` `
aj = σ (zj ), zj = wjk` ak`−1 + bj` ; ` = 2, 3, · · · , L .
k
Then, ∂C 0 L
(a) δjL = σ (zj ),
∂ ajL
X
`
(b) δj = wkj`+1 δk`+1 σ 0 (zj` ), ` = L − 1, · · · , 2,
k
∂C (9.20)
(c) `
= δj` , ` = 2, · · · , L ,
∂ bj
∂C
(d) `
= ak`−1 δj` , ` = 2, · · · , L .
∂ wjk
Proof. Here, we will prove (b) only; see Exercise 1 for the others. Using the
definition (9.19) and the chain rule, we have
` ∂ C X ∂ C ∂ zk`+1 X ∂ zk`+1 `+1
δj = ` = `+1 ∂ z `
= δ
` k
(9.21)
∂ zj k
∂ z k j k
∂ zj
Note X X
zk`+1 = wki`+1 ai` + bk`+1 = wki`+1 σ (zi` ) + bk`+1 . (9.22)
i i
Remarks 9.4.
• The four fundamental equations satisfy independently of choices of
the cost function C and the activation σ.
• A consequence of (9.24.d) is that if a `−1 is small (in modulus), gra-
dient term ∂ C /∂ W ` will also tend to be small. In this case, we’ll
say the weight learns slowly, meaning that it’s not changing much
during gradient descent.
• In other words, a consequence of (9.24.d) is that weights output
from low-activation neurons learn slowly.
• The sigmoid function σ becomes very flat when σ (zjL ) is approxi-
mately 0 or 1. When this occurs we will have σ 0 (zjL ) ≈ 0. So, a weight
in the final layer will learn slowly if the output neuron is either low
activation (≈ 0) or high activation (≈ 1). In this case, we usually say
the output neuron has saturated and, as a result, the weight is
learning slowly (or stopped).
• Similar remarks hold also in other layers and for the biases as well.
• Summing up, weights and biases will learn slowly if either the in-
neurons (upwind) are in low-activation or the out-neurons
(downwind) have saturated (either high- or low-activation).
z ` = W ` a `−1 + b ` ; a ` = σ (z ` ); ` = 2, 3, · · · , L
3. Output error δ L :
δ L = ∇a L C σ 0 (z L );
4. Back-propagate the error:
∇b ` C = δ ` ; ∇W ` C = δ ` (a `−1 )T ; ` = 2, · · · , L
Figure 9.13
Remarks 9.6. The neural network exemplified in Figure 9.13 can pro-
duce a classification accuracy better than 98%, for the MNIST hand-
written digit data set. But upon reflection, it’s strange to use net-
works with fully-connected layers to classify images.
• The network architecture does not take into account the spatial
structure of the images.
• For instance, it treats input pixels which are far apart and close
together, on exactly the same footing.
Figure 9.15: Two of local receptive fields, starting from the top-left corner.
(Geometry of neurons in the first hidden layer is 24 × 24.)
9.4. Deep Learning 295
Note: We have seen the local receptive field being moved by one pixel at
a time (stride_length=1). In fact, sometimes a different stride length
is used. For instance, we might move the local receptive field 2 pixels to
the right (or down), in which case we would say a stride length of 2 is
used.
• We sometimes call the map from the input layer to the hidden layer a
feature map.
– Suppose the weights and bias are such that the hidden neuron
can pick out a feature (e.g., a vertical edge) in a particular local
receptive field.
– That ability is also likely to be useful at other places in the image.
– And therefore it is useful to apply the same feature detector
everywhere in the image.
• We call the weights and bias defining the feature map the shared
weights and the shared bias, respectively.
• The shared weights and bias defines clearly a kernel or filter.
To put it in slightly more abstract terms, CNNs are well adapted to the
translation invariance of images.2
1
The weighting in (9.25) is just a form of convolution; we may rewrite it as a 1 = σ (b + w ∗ a 0 ). So the
network is called a convolutional network.
2
Move a picture of a cat a little ways, and it’s still an image of a cat.
296 Chapter 9. Neural Networks and Deep Learning
Remark 9.7.
• The network structure we have considered so far can detect just a
single localized feature.
• To do more effective image recognition, we’ll need more than
one feature map.
• Thus, a complete convolutional layer consists of several different
feature maps:
Modern CNNs are often built with 10 to 50 feature maps, each associated
to a r × r local receptive field: r = 3 ∼ 9.
Figure 9.17: The 20 images corresponding to 20 different feature maps, which are actually
learned when classifying the MNIST data set (r = 5).
(c) Pooling
From Figure 9.16: Since we have 24 × 24 neurons output from the convo-
lutional layer, after pooling we will have 12 × 12 neurons for each feature
map:
Types of pooling
Figure 9.20: A simple CNN of three feature maps, to classify MNIST digits.
Figure 9.21: The images missed by an ensemble of 5 CNNs. The label in the top right is
the correct classification, while in the bottom right is the label classified output.
9.1. Complete proof of Theorem 9.3. Hint : The four fundamental equations in (9.20) can
be obtained by simple applications of the chain rule.
9.2. The core equations of back-propagation in a network with fully-connected layers are
given in (9.20). Suppose we have a network containing a convolutional layer, a max-
pooling layer, and a fully-connected output layer, as in the network shown in Fig-
ure 9.20. How are the core equations of back-propagation modified?
9.3. (Designing a deep network). First, download a CNN code (including the MNIST
data set) by accessing to
https://github.com/mnielsen/neural-networks-and-deep-learning.git
or ‘git clone’ it. In the ‘src’ directory, there are 8 python source files:
conv.py mnist_average_darkness.py mnist_svm.py network2.py
expand_mnist.py mnist_loader.py network.py network3.py
On lines 16–22 in Figure 9.22 below, I put a design of a CNN model, which involved 2
hidden layers, one for a convolution-pooling layer and the other for a fully-connected
layer. Its test accuracy becomes approximately 98.8% in 30 epochs.
(a) Set ‘GPU = False’ in network3.py, if you are NOT using a GPU.
(Default: ‘GPU = True’, set on line 50-some of network3.py)
(b) Modify Run_network3.py appropriately to design a CNN model as accurate as
possible. Can your network achieve an accuracy better than 99.5%? Hint : You
may keep using the SoftmaxLayer for the final layer. ReLU (Rectified Linear
Units) seems comparable with the sigmoid function (default) for activation. The
default p_dropout=0.0. You should add some more layers, meaningful, and tune
well all of hyperparameters: the number of feature maps for convolutional layers,
the number of fully-connected neurons, η, p_dropout, etc..
(c) Design an ensemble of 5 such networks to further improve the accuracy. Hint :
Explore the function ‘ensemble’ defined in conv.py.
302 Chapter 9. Neural Networks and Deep Learning
Run_network3.py
1 """ Run_network3.py:
2 -------------------
3 A CNN model, for the MNIST data set,
4 which uses network3.py written by Michael Nielsen.
5 The source code can be downloaded from
6 https://github.com/mnielsen/neural-networks-and-deep-learning.git
7 or 'git clone' it
8 """
9 import network3
10 from network3 import Network, ReLU
11 from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
12
15 mini_batch_size = 10
16 net = Network([
17 ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),
18 filter_shape=(20, 1, 5, 5),
19 poolsize=(2, 2), activation_fn=ReLU),
20 FullyConnectedLayer(
21 n_in=20*12*12, n_out=100, activation_fn=ReLU, p_dropout=0.0),
22 SoftmaxLayer(n_in=100, n_out=10, p_dropout=0.5)], mini_batch_size)
23
H
bqb=∇
cf (x n ). (9.30)
∼ ||∇
cf (x n )||
σ = . (9.32)
||q
b||
• Comparisons:
– Implement (or download codes for) the original Newton’s method
and one of quasi-Newton methods (e.g., BFGS).
– Let’s call our method the partial Hessian (PH)-Newton method.
Compare the PH-Newton with those known methods for: the num-
ber of iterations, the total elapsed time, convergence behavior, and
stability/robustness.
– Test with e.g. the Rosenbrock function defined on Rd , d ≥ 10, with
various initial points x 0 .
306 Chapter 9. Neural Networks and Deep Learning
• Preparing data for analysis is one of the most important steps in any
data-mining project – and traditionally, one of the most time consum-
ing.
• Often, it takes up to 80% of the time.
• Data preparation is not a once-off process; that is, it is iterative as
you understand the problem deeper on each successive pass.
Objectives.
• Algorithm Development:
– For the Wine dataset, for example, erase r % data values in random;
r = 5, 10, 20.
– Compare your new filling strategy with 1 the simple sample re-
moval method and 2 the imputation strategy using mean, me-
dian, or mode.
– Perform accuracy analysis for various classifiers, e.g., logistic re-
gression, support vector machine, and random forests.
308 Chapter 9. Neural Networks and Deep Learning
Objectives.
• Algorithm Development:
– This project will study a new probabilistic classifier, called the mem-
bership score machines (MSM), by introducing membership
score functions of the form
d
−r 2 /σ 2 2
X (µj − xi )2
ϕ(x) = e , r = . (9.33)
j=1
λ2j
[5] Y. B ENGIO, Y. L E C UN, AND G. H INTON, Deep learning, Nature, 521 (2015), pp. 436–
444.
[6] A. B JÖRCK, Numerical Methods for Least Squares Problems, SIAM, Philadelphia,
1996.
[10] J. B UNCH AND L. K AUFMAN, Some stable methods for calculating inertia and solving
symmetric linear systems, Math. Comput., 31 (1977), pp. 163–179.
[11] R. B. C ATTELL, The description of personality: Basic traits resolved into clusters, Jour-
nal of Abnormal and Social Psychology., 38 (1943), pp. 476–506.
311
312 BIBLIOGRAPHY
[15] A. F ISHER , C. R UDIN, AND F. D OMINICI, Model class reliance: Variable importance
measures for any machine learning model class, from the "rashomon" perspective,
(2018).
[17] R. F LETCHER, A new approach to variable metric algorithms, The Computer Journal,
13 (1970), pp. 317–322.
[18] S. G ERSCHGORIN, Über die abgrenzung der eigenwerte einer matrix, Izv. Akad. Nauk
SSSR Ser. Mat., 7 (1931), pp. 746–754.
[19] X. G LOROT, A. B ORDES, AND Y. B ENGIO, Deep sparse rectifier neural networks,
in Proceedings of the Fourteenth International Conference on Artificial Intelligence
and Statistics, G. Gordon, D. Dunson, and M. Dudík, eds., vol. 15 of Proceedings
of Machine Learning Research, Fort Lauderdale, FL, USA, 11–13 Apr 2011, PMLR,
pp. 315–323.
[21] F. G OLUB AND C. V. L OAN, Matrix Computations, 3rd Ed., The Johns Hopkins Uni-
versity Press, Baltimore, 1996.
[22] B. G ROSSER AND B. L ANG, An o(n2 ) algorithm for the bidiagonal svd, Lin. Alg. Appl.,
358 (2003), pp. 45–70.
[24] G. E. H INTON, S. O SINDERO, AND Y.-W. T EH, A fast learning algorithm for deep
belief nets, Neural Comput., 18 (2006), pp. 1527–1554.
[45] H. M OBAHI AND J. W. F ISHER III, On the link between gaussian homotopy continua-
tion and convex envelopes, in Energy Minimization Methods in Computer Vision and
Pattern Recognition. Lecture Notes in Computer Science, vol. 8932, X.-C. Tai, E. Bae,
T. F. Chan, and M. Lysaker, eds., Hong Kong, China, 2015, Springer, pp. 43–56.
[46] R. T. N G AND J. H AN, Efficient and effective clustering methods for spatial data
mining, in Proceedings of the 20th International Conference on Very Large Data
Bases, VLDB ’94, San Francisco, CA, USA, 1994, Morgan Kaufmann Publishers Inc.,
pp. 144–155.
[47] , CLARANS: A method for clustering objects for spatial data mining, IEEE Trans-
actions on Knowledge and Data Engineering, 14 (2002), pp. 1003–1016.
[48] M. N IELSEN, Neural networks and deep learning. (The online book can be found at
http://neuralnetworksanddeeplearning.com), 2013.
[50] K. P EARSON, On lines and planes of closest fit to systems of points in space, Philosoph-
ical Magazine, 2 (1901), pp. 559–572.
[53] S. R ASCHKA AND V. M IRJALILI, Python Machine Learning, 3rd Ed., Packt Publishing
Ltd., Birmingham, UK, 2019.
[54] H. R OBBINS AND S. M ONRO, A stochastic approximation method, Ann. Math. Statist.,
22 (1951), pp. 400–407.
[59] B. S CHÖLKOPF , A. S MOLA , AND K.-R. M ÜLLER, Kernel principal component analy-
sis, in Artificial Neural Networks — ICANN’97, EDITORS, ed., vol. 1327 of Lecture
Notes in Computer Science, Springer, Berlin, Heidelberg, 1997, pp. 583–588.
[63] O. T AUSSKY, Bounds for characteristic roots of matrices, Duke Math. J., 15 (1948),
pp. 1043–1044.
[64] R. T IBSHIRANI, Regression shrinkage and selection via the LASSO, Journal of the
Royal Statistical Society. Series B (methodological), 58 (1996), pp. 267–288.
[65] R. C. T RYON, Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis
for the Isolation of Unities in Mind and Personality, Edwards Brothers, 1939.
[66] R. VARGA, Matrix Iterative Analysis, 2nd Ed., Springer-Verlag, Berlin, Heidelberg,
2000.
[67] P. R. W ILLEMS, B. L ANG, AND C. V ÖMEL, Computing the bidiagonal SVD using
multiple relatively robust representations, SIAM Journal on Matrix Analysis and Ap-
plications, 28 (2006), pp. 907–926.
[70] J. Z UBIN, A technique for measuring like-mindedness, The Journal of Abnormal and
Social Psychology, 33 (1938), pp. 508–516.
Index
317
318 INDEX