Problems Chap3
Problems Chap3
Chapter 3
Pierre Paquay
Problem 3.1
First, we generate 2000 examples uniformly for the two semi-circles, this means that we will have approximately
1000 examples for each class.
set.seed(101)
return(cbind(D, y))
}
rad <- 10
thk <- 8
sep <- 5
D <- init_data(2000, rad, thk, sep)
p <- ggplot(D, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +
theme(legend.position = "none") +
coord_fixed()
p
1
20
10
0
x2
−10
−20
−20 −10 0 10 20 30
x1
(a) Then, we run the PLA starting from w = 0 until it converges and we plot the data with the final
hypothesis.
h <- function(D, w) {
scalar_prod <- cbind(1, D$x1, D$x2) %*% w
return(as.vector(sign(scalar_prod)))
}
iter <- 0
w_PLA <- c(0, 0, 0)
repeat {
y_pred <- h(D, w_PLA)
D_mis <- subset(D, y != y_pred)
if (nrow(D_mis) == 0)
break
xt <- D_mis[1, ]
w_PLA <- w_PLA + c(1, xt$x1, xt$x2) * xt$y
iter <- iter + 1
}
p + geom_abline(slope = -w_PLA[2] / w_PLA[3], intercept = -w_PLA[1] / w_PLA[3])
2
20
10
0
x2
−10
−20
−20 −10 0 10 20 30
x1
(b) Now, we use linear regression for classification to obtain wlin .
X <- as.matrix(cbind(1, D[, c("x1", "x2")]))
y <- D$y
X_cross <- solve(t(X) %*% X) %*% t(X)
w_lin <- as.vector(X_cross %*% y)
p + geom_abline(slope = -w_lin[2] / w_lin[3], intercept = -w_lin[1] / w_lin[3])
20
10
0
x2
−10
−20
−20 −10 0 10 20 30
x1
T
As we may see, linear regression can also be used for classification (the values sign(wlin x) will likely make
good classification predictions). The linear regression weights wlin are also an approximate solution for the
perceptron model.
3
Problem 3.2
In this problem, we consider again the double-semi-circle of Problem 3.1 and we vary sep in the range
{0.2, 0.4, · · · , 5}, with these values we generate 2000 examples and we run PLA starting with w = 0. Below,
we plot sep versus the number of iterations PLA takes to converge.
set.seed(10)
150
iterations
100
50
0
0 1 2 3 4 5
sep
We may see that the number of iterations tends to decrease when sep increases. This trend is confirmed
by the theoretical results (see Problem 1.3), to see this we plot below sep versus the theoretical maximum
4
number of iterations.
ggplot(data.frame(sep = sep_seq, iterations = iterations_max), aes(x = sep, y = iterations)) +
geom_line(col = "red") +
ylab("Maximum number of iterations (theoretical)")
Maximum number of iterations (theoretical)
6e+07
4e+07
2e+07
0e+00
0 1 2 3 4 5
sep
Problem 3.3
Here again, we consider the double-semi-circle of Problem 3.1 and we set sep = −5 (which makes the data
non linearly separable) and we generate 2000 examples.
set.seed(101)
sep <- -5
D <- init_data(2000, rad, thk, sep)
p <- ggplot(D, aes(x = x1, y = x2, col = as.factor(y + 3))) + geom_point() +
theme(legend.position = "none") +
coord_fixed()
p
5
10
x2
−10
−20 −10 0 10 20 30
x1
(a) If we run PLA on these examples, it will never stop updating.
(b) Now, we run the pocket algorithm for 100000 iterations and we plot Ein versus the iteration number t for
t = 1, · · · , 100.
set.seed(101)
6
0.6
E_in
0.4
0.2
0 25 50 75 100
t
We clearly see that the Ein is monotonously decreasing (as opposed to what would happen if we had used
the PLA).
(c) Below, we plot the data and the final hypothesis obtained in (b).
p + geom_abline(slope = -w_pocket[2] / w_pocket[3], intercept = -w_pocket[1] / w_pocket[3])
10
x2
−10
−20 −10 0 10 20 30
x1
(d) Here, we use the linear regression algorithm to obtain w and we compare this result with the pocket
algorithm in terms of computation time and quality of the solution.
start_time_lin <- Sys.time()
X <- as.matrix(cbind(1, D[, c("x1", "x2")]))
y <- D$y
X_cross <- solve(t(X) %*% X) %*% t(X)
7
w_lin <- as.vector(X_cross %*% y)
E_in_lin <- mean(D$y != h(D, w_lin))
end_time_lin <- Sys.time()
p + geom_abline(slope = -w_lin[2] / w_lin[3], intercept = -w_lin[1] / w_lin[3])
10
x2
−10
−20 −10 0 10 20 30
x1
For the pocket algorithm (with 100000 iterations), we have a computation time of 1.7378043, and for the
linear regression algorithm, we have a computation time of 0.0074768. The linear regression algorithm is
clearly better than the pocket algorithm in terms of computation time. When we take into account the
quality of the solution, the pocket algorithm has a (final) Ein of 0.086, and the linear regression algorithm
has a Ein of 0.0995. So, regarding the quality of the solution, the pocket algorithm is a little better than the
linear regression algorithm.
(e) Here, we repeat the points (b) to (d) with a 3rd order polynomial feature transform
First, we run the pocket algorithm for 100000 iterations and we plot Ein versus the iteration number t.
set.seed(11)
return(as.vector(sign(scalar_prod)))
}
8
start_time_pocket <- Sys.time()
E_in <- numeric()
E_in_pocket <- numeric()
w <- rep(0, 10)
w_pocket <- w
E_in <- c(E_in, mean(D_trans$y != h_trans(D_trans, w)))
E_in_pocket <- E_in
for (iter in 1:100000) {
D_mis <- subset(D_trans, y != h_trans(D_trans, w))
if (nrow(D_mis) == 0)
break
xt <- D_mis[sample(nrow(D_mis), 1), ]
w <- w + c(1, as.numeric(xt[1:9])) * xt$y
E_in <- c(E_in, mean(D_trans$y != h_trans(D_trans, w)))
if (E_in[length(E_in)] < E_in_pocket[length(E_in_pocket)]) {
w_pocket <- w
}
E_in_pocket <- c(E_in_pocket, mean(D_trans$y != h_trans(D_trans, w_pocket)))
}
end_time_pocket <- Sys.time()
ggplot(data.frame(t = 1:100000, E_in = E_in_pocket[-1]), aes(x = t, y = E_in)) +
geom_line(col = "red") +
coord_cartesian(xlim = c(1, 200))
0.16
0.12
E_in
0.08
0.04
0 50 100 150 200
t
Then, we plot the data and the final hypothesis obtained above.
cc <- emdbook::curve3d(1 * w_pocket[1] + x * w_pocket[2] + y * w_pocket[3] + x^2 * w_pocket[4] +
x * y * w_pocket[5] + y^2 * w_pocket[6] + x^3 * w_pocket[7] +
x^2 * y * w_pocket[8] + x * y^2 * w_pocket[9] + y^3 * w_pocket[10],
xlim = c(-20, 35), ylim = c(-15, 20), sys3d = "none")
dimnames(cc$z) <- list(cc$x, cc$y)
mm <- reshape2::melt(cc$z)
p + geom_contour(data = mm, aes(x = Var1, y = Var2, z = value), breaks = 0, colour = "black")
9
20
10
x2
−10
−20 −10 0 10 20 30
x1
Finally, we use the linear regression algorithm to obtain the weights w.
start_time_lin <- Sys.time()
X <- as.matrix(cbind(1, D_trans[, 1:9]))
y <- D_trans$y
X_cross <- solve(t(X) %*% X) %*% t(X)
w_lin <- as.vector(X_cross %*% y)
E_in_lin <- mean(D_trans$y != h_trans(D_trans, w_lin))
end_time_lin <- Sys.time()
cc <- emdbook::curve3d(1 * w_lin[1] + x * w_lin[2] + y * w_lin[3] + x^2 * w_lin[4] +
x * y * w_lin[5] + y^2 * w_lin[6] + x^3 * w_lin[7] +
x^2 * y * w_lin[8] + x * y^2 * w_lin[9] + y^3 * w_lin[10],
xlim = c(-20, 35), ylim = c(-15, 20), sys3d = "none")
dimnames(cc$z) <- list(cc$x, cc$y)
mm <- reshape2::melt(cc$z)
p + geom_contour(data = mm, aes(x = Var1, y = Var2, z = value), breaks = 0, colour = "black")
10
20
10
x2
−10
−20 −10 0 10 20 30
x1
For the pocket algorithm (with 100000 iterations), we have a computation time of 2.7625689, and for the
linear regression algorithm, we have a computation time of 0.0071368. The linear regression algorithm is once
again clearly better than the pocket algorithm in terms of computation time. When we take into account the
quality of the solution, the pocket algorithm has a (final) Ein of 0.0445, and the linear regression algorithm
has a Ein of 0.021. In this case, regarding the quality of the solution, the linear regression algorithm is also
better than the pocket algorithm.
Problem 3.4
si yn wT xn > 1
0
en (w) = ;
(1 − yn wT xn )2 si yn wT xn < 1
lim en (w) = 0.
w:yn wT xn →1±
si yn wT xn > 1
0
∇en (w) = ,
−2yn (1 − yn wT xn )xn si yn wT xn < 1
and
lim ∇en (w) = 0.
w:yn wT xn →1±
11
(b) Let us consider first the case where sign(wT xn ) 6= yn , which means that yn wT xn ≤ 0 < 1. In this case,
we have [[sign(wT xn ) 6= yn ]] = 1 and en (w) = (1 − yn wT xn )2 ≥ 1, consequently
[[sign(wT xn ) 6= yn ]] ≤ en (w).
Now we consider the second case where sign(wT xn ) = yn , which means that yn wT xn ≥ 0. In this case, we
have [[sign(wT xn ) 6= yn ]] = 0, en (w) = (1 − yn wT xn )2 ≥ 0 if 0 ≤ yn wT xn < 1 and en (w) = 0 if yn wT xn ≥ 1;
consequently
[[sign(wT xn ) 6= yn ]] ≤ en (w).
In conclusion, we have that
N N
1 X 1 X
Ein (w) = [[sign(wT xn ) 6= yn ]] ≤ en (w).
N n=1 N n=1
(c) If we apply SGD to our upper bound above, we get the following algorithm.
1. Select an initial w.
2. Repeat until CONDITION :
Select (xn , yn ) randomly and let sn = wT xn . If yn sn = yn wT xn ≤ 1, we have
and we update w as
w ← w − η∇en (w) = w − η · 0 = w.
Problem 3.5
si yn wT xn > 1
0
en (w) = ;
1 − yn w T xn si yn wT xn < 1
and this function is actually continuous everywhere as we have
lim en (w) = 0.
w:yn wT xn →1±
However, this function is not differentiable everywhere, to see this we note that
si yn wT xn > 1
0
∇en (w) = ;
−yn xn si yn wT xn < 1
moreover
lim ∇en (w) = 0 and lim ∇en (w) = −yn xn 6= 0.
w:yn wT xn →1+ w:yn wT xn →1−
[[sign(wT xn ) 6= yn ]] ≤ en (w).
12
Now we consider the second case where sign(wT xn ) = yn , which means that yn wT xn ≥ 0. In this case, we
have [[sign(wT xn ) 6= yn ]] = 0, en (w) = 1 − yn wT xn ≥ 0 if 0 ≤ yn wT xn < 1 and en (w) = 0 if yn wT xn ≥ 1;
consequently
[[sign(wT xn ) 6= yn ]] ≤ en (w).
In conclusion, we have that
N N
1 X 1 X
Ein (w) = [[sign(wT xn ) 6= yn ]] ≤ en (w).
N n=1 N n=1
(c) If we apply SGD to our upper bound above, we get the following algorithm.
1. Select an initial w.
2. Repeat until CONDITION :
Select (xn , yn ) randomly. If yn wT xn < 1, we have
and we update w as
w ← w − η∇en (w) = w + ηyn xn .
And if yn sn = yn wT xn > 1, we have ∇en (w) = 0, and we update w as
w ← w − η∇en (w) = w − η · 0 = w.
Problem 3.6
(a) For linearly separable data, we obviously have that there exists w∗ so that yn (w∗ )T xn > 0 for all
n = 1, · · · , N . By the density property of the real numbers, we know that we may find > 0 so that
yn (w∗ )T xn ≥ for all n = 1, · · · , N , and consequently if we let w = w∗ /, we get
yn w T xn ≥ 1
for all n = 1, · · · , N .
(b) The task of finding w for separable data may be formulated as the following linear program
cT w
minw
subject to Aw ≤ b
− y1 xT1
−
.. .. .. (N × (d + 1)).
A = − . . .
− yN xTN −
(c) When the data is not separable, the minimization problem may be formulated as a linear program as
follows
cT (w, ξ)T
min(w,ξ)
subject to A(w, ξ)T ≤ b
13
where cT = (0, · · · , 0, 1, · · · , 1), bT = (−1, · · · , −1, 0, · · · , 0), and
− y1 xT1
− 1
.. .. .. ..
.
. . .
− yN xTN − 1
A = − (2N × (d + 1 + N )).
0
··· 0 1
. .. .. ..
..
. . .
0 ··· 0 1
with
si yn wT xn > 1
0
en (w) = .
1 − yn w T xn si yn wT xn < 1
If we take a look at en (w), we may note that when xn is correctly classified by w and at least at a margin of
one of the linear separator (yn wT xn ≥ 1), we get en (w) = 0 which means that in this case this term does
not contribute to the overall error. However, when xn is in the margin of one, the deeper xn is into the
margin of one, the higher the term en (w) = 1 − yn wT xn contributes to the overall error. For example, if xn
is correctly classified by w but into the margin of one (0 < yn wT xn < 1), then the error term for this point is
0 < en (w) < 1; and if xn is not correctly classified by w (yn wT xn < 0), then the error term for this point
is en (w) > 1. In conclusion, the overall error characterizes the amount of violation of the margin, which is
exactly what we seek to minimize in point (c) above.
Problem 3.7
First, we use the linear programming algorithm from Problem 3.6 on the learning task in Problem 3.1 for the
separable case.
set.seed(10)
rad <- 10
thk <- 8
sep <- 5
D <- init_data(2000, rad, thk, sep)
d <- 2
N <- nrow(D)
c_T <- rep(0, d + 1)
b_T <- rep(-1, N)
A <- -diag(D$y) %*% as.matrix(cbind(1, D[, 1:2]))
dir <- rep("<=", N)
linear_prog <- lp("min", c(c_T, -c_T), cbind(A, -A), const.dir = dir, const.rhs = b_T)
w <- linear_prog$solution[1:(d+1)] - linear_prog$solution[(d+2):(2 * (d + 1))]
14
y <- D$y
X_cross <- solve(t(X) %*% X) %*% t(X)
w_lin <- as.vector(X_cross %*% y)
10
0
x2
−10
−20
−20 −10 0 10 20 30
x1
As we may see, our linear programming algorithm solution perfectly separates the dataset so we have an Ein
of 0; the linear regression approach gives us an Ein of 0, and the 3rd order polynomial feature transform
gives us an Ein of 0 as well.
Now, we use the linear programming algorithm from Problem 3.6 on the learning task in Problem 3.1 for the
non separable case.
set.seed(10)
rad <- 10
thk <- 8
sep <- -5
D <- init_data(2000, rad, thk, sep)
d <- 2
15
N <- nrow(D)
c_T <- c(rep(0, d + 1), rep(1, N))
b_T <- c(rep(-1, N), rep(0, N))
A1 <- rbind(diag(D$y) %*% as.matrix(cbind(1, D[, 1:2])), matrix(0, nrow = N, ncol = d + 1))
A2 <- rbind(diag(1, nrow = N, ncol = N), diag(1, nrow = N, ncol = N))
A <- -cbind(A1, A2)
dir <- rep("<=", 2 * N)
linear_prog <- lp("min", c(c_T, -c_T), cbind(A, -A), const.dir = dir, const.rhs = b_T)
w <- linear_prog$solution[1:(d + 1 + N)] - linear_prog$solution[(d + 1 + N + 1):(2 * (d + 1 + N))]
10
x2
−10
−20 −10 0 10 20 30
x1
Here, our linear programming algorithm solution gives us an Ein of 0.072; the linear regression approach
gives us an Ein of 0.0855, and the 3rd order polynomial feature transform gives us an Ein of 0.0205 which is
the best of the three approaches.
16
Problem 3.8
First, we will show that h∗ (x) = Ey|x [y|x] minimizes the Eout expression. For any hypothesis h, we have that
Eout (h) = E(x,y) [(h(x) − y)2 ] = E(x,y) [(h(x) − h∗ (x))2 ] + E(x,y) [(h∗ (x) − y)2 ] ≥ E(x,y) [(h∗ (x) − y)2 ]
for any hypothesis h, which means that h∗ (x) is actually the one hypothesis that minimizes Eout .
Now, it is obvious that we are able to write that
Problem 3.9
1
(∗) = [(w − (X T X)−1 X T y)T (X T X)(w − (X T X)−1 X T y) + y T (1 − X(X T X)−1 X T )y]
N
1
= [(wT − y T X(X T X)−1 )(X T X)(w − (X T X)−1 X T y) + y T y − y T X(X T X)−1 X T y]
N
1 T T
= [w (X X)w − 2wT X T y + y T y] = Ein (w).
N
w = (X T X)−1 X T y = wlin .
17
Problem 3.10
H 2 v = Hv = λv
V −1 AV = diag(λ1 , · · · , λN ) = D
This means that d + 1 eigenvalues are equal to 1. As H is symmetric, it is also diagonalizable, so there exists
an invertible matrix V (whose columns are the eigenvectors of H) and a diagonal matrix D (whose elements
are the eigenvalues of H) such that
as V and V −1 are of maximum rank. As the rank of D is the number of eigenvalues not equal to 0, we finally
get that
rank(H) = trace(H) = d + 1.
Problem 3.11
18
(b) By taking the expectation with respect to x0 and 0 , we get the following expression for Eout . We have
Finally, we do the same for the expression (3), first, we note that
trace(T X(X T X)−1 x0 xT0 (X T X)−1 X T ) = trace(x0 xT0 (X T X)−1 X T T X(X T X)−1 )
by the cyclic property of the trace. Then we get that
In conclusion, we have
Eout (g) = σ 2 + trace(Σ(X T X)−1 X T T X(X T X)−1 ).
19
1 T
Moreover, if NX X = Σ, we get that
!
2 σ2 −1 2 d+1
E [Eout (g)] = σ + trace(ΣΣ ) = σ 1 + .
N N
(e) As the matrix N1 X T X is the maximum likelihood estimator of Σ, we know that N1 X T X → Σ in probability.
Moreover, by the continuity of the inverse and of the trace functions, we also have that
1 T −1
xN = trace(Σ( X X) ) → trace(Id+1 ) = d + 1
N
in probability. This means that with high probability, we have |xN − (d + 1)| ≤ η for any η > 0 provided N
is big enough; in this case, we also have that
2 2 2 σ2
σ σ σ
E [Eout (g)] − (σ 2 + (d + 1)) = (σ 2 + xN ) − (σ 2 + (d + 1)) ≤ η.
N N N N
Problem 3.12
We have already proven that H is a symmetric and indempotent matrix, which makes it a projection matrix
by definition. We may write that
where span(X) is the subspace generated by the columns of X. So, ŷ is the projection of y onto the subspace
generated by the columns of X.
Problem 3.13
(a) For the linear regression problem, we need d + 1 weights, and in the classification problem we need d + 2
weights (and also 2N violations in the non separable case).
(b) To solve this problem, we first consider the case where d = 1. The weights w obtained from the linear fit
give us the hypothesis
h(x) = w0 + w1 x,
and the weights wclass obtained from the classification problem give us the hypothesis
20
We will now apply these results to the general case, we may write that
!
w0class wdclass
w = − class , · · · , − class .
wd+1 wd+1
(c) We generate a data set yn = x2n + σn with N = 50, where xn is uniform on [0, 1] and n is a zero mean
Gaussian noise (with σ = 0.1). Below, we plot the positive and negative points D+ and D− for a = (0, 0.1)T .
set.seed(10)
N <- 50
sigma <- 0.1
epsilon <- rnorm(N, 0, 1)
xn <- runif(N, 0, 1)
yn <- xn^2 + sigma * epsilon
a <- c(0, 0.1)
0.8
y
0.4
0.0
21
D_union <- rbind(D_plus, D_minus)
d <- 2
N <- nrow(D_union)
c_T <- c(rep(0, d + 1), rep(1, N))
b_T <- c(rep(-1, N), rep(0, N))
A1 <- rbind(diag(D_union$class) %*%
as.matrix(cbind(1, D_union[, 1:2])), matrix(0, nrow = N, ncol = d + 1))
A2 <- rbind(diag(1, nrow = N, ncol = N), diag(1, nrow = N, ncol = N))
A <- -cbind(A1, A2)
dir <- rep("<=", 2 * N)
linear_prog <- lp("min", c(c_T, -c_T), cbind(A, -A), const.dir = dir, const.rhs = b_T)
w <- linear_prog$solution[1:(d + 1 + N)] - linear_prog$solution[(d + 1 + N + 1):(2 * (d + 1 + N))]
0.8
y
0.4
0.0
Problem 3.14
22
g(x) = ED [g (D) (x)]
= ED [xT wlin ]
= EX [Ey|X [xT (X T X)−1 X T y|X]]
= EX [xT (X T X)−1 X T Ey|X [y|X] ]
| {z }
Xwf +E []=Xwf
T T −1 T
= EX [x (X X) X Xwf ]
T
= x wf = f (x).
Now, we examine the expression (∗) more closely, we may write that
and so we get
var(x) = σ 2 xT EX [(X T X)−1 ]x.
And finally, we have that
As we saw in Problem 3.11 that ( N1 X T X)−1 converges in probability to Σ−1 , we may assume that
1 T −1
( X X) ≈ Σ−1
N
and so that
σ2 σ 2 (d + 1)
var ≈ trace(ΣΣ−1 ) = .
N N
23
Problem 3.15
(a) We know that rank(X T X) < d + 1 as X T X is not invertible, this means that we can find an y =
6 0 such
that X T Xy = 0 (as the columns of X T X are not linearly independant). We may now write that
0 = y T X T Xy = ||Xy||2 ,
X T Xwlin = X T XV Γ−1 U T y
= V ΓU T U ΓV T V Γ−1 U T y
= V ΓU T y
= X T y.
As w and wlin are both solutions of the normal equations, we have that
X T X(w − wlin ) = X T y − X T y =0
⇒ V ΓU T U ΓV T (w − wlin ) =0
⇒ V Γ2 V T (w − wlin ) =0
−2 T 2 T
⇒ Γ V V Γ V (w − wlin ) =0
⇒ V T (w − wlin ) = 0.
We may now see that wlin and δ = (w − wlin ) are orthogonal because
T
wlin T
δ = wlin (w − wlin ) = y T U Γ−1 V T (w − wlin ) = 0.
| {z }
=0
24
Problem 3.16
cost(accept) = cost(reject)
⇔ ca (1 − g(x)) = cr g(x)
ca
⇔ g(x) = ;
ca + cr
(c) For the Supermarket, we have κ = 1/11, this makes sense as in this case, we want to avoid false rejects;
so with κ <<, we are rejecting only when g(x) < κ which is very close to 0. And for the CIA, we have
κ = 1000/1001, here we want to avoid false accepts; so with κ >> we want to accept only when g(x) ≥ κ
which is really close to 1.
Problem 3.17
∆u
Ê1 (∆u, ∆v) = E(0, 0) + ∇E(0, 0)T
∆v
∆u
= 3 + (eu + euv v + 2u − 3v − 3, 2e2v + euv u − 3u + 8v − 5)|(0,0)
∆v
= au ∆u + av ∆v + a,
∆u
Ê1 (∆u, ∆v) = E(0, 0) + ∇E(0, 0)T
∆v
= E(0, 0) + ||∇E(0, 0)|| · ||(∆u, ∆v)|| cos(∇E(0, 0), (∆u, ∆v))
≥ E(0, 0) − ||∇E(0, 0)|| · ||(∆u, ∆v)||.
25
Thus the optimal vector (∆u∗ , ∆v ∗ )T such that ||(∆u, ∆v)|| = 0.5 is given by
∗
∆u ∇E(0, 0) 1 2
=− ||(∆u, ∆v)|| = √ .
∆v ∗ ||∇E(0, 0)|| 2 13 3
(c) With the second order Taylor expansion, we may write that
∆u 1 ∆u
E(u + ∆u, v + ∆v) = E(u, v) + ∇E(u, v)T + (∆u, ∆v)∇2 E(u, v) + O(||(∆u, ∆v)||3 ).
∆v 2 ∆v
∆u 1 ∆u
Ê3 (∆u, ∆v) = E(0, 0) + ∇E(0, 0)T + (∆u, ∆v)∇2 E(0, 0)
∆v 2 ∆v
u
e + e v + 2 euv (uv + 1) − 3
uv 2
∆u 1 ∆u
= 3 + (−2, −3) + (∆u, ∆v) uv
∆v 2 e (uv + 1) − 3 4e2v + euv u2 + 8 ∆v
1 3 −2 ∆u
= 3 − 2∆u − 3∆v + (∆u, ∆v)
2 −2 12 ∆v
= buu (∆u)2 + bvv (∆v)2 + buv ∆u∆v + bu ∆u + bv ∆v + b
if
∆u
= −(∇2 E(0, 0))−1 ∇E(0, 0).
∆v
However, for the vector above to be a mimimum, the matrix ∇2 E(0, 0) must be positive semidefinite and
invertible; in our case we have
det(∇2 E(0, 0)) = 32 6= 0,
which means that it is actually invertible, and
3 −2 u
(u, v) = 2u2 + 8v 2 + (u − 2v)2 ≥ 0
−2 12 v
for any (u, v), which means that it is actually positive semidefinite. Thus the optimal vector (∆u∗ , ∆v ∗ )T is
given by ∗
∆u 2 −1 1 30
= −(∇ E(0, 0)) ∇E(0, 0) = .
∆v ∗ 32 13
Problem 3.18
(a) Since HΦ is the perceptron in Z, we know that its VC dimension is equal to d˜ + 1 = 6, so dV C (HΦ ) ≤ 6
(the ≤ is because some points z ∈ Z may not be valid transforms of any x ∈ X , so some dichotomies may not
be realizable).
(b)
26
(c) By the same reasoning as in point (a), we get that dV C (HΦk ) ≤ d˜ + 1. Moreover, as d˜ is the number of
terms in a polynomial of degree k in d variables, it is well known that
˜ k+d
d= − 1.
d
h̃(Φ̃2 (x)) = w̃0 + w̃1 x1 + w̃2 x2 + w̃3 (x1 + x2 ) + w̃4 (x1 − x2 ) + w̃5 x21 + w̃6 x1 x2 + w̃7 x2 x1 + w̃8 x22
= w̃0 + (w̃1 + w̃3 + w̃4 )x1 + (w̃2 + w̃3 − w̃4 )x2 + w̃5 x21 + (w̃6 + w̃7 )x1 x2 + w̃8 x22 ∈ HΦ2
Problem 3.19
(a) The first problem with this feature tansform is that Z has a very large dimension (N ) which we know
to be computationnaly expensive. The second problem is that it maps each point not in the dataset to the
origin, it is not unusual for a transformation not being injective but here it is way worse.
(b) This transformation may be a good idea, actually it is a well known transformation called radial basis
function.
(c) This transformation is also another type of radial basis function.
27