0% found this document useful (0 votes)
5 views

sheet_2_solution

This document contains solution sheets for optimization exercises in machine learning, specifically for a class at Universität Mannheim. It covers topics such as descent directions of maxima, convergence to stationary points, optimizing quadratic functions, and includes programming exercises. The solutions involve mathematical proofs and derivations related to gradient descent and the Newton method.

Uploaded by

mamakemrosly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

sheet_2_solution

This document contains solution sheets for optimization exercises in machine learning, specifically for a class at Universität Mannheim. It covers topics such as descent directions of maxima, convergence to stationary points, optimizing quadratic functions, and includes programming exercises. The solutions involve mathematical proofs and derivations related to gradient descent and the Newton method.

Uploaded by

mamakemrosly
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Optimization in Machine Learning Universität Mannheim

HWS 2024 Prof. Simon Weißmann, Felix Benning

Solution Sheet 2
For the exercise class on the 03.10.2024 at 12:00.
Hand in your solutions by 10:15 in the lecture on Tuesday 01.10.2024.

Exercise 1 (Descent Directions of a Maximum). (1 Points)


d d d
Let x∗ ∈ R be a strict local maximum of f : R → R. Prove that every d ∈ R is a descent direction
of f in x∗ .

Solution. Let  > 0 be such, that x∗ is a strict maximum in B (x∗ ) \ {x∗ }, where existence of such an
 is the strict local maximum property. We now have for any direction d that it is a descent direction,

because with ᾱ = kdk > 0 we have for all α ∈ (0, ᾱ]

f (x∗ + αd) < f (x∗ ),

since x∗ + αd ∈ B (x∗ ) \ {x∗ } as kαdk ≤ ᾱkdk = .

Exercise 2 (Convergence to Stationary Point). (5 Points)


Let f : Rd → R be a continuously differentiable function.

(i) Let (xk )k∈N be defined by gradient descent

xk+1 = xk − αk ∇f (xk ), x0 ∈ Rd

with diminishing step size αk > 0 such that ∞


P
k=1 αk = ∞. Suppose that (xk )k∈N converges
to some x∗ ∈ Rd . Prove that x∗ is a stationary point of f , i.e. ∇f (x∗ ) = 0. (2.5 pts)

Solution. For any  > 0 there exists n ≥ 0 such that for all i, j ≥ n we have by Cauchy-
Schwarz and convergence of ∇f (xi ) to ∇f (x∗ ) due to continuity of ∇f

h∇f (xi ), ∇f (xj )i


= k∇f (x∗ )k2 + h∇f (xi ) − ∇f (x∗ ), ∇f (x∗ )i + h∇f (xi ), ∇f (xj ) − ∇f (x∗ )i
C.S.
≥ k∇f (x∗ )k2 − k∇f (xi ) − ∇f (x∗ )k k∇f (x∗ )k − k∇f (xi )k k∇f (xj ) − ∇f (x∗ )k
| {z } | {z } | {z }
≤ ≤k∇f (x∗ )k+ ≤
2 2
≥ k∇f (x∗ )k − 2k∇f (x∗ )k −  =: p()

This results in
m−1 m−1
X 2 X
2
kxn − xm k = αk ∇f (xk ) = αi αj h∇f (xi ), ∇f (xj )i
k=n i,j=n
m−1
X 2
≥ αk p()
k=n

1
Taking the limit over m results in

X 2
2
∞ > kxn − x∗ k ≥ αk p().
k=n
| {z }
=∞

So we necessarily need p() ≤ 0. But as  was arbitrary, we have

0 ≤ k∇f (x∗ )k2 = lim p() ≤ 0.


→0

(ii) Assume that f is also L-smooth. Prove for xn generated by gradient descent with constant step
size α ∈ (0, L2 ) we have
m
X f (xn ) − f (xm ) f (xn ) − minx f (x)
k∇f (xk )k2 ≤ L

k=n
α(1 − 2 α) α(1 − L2 α)

for any n, m ∈ N. Deduce for the case minx f (x) > −∞, that we have

min k∇f (xk )k2 ∈ o(1/n). (2.5 pts)


k≤n

Solution. By L-smoothness of f , we have


−α∇f (xk )
z }| {
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
= f (xk ) − (α − L2 α2 )k∇f (xk )k2

and therefore
m m
X X f (xk ) − f (xk+1 ) telescope f (xn ) − f (xm )
k∇f (xk )k2 ≤ = .
k=n k=n
α(1 − L2 α) α(1 − L2 α)

Now an := mink≤n k∇f (xk )k2 is non-increasing, therefore


2n
X ∞
X
na2n ≤ ak ≤ ak → 0 (n → ∞).
k=n k=n

Thus a2n ∈ o(1/n). And we can simply bound the odd elements of the sequence

a2n+1 ≤ a2n ∈ o(1/n).

Exercise 3 (Optimizing Quadratic Functions). (9 Points)


In this exercise we consider functions of type

f (x) = xT Ax + bT x + c,

where x ∈ Rd , A ∈ Rd×d , b ∈ Rd , c ∈ R.

2
(i) Let H := AT + A be invertible. Prove that f can be written in the forms

f (x) = (x − x∗ )T A(x − x∗ ) + c̃ (1)


= 21 (x − x∗ )T (AT + A)(x − x∗ ) + c̃ (2)
| {z }
=:H

for some x∗ ∈ Rd and c̃ ∈ R. Argue that H is always symmetric. Under which circumstances
is x∗ a minimum? (3 pts)

Solution. We want for some x∗


!
f (x) = (x − x∗ )T A(x − x∗ ) + c̃ = xT Ax −xT Ax∗ − xT∗ Ax + xT∗ Ax∗ + c̃
| {z } | {z }
! !
=−xT T T
∗ (A+A )x=b x =c

So we simply select

x∗ := −(A + AT )−1 b and c̃ := c − xT∗ Ax∗ .

This proves our first representation (1). For (2) we simply note
symm.
y T Ay = hy, Ayi = hAy, yi = y T AT y.

Applying this to y = x − x∗ in (1) we are done.


Symmetry of H follows directly from its definition as Hij = Aji + Aij = Hji .
Now x∗ is a minimum iff H is positive definite. If it is positive definite, then x∗ is a minimum
by ∇2 f (x) = H and the lecture. If H is not positive definite, we need to show that there exists
some x such that f (x) ≤ m for all m ≤ c̃. Since H is not positive definite, there exists some y
such that
y T Hy =: −ε < 0
q
define x = x∗ + y c̃−mε . Then

c̃ − m T
f (x) = y Hy + c̃ = m.
ε

(ii) Argue that the Newton Method (with step size αn = 1) applied to f would jump to x∗ in one
step and then stop moving. (1 pt)

Solution. Taking the derivative of (2) we get

∇f (x) = H(x − x∗ ). (3)

So with ∇2 f (x) = H and

x∗ = x − H −1 H(x − x∗ ) = x − [∇2 f (x)]−1 ∇f (x),

we have that the Newton Method finds x∗ in one step. By (3) we also get ∇f (x∗ ) = 0 which
stops the Newton method afterwards.

3
(iii) Let V = (v1 , . . . , vd ) be an orthonormal basis such that

H = V diag[λ1 , . . . , λd ]V T

with 0 < λ1 ≤ · · · ≤ λd and write


y (i) := hy, vi i.
Express (xn −x∗ )(i) in terms of (x0 −x∗ )(i) , where xn is given by the gradient descent recursion

xn+1 = xn − h∇f (xn ).

For which step size h do all the components (xn − x∗ )(i) converge to zero? Which component
has the slowest convergence speed? Find the optimal learning rate h∗ and deduce for this
learning rate
2 n
kxn − x∗ k ≤ (1 − 1+κ ) kx0 − x∗ k.
λd
with the condition number κ = λ1 . (5 pts)

Solution. Using the representation (3) of the gradient again and subtracting x∗ from our recur-
sion, we get

xn+1 − x∗ = xn − x∗ − hH(xn − x∗ ) = [I − hH](xn − x∗ )

Therefore

(xn+1 − x∗ )(i) = h[I − hH](xn − x∗ ), vi i


H symmetric
= hxn − x∗ , [I − hH]vi i
eigenvec.
= hxn − x∗ , (1 − hλi )vi i
= (1 − hλi )(xn − x∗ )(i)
induction
= (1 − hλi )n+1 (x0 − x∗ )(i) .

For all components to converge we need |1 − hλi | < 1 for all i. Since 1 − hλi < 1 is always
given, because h, λi > 0, we only need 1 − hλi > −1 or h2 > λi for all i. Since the eigenvalues
are sorted, this is equivalent to h2 > λd or

2
h< .
λd
Under this condition, all components converge. The component with the slowest convergence
is given by

max |1 − hλi | = max max{1 − hλi , −(1 − hλi )} = max{1 − hλ1 , −(1 − hλd )}.
i i | {z } | {z }
≤1−hλ1 ≤−(1−hλd )

To minimize the slowest convergence, we want to take the derivative. The discontinuity is at

1 − hλ1 = −(1 − hλd ) ⇐⇒ 2 = h(λ1 + λd )

4
so (
2
d −λ1 h≤ λ1 +λ2
= 2
dh λd h≥ λ1 +λ2

So the maximal convergence speed is achieved by h∗ = 2


λ1 +λ2 with

2λ1 2
r(h∗ ) := max |1 − h∗ λi | = 1 − =1−
i λ1 + λd 1+κ
λd
where κ = λ1 is the condition number. Putting things together, we have

d
X 2
kxn − x∗ k2 = (1 − hλi )n (x0 − x∗ )(i) vi
i=1
d
orthonormal (i) (i)
X
= (1 − hλi )2n (x0 − x∗ )2 = r(h∗ )2n kx0 − x∗ k2
| {z }
i=1
≤r(h∗ )2n

and therefore
kxn − x∗ k ≤ r(h∗ )n kx0 − x∗ k.

Exercise 4 (Programming exercise). (9 Points)


For the Python exercises join the GitHub classroom https://classroom.github.com/a/
8yrTMIm1. If you are new to git, checkout https://classroom.github.com/a/dEzm_
HGt

You might also like