sheet_2_solution
sheet_2_solution
Solution Sheet 2
For the exercise class on the 03.10.2024 at 12:00.
Hand in your solutions by 10:15 in the lecture on Tuesday 01.10.2024.
Solution. Let > 0 be such, that x∗ is a strict maximum in B (x∗ ) \ {x∗ }, where existence of such an
is the strict local maximum property. We now have for any direction d that it is a descent direction,
because with ᾱ = kdk > 0 we have for all α ∈ (0, ᾱ]
xk+1 = xk − αk ∇f (xk ), x0 ∈ Rd
Solution. For any > 0 there exists n ≥ 0 such that for all i, j ≥ n we have by Cauchy-
Schwarz and convergence of ∇f (xi ) to ∇f (x∗ ) due to continuity of ∇f
This results in
m−1 m−1
X 2 X
2
kxn − xm k = αk ∇f (xk ) = αi αj h∇f (xi ), ∇f (xj )i
k=n i,j=n
m−1
X 2
≥ αk p()
k=n
1
Taking the limit over m results in
∞
X 2
2
∞ > kxn − x∗ k ≥ αk p().
k=n
| {z }
=∞
(ii) Assume that f is also L-smooth. Prove for xn generated by gradient descent with constant step
size α ∈ (0, L2 ) we have
m
X f (xn ) − f (xm ) f (xn ) − minx f (x)
k∇f (xk )k2 ≤ L
≤
k=n
α(1 − 2 α) α(1 − L2 α)
for any n, m ∈ N. Deduce for the case minx f (x) > −∞, that we have
and therefore
m m
X X f (xk ) − f (xk+1 ) telescope f (xn ) − f (xm )
k∇f (xk )k2 ≤ = .
k=n k=n
α(1 − L2 α) α(1 − L2 α)
Thus a2n ∈ o(1/n). And we can simply bound the odd elements of the sequence
f (x) = xT Ax + bT x + c,
where x ∈ Rd , A ∈ Rd×d , b ∈ Rd , c ∈ R.
2
(i) Let H := AT + A be invertible. Prove that f can be written in the forms
for some x∗ ∈ Rd and c̃ ∈ R. Argue that H is always symmetric. Under which circumstances
is x∗ a minimum? (3 pts)
So we simply select
This proves our first representation (1). For (2) we simply note
symm.
y T Ay = hy, Ayi = hAy, yi = y T AT y.
c̃ − m T
f (x) = y Hy + c̃ = m.
ε
(ii) Argue that the Newton Method (with step size αn = 1) applied to f would jump to x∗ in one
step and then stop moving. (1 pt)
we have that the Newton Method finds x∗ in one step. By (3) we also get ∇f (x∗ ) = 0 which
stops the Newton method afterwards.
3
(iii) Let V = (v1 , . . . , vd ) be an orthonormal basis such that
H = V diag[λ1 , . . . , λd ]V T
For which step size h do all the components (xn − x∗ )(i) converge to zero? Which component
has the slowest convergence speed? Find the optimal learning rate h∗ and deduce for this
learning rate
2 n
kxn − x∗ k ≤ (1 − 1+κ ) kx0 − x∗ k.
λd
with the condition number κ = λ1 . (5 pts)
Solution. Using the representation (3) of the gradient again and subtracting x∗ from our recur-
sion, we get
Therefore
For all components to converge we need |1 − hλi | < 1 for all i. Since 1 − hλi < 1 is always
given, because h, λi > 0, we only need 1 − hλi > −1 or h2 > λi for all i. Since the eigenvalues
are sorted, this is equivalent to h2 > λd or
2
h< .
λd
Under this condition, all components converge. The component with the slowest convergence
is given by
max |1 − hλi | = max max{1 − hλi , −(1 − hλi )} = max{1 − hλ1 , −(1 − hλd )}.
i i | {z } | {z }
≤1−hλ1 ≤−(1−hλd )
To minimize the slowest convergence, we want to take the derivative. The discontinuity is at
4
so (
2
d −λ1 h≤ λ1 +λ2
= 2
dh λd h≥ λ1 +λ2
2λ1 2
r(h∗ ) := max |1 − h∗ λi | = 1 − =1−
i λ1 + λd 1+κ
λd
where κ = λ1 is the condition number. Putting things together, we have
d
X 2
kxn − x∗ k2 = (1 − hλi )n (x0 − x∗ )(i) vi
i=1
d
orthonormal (i) (i)
X
= (1 − hλi )2n (x0 − x∗ )2 = r(h∗ )2n kx0 − x∗ k2
| {z }
i=1
≤r(h∗ )2n
and therefore
kxn − x∗ k ≤ r(h∗ )n kx0 − x∗ k.