0% found this document useful (0 votes)

5 views

sheet_2_solution

This document contains solution sheets for optimization exercises in machine learning, specifically for a class at Universität Mannheim. It covers topics such as descent directions of maxima, convergence to stationary points, optimizing quadratic functions, and includes programming exercises. The solutions involve mathematical proofs and derivations related to gradient descent and the Newton method.

Uploaded by

mamakemrosly

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

sheet_2_solution

Uploaded by

mamakemrosly

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Optimization in Machine Learning Universität Mannheim

HWS 2024 Prof. Simon Weißmann, Felix Benning

Solution Sheet 2
For the exercise class on the 03.10.2024 at 12:00.
Hand in your solutions by 10:15 in the lecture on Tuesday 01.10.2024.

Exercise 1 (Descent Directions of a Maximum). (1 Points)

d d d
Let x∗ ∈ R be a strict local maximum of f : R → R. Prove that every d ∈ R is a descent direction
of f in x∗ .

Solution. Let > 0 be such, that x∗ is a strict maximum in B (x∗ ) \ {x∗ }, where existence of such an
is the strict local maximum property. We now have for any direction d that it is a descent direction,

because with ᾱ = kdk > 0 we have for all α ∈ (0, ᾱ]

f (x∗ + αd) < f (x∗ ),

since x∗ + αd ∈ B (x∗ ) \ {x∗ } as kαdk ≤ ᾱkdk = .

Exercise 2 (Convergence to Stationary Point). (5 Points)

Let f : Rd → R be a continuously differentiable function.

(i) Let (xk )k∈N be defined by gradient descent

xk+1 = xk − αk ∇f (xk ), x0 ∈ Rd

with diminishing step size αk > 0 such that ∞

P
k=1 αk = ∞. Suppose that (xk )k∈N converges
to some x∗ ∈ Rd . Prove that x∗ is a stationary point of f , i.e. ∇f (x∗ ) = 0. (2.5 pts)

Solution. For any > 0 there exists n ≥ 0 such that for all i, j ≥ n we have by Cauchy-
Schwarz and convergence of ∇f (xi ) to ∇f (x∗ ) due to continuity of ∇f

h∇f (xi ), ∇f (xj )i

= k∇f (x∗ )k2 + h∇f (xi ) − ∇f (x∗ ), ∇f (x∗ )i + h∇f (xi ), ∇f (xj ) − ∇f (x∗ )i
C.S.
≥ k∇f (x∗ )k2 − k∇f (xi ) − ∇f (x∗ )k k∇f (x∗ )k − k∇f (xi )k k∇f (xj ) − ∇f (x∗ )k
| {z } | {z } | {z }
≤ ≤k∇f (x∗ )k+ ≤
2 2
≥ k∇f (x∗ )k − 2k∇f (x∗ )k − =: p()

This results in
m−1 m−1
X 2 X
2
kxn − xm k = αk ∇f (xk ) = αi αj h∇f (xi ), ∇f (xj )i
k=n i,j=n
m−1
X 2
≥ αk p()
k=n

1
Taking the limit over m results in
∞
X 2
2
∞ > kxn − x∗ k ≥ αk p().
k=n
| {z }
=∞

So we necessarily need p() ≤ 0. But as was arbitrary, we have

0 ≤ k∇f (x∗ )k2 = lim p() ≤ 0.

→0

(ii) Assume that f is also L-smooth. Prove for xn generated by gradient descent with constant step
size α ∈ (0, L2 ) we have
m
X f (xn ) − f (xm ) f (xn ) − minx f (x)
k∇f (xk )k2 ≤ L
≤
k=n
α(1 − 2 α) α(1 − L2 α)

for any n, m ∈ N. Deduce for the case minx f (x) > −∞, that we have

min k∇f (xk )k2 ∈ o(1/n). (2.5 pts)

k≤n

Solution. By L-smoothness of f , we have

−α∇f (xk )
z }| {
f (xk+1 ) ≤ f (xk ) + h∇f (xk ), xk+1 − xk i + L2 kxk+1 − xk k2
= f (xk ) − (α − L2 α2 )k∇f (xk )k2

and therefore
m m
X X f (xk ) − f (xk+1 ) telescope f (xn ) − f (xm )
k∇f (xk )k2 ≤ = .
k=n k=n
α(1 − L2 α) α(1 − L2 α)

Now an := mink≤n k∇f (xk )k2 is non-increasing, therefore

2n
X ∞
X
na2n ≤ ak ≤ ak → 0 (n → ∞).
k=n k=n

Thus a2n ∈ o(1/n). And we can simply bound the odd elements of the sequence

a2n+1 ≤ a2n ∈ o(1/n).

Exercise 3 (Optimizing Quadratic Functions). (9 Points)

In this exercise we consider functions of type

f (x) = xT Ax + bT x + c,

where x ∈ Rd , A ∈ Rd×d , b ∈ Rd , c ∈ R.

2
(i) Let H := AT + A be invertible. Prove that f can be written in the forms

f (x) = (x − x∗ )T A(x − x∗ ) + c̃ (1)

= 21 (x − x∗ )T (AT + A)(x − x∗ ) + c̃ (2)
| {z }
=:H

for some x∗ ∈ Rd and c̃ ∈ R. Argue that H is always symmetric. Under which circumstances
is x∗ a minimum? (3 pts)

Solution. We want for some x∗

!
f (x) = (x − x∗ )T A(x − x∗ ) + c̃ = xT Ax −xT Ax∗ − xT∗ Ax + xT∗ Ax∗ + c̃
| {z } | {z }
! !
=−xT T T
∗ (A+A )x=b x =c

So we simply select

x∗ := −(A + AT )−1 b and c̃ := c − xT∗ Ax∗ .

This proves our first representation (1). For (2) we simply note
symm.
y T Ay = hy, Ayi = hAy, yi = y T AT y.

Applying this to y = x − x∗ in (1) we are done.

Symmetry of H follows directly from its definition as Hij = Aji + Aij = Hji .
Now x∗ is a minimum iff H is positive definite. If it is positive definite, then x∗ is a minimum
by ∇2 f (x) = H and the lecture. If H is not positive definite, we need to show that there exists
some x such that f (x) ≤ m for all m ≤ c̃. Since H is not positive definite, there exists some y
such that
y T Hy =: −ε < 0
q
define x = x∗ + y c̃−mε . Then

c̃ − m T
f (x) = y Hy + c̃ = m.
ε

(ii) Argue that the Newton Method (with step size αn = 1) applied to f would jump to x∗ in one
step and then stop moving. (1 pt)

Solution. Taking the derivative of (2) we get

∇f (x) = H(x − x∗ ). (3)

So with ∇2 f (x) = H and

x∗ = x − H −1 H(x − x∗ ) = x − [∇2 f (x)]−1 ∇f (x),

we have that the Newton Method finds x∗ in one step. By (3) we also get ∇f (x∗ ) = 0 which
stops the Newton method afterwards.

3
(iii) Let V = (v1 , . . . , vd ) be an orthonormal basis such that

H = V diag[λ1 , . . . , λd ]V T

with 0 < λ1 ≤ · · · ≤ λd and write

y (i) := hy, vi i.
Express (xn −x∗ )(i) in terms of (x0 −x∗ )(i) , where xn is given by the gradient descent recursion

xn+1 = xn − h∇f (xn ).

For which step size h do all the components (xn − x∗ )(i) converge to zero? Which component
has the slowest convergence speed? Find the optimal learning rate h∗ and deduce for this
learning rate
2 n
kxn − x∗ k ≤ (1 − 1+κ ) kx0 − x∗ k.
λd
with the condition number κ = λ1 . (5 pts)

Solution. Using the representation (3) of the gradient again and subtracting x∗ from our recur-
sion, we get

xn+1 − x∗ = xn − x∗ − hH(xn − x∗ ) = [I − hH](xn − x∗ )

Therefore

(xn+1 − x∗ )(i) = h[I − hH](xn − x∗ ), vi i

H symmetric
= hxn − x∗ , [I − hH]vi i
eigenvec.
= hxn − x∗ , (1 − hλi )vi i
= (1 − hλi )(xn − x∗ )(i)
induction
= (1 − hλi )n+1 (x0 − x∗ )(i) .

For all components to converge we need |1 − hλi | < 1 for all i. Since 1 − hλi < 1 is always
given, because h, λi > 0, we only need 1 − hλi > −1 or h2 > λi for all i. Since the eigenvalues
are sorted, this is equivalent to h2 > λd or

2
h< .
λd
Under this condition, all components converge. The component with the slowest convergence
is given by

max |1 − hλi | = max max{1 − hλi , −(1 − hλi )} = max{1 − hλ1 , −(1 − hλd )}.
i i | {z } | {z }
≤1−hλ1 ≤−(1−hλd )

To minimize the slowest convergence, we want to take the derivative. The discontinuity is at

1 − hλ1 = −(1 − hλd ) ⇐⇒ 2 = h(λ1 + λd )

4
so (
2
d −λ1 h≤ λ1 +λ2
= 2
dh λd h≥ λ1 +λ2

So the maximal convergence speed is achieved by h∗ = 2

λ1 +λ2 with

2λ1 2
r(h∗ ) := max |1 − h∗ λi | = 1 − =1−
i λ1 + λd 1+κ
λd
where κ = λ1 is the condition number. Putting things together, we have

d
X 2
kxn − x∗ k2 = (1 − hλi )n (x0 − x∗ )(i) vi
i=1
d
orthonormal (i) (i)
X
= (1 − hλi )2n (x0 − x∗ )2 = r(h∗ )2n kx0 − x∗ k2
| {z }
i=1
≤r(h∗ )2n

and therefore
kxn − x∗ k ≤ r(h∗ )n kx0 − x∗ k.

Exercise 4 (Programming exercise). (9 Points)

For the Python exercises join the GitHub classroom https://classroom.github.com/a/
8yrTMIm1. If you are new to git, checkout https://classroom.github.com/a/dEzm_
HGt

Concrete Structures For Retaining Aqueous Liquids - Code of Practice
No ratings yet
Concrete Structures For Retaining Aqueous Liquids - Code of Practice
22 pages
Technological Applications of Superconductivity
No ratings yet
Technological Applications of Superconductivity
3 pages
Functional Analysis Solutions
100% (1)
Functional Analysis Solutions
43 pages
Unconstrained Minimization in R: Newton Methods
No ratings yet
Unconstrained Minimization in R: Newton Methods
5 pages
13 Generalized Programming and Subgradient Optimization PDF
No ratings yet
13 Generalized Programming and Subgradient Optimization PDF
20 pages
p5-CO-opti-algo
No ratings yet
p5-CO-opti-algo
15 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Exercise1 Q
No ratings yet
Exercise1 Q
2 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Fixed Point
No ratings yet
Fixed Point
8 pages
Opt_Lec_10
No ratings yet
Opt_Lec_10
16 pages
Homework 2
No ratings yet
Homework 2
5 pages
Unconstrained Optimization (Contd.) Constrained Optimization
No ratings yet
Unconstrained Optimization (Contd.) Constrained Optimization
19 pages
Record Lab
No ratings yet
Record Lab
2 pages
Appendix PDF
No ratings yet
Appendix PDF
6 pages
Fixedpoint
No ratings yet
Fixedpoint
5 pages
Introduction To Probability Theory
No ratings yet
Introduction To Probability Theory
12 pages
Assignment
No ratings yet
Assignment
2 pages
2008 Seemous Problems Solutions
100% (1)
2008 Seemous Problems Solutions
4 pages
Model Answer Mid2-Mth462 - 2023-2024
No ratings yet
Model Answer Mid2-Mth462 - 2023-2024
3 pages
Llerena D
No ratings yet
Llerena D
14 pages
6 APPENDIX: Technical Results: 6.1 A: The Inverse Function Theorem
No ratings yet
6 APPENDIX: Technical Results: 6.1 A: The Inverse Function Theorem
7 pages
Lecture 10 Proximal
No ratings yet
Lecture 10 Proximal
4 pages
Second Exam Sheet: Taylor Polynomial Approximation
No ratings yet
Second Exam Sheet: Taylor Polynomial Approximation
2 pages
This Study Resource Was: CS 7641 CSE/ISYE 6740 Homework 3
No ratings yet
This Study Resource Was: CS 7641 CSE/ISYE 6740 Homework 3
4 pages
5 ND Basic Questions
No ratings yet
5 ND Basic Questions
11 pages
2017-10-16-midsem
No ratings yet
2017-10-16-midsem
1 page
Math-UA.326.001: Analysis II Notes For The Inverse Function Theorem
No ratings yet
Math-UA.326.001: Analysis II Notes For The Inverse Function Theorem
14 pages
hw4 Sol
No ratings yet
hw4 Sol
4 pages
465158_solutions9-24
No ratings yet
465158_solutions9-24
8 pages
Sigma Notation Area As A Limit
No ratings yet
Sigma Notation Area As A Limit
2 pages
Normed and Banach Spaces
No ratings yet
Normed and Banach Spaces
9 pages
Raghu Meka notes
No ratings yet
Raghu Meka notes
7 pages
A Newton Method For Systems of M Equations in N Variables
No ratings yet
A Newton Method For Systems of M Equations in N Variables
13 pages
GR Exercise 1
No ratings yet
GR Exercise 1
4 pages
uniform (1)
No ratings yet
uniform (1)
2 pages
Notes ch0
No ratings yet
Notes ch0
12 pages
HW Jan 23 Sols
No ratings yet
HW Jan 23 Sols
4 pages
Lect 18 PDF
No ratings yet
Lect 18 PDF
10 pages
Expectation Values of Operators: Lecture Notes: QM 04
No ratings yet
Expectation Values of Operators: Lecture Notes: QM 04
8 pages
DSCC 435 PS6(1)
No ratings yet
DSCC 435 PS6(1)
3 pages
Taleextra
No ratings yet
Taleextra
15 pages
Tentalosning TMA947 070312 2
No ratings yet
Tentalosning TMA947 070312 2
6 pages
2009 Seemous Problems Solutions
No ratings yet
2009 Seemous Problems Solutions
4 pages
IWIAS Mini Course Opt GF Aug 2023 Nopause
No ratings yet
IWIAS Mini Course Opt GF Aug 2023 Nopause
26 pages
week3_25
No ratings yet
week3_25
7 pages
EC3120 - Mathematical Economics - 2008 Examiners Commentaries - ZA-ZB
No ratings yet
EC3120 - Mathematical Economics - 2008 Examiners Commentaries - ZA-ZB
11 pages
P-Adic Analysis Compared To Real, Lecture 1 - F. Hensel, W. Lederle, S. Montemezzani
No ratings yet
P-Adic Analysis Compared To Real, Lecture 1 - F. Hensel, W. Lederle, S. Montemezzani
11 pages
4
No ratings yet
4
1 page
Notes ch4 1
No ratings yet
Notes ch4 1
7 pages
Interesting Integral: Solution
No ratings yet
Interesting Integral: Solution
2 pages
Math 432 - Real Analysis II: Solutions Homework Due February 4
No ratings yet
Math 432 - Real Analysis II: Solutions Homework Due February 4
2 pages
Analysis II MS Sol 2015-16
No ratings yet
Analysis II MS Sol 2015-16
3 pages
Acceleration Methods: Aitken's Method
No ratings yet
Acceleration Methods: Aitken's Method
5 pages
hw01
No ratings yet
hw01
3 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
Week6HW2 5
No ratings yet
Week6HW2 5
2 pages
hw8-solutions
No ratings yet
hw8-solutions
6 pages
FULLTEXT01
No ratings yet
FULLTEXT01
38 pages
Solutions To Problems in Merzbacher Quantum Mechanics 3rd Ed-Reid-P59
90% (10)
Solutions To Problems in Merzbacher Quantum Mechanics 3rd Ed-Reid-P59
59 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
E-Math Practice Questions (July)
No ratings yet
E-Math Practice Questions (July)
34 pages
Hyd Hose Data
No ratings yet
Hyd Hose Data
4 pages
Lattice Dynamics Extends The Concept of Crystal Lattice To An Array of Atoms With Finite
No ratings yet
Lattice Dynamics Extends The Concept of Crystal Lattice To An Array of Atoms With Finite
6 pages
Electric Circuit Fundamentals
No ratings yet
Electric Circuit Fundamentals
19 pages
Be Mechanical Engineering Semester 7 2019 May Dynamics of Machinery Dompattern 2015
No ratings yet
Be Mechanical Engineering Semester 7 2019 May Dynamics of Machinery Dompattern 2015
3 pages
Periodical Test in Math 4th Pces
No ratings yet
Periodical Test in Math 4th Pces
8 pages
Friction
No ratings yet
Friction
58 pages
4 Problems Chapter 4: Free and Con Ned Electrons: 9 2 Jqej)
No ratings yet
4 Problems Chapter 4: Free and Con Ned Electrons: 9 2 Jqej)
10 pages
Finite Element and Boundary Element Techniques From Mathematical and Engineering Point of View
No ratings yet
Finite Element and Boundary Element Techniques From Mathematical and Engineering Point of View
338 pages
Surveying I Course Material: Addis Ababa University Department of Civil Engineering
No ratings yet
Surveying I Course Material: Addis Ababa University Department of Civil Engineering
45 pages
Area DPP 1
No ratings yet
Area DPP 1
2 pages
Second Periodical TEST IN TLE 7
No ratings yet
Second Periodical TEST IN TLE 7
4 pages
Catálogo en Inglés PDF
No ratings yet
Catálogo en Inglés PDF
76 pages
Alternative Methods For Wake Quality Assessment
No ratings yet
Alternative Methods For Wake Quality Assessment
9 pages
Lateral Buckling of C-Section Purlins With One Anti-Sag Bar at Middle Span Section
No ratings yet
Lateral Buckling of C-Section Purlins With One Anti-Sag Bar at Middle Span Section
12 pages
A Review On Manufacturing The Surface Composites by Friction Stir Processing
No ratings yet
A Review On Manufacturing The Surface Composites by Friction Stir Processing
37 pages
Breth Lab Report
No ratings yet
Breth Lab Report
2 pages
Calculus For Business Economics and The Social and Life Scienc Brief Edition 11th Edition Hoffmann Test Bank
100% (43)
Calculus For Business Economics and The Social and Life Scienc Brief Edition 11th Edition Hoffmann Test Bank
13 pages
C4 Kinematics of Point Mass 1
No ratings yet
C4 Kinematics of Point Mass 1
9 pages
Research Method FD
No ratings yet
Research Method FD
77 pages
Cambridge International General Certiþ Cate of Secondary Education
No ratings yet
Cambridge International General Certiþ Cate of Secondary Education
16 pages
Calculus 2 - MTHG003 - Lecture Notes (8) - Spring 2024 - Dr. Fayad Galal
No ratings yet
Calculus 2 - MTHG003 - Lecture Notes (8) - Spring 2024 - Dr. Fayad Galal
33 pages
Control System in A Hybrid Solar Dryer
No ratings yet
Control System in A Hybrid Solar Dryer
5 pages
(Part1) EEIC PRACTICE PROBLEMS 2024 - EE
No ratings yet
(Part1) EEIC PRACTICE PROBLEMS 2024 - EE
6 pages
Development and Operation of Two Jetties 3 & 4 at Mongla Port Project Trough "PPP"
No ratings yet
Development and Operation of Two Jetties 3 & 4 at Mongla Port Project Trough "PPP"
5 pages
Final Test Series for NEET-2025 (XII & RM) Code-A_Phase-01 1 (1)
No ratings yet
Final Test Series for NEET-2025 (XII & RM) Code-A_Phase-01 1 (1)
2 pages
Summative in Physical Science 12
No ratings yet
Summative in Physical Science 12
3 pages
20AE1005_StrucLab7
No ratings yet
20AE1005_StrucLab7
13 pages

sheet_2_solution

Uploaded by

sheet_2_solution

Uploaded by

Optimization in Machine Learning Universität Mannheim

HWS 2024 Prof. Simon Weißmann, Felix Benning

Exercise 1 (Descent Directions of a Maximum). (1 Points)

f (x∗ + αd) < f (x∗ ),

since x∗ + αd ∈ B (x∗ ) \ {x∗ } as kαdk ≤ ᾱkdk = .

Exercise 2 (Convergence to Stationary Point). (5 Points)

(i) Let (xk )k∈N be defined by gradient descent

with diminishing step size αk > 0 such that ∞

h∇f (xi ), ∇f (xj )i

So we necessarily need p() ≤ 0. But as  was arbitrary, we have

0 ≤ k∇f (x∗ )k2 = lim p() ≤ 0.

min k∇f (xk )k2 ∈ o(1/n). (2.5 pts)

Solution. By L-smoothness of f , we have

Now an := mink≤n k∇f (xk )k2 is non-increasing, therefore

a2n+1 ≤ a2n ∈ o(1/n).

Exercise 3 (Optimizing Quadratic Functions). (9 Points)

f (x) = (x − x∗ )T A(x − x∗ ) + c̃ (1)

Solution. We want for some x∗

x∗ := −(A + AT )−1 b and c̃ := c − xT∗ Ax∗ .

Applying this to y = x − x∗ in (1) we are done.

Solution. Taking the derivative of (2) we get

∇f (x) = H(x − x∗ ). (3)

So with ∇2 f (x) = H and

x∗ = x − H −1 H(x − x∗ ) = x − [∇2 f (x)]−1 ∇f (x),

with 0 < λ1 ≤ · · · ≤ λd and write

xn+1 = xn − h∇f (xn ).

xn+1 − x∗ = xn − x∗ − hH(xn − x∗ ) = [I − hH](xn − x∗ )

(xn+1 − x∗ )(i) = h[I − hH](xn − x∗ ), vi i

1 − hλ1 = −(1 − hλd ) ⇐⇒ 2 = h(λ1 + λd )

So the maximal convergence speed is achieved by h∗ = 2

Exercise 4 (Programming exercise). (9 Points)

You might also like

since x∗ + αd ∈ B (x∗ ) \ {x∗ } as kαdk ≤ ᾱkdk = .

So we necessarily need p() ≤ 0. But as was arbitrary, we have

0 ≤ k∇f (x∗ )k2 = lim p() ≤ 0.