0% found this document useful (0 votes)
2 views191 pages

MathOptimizationRoleInMLDL

The document discusses the significance of mathematics and optimization in machine learning and deep learning, highlighting various applications and theoretical foundations. It includes a short biography of Tuyen Trung Truong, detailing his educational background and research contributions in the field. The content also outlines specific mathematical concepts like PageRank, least squares problems, and image classification, emphasizing their roles in developing algorithms for machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views191 pages

MathOptimizationRoleInMLDL

The document discusses the significance of mathematics and optimization in machine learning and deep learning, highlighting various applications and theoretical foundations. It includes a short biography of Tuyen Trung Truong, detailing his educational background and research contributions in the field. The content also outlines specific mathematical concepts like PageRank, least squares problems, and image classification, emphasizing their roles in developing algorithms for machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 191

The role of Mathematics and Optimization in

Machine Learning/Deep Learning

Tuyen Trung Truong

Matematikk Institut, Universitetet i Oslo


Supported by Research Council of Norway

University of Technology, Ho Chi Minh city, August 2024

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Blank page

This page is intended to be blank.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio

High school: Luong Van Chanh, Phu Yen. Undergraduate


degree: Mathematics department, University of Science, Ho
Chi Minh City.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio

High school: Luong Van Chanh, Phu Yen. Undergraduate


degree: Mathematics department, University of Science, Ho
Chi Minh City.
PhD degree: Mathematics department, Indiana University,
USA.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio

High school: Luong Van Chanh, Phu Yen. Undergraduate


degree: Mathematics department, University of Science, Ho
Chi Minh City.
PhD degree: Mathematics department, Indiana University,
USA.
Postdoc positions: Syracuse University (USA), Korea Institute
for Advanced Study (South Korea), and University of Adelaide
(Australia).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio

High school: Luong Van Chanh, Phu Yen. Undergraduate


degree: Mathematics department, University of Science, Ho
Chi Minh City.
PhD degree: Mathematics department, Indiana University,
USA.
Postdoc positions: Syracuse University (USA), Korea Institute
for Advanced Study (South Korea), and University of Adelaide
(Australia).
Work: Professor (since 2023), Associate Professor
(2017-2023), Mathematics department, University of Oslo,
Norway.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 2)
Some research in Optimization and Machine Learning/Deep
Learning:

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 2)
Some research in Optimization and Machine Learning/Deep
Learning:
Papers:
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation, Part 2: Algorithms and experiments, Applied
Mathematics and Optimization, 84 (2021), no 3, 2557-2586.
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation. Part 1: Theory, Minimax Theory and its
Applications, 7 (2022), no 1, 79-108.
Maged Helmy, Tuyen Trung Truong, Anastaiya Dykyy, Paulo
Ferreira and Eric Jul, Capillary Net: an automated system to
quantify skin capillary density and red blood cell velocity from
handheld vital microscopy, Artificial Intelligence in Medicine,
Volume 127, May 2022, 102287.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 3)
Maged Helmy, Tuyen Trung Truong, Eric Jul and Paulo
Ferreira, Deep learning and computer vision techniques for
microcirculation analysis: A review, Patterns, Volume 4, Issue
1, 2023, 100641.
Tuyen Trung Truong, Tat Dat To, Hang-Tuan Nguyen, Thu
Hang Nguyen, Hoang Phuong Nguyen and Maged Helmy, A
fast and simple modification of Newton’s method avoiding
saddle points, Journal of Optimization Theory and
Applications, Published online 18 July 2023.
John Erik Fornæss, Mi Hu, Tuyen Trung Truong and Takayuki
Watanabe, Backtracking New Q-Newton’s method, Newton’s
flow, Voronoi’s diagram and Stochastic root finding,
arXiv:2401.01393. Accepted in Complex Analysis and
Operator Theory, special issue 70th birthday of Steve Krantz.
Tuyen Trung Truong, Backtracking New Q-Newton’s method:
a good algorithm for optimization and solving systems of
equations, arXiv:2209.05378.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 4)
John Erik Fornæss, Mi Hu, Tuyen Trung Truong and Takayuki
Watanabe, Backtracking New Q-Newton’s method,
Schroder’s theorem, and Linear conjugacy, arXiv:2312.12126
Thuan Quang Tran and Tuyen Trung Truong, The Riemann
hypothesis and dynamics of Backtracking New Q-Newton’s
method, arXiv: 2405.05834.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 4)
John Erik Fornæss, Mi Hu, Tuyen Trung Truong and Takayuki
Watanabe, Backtracking New Q-Newton’s method,
Schroder’s theorem, and Linear conjugacy, arXiv:2312.12126
Thuan Quang Tran and Tuyen Trung Truong, The Riemann
hypothesis and dynamics of Backtracking New Q-Newton’s
method, arXiv: 2405.05834.
Some ongoing work:
Develop an app for Optimization methods. Will run some
experiments on this when discussing Stochastic Optimization.
Develop an app using AI to create good mathematical
questions for elementary school, supported by University of
Oslo’s growth house.
Collaborate with Torus AI on a project on AI for Spirometry.
Develop an app using AI to help high school students solve
mathematical exam questions.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content

§1 Some motivating applications, some initial ideas

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content

§1 Some motivating applications, some initial ideas


§2 Specific fields in Mathematics and Optimization, with
explicit examples on their roles in Machine Learning/Deep
Learning

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content

§1 Some motivating applications, some initial ideas


§2 Specific fields in Mathematics and Optimization, with
explicit examples on their roles in Machine Learning/Deep
Learning
§3 Looking forward - Some open questions

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content

§1 Some motivating applications, some initial ideas


§2 Specific fields in Mathematics and Optimization, with
explicit examples on their roles in Machine Learning/Deep
Learning
§3 Looking forward - Some open questions
§4 Some references:
(Universal approximation theorem) Allan Pinkus,
Approximation theory of the MLP model in neural networks,
Acta Numerica (1999), pp. 143–195
https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf
(Online book) Dive into Deep Learning https://d2l.ai
(Article on the role of formal logic in computers and neural
networks) https://nautil.us/the-man-who-tried-to-redeem-the-
world-with-logic-235253/

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
What is Deep Learning?

Slogan: Deep Learning is the ability to run old algorithms


with more sophisticated models (= more parameters), utilising
the new advantages in theoretical knowledge and hardware!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1 Some motivating applications

Image from the website:


https://www.zendesk.com/blog/machine-learning-and-deep-
learning/
§1.1: PageRank

PageRank is the first search algorithm used by Google.


https://en.wikipedia.org/wiki/PageRank
Original paper by Brin and Page:
http://infolab.stanford.edu/pub/papers/google.pdf (see
Section 2.1.1)

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.1: PageRank

PageRank is the first search algorithm used by Google.


https://en.wikipedia.org/wiki/PageRank
Original paper by Brin and Page:
http://infolab.stanford.edu/pub/papers/google.pdf (see
Section 2.1.1)
The explanation of the damping factor d is in the section
Damping factor in the Wikipedia page.
The matrix is given in the Wikpedia page in the same section
Damping factor, where the vector R appears.
The ”iterative” is in the section Iterative in the Wikipedia
page.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.1: PageRank

PageRank is the first search algorithm used by Google.


https://en.wikipedia.org/wiki/PageRank
Original paper by Brin and Page:
http://infolab.stanford.edu/pub/papers/google.pdf (see
Section 2.1.1)
The explanation of the damping factor d is in the section
Damping factor in the Wikipedia page.
The matrix is given in the Wikpedia page in the same section
Damping factor, where the vector R appears.
The ”iterative” is in the section Iterative in the Wikipedia
page.
The algorithm is all Linear Algebra. A crucial component is to
find major eigenvector of a matrix. Indeed, a related algorithm
was proposed by Edmund Landau in 1895. Recall the slogan!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares
Have a dataset (xi , yi )i=1,...,N .
Guess that yi is approximately F (xi , β) where β are
parameters need to be decided.
Decide βPby solving the minimization problem:
minβ N1 N 2
i=1 (yi − F (xi , β)) .

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares
Have a dataset (xi , yi )i=1,...,N .
Guess that yi is approximately F (xi , β) where β are
parameters need to be decided.
Decide βPby solving the minimization problem:
minβ N1 N 2
i=1 (yi − F (xi , β)) .
Example: Linear regression F (x, β) = ax + b, where
β = (a, b).
p
Example: Fit on an ellipse: F (x, β) = ± (1 − ax 2 )/b, where
β = (a, b) and the ellipse is ax 2 + by 2 = 1.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares
Have a dataset (xi , yi )i=1,...,N .
Guess that yi is approximately F (xi , β) where β are
parameters need to be decided.
Decide βPby solving the minimization problem:
minβ N1 N 2
i=1 (yi − F (xi , β)) .
Example: Linear regression F (x, β) = ax + b, where
β = (a, b).
p
Example: Fit on an ellipse: F (x, β) = ± (1 − ax 2 )/b, where
β = (a, b) and the ellipse is ax 2 + by 2 = 1.
If F (x, β) comes from a Deep Neural Network, then this is an
example of Deep Learning!
The difference is that in the 1960s and so, it is hard to solve
this if F (x, β) is not simple (e.g. Linear Regression).
Nowadays, one can solve for very complicated F (x, β).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images
It was impossible for a computer to recognise if there is a cat
in a picture or not.
While this is a piece of cake for human children (2 year old or
more).
No rigorous theory can do this! Even if it is published in
Annals of Mathematics.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images
It was impossible for a computer to recognise if there is a cat
in a picture or not.
While this is a piece of cake for human children (2 year old or
more).
No rigorous theory can do this! Even if it is published in
Annals of Mathematics.
A specific problem: A picture has either a dog, a cat or a
mouse, and only one of them. Can we design an algorithm for
computers to detect?
How to proceed?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images
It was impossible for a computer to recognise if there is a cat
in a picture or not.
While this is a piece of cake for human children (2 year old or
more).
No rigorous theory can do this! Even if it is published in
Annals of Mathematics.
A specific problem: A picture has either a dog, a cat or a
mouse, and only one of them. Can we design an algorithm for
computers to detect?
How to proceed?
First step is to collect a large dataset of the form
(xi , yi )i=1,...,N where N is ”big enough”. Where xi is an
image, and yi is one of the labels ”dog”, ”cat”, ”mouse”.
A convenient way is to use ”one hot vector”. (1, 0, 0) means
”dog”, (0, 1, 0) means ”cat”, (0, 0, 1) means ”mouse”.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images (cont. 2)
Second step: What type of values can F (x, β) get?
A nontrivial question: should F (x, β) just take values in the
discrete set (indeed, finite set) (1, 0, 0), (0, 1, 0) and (0, 0, 1)?
Why nontrivial? Since yi has values in this discrete set,
shouldn’t F (x, β)?
What’s not good? Then F (x, β) cannot be a continuous
function, and then it is very difficult to work with!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images (cont. 2)
Second step: What type of values can F (x, β) get?
A nontrivial question: should F (x, β) just take values in the
discrete set (indeed, finite set) (1, 0, 0), (0, 1, 0) and (0, 0, 1)?
Why nontrivial? Since yi has values in this discrete set,
shouldn’t F (x, β)?
What’s not good? Then F (x, β) cannot be a continuous
function, and then it is very difficult to work with!
A new, better idea? Indeed, the idea that F (x, β) just takes
values in the discrete set is not good.
You can imagine that there are some pictures which look a bit
like dog, but also a bit like cat, and a bit like mouse.
A better way is to have F (x, β) = (a, b, c), where 0 ≤ a, b, c,
a + b + c = 1.
a means the probability that x is dog, b means the probability
that x is cat, c means the probability that x is mouse.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images (cont. 3)

How to achieve these goals? A good choice of approximation


function.
Later. Some key words: softmax function,...
https://en.wikipedia.org/wiki/Softmax function
A softmax function is a real analytic function.
Plus an appropriate cost/loss function. (A good choice of how
to measure the difference between your model and the reality.)

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2 Specific fields in Mathematics and Optimization, with
usage in Machine Learning/Deep Learning

Image from the website:


https://didyouhearaboutworksheet.blogspot.com/2021/09/what-
are-four-branches-of-mathematics.html
§2.1 Logic

Image from the website: https://www.proprofs.com/quiz-


school/story.php?title=logic-math-quiz
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.
Some names: Gottfried Leibniz, Bertrand Russell and Alfred
Whitehead, Walter Pitts, Warren McCulloch. ”Along the way,
they would create the first mechanistic theory of the mind,
the first computational approach to neuroscience, the logical
design of modern computers, and the pillars of artificial
intelligence. ”

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.
Some names: Gottfried Leibniz, Bertrand Russell and Alfred
Whitehead, Walter Pitts, Warren McCulloch. ”Along the way,
they would create the first mechanistic theory of the mind,
the first computational approach to neuroscience, the logical
design of modern computers, and the pillars of artificial
intelligence. ”
”A neuron’s signal is a proposition, and neurons seemed to
work like logic gates, taking in multiple inputs and producing
a single output.” https://nautil.us/the-man-who-tried-to-
redeem-the-world-with-logic-235253/

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.
Some names: Gottfried Leibniz, Bertrand Russell and Alfred
Whitehead, Walter Pitts, Warren McCulloch. ”Along the way,
they would create the first mechanistic theory of the mind,
the first computational approach to neuroscience, the logical
design of modern computers, and the pillars of artificial
intelligence. ”
”A neuron’s signal is a proposition, and neurons seemed to
work like logic gates, taking in multiple inputs and producing
a single output.” https://nautil.us/the-man-who-tried-to-
redeem-the-world-with-logic-235253/
Logic is expected to play an important role to creating true AI.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2 Approximation theory

Image from internet, website cannot be accessed.


§2.2: Approximation Theory (cont. 2)
Linear regression: approximate data by a line in dimension 2
(more generally a linear subspace in higher dimensions).
Classical statistics/Classical Machine Learning. Still useful for
certain datasets.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 2)
Linear regression: approximate data by a line in dimension 2
(more generally a linear subspace in higher dimensions).
Classical statistics/Classical Machine Learning. Still useful for
certain datasets.
More general: Polynomial interpolation. Approximate data by
polynomial functions.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 2)
Linear regression: approximate data by a line in dimension 2
(more generally a linear subspace in higher dimensions).
Classical statistics/Classical Machine Learning. Still useful for
certain datasets.
More general: Polynomial interpolation. Approximate data by
polynomial functions.
However, not enough for general approximation problem.
Must go beyond polynomials to be much more useful.
Exponential functions have been preferred (like sigmoid,
softmax, hyperbolic functions...). More research and
experiments show that even piece-wise linear functions (like
ReLU, maxpool,...) can also be effective.
Indeed, probably our brains work better with these piece-wise
functions than highly non-linear functions (even working with
polynomials can be already difficult in general, if we do not
get assistance from computers). Why Linear Algebra matters.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 3)
Universal approximation theorem: see Theorem 3.1 in
https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf
Fix a function σ which is continuous and not a polynomial.
Any continuous function can be
P approximated on compact
sets by functions of the form N i=1 ci σ(Li x + θi ), where ci are
real numbers, Li are matrices and θi are vectors. The number
N depends on which error we want to approximate.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 3)
Universal approximation theorem: see Theorem 3.1 in
https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf
Fix a function σ which is continuous and not a polynomial.
Any continuous function can be
P approximated on compact
sets by functions of the form N i=1 ci σ(Li x + θi ), where ci are
real numbers, Li are matrices and θi are vectors. The number
N depends on which error we want to approximate.
In terms of Neural Networks, the above theorem means that
any function can be approximated by Neural networks with 1
layer, provided we are allowed to increase N (the breath).
In practice, the Deep Neural Network paradigm seems to be
more efficient: we use more layers, but each layer has a
bounded dimension.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 3)
Universal approximation theorem: see Theorem 3.1 in
https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf
Fix a function σ which is continuous and not a polynomial.
Any continuous function can be
P approximated on compact
sets by functions of the form N i=1 ci σ(Li x + θi ), where ci are
real numbers, Li are matrices and θi are vectors. The number
N depends on which error we want to approximate.
In terms of Neural Networks, the above theorem means that
any function can be approximated by Neural networks with 1
layer, provided we are allowed to increase N (the breath).
In practice, the Deep Neural Network paradigm seems to be
more efficient: we use more layers, but each layer has a
bounded dimension.
The theorem appeared in 1960s but could not be used to
develop Deep Neural Networks since the computer powers
were not enough, Linear Regression was better fitted then.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 3)
Deep Neural Networks: A function which is a composition of
functions of the form σk ◦ (Lk xk + θk ), 0 ≤ k ≤ m (the bigger
m, the deeper the network).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 3)
Deep Neural Networks: A function which is a composition of
functions of the form σk ◦ (Lk xk + θk ), 0 ≤ k ≤ m (the bigger
m, the deeper the network).
xk are variables.
σk are nonlinear functions, and are fixed. For example,
sigmoid function, ReLU function, maxpool (meaning max of a
finite set of numbers)...
Lk are matrices and θk are vectors.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 3)
Deep Neural Networks: A function which is a composition of
functions of the form σk ◦ (Lk xk + θk ), 0 ≤ k ≤ m (the bigger
m, the deeper the network).
xk are variables.
σk are nonlinear functions, and are fixed. For example,
sigmoid function, ReLU function, maxpool (meaning max of a
finite set of numbers)...
Lk are matrices and θk are vectors.
The coefficients of Lk and θk are not before hand determined.
We need to decide the best coefficients for our specific
application.
The idea is to initiate these coefficients randomly. Then use
an optimization algorithm to minize an appropriate cost
function. Compare with ideas by Pitts and McCulloch!
These coefficients are called parameters of the DNN. For
example, GPT 4 is said to have 1.7 T parameters.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 4)

Question: Can you spot a big difference between the Universal


Approximation Theorem and the design of a DNN? Hint:
what corresponds to the cj in the Universal Approximation
Theorem?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 4)

Question: Can you spot a big difference between the Universal


Approximation Theorem and the design of a DNN? Hint:
what corresponds to the cj in the Universal Approximation
Theorem?
Moral: whatever good a theory is, when using in practice you
would need to modify it to fit your purpose. Many times this
modification requires a very big effort.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus

Image from
https://www.mindingthecampus.org/2023/03/02/calculus-
without-fear/
§2.3 Calculus (cont. 2)

Crucial/useful notations:

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 2)

Crucial/useful notations:
First derivatives:
Gradient: ∇f = (∂f /∂x1 , . . . , ∂f /∂xn ). Used in checking if a
point is a critical point. Used in Taylor’s series. Used in
gradient descent.
Jacobian: generalisation of Gradient to a tuple of functions.
JF = (∇f1 , . . . , ∇fm ), where F = (f1 , . . . , fm ). Used in
Newton’s method and Levenberg-Marquardt method for Least
Square Problem (including solving systems of equations).
To solve f1 = . . . = fm = 0, we consider F = f12 + . . . + fm2 .
Relation between roots to the system, and global minima of
F?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 3)
Second derivatives:
Hessian matrix: ∇2 f = (∂ 2 f /∂xi ∂xj )i,j=1,...,n . Used to classify
critical points (local minimum, local maximum, saddle
points). Used in Newton’s method for optimization.
Represent ”curvature”.
We have ∇2 f = J(∇f ). (Hessian matrix is Jacobian matrix of
the gradient).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 3)
Second derivatives:
Hessian matrix: ∇2 f = (∂ 2 f /∂xi ∂xj )i,j=1,...,n . Used to classify
critical points (local minimum, local maximum, saddle
points). Used in Newton’s method for optimization.
Represent ”curvature”.
We have ∇2 f = J(∇f ). (Hessian matrix is Jacobian matrix of
the gradient).
One important property: If f is C 2 , then ∇2 f is a symmetric
square matrix with real coefficients.
Consequence: all eigenvalues of Hessian matrix are real, and
Hessian matrix is diagonalisable.
https://en.wikipedia.org/wiki/Diagonalizable matrix

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 3)
Second derivatives:
Hessian matrix: ∇2 f = (∂ 2 f /∂xi ∂xj )i,j=1,...,n . Used to classify
critical points (local minimum, local maximum, saddle
points). Used in Newton’s method for optimization.
Represent ”curvature”.
We have ∇2 f = J(∇f ). (Hessian matrix is Jacobian matrix of
the gradient).
One important property: If f is C 2 , then ∇2 f is a symmetric
square matrix with real coefficients.
Consequence: all eigenvalues of Hessian matrix are real, and
Hessian matrix is diagonalisable.
https://en.wikipedia.org/wiki/Diagonalizable matrix
Higher derivatives are rarely used, and not yet produced
algorithms more advantegous compared to first and second
derivatives.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics

Image from https://betterexplained.com/articles/a-brief-


introduction-to-probability-statistics/
§2.4 Probability & Statistics (cont. 2)

No purely theoretical model can always predict correctly if


there is a cat in a picture or not. (People tried and failed.)

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 2)

No purely theoretical model can always predict correctly if


there is a cat in a picture or not. (People tried and failed.)
Machine Learning/Deep Learning aims for a high precision.
(e.g. 90% correct)
Tools from Probability, Statistics help.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 2)

No purely theoretical model can always predict correctly if


there is a cat in a picture or not. (People tried and failed.)
Machine Learning/Deep Learning aims for a high precision.
(e.g. 90% correct)
Tools from Probability, Statistics help.
Some classical Statistics tools like Decision trees, Random
Forests,... are good for tabular data.
https://en.wikipedia.org/wiki/Decision tree
https://en.wikipedia.org/wiki/Random forest
However, in general, we need more complicated and non-linear
tools.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:
Construct a Deep Neural Network F (x, α), where x is input
and α is parameters.
P
Construct a cost function G (α) = x∈I d(F (x, α), yx ).
Here I is the dataset, yx is the ”correct value” for the data x,
and d is a metric (distance function).
Initialize a random value α0 , choose an optimization method
and run it to find a good value α̂, and use it in F (x, α̂) to
provide new predictions.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:
Construct a Deep Neural Network F (x, α), where x is input
and α is parameters.
P
Construct a cost function G (α) = x∈I d(F (x, α), yx ).
Here I is the dataset, yx is the ”correct value” for the data x,
and d is a metric (distance function).
Initialize a random value α0 , choose an optimization method
and run it to find a good value α̂, and use it in F (x, α̂) to
provide new predictions.
Not enough, need to test with reality to make sure!
Each time getting a value αn from the optimization algorithm,
test with so-called validation dataset, and choose the one
which gives the best match.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:
Construct a Deep Neural Network F (x, α), where x is input
and α is parameters.
P
Construct a cost function G (α) = x∈I d(F (x, α), yx ).
Here I is the dataset, yx is the ”correct value” for the data x,
and d is a metric (distance function).
Initialize a random value α0 , choose an optimization method
and run it to find a good value α̂, and use it in F (x, α̂) to
provide new predictions.
Not enough, need to test with reality to make sure!
Each time getting a value αn from the optimization algorithm,
test with so-called validation dataset, and choose the one
which gives the best match.
This is close to what actually done. But there are many other
modifications, some will be covered next.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 4)

How to deal with probabilities?


For example, classify an image into 3 classes: dog, cat and
mouse.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 4)

How to deal with probabilities?


For example, classify an image into 3 classes: dog, cat and
mouse.
We construct a DNN whose last layer is of dimension 3. So
we can write F (x, α) = (y1 , y2 , y3 ).
Then we use softmax to create probability:
(e y1 /N, e y2 /N, e y3 /N) with N = e y1 + e y2 + e y3 .

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 4)

How to deal with probabilities?


For example, classify an image into 3 classes: dog, cat and
mouse.
We construct a DNN whose last layer is of dimension 3. So
we can write F (x, α) = (y1 , y2 , y3 ).
Then we use softmax to create probability:
(e y1 /N, e y2 /N, e y3 /N) with N = e y1 + e y2 + e y3 .
Each of the coordinate above is > 0, and the sum. is 1, hence
they can represent probabilities.
Whether these numbers correspond to actual probabilities, we
don’t know. However, this idea works!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 5)

How to choose the metric d(., .)?


In general, you need to do experiments with your data, and
find a good metric to work with.
There is, on the other hand, research in Statistics/Probability
on which metric can be good. These can be the initial choice
of your metric, and then you can modify it to fit your purpose.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 5)

How to choose the metric d(., .)?


In general, you need to do experiments with your data, and
find a good metric to work with.
There is, on the other hand, research in Statistics/Probability
on which metric can be good. These can be the initial choice
of your metric, and then you can modify it to fit your purpose.
Some common loss functions, and what they are good for:
https://www.datacamp.com/tutorial/loss-function-in-
machine-learning

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 5)

How to choose the metric d(., .)?


In general, you need to do experiments with your data, and
find a good metric to work with.
There is, on the other hand, research in Statistics/Probability
on which metric can be good. These can be the initial choice
of your metric, and then you can modify it to fit your purpose.
Some common loss functions, and what they are good for:
https://www.datacamp.com/tutorial/loss-function-in-
machine-learning
A related topic is maximum likelihood estimation in Statistics.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 6)
P
We considered the cost function G (α) = x∈I d(F (x, α), yx ),
where x runs all over the whole dataset I .
However, practice shows that this is not a smart way.
Better is to use ”mini batches”.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 6)
P
We considered the cost function G (α) = x∈I d(F (x, α), yx ),
where x runs all over the whole dataset I .
However, practice shows that this is not a smart way.
Better is to use ”mini batches”.
We divide the dataset I into smaller subsets I1 , . . . , Im
randomly, Each Ij is a mini-batch.
We will run our Poptimization algorithm with changing cost
functions
P G1 = x∈I1 d(F (x, α), yx ), P
G2 = x∈I2 d(F (x, α), yx ), . . . Gm = x∈Im d(F (x, α), yx ).
When we are done with Gm , we finish an epoch, and redivide
the dataset I into minibatch again, and do this again and
again til we are satisfied.
Why is this a better way? Maybe because we are exposed to
many more variations in the dataset, which makes the
algorithm more robust to new data.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 6)
P
We considered the cost function G (α) = x∈I d(F (x, α), yx ),
where x runs all over the whole dataset I .
However, practice shows that this is not a smart way.
Better is to use ”mini batches”.
We divide the dataset I into smaller subsets I1 , . . . , Im
randomly, Each Ij is a mini-batch.
We will run our Poptimization algorithm with changing cost
functions
P G1 = x∈I1 d(F (x, α), yx ), P
G2 = x∈I2 d(F (x, α), yx ), . . . Gm = x∈Im d(F (x, α), yx ).
When we are done with Gm , we finish an epoch, and redivide
the dataset I into minibatch again, and do this again and
again til we are satisfied.
Why is this a better way? Maybe because we are exposed to
many more variations in the dataset, which makes the
algorithm more robust to new data.
We do not work with normal optimization, but stochastic
optimization.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear algebra

Image from
https://www.codecademy.com/learn/dsml-math-for-machine-
learning/modules/math-ds-linear-algebra/cheatsheet
§2.5 Linear Algebra (cont. 2)
Fast multiplication of matrix is important in applications.
E.g. Calculate gradient of the output of a Deep Neural
Network is, by Chain rule in Calculus, reduced to calculating
multiplications of matrices.
Computer chips are more powerful when the incorporated
matrix multiplication in the chip is more fast.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 2)
Fast multiplication of matrix is important in applications.
E.g. Calculate gradient of the output of a Deep Neural
Network is, by Chain rule in Calculus, reduced to calculating
multiplications of matrices.
Computer chips are more powerful when the incorporated
matrix multiplication in the chip is more fast.
Textbook’s algorithm for multiplying two matrices of size n is
n3 .
The complexity of multiplying two matrices of size n is O(nθ ),
where 2 ≤ θ < 3. Conjecturally θ = 2 (we cannot do better,
Why?).
The one which is fastest in practice, even though not with the
best complexity, is Strassen’s algorithm.
https://en.wikipedia.org/wiki/Computational complexity of
matrix multiplication
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 3)
Finding inverse of a matrix, finding eigenvalue/eigenvector
pairs of a matrix is important in application.
Strassen proves that finding inverse of a matrix, finding its
determinant, and doing Gaussian elimination (in solving linear
equations) have the same complexity O(nθ ).
Finding eigenvalues is a root finding problem: the eigenvalues
are the roots of the characteristic matrix.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 3)
Finding inverse of a matrix, finding eigenvalue/eigenvector
pairs of a matrix is important in application.
Strassen proves that finding inverse of a matrix, finding its
determinant, and doing Gaussian elimination (in solving linear
equations) have the same complexity O(nθ ).
Finding eigenvalues is a root finding problem: the eigenvalues
are the roots of the characteristic matrix.
Numerical Linear Algebra: usually, try to use an iterative
method. Example: How to find the principal λ1 /e1 pair for a
matrix (the Google’s algorithm)?
Another idea for fast computation is to compute just a few
significant values, and not all.
For example, compute the first 10 eigenvalue/eigenvector
pairs, where the eigenvalues are the 10 largest. Lanczos
algorithm: https://en.wikipedia.org/wiki/Lanczos algorithm

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 4)

Linear Algebra
P is intimately related to optimizing a quadratic
function i,j ai,j xi xj , where ai,j = aj,i .
By
P finding eigenvalue/eigenvector pairs, can write in the form
λ x 2.
j j j
An important result:
https://en.wikipedia.org/wiki/Sylvester%27s law of inertia

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 4)

Linear Algebra
P is intimately related to optimizing a quadratic
function i,j ai,j xi xj , where ai,j = aj,i .
By
P finding eigenvalue/eigenvector pairs, can write in the form
λ x 2.
j j j
An important result:
https://en.wikipedia.org/wiki/Sylvester%27s law of inertia
Classical machine learning (in particular, unsupervised
algorithms) use optimization of square functions, hence
directly related to Linear Algebra.
E.g. Linear regression, SVM (Support vector machine), PCA
analysis
https://en.wikipedia.org/wiki/Support vector machine
https://en.wikipedia.org/wiki/Principal component analysis

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 5)

Linear Algebra is also important in non-classical Machine


Learning.
Inverse of Hessian matrix is used in Newton’s method.
Eigenvalue/Eigenvector pairs of Hessian matrix is used in
Backtracking New Q-Newton’s method (BNQN), a robust
variant of Newton’s method.
Estimating the size of eigenvalues of Hessian matrix helps to
analyse the stability of Gradient descent methods.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 5)

Linear Algebra is also important in non-classical Machine


Learning.
Inverse of Hessian matrix is used in Newton’s method.
Eigenvalue/Eigenvector pairs of Hessian matrix is used in
Backtracking New Q-Newton’s method (BNQN), a robust
variant of Newton’s method.
Estimating the size of eigenvalues of Hessian matrix helps to
analyse the stability of Gradient descent methods.
Some explanations from the papers:
https://arxiv.org/pdf/2209.05378
https://link.springer.com/article/10.1007/s00245-020-09718-
8

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 6)

Linear Algebra is also important in non-classical Machine


Learning.
Inverse of Hessian matrix is used in Newton’s method.
Eigenvalue/Eigenvector pairs of Hessian matrix is used in
Backtracking New Q-Newton’s method (BNQN), a robust
variant of Newton’s method.
Estimating the size of eigenvalues of Hessian matrix helps to
analyse the stability of Gradient descent methods.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 6)

Linear Algebra is also important in non-classical Machine


Learning.
Inverse of Hessian matrix is used in Newton’s method.
Eigenvalue/Eigenvector pairs of Hessian matrix is used in
Backtracking New Q-Newton’s method (BNQN), a robust
variant of Newton’s method.
Estimating the size of eigenvalues of Hessian matrix helps to
analyse the stability of Gradient descent methods.
Some explanations from the papers:
https://arxiv.org/pdf/2209.05378
https://link.springer.com/article/10.1007/s00245-020-09718-
8

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 7)
Linear Algebra is also important in LLM!
Besides matrix multiplication.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 7)
Linear Algebra is also important in LLM!
Besides matrix multiplication.
A crucial component is Transfomers, which relies on
Attention. (also positional encoding, see later)
https://en.wikipedia.org/wiki/Attention (machine learning)

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 7)
Linear Algebra is also important in LLM!
Besides matrix multiplication.
A crucial component is Transfomers, which relies on
Attention. (also positional encoding, see later)
https://en.wikipedia.org/wiki/Attention (machine learning)
Simplified statement of attention problem: Let’s be given a
sequence of objects Y1 , . . . , Ym , and another object X . How
to decide which of the Yj′ s is closest to X ?
Assume that X and each Yj are represented by a vector in RN .
One idea is to say that two vectors are most similar if their
angle is 0o , and most opposite if their angle is 180o .
Which is written by a simple statement in Linear Algebra: X
and Y are most similar if < X , Y > (the inner product) is
biggest. Can combine with softmax function for probability.
Maybe we should normalise by dividing by the norm of Y ?
Example: X = (1, 2, 3) is closer to Y1 = (1, −1, 1) or
Y2 = (−1, 1, 1)?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization

Image from https://medium.com/analytics-


vidhya/optimization-acb996a4623c
§2.6 Optimization (cont. 2)
Main problem (unconstrained): Find x ∗ which minimizes a
function F (x).
First clue: Any minimum is a critical point: if
F (x ∗ ) = minx F (x) then ∇F (x ∗ ) = 0.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 2)
Main problem (unconstrained): Find x ∗ which minimizes a
function F (x).
First clue: Any minimum is a critical point: if
F (x ∗ ) = minx F (x) then ∇F (x ∗ ) = 0.
Symbolic approach (high school approach!): find all critical
points, then check the values to find the minima.
This allows to find exact minima. (But is this needed in
reality, the formulation of the Deep Neural Network is not
that.)
This can work for simple problems, where the set of critical
points is easy to determine.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 2)
Main problem (unconstrained): Find x ∗ which minimizes a
function F (x).
First clue: Any minimum is a critical point: if
F (x ∗ ) = minx F (x) then ∇F (x ∗ ) = 0.
Symbolic approach (high school approach!): find all critical
points, then check the values to find the minima.
This allows to find exact minima. (But is this needed in
reality, the formulation of the Deep Neural Network is not
that.)
This can work for simple problems, where the set of critical
points is easy to determine.
In general, one needs numerical methods.
A popular choice is that of iterative methods. One start with
a random choice of x0 , then has a way to create a new point
x1 from x0 , then a new point x2 from x1 and so on. Will
discuss later some of these methods.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 3)

What could be most problematic for an optimization


algorithm?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 3)

What could be most problematic for an optimization


algorithm?
(Generalised) Saddle points: A critical point x ∗ for which the
Hessian matrix ∇2 F (x ∗ ) has at least one negative eigenvalue.
Roughly speaking, it is maximum in at least one direction.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 3)

What could be most problematic for an optimization


algorithm?
(Generalised) Saddle points: A critical point x ∗ for which the
Hessian matrix ∇2 F (x ∗ ) has at least one negative eigenvalue.
Roughly speaking, it is maximum in at least one direction.
Bray and Dean’s theorem:
journals.aps.org/prl/abstract/10.1103/PhysRevLett.98.150201
When the dimension increases, there are exponentially more
saddle points than minima.
A heuristic: Let’s have a diagonal matrix, and try to put +1
or -1 on the diagonal, randomly. There is only one figuration
which corresponds to a minimum (where all the numbers are
+1). The other figurations always have at least one negative
number (-1). The number of all possible figurations is
2dimension .

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 4)
Iterative methods: IM(F,x), depending on the function F and
the current point x.
Choose a random point x0 .
Define a sequence by induction xn+1 = IM(F , xn ).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 4)
Iterative methods: IM(F,x), depending on the function F and
the current point x.
Choose a random point x0 .
Define a sequence by induction xn+1 = IM(F , xn ).
First order methods: involve only first derivatives of F . Like
Gradient Descent.
Second order methods: involve second derivatives of F . Like
Newton’s method.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 4)
Iterative methods: IM(F,x), depending on the function F and
the current point x.
Choose a random point x0 .
Define a sequence by induction xn+1 = IM(F , xn ).
First order methods: involve only first derivatives of F . Like
Gradient Descent.
Second order methods: involve second derivatives of F . Like
Newton’s method.
Some criteria for a good iterative method:
Any cluster point should be a critical point.
The sequence should converge for good enough functions F .
(In the literature, usually require for example that F is convex,
or ∇F is Lipschitz continuous ⇒ too restrictive, not
applicable for interesting cases)
Any limit point should not be a saddle point. (Refer to
previous slide.)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 5)
Gradient descent: IM(F , x) = x − ∇F (x).
Avoid saddle points (Good).
Not having global convergence guarantee (Bad).
May be slow if converges (Bad).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 5)
Gradient descent: IM(F , x) = x − ∇F (x).
Avoid saddle points (Good).
Not having global convergence guarantee (Bad).
May be slow if converges (Bad).
Newton’s method: IM(F , x) = x − (∇2 F (x))−1 .∇F (x).
If x0 is close to a non-degenerate local minima, then the
convergence rate is quadratic. (good)
Has the tendency to converge to the nearest critical point
(even if it is a saddle point) : Bad. Example: What happens if
we apply Newton’s method to a quadratic function?
Must be careful when ∇2 F (x) is not invertible (i.e. has 0 as
an eigenvalue).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 5)
Gradient descent: IM(F , x) = x − ∇F (x).
Avoid saddle points (Good).
Not having global convergence guarantee (Bad).
May be slow if converges (Bad).
Newton’s method: IM(F , x) = x − (∇2 F (x))−1 .∇F (x).
If x0 is close to a non-degenerate local minima, then the
convergence rate is quadratic. (good)
Has the tendency to converge to the nearest critical point
(even if it is a saddle point) : Bad. Example: What happens if
we apply Newton’s method to a quadratic function?
Must be careful when ∇2 F (x) is not invertible (i.e. has 0 as
an eigenvalue).
Many popular algorithms in Deep Learning are variants of
Gradient descent like Adam. Second order methods (BFGS...)
are expensive to use, and not show better results yet.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).
Trust region: Choose an rn > 0 and define wn to minimize
min||xn −w ||≤rn F (w ). (Or having some associated cost function
at step n).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).
Trust region: Choose an rn > 0 and define wn to minimize
min||xn −w ||≤rn F (w ). (Or having some associated cost function
at step n).
Armijo’s condition: If < wn , ∇F (xn ) > is strictly positive,
then there is α0 > 0 so that for all 0 ≤ αn ≤ α0 , then
F (xn − αwn ) − F (xn )+ < αwn , ∇F (xn ) > /3 ≤ 0. (i.e. the
function value at the new point xn − αwn decreases enough).
Backtracking Armijo’s algorithn: explain how.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).
Trust region: Choose an rn > 0 and define wn to minimize
min||xn −w ||≤rn F (w ). (Or having some associated cost function
at step n).
Armijo’s condition: If < wn , ∇F (xn ) > is strictly positive,
then there is α0 > 0 so that for all 0 ≤ αn ≤ α0 , then
F (xn − αwn ) − F (xn )+ < αwn , ∇F (xn ) > /3 ≤ 0. (i.e. the
function value at the new point xn − αwn decreases enough).
Backtracking Armijo’s algorithn: explain how.
Backtracking Armijo’s algorithm is easy to implement, while
working quite efficient. Exact line search and trust region may
be difficult to implement to be stable. Wolfe’ s conditions:
more complicated but not better.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 7)
Gradient boosting: This is a different boosting mechanic than
what described in previous slide. E.g. Gradient Boosted Tree.
Here we do not modify the point x, but modify the model g0 !
The idea is that, for example, after training we already got a
good model g0 . But now we want to modify it to a new
model g , which may work better.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 7)
Gradient boosting: This is a different boosting mechanic than
what described in previous slide. E.g. Gradient Boosted Tree.
Here we do not modify the point x, but modify the model g0 !
The idea is that, for example, after training we already got a
good model g0 . But now we want to modify it to a new
model g , which may work better.
Setting: Let our data I consists of x1 , . . . , xm . At xi let yi be
the correct value. Let the metric is d. Let g be any model,
then we look at the cost function
P
F (g ) = i∈I d(yi , g (xi )). Note that F (g ) depends only on
the values g (xi ), for i ∈ I . Hence, we can define zi = g (xi ).
Then our cost function is now F (z1 , . . . , zm ). We can apply
any optimization method to this, with a special modification
that the initial point is fixed, it is (g0 (x1 ), . . . , g0 (xm )).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 7)
Gradient boosting: This is a different boosting mechanic than
what described in previous slide. E.g. Gradient Boosted Tree.
Here we do not modify the point x, but modify the model g0 !
The idea is that, for example, after training we already got a
good model g0 . But now we want to modify it to a new
model g , which may work better.
Setting: Let our data I consists of x1 , . . . , xm . At xi let yi be
the correct value. Let the metric is d. Let g be any model,
then we look at the cost function
P
F (g ) = i∈I d(yi , g (xi )). Note that F (g ) depends only on
the values g (xi ), for i ∈ I . Hence, we can define zi = g (xi ).
Then our cost function is now F (z1 , . . . , zm ). We can apply
any optimization method to this, with a special modification
that the initial point is fixed, it is (g0 (x1 ), . . . , g0 (xm )).
Explain how to do it if we use Gradient Descent (or
Backtracking Gradient Descent = Backtracking Armijo’s
method for Gradient Descent), and d is Euclidean distance.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 8)
Stochastic optimization: Here we formalise the mini-batch
practice in Deep Neural Networks.
Our cost function F depends on extra variable ξ, so F (x, ξ).
The expectation is E (F (x, ξ)) = G (x).
At each step n, we generate a new value ξn based on the
given distribution for ξ, and run our algorithm
xn+1 = IM(F (., ξn ), xn ). We hope xn ”converges” to a
minimum of G (x). (Will discuss later on the delicate issue of
convergence in stochastic optimization.)

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 8)
Stochastic optimization: Here we formalise the mini-batch
practice in Deep Neural Networks.
Our cost function F depends on extra variable ξ, so F (x, ξ).
The expectation is E (F (x, ξ)) = G (x).
At each step n, we generate a new value ξn based on the
given distribution for ξ, and run our algorithm
xn+1 = IM(F (., ξn ), xn ). We hope xn ”converges” to a
minimum of G (x). (Will discuss later on the delicate issue of
convergence in stochastic optimization.)
If F (x, ξ) = (ξ + 1)x 2 − 2x, where ξ has normal distribution,
then E (ξ) = 0 and hence G (x) = E (F (x, ξ)) = x 2 − 2x. G (x)
obtains minimum value −1 at x = 1. If using Gradient
Descent, then xn+1 = xn − [2(ξ + 1)xn − 2].
Test with the Optimization link. Observation?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 8)
Stochastic optimization: Here we formalise the mini-batch
practice in Deep Neural Networks.
Our cost function F depends on extra variable ξ, so F (x, ξ).
The expectation is E (F (x, ξ)) = G (x).
At each step n, we generate a new value ξn based on the
given distribution for ξ, and run our algorithm
xn+1 = IM(F (., ξn ), xn ). We hope xn ”converges” to a
minimum of G (x). (Will discuss later on the delicate issue of
convergence in stochastic optimization.)
If F (x, ξ) = (ξ + 1)x 2 − 2x, where ξ has normal distribution,
then E (ξ) = 0 and hence G (x) = E (F (x, ξ)) = x 2 − 2x. G (x)
obtains minimum value −1 at x = 1. If using Gradient
Descent, then xn+1 = xn − [2(ξ + 1)xn − 2].
Test with the Optimization link. Observation?
Similar for F (x, ξ) = (10ξ + 10)x 2 − 2x. Other functions, like
x 2 − 2 ∗ (1 + ξ)x?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 9)
Constrained optimization: minimize a function F (x) under
some constraints.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 9)
Constrained optimization: minimize a function F (x) under
some constraints.
Projected algorithms: Assume that the constraint set S is
convex. Assume that you already have an iterative method
xn+1 = IM(F (x), xn ). Then the projected version is
xn+1 = prS (IM(F (x), xn )), where prS (x) is the projection of
the point x to the set S, i.e. the point in S which is closest to
x.
If the set S is complicated, then may be difficult to check if it
is convex, or to calculate explicitly the projection.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 9)
Constrained optimization: minimize a function F (x) under
some constraints.
Projected algorithms: Assume that the constraint set S is
convex. Assume that you already have an iterative method
xn+1 = IM(F (x), xn ). Then the projected version is
xn+1 = prS (IM(F (x), xn )), where prS (x) is the projection of
the point x to the set S, i.e. the point in S which is closest to
x.
If the set S is complicated, then may be difficult to check if it
is convex, or to calculate explicitly the projection.
Lagrange multiplier: If the constraints are equalities, then
Lagrange multiplier method reduces to finding critical points
of a new function, i.e. to finding roots of a system of
equations. https://en.wikipedia.org/wiki/Lagrange multiplier
Karush-Kuhn-Tucker conditions: generalization of Lagrange
multiplier method, where inequality constraints are allowed.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 10)
These can be useful when the minima lie on the boundary of
the domain. On the other hand, the complexity may be high.
Solving Lagrange multiplier, KKT conditions = finding roots
of a system of equations. Explain why!
On the other hand, root finding can be considered as a special
case of Square Least Problem. Explain why (sum of squares of
functions in the system)!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 10)
These can be useful when the minima lie on the boundary of
the domain. On the other hand, the complexity may be high.
Solving Lagrange multiplier, KKT conditions = finding roots
of a system of equations. Explain why!
On the other hand, root finding can be considered as a special
case of Square Least Problem. Explain why (sum of squares of
functions in the system)!
Replacing the cost function by a new cost function (barrier
methods): For example, one can add a term into the cost
function which becomes infinity at the boundary of S. For
example F (x) − log dist(x, ∂S).
Minima of the new function may not have any relation to
those of the original function. Some special designs of the
new function may help.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 10)
These can be useful when the minima lie on the boundary of
the domain. On the other hand, the complexity may be high.
Solving Lagrange multiplier, KKT conditions = finding roots
of a system of equations. Explain why!
On the other hand, root finding can be considered as a special
case of Square Least Problem. Explain why (sum of squares of
functions in the system)!
Replacing the cost function by a new cost function (barrier
methods): For example, one can add a term into the cost
function which becomes infinity at the boundary of S. For
example F (x) − log dist(x, ∂S).
Minima of the new function may not have any relation to
those of the original function. Some special designs of the
new function may help.
Constrained optimization also appears naturally in
optimization on submanifolds of ambient Euclidean space, like
on spheres. In that case, Riemann geometry may be helpful.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE

Image from https://slideplayer.com/slide/17770451/


§2.7 Dynamical Systems - ODE (cont. 2)

Dynamical Systems helps understanding long term behaviour


of an iterative method.
Example: The function F (x, y ) = x 2 − y 2 has a saddle point
at (0, 0). If we apply Gradient Descent method, then for
which initial point (x0 , y0 ) will the constructed sequence
(xn , yn ) converge to (0, 0)? The same if we apply Newton’s
method. Explain. Check with the Optimization link.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 2)

Dynamical Systems helps understanding long term behaviour


of an iterative method.
Example: The function F (x, y ) = x 2 − y 2 has a saddle point
at (0, 0). If we apply Gradient Descent method, then for
which initial point (x0 , y0 ) will the constructed sequence
(xn , yn ) converge to (0, 0)? The same if we apply Newton’s
method. Explain. Check with the Optimization link.
Remark: Need to distinguish between a saddle point of a cost
function, and a saddle point (indeed, unstability, meaning that
the Jacobian matrix has an eigenvalue > 1) of an
optimization method applied to the cost function! Analysing
with the above example.
A good iterative method is one for which a saddle point of the
cost function becomes an unstable point of the iterative
method applied to the cost function.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 3)

The most useful result in Dynamical Systems for Machine


Learning is probably the Stable-Center manifold theorem.
Reference: Mike Shub, Global stability of Dynamical Systems,
https://www.amazon.com/Global-Stability-Dynamical-
Systems-Michael/dp/0387962956?ref =ast author mpb

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 3)

The most useful result in Dynamical Systems for Machine


Learning is probably the Stable-Center manifold theorem.
Reference: Mike Shub, Global stability of Dynamical Systems,
https://www.amazon.com/Global-Stability-Dynamical-
Systems-Michael/dp/0387962956?ref =ast author mpb
(Stable-Central theorem manifold) Let H : Rm → R be a C 1
function, and 0 is a fixed point of H. Assume that the
Jacobian at 0, i.e. JH(0), has at least one eigenvalue > 1.
Then there is a small open set B(0, r ) and a C 1 -manifold
V ⊂ Rm , V has dimension ≤ m − 1, so that if x0 ∈ B(0, r )
and xn = H n (x0 ) converges to 0, then x0 ∈ V .
Consequence: V has measure 0, hence in term of probability,
the probability of having x0 for which xn converges to 0 is
zero! In other words, if we choose x0 ∈ B(0, r0 ) randomly,
then xn does not converge to 0.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 4)

Example: Gradient Descent. H(x) = x − α∇F (x) (here


α > 0 is learning rate).
Compute Jacobian of this map?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 4)

Example: Gradient Descent. H(x) = x − α∇F (x) (here


α > 0 is learning rate).
Compute Jacobian of this map?
JH(0) = Id − ∇2 F (0). If 0 is a saddle point of F , then
∇2 F (0) has a negative eigenvalue, then JH(0) has an
eigenvalue > 1. Therefore, Stable-Central manifold can be
applied to show that there is a B(0, r ) so that if x0 ∈ B(0, r ) is
randomly chosen then xn created by Gradient Descent cannot
converge to 0 (in other words, avoid the saddle point 0).
Note that this is only a local statement. Want to show the
same for x0 ∈ Rm randomly chosen. The idea is to take
preimage of V by iterates of H, and show that they have zero
Lebesgue measure.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 5)
If F has only countably many saddle points, then one can
show that we can avoid all saddle points. See: J. D. Lee, M.
Simchowitz, M. I. Jordan and B. Recht, Gradient descent only
converges to minimizers, JMRL: Workshop and conference
proceedings, vol 49 (2016), 1–12.
For the general case, an idea is to use Lindelof’s lemma in
analysis, see I. Panageas and G. Piliouras, Gradient descent
only converges to minimizers: Non-isolated critical points and
invariant regions, 8th Innovations in theoretical computer
science conference (ITCS 2017), article no 2, 2:1–2:12.
After this, some additional improvements.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 5)
If F has only countably many saddle points, then one can
show that we can avoid all saddle points. See: J. D. Lee, M.
Simchowitz, M. I. Jordan and B. Recht, Gradient descent only
converges to minimizers, JMRL: Workshop and conference
proceedings, vol 49 (2016), 1–12.
For the general case, an idea is to use Lindelof’s lemma in
analysis, see I. Panageas and G. Piliouras, Gradient descent
only converges to minimizers: Non-isolated critical points and
invariant regions, 8th Innovations in theoretical computer
science conference (ITCS 2017), article no 2, 2:1–2:12.
After this, some additional improvements.
However, unknown if Backtracking Gradient Descent can
avoid saddle points.
This is an interesting and useful question, since Backtracking
Gradient Descent has much better theoretical guarantee
compared to Gradient Descent and other variants.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 6)
How about Newton’s method? Again coming back to the
example F (x, y ) = x 2 − y 2 . Newton’s method
H(x) = x − (∇2 F (x))−1 .∇F (x) will converge to the saddle
point (0, 0).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 6)
How about Newton’s method? Again coming back to the
example F (x, y ) = x 2 − y 2 . Newton’s method
H(x) = x − (∇2 F (x))−1 .∇F (x) will converge to the saddle
point (0, 0).
How about we change the matrix ∇2 F (x) to another matrix
which has only positive eigenvalues, by changing the signs of
negative eigenvalues of ∇2 F (x)? Do calculations to see that
indeed this works: we can avoid the saddle point!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 6)
How about Newton’s method? Again coming back to the
example F (x, y ) = x 2 − y 2 . Newton’s method
H(x) = x − (∇2 F (x))−1 .∇F (x) will converge to the saddle
point (0, 0).
How about we change the matrix ∇2 F (x) to another matrix
which has only positive eigenvalues, by changing the signs of
negative eigenvalues of ∇2 F (x)? Do calculations to see that
indeed this works: we can avoid the saddle point!
In general, need some additional ideas. New Q-Newton’s
method variant of Newton’s method, keeps good properties of
Newton’s method, also avoid saddle points. See
https://link.springer.com/article/10.1007/s10957-023-02270-
9
Backtracking New Q-Newton’s method is a variety of
Newton’s method, which keeps the good properties of New
Q-Newton’s method, while having better global convergence
guarantee, see https://arxiv.org/abs/2209.05378
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?
Gradient descent: We get Gradient flow x ′ (t) = −∇F (x(t)).
Newton’s method: We get Newton’s flow:
x ′ (t) = −(∇2 F (x(t)))−1 .∇F (x(t)).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?
Gradient descent: We get Gradient flow x ′ (t) = −∇F (x(t)).
Newton’s method: We get Newton’s flow:
x ′ (t) = −(∇2 F (x(t)))−1 .∇F (x(t)).
Usually, flow has better properties than discrete version.
Therefore, to study the discrete iterative method, one can
start with studying its flow, to get some initial ideas.
On the other hand, showing that the flow has good properties
does not necessarily mean the discrete version has the same
good properties. Also, in practice, one cannot use flows, but
only discrete versions (like Euler’s scheme to solve ODE).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?
Gradient descent: We get Gradient flow x ′ (t) = −∇F (x(t)).
Newton’s method: We get Newton’s flow:
x ′ (t) = −(∇2 F (x(t)))−1 .∇F (x(t)).
Usually, flow has better properties than discrete version.
Therefore, to study the discrete iterative method, one can
start with studying its flow, to get some initial ideas.
On the other hand, showing that the flow has good properties
does not necessarily mean the discrete version has the same
good properties. Also, in practice, one cannot use flows, but
only discrete versions (like Euler’s scheme to solve ODE).
Some people use this correspondence (ODE v.s. discrete
version) to study CNN.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology

Image from http://mathonline.wikidot.com/the-distance-


between-points-and-subsets-in-a-metric-space
§2.8 Metric/Topology (cont. 2)

Metric is very useful in Machine Learning/Deep Learning.


E.g. in defining a cost function for Deep Neural Networks.
Here, we will discuss another application: showing
convergence of iterative methods.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 2)

Metric is very useful in Machine Learning/Deep Learning.


E.g. in defining a cost function for Deep Neural Networks.
Here, we will discuss another application: showing
convergence of iterative methods.
Many different ways to define metric.
E.g. on Rmpone has the usual Euclidean distance
d(x, y ) = (x1 − y1 )2 + . . . + (xm − ym )2 .
One can also use L1 metric:
d(x, y ) = |x1 − y1 | + . . . + |xm − ym | (used in many problems
involving matrices).
One can also modify, for example if one wants to penalize the
m-th coordinate: d(x, y ) =
p
(x1 − y1 )2 + . . . + (xm−1 − ym−1 )2 + 105 (xm − ym )2

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 3)

Lp Regularization
A common technique in Machine Learning/Deep Learning is
to add an Lp metric into the main metric d(x, y ) + ||x − y ||p .
It is observed from many experiments that this can help
improve the performance. Can we explain?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 3)

Lp Regularization
A common technique in Machine Learning/Deep Learning is
to add an Lp metric into the main metric d(x, y ) + ||x − y ||p .
It is observed from many experiments that this can help
improve the performance. Can we explain?
One explanation is that this process makes the cost function
to be a Morse function. There are algorithms which have
guarantee to find local minima of Morse functions. (Recall: a
function is Morse if all its critical points are non-degenerate.
I.e. if ∇F (x ∗ ) = 0, then ∇2 F (x ∗ ) is an invertible matrix.)

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 4)
An important convergence theorem: M. D. Asic and D. D.
Adamovic, Limit points of sequences in metric spaces, The
American mathematical monthly, vol 77, so 6 (June–July
1970), 613–616
Let (X , d) be a compact metric space, and let {xn } be a
sequence in X . Assume that limn→∞ d(xn , xn+1 ) = 0. Let S
be the set of cluster points of {xn } (i.e. points x ∗ such that
there is a subsequence {xnk } converging to x ∗ ). Then S is a
connected set.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 4)
An important convergence theorem: M. D. Asic and D. D.
Adamovic, Limit points of sequences in metric spaces, The
American mathematical monthly, vol 77, so 6 (June–July
1970), 613–616
Let (X , d) be a compact metric space, and let {xn } be a
sequence in X . Assume that limn→∞ d(xn , xn+1 ) = 0. Let S
be the set of cluster points of {xn } (i.e. points x ∗ such that
there is a subsequence {xnk } converging to x ∗ ). Then S is a
connected set.
An application is as follows: In many algorithms, one can
show that limn→∞ ||xn+1 − xn || = 0 for the sequence
constructed by the algorithm.
If cluster points of {xn } are all critical points of F , and
assume that all critical points of F are isolated (e.g. if F is
Morse), and can show that the set {xn } is bounded, then one
can apply the above theorem to show that if S is non-empty
then the whole sequence {xn } converges to a point x ∗ .
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 5)
This is basically a common scheme to prove convergence for
certain optimization algorithms in the literature.
However, the requirement that {xn } is bounded is quite
restrictive (for example, a common requirement is that F has
compact sublevels). Can we overcome this?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 5)
This is basically a common scheme to prove convergence for
certain optimization algorithms in the literature.
However, the requirement that {xn } is bounded is quite
restrictive (for example, a common requirement is that F has
compact sublevels). Can we overcome this?
Yes, see
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation, Part 2: Algorithms and experiments, Applied
Mathematics and Optimization, 84 (2021), no 3, 2557-2586.
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation. Part 1: Theory, Minimax Theory and its
Applications, 7 (2022), no 1, 79-108.
Main idea is to embed Rm as a topological subspace (but not
metric subspace) of a compact metric space PRm (real
projective space).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 6)

Hence, one has an abstract general result on convergence as


follows:
Let {xn } be a sequence constructed by an optimization
method applied to a cost function F . Assume that every
cluster points of {xn } are critical points of F , and assume that
the set of critical points of F is at most countable (this is the
generic case, including Morse functions). Assume moreover
that limn→∞ ||xn+1 − xn || = 0. If the set of cluster points of
{xn } is non-empty, then S is a point, i.e. the sequence {xn }
converges.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 6)

Hence, one has an abstract general result on convergence as


follows:
Let {xn } be a sequence constructed by an optimization
method applied to a cost function F . Assume that every
cluster points of {xn } are critical points of F , and assume that
the set of critical points of F is at most countable (this is the
generic case, including Morse functions). Assume moreover
that limn→∞ ||xn+1 − xn || = 0. If the set of cluster points of
{xn } is non-empty, then S is a point, i.e. the sequence {xn }
converges.
The above theorem can be applied to Backtracking Gradient
Descent, BNQN.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry

Image from
https://www.pas.va/en/academicians/deceased/lojasiewicz.html
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 2)

An important condition is Lojasiewicz gradient inequality:


https://en.wikipedia.org/wiki/Lojasiewicz inequality

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 2)

An important condition is Lojasiewicz gradient inequality:


https://en.wikipedia.org/wiki/Lojasiewicz inequality
Typical example: real analytic functions.
Many functions in Deep Neural Networks satisfy this: linear
function, sigmoid, softmax, max pool... Also in positional
encoding in Transformers, which uses analytic functions (cos,
sin).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 2)

An important condition is Lojasiewicz gradient inequality:


https://en.wikipedia.org/wiki/Lojasiewicz inequality
Typical example: real analytic functions.
Many functions in Deep Neural Networks satisfy this: linear
function, sigmoid, softmax, max pool... Also in positional
encoding in Transformers, which uses analytic functions (cos,
sin).
Help to prove convergence of optimization algorithms.
Lojasiewicz proved convergence of Gradient flow:
dx(t)/dt = −∇F (x(t)).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 3)

An important result for discrete iterative methods:


P.-A. Absil, R. Mahony and B. Andrews, Convergence of the
iterates of descent methods for analytic cost functions, SIAM
J. Optim. 16 (2005), vol 16, no 2, 531–547.
Theorem: Assume that F satisfies Lojasiewicz gradient
inequality near x ∗ . Assume that {xn } is a sequence so that
F (xn+1 ) − F (xn ) ≤ −C ||xn+1 − xn || × ||∇F (xn )||, for all n,
where C > 0 is a constant. If x ∗ is a cluster point of {xn },
then indeed limn→∞ xn = x ∗ .

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 3)

An important result for discrete iterative methods:


P.-A. Absil, R. Mahony and B. Andrews, Convergence of the
iterates of descent methods for analytic cost functions, SIAM
J. Optim. 16 (2005), vol 16, no 2, 531–547.
Theorem: Assume that F satisfies Lojasiewicz gradient
inequality near x ∗ . Assume that {xn } is a sequence so that
F (xn+1 ) − F (xn ) ≤ −C ||xn+1 − xn || × ||∇F (xn )||, for all n,
where C > 0 is a constant. If x ∗ is a cluster point of {xn },
then indeed limn→∞ xn = x ∗ .
The above result can be applied to Backtracking Gradient
Descent, BNQN. For example for systems of polynomial
equations or transcendental functions (e.g. Riemann xi
function in the Riemann hypothesis).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemannian geometry

Image from
https://vccvisualization.org/RiemannianGeometryTutorial/
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.
Fisher information metric, a natural choice of Riemannian
metric on a space of parametrised statistical models (such as
Deep Neural Networks).
S. Amari and H. Nagaoka, Methods of information geometry,
Translations of mathematical monographs, AMS and Oxford
University Press, 2000.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.
Fisher information metric, a natural choice of Riemannian
metric on a space of parametrised statistical models (such as
Deep Neural Networks).
S. Amari and H. Nagaoka, Methods of information geometry,
Translations of mathematical monographs, AMS and Oxford
University Press, 2000.
Many constrained optimization problems concerning
constraints on a submanifold of Rm .
Of which methods in Riemann geometry may be useful and
better.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.
Fisher information metric, a natural choice of Riemannian
metric on a space of parametrised statistical models (such as
Deep Neural Networks).
S. Amari and H. Nagaoka, Methods of information geometry,
Translations of mathematical monographs, AMS and Oxford
University Press, 2000.
Many constrained optimization problems concerning
constraints on a submanifold of Rm .
Of which methods in Riemann geometry may be useful and
better.
Even unconstrained problems can also be stated in terms of
Riemann geometry. E.g.: Finding eigenvalue/eigenvector pairs
of a matrix. Explain how.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 3)

Like in Euclidean space, we can compute gradient of a real


function F : X → R.

https://math.stackexchange.com/questions/2228747/definition-
of-gradient-of-a-function-f-in-riemannian-manifold

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 3)

Like in Euclidean space, we can compute gradient of a real


function F : X → R.

https://math.stackexchange.com/questions/2228747/definition-
of-gradient-of-a-function-f-in-riemannian-manifold
Similarly, we can compute Hessian of a real function
F : X → R.

https://math.stackexchange.com/questions/81203/definitions-
of-hessian-in-riemannian-geometry

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 3)

Like in Euclidean space, we can compute gradient of a real


function F : X → R.

https://math.stackexchange.com/questions/2228747/definition-
of-gradient-of-a-function-f-in-riemannian-manifold
Similarly, we can compute Hessian of a real function
F : X → R.

https://math.stackexchange.com/questions/81203/definitions-
of-hessian-in-riemannian-geometry
It seems we have enough to define Gradient descent and
Newton’s method in the Riemann geometry setting.
However, there is an obstacle. What is it?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 4)
The obstacle is this.
We have the point xn and the gradient grad(F )(xn ). Note
that the gradient is a vector field, hence grad(F )(xn ) is a
vector in Txn X .
If X = Rm , then there is a natural identification between
Txn Rm and Rm . Then we can subtract them together to get a
new point xn+1 . This is how we define Gradient Descent on
Rm . Similarly, this is also the reason we can define Newton’s
method on Rm .
In general, there is no way to combine a point and a vector to
get a new point.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 4)
The obstacle is this.
We have the point xn and the gradient grad(F )(xn ). Note
that the gradient is a vector field, hence grad(F )(xn ) is a
vector in Txn X .
If X = Rm , then there is a natural identification between
Txn Rm and Rm . Then we can subtract them together to get a
new point xn+1 . This is how we define Gradient Descent on
Rm . Similarly, this is also the reason we can define Newton’s
method on Rm .
In general, there is no way to combine a point and a vector to
get a new point.
Exponential map: An idea in Differential Geometry which
makes it possible to combine a point and a tangent vector
into a new point is Exponential map.
Any point on a manifold of real dimension m has a small
neighbourhood which is isomorphic to a small ball in Rm .
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 5)
But such an isomorphism may be arbitrary.
Exponential map is an intrinsic way to construct such
isomorphisms.
(Definition-Theorem) Let v ∈ Tx X be a tangent vector where
||v || ≤ r , r is a small enough positive number.
Then there is a unique geodesic γv : [0, 1] → X such that
γv (0) = x and γv′ (0) = 0.
The exponential map is defined as v 7→ γv (1).
How to prove? Note that the geodesic satisfies a certain
ODE, hence we can apply existence and uniqueness results.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 5)
But such an isomorphism may be arbitrary.
Exponential map is an intrinsic way to construct such
isomorphisms.
(Definition-Theorem) Let v ∈ Tx X be a tangent vector where
||v || ≤ r , r is a small enough positive number.
Then there is a unique geodesic γv : [0, 1] → X such that
γv (0) = x and γv′ (0) = 0.
The exponential map is defined as v 7→ γv (1).
How to prove? Note that the geodesic satisfies a certain
ODE, hence we can apply existence and uniqueness results.
How to define Gradient Descent on Riemannian manifolds?
xn+1 = γxn (−r × grad(F )(xn )/||grad(F )(xn )||) (here r
depends on xn ). Why not using γxn (−grad(F )(xn )), which is
simpler?
Similarly, one can define Newton’s method on Riemannian
manifolds.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 6)

Retraction: exponential map may be not flexible (for example,


one needs to solve an ODE, which takes time).
Retraction is more flexible, it is an approximation of the
exponential map.
(Definition) A retraction at x is a smooth map R : E → X ,
where E ⊂ TX is an open set which contains the zero section,
such that for every x ∈ X , if Rx = R|E ∩ Tx X then
Rx (0x ) = x and DRx (0x ) = Id.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 6)

Retraction: exponential map may be not flexible (for example,


one needs to solve an ODE, which takes time).
Retraction is more flexible, it is an approximation of the
exponential map.
(Definition) A retraction at x is a smooth map R : E → X ,
where E ⊂ TX is an open set which contains the zero section,
such that for every x ∈ X , if Rx = R|E ∩ Tx X then
Rx (0x ) = x and DRx (0x ) = Id.
Note that retractions are easier to construct, and may be
defined on the whole tangent bundle TX , even though
exponential maps may not be defined on the whole tangent
bundle.
Using retractions, one can also define Gradient Descent and
Newton’s method in the Riemannian geometry setting.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?
Need stronger properties: ”Strong local retractions” and ”real
analytic-like local retractions”. Details see:
https://arxiv.org/pdf/2008.11091
Example with finding eigenvalues/eigenvectors of symmetric
real matrices.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?
Need stronger properties: ”Strong local retractions” and ”real
analytic-like local retractions”. Details see:
https://arxiv.org/pdf/2008.11091
Example with finding eigenvalues/eigenvectors of symmetric
real matrices.
How do we prove convergence in the Riemannian manifold
setting?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?
Need stronger properties: ”Strong local retractions” and ”real
analytic-like local retractions”. Details see:
https://arxiv.org/pdf/2008.11091
Example with finding eigenvalues/eigenvectors of symmetric
real matrices.
How do we prove convergence in the Riemannian manifold
setting?
Nash-Kuiper embedding theorems:
https://en.wikipedia.org/wiki/Nash embedding theorems. In
particularly, isometric embeddings, which allow to reduce the
convergence problem to that of subspaces of Euclidean spaces.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical education

Image from
https://www.curry.edu/academics/undergraduate-degrees-
and-programs/science-and-mathematics-degrees-and-
programs/mathematics-education-minor
§2.11 Mathematical Education (cont. 2)
The ultimate dream is to create AGI (Artificial General
Intelligence). Roughly speaking, machines which are as
intelligent as humans.
LLM (Large Language Models) seems to be a right direction
to go.
However, LLM, as other Deep Learning models, works on
probability.
Which is good for tasks not require precise answers. E.g.:
good for art, image classification, assistant,...
Which is not good for mathematics, logic.
E.g. no LLM can always solve linear equations in 1 real
variable correctly. E.g.
2(−3(4x + 1) + 10) − 9(2(5(100x + 3) − 27) + 4) =
8(7(6x + 5) − 4) − 73 (if a LLM can solve this equation, make
more parentheses, and it will fail!). Run with Mistral AI Chat.
[More example on mathematics and LLM in the final section.]
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 3)

The least requirement for an AGI is that it can do logic and


simple mathematics.
Hence, with the current technology progress, we need to wait
a Loooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooong Time for AGI to
come.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 3)

The least requirement for an AGI is that it can do logic and


simple mathematics.
Hence, with the current technology progress, we need to wait
a Loooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooong Time for AGI to
come.
However, reading the wrong answers by LLM provide some
insights.
Still the answers show some understandings by LLM of the
question asked.
Prompt engineering is a way to help us teach LLM to provide
correct answers.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 4)

LLM is like a student, who has some understandings of the


subject.
Appropriate prompts to guide LLM are like appropriate
instructions to specific students.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 4)

LLM is like a student, who has some understandings of the


subject.
Appropriate prompts to guide LLM are like appropriate
instructions to specific students.
The relation between LLM and Mathematical Education is
two ways.
Knowledge from teaching can help to write good prompts for
LLM to solve mathematics correctly.
Answers by LLM can reflect how students can make mistakes.
Students can learn by studying wrong answers by LLM.
Can use LLM to help teaching students.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions

Image from http://bakerpublishinggroup.com/books/3-big-


questions-that-shape-your-future/390834
§3 Looking forward - Some open questions (cont. 2)
Isn’t AI already can solve Olympics math questions and hence
intelligent?
In the news: AlphaProof and AlphaGeometry 2 solve correctly
4 problems in the 2024 IMO, which is equivalent to silver
medal.
Since codes/papers for AlphaProof and AlphaGeometry 2 not
available, cannot comment. However, reading the blog post
by DeepMind, it seems that AlphaGeometry2 is similar to
Alpha Geometry (last year). And also maybe to FunSearch.
https://www.nature.com/articles/s41586-023-06747-5
https://deepmind.google/discover/blog/funsearch-making-
new-discoveries-in-mathematical-sciences-using-large-
language-models/

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 2)
Isn’t AI already can solve Olympics math questions and hence
intelligent?
In the news: AlphaProof and AlphaGeometry 2 solve correctly
4 problems in the 2024 IMO, which is equivalent to silver
medal.
Since codes/papers for AlphaProof and AlphaGeometry 2 not
available, cannot comment. However, reading the blog post
by DeepMind, it seems that AlphaGeometry2 is similar to
Alpha Geometry (last year). And also maybe to FunSearch.
https://www.nature.com/articles/s41586-023-06747-5
https://deepmind.google/discover/blog/funsearch-making-
new-discoveries-in-mathematical-sciences-using-large-
language-models/
As we see, LLM cannot solve linear equations in 1 variable.
How does Alpha Geometry or Fun Search do?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 3)
The trick is in high school geometry, there are several
common auxiliary constructions which can help solve problem.
Usually, you may not need more than 10 such auxiliary
constructions to solve a problem.
In the Alpha Geometry paper, they have a list of about 50
such constructions.
So if you restrict yourself to maximum 10 auxiliary
constructions, you have about 1050 choices. No hope to
search all possibilities.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 3)
The trick is in high school geometry, there are several
common auxiliary constructions which can help solve problem.
Usually, you may not need more than 10 such auxiliary
constructions to solve a problem.
In the Alpha Geometry paper, they have a list of about 50
such constructions.
So if you restrict yourself to maximum 10 auxiliary
constructions, you have about 1050 choices. No hope to
search all possibilities.
What LLM can help is that it can classify a finite number of
categories well, after training on big and representative
datasets.
E.g. it can classify whether each sentence in a paragraph has
”positive”, ”negative” or ”neutral” meaning (3 categories).
Run example with Mistral AI Chat.
Now, for Alpha Geometry it is more complicated, but main
idea is similar.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 4)
A direction in Alpha Geometry is to combine with formal
proof checking: they send their proof proposal to a Proof
Assistant to check if it is correct.
Another direction is to utilise the abilities of computers to do
in parallel: basically, they have like 1000 computers, each
computer propose a different proof proposal.
Both directions are in the right track: AI should be combined
with formal computations in order to do mathematics.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 4)
A direction in Alpha Geometry is to combine with formal
proof checking: they send their proof proposal to a Proof
Assistant to check if it is correct.
Another direction is to utilise the abilities of computers to do
in parallel: basically, they have like 1000 computers, each
computer propose a different proof proposal.
Both directions are in the right track: AI should be combined
with formal computations in order to do mathematics.
However, there are differences to humans.
Humans don’t train on million problems (and cannot have
time to do so).
To be in fair comparison, that Alpha Geometry uses parallel
computations, we should have 1000 people working together.
Also, people should just need to come up with proof proposal,
and send to a Proof Assistant to check, like Alpha Geometry.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 5)
Not mentioning that computers do computations much
quicker than humans. Also, Alpha Geometry can only do
planar geometric questions. If one wants it to do another type
of questions, one needs to train another AI.
So probably we are on the right track, but will need
significant/radical changes in how to train AI.
Need to reduce the needed resources, the cost and so on.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 5)
Not mentioning that computers do computations much
quicker than humans. Also, Alpha Geometry can only do
planar geometric questions. If one wants it to do another type
of questions, one needs to train another AI.
So probably we are on the right track, but will need
significant/radical changes in how to train AI.
Need to reduce the needed resources, the cost and so on.
Moreover, if one can state a high school problem in terms of
algebraic equations, then formal tools like Grobner basis can
solve efficiently. Don’t need to train on million datasets, you
don’t need to propose thousand different proof proposals, ...
The Alpha Geometry paper said that Grobner basis ... cannot
solve these planar geometry problems (in a reasonable time),
but that claim does not match what I know about Grobner
basis and these planar geometry problems! Need a verification
by an independent committee.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 5)
Not mentioning that computers do computations much
quicker than humans. Also, Alpha Geometry can only do
planar geometric questions. If one wants it to do another type
of questions, one needs to train another AI.
So probably we are on the right track, but will need
significant/radical changes in how to train AI.
Need to reduce the needed resources, the cost and so on.
Moreover, if one can state a high school problem in terms of
algebraic equations, then formal tools like Grobner basis can
solve efficiently. Don’t need to train on million datasets, you
don’t need to propose thousand different proof proposals, ...
The Alpha Geometry paper said that Grobner basis ... cannot
solve these planar geometry problems (in a reasonable time),
but that claim does not match what I know about Grobner
basis and these planar geometry problems! Need a verification
by an independent committee.
⇒ Need a stronger reason for why LLM is needed.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 6)

Can we theorise the practice in Deep Learning: running


optimization algorithm on a Deep Neural Network
architecture, but then take validation accuracy as the deciding
factor?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 7)

Can we prove that Backtracking Gradient Descent can avoid


saddle points?
Unknown yet, even thought it is known for Gradient Descent,
and for a variant of Backtracking Gradient Descent.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 7)

Can we prove that Backtracking Gradient Descent can avoid


saddle points?
Unknown yet, even thought it is known for Gradient Descent,
and for a variant of Backtracking Gradient Descent.
More difficult, can we state a good version of convergence for
stochastic optimization, and prove it?
The current literature mainly considering convergence of
{F (xn )} or {||∇F (xn )||}, and not of {xn }.
Also, requiring the cost function is (strongly) convex, and
mainly with Gradient Descent or some of its variants.
This is not yet satisfying (from the theoretical viewpoint) and
not useful enough (from an applied viewpoint).

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 8)

Can we develop a theory which for each type of problems can


propose a good Deep Neural Network architecture/loss
functions towards solving it?
The current approach is mainly experimental, which takes
resources and time.
As a consequence, most (if not all) models in use were
developed by big companies like Google, Facebook,...
If we have a good theory, then smaller/poorer teams can also
contribute to this important issue.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 8)

Can we develop a theory which for each type of problems can


propose a good Deep Neural Network architecture/loss
functions towards solving it?
The current approach is mainly experimental, which takes
resources and time.
As a consequence, most (if not all) models in use were
developed by big companies like Google, Facebook,...
If we have a good theory, then smaller/poorer teams can also
contribute to this important issue.
What should be the new paradigm, so that we can build
actual AGI?
One thing we know is we need to go above Deep Neural
Networks. We need to go above Deep Learning.

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 8)

Can we develop a theory which for each type of problems can


propose a good Deep Neural Network architecture/loss
functions towards solving it?
The current approach is mainly experimental, which takes
resources and time.
As a consequence, most (if not all) models in use were
developed by big companies like Google, Facebook,...
If we have a good theory, then smaller/poorer teams can also
contribute to this important issue.
What should be the new paradigm, so that we can build
actual AGI?
One thing we know is we need to go above Deep Neural
Networks. We need to go above Deep Learning.
HOW?

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Thank you very much for your attention!

Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning

You might also like