MathOptimizationRoleInMLDL
MathOptimizationRoleInMLDL
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Blank page
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 2)
Some research in Optimization and Machine Learning/Deep
Learning:
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 2)
Some research in Optimization and Machine Learning/Deep
Learning:
Papers:
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation, Part 2: Algorithms and experiments, Applied
Mathematics and Optimization, 84 (2021), no 3, 2557-2586.
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation. Part 1: Theory, Minimax Theory and its
Applications, 7 (2022), no 1, 79-108.
Maged Helmy, Tuyen Trung Truong, Anastaiya Dykyy, Paulo
Ferreira and Eric Jul, Capillary Net: an automated system to
quantify skin capillary density and red blood cell velocity from
handheld vital microscopy, Artificial Intelligence in Medicine,
Volume 127, May 2022, 102287.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 3)
Maged Helmy, Tuyen Trung Truong, Eric Jul and Paulo
Ferreira, Deep learning and computer vision techniques for
microcirculation analysis: A review, Patterns, Volume 4, Issue
1, 2023, 100641.
Tuyen Trung Truong, Tat Dat To, Hang-Tuan Nguyen, Thu
Hang Nguyen, Hoang Phuong Nguyen and Maged Helmy, A
fast and simple modification of Newton’s method avoiding
saddle points, Journal of Optimization Theory and
Applications, Published online 18 July 2023.
John Erik Fornæss, Mi Hu, Tuyen Trung Truong and Takayuki
Watanabe, Backtracking New Q-Newton’s method, Newton’s
flow, Voronoi’s diagram and Stochastic root finding,
arXiv:2401.01393. Accepted in Complex Analysis and
Operator Theory, special issue 70th birthday of Steve Krantz.
Tuyen Trung Truong, Backtracking New Q-Newton’s method:
a good algorithm for optimization and solving systems of
equations, arXiv:2209.05378.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 4)
John Erik Fornæss, Mi Hu, Tuyen Trung Truong and Takayuki
Watanabe, Backtracking New Q-Newton’s method,
Schroder’s theorem, and Linear conjugacy, arXiv:2312.12126
Thuan Quang Tran and Tuyen Trung Truong, The Riemann
hypothesis and dynamics of Backtracking New Q-Newton’s
method, arXiv: 2405.05834.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Short bio (cont. 4)
John Erik Fornæss, Mi Hu, Tuyen Trung Truong and Takayuki
Watanabe, Backtracking New Q-Newton’s method,
Schroder’s theorem, and Linear conjugacy, arXiv:2312.12126
Thuan Quang Tran and Tuyen Trung Truong, The Riemann
hypothesis and dynamics of Backtracking New Q-Newton’s
method, arXiv: 2405.05834.
Some ongoing work:
Develop an app for Optimization methods. Will run some
experiments on this when discussing Stochastic Optimization.
Develop an app using AI to create good mathematical
questions for elementary school, supported by University of
Oslo’s growth house.
Collaborate with Torus AI on a project on AI for Spirometry.
Develop an app using AI to help high school students solve
mathematical exam questions.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Table of content
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
What is Deep Learning?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1 Some motivating applications
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.1: PageRank
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.1: PageRank
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares
Have a dataset (xi , yi )i=1,...,N .
Guess that yi is approximately F (xi , β) where β are
parameters need to be decided.
Decide βPby solving the minimization problem:
minβ N1 N 2
i=1 (yi − F (xi , β)) .
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares
Have a dataset (xi , yi )i=1,...,N .
Guess that yi is approximately F (xi , β) where β are
parameters need to be decided.
Decide βPby solving the minimization problem:
minβ N1 N 2
i=1 (yi − F (xi , β)) .
Example: Linear regression F (x, β) = ax + b, where
β = (a, b).
p
Example: Fit on an ellipse: F (x, β) = ± (1 − ax 2 )/b, where
β = (a, b) and the ellipse is ax 2 + by 2 = 1.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.2: Least squares problem
Classical problem in statistics.
https://en.wikipedia.org/wiki/Least squares
Have a dataset (xi , yi )i=1,...,N .
Guess that yi is approximately F (xi , β) where β are
parameters need to be decided.
Decide βPby solving the minimization problem:
minβ N1 N 2
i=1 (yi − F (xi , β)) .
Example: Linear regression F (x, β) = ax + b, where
β = (a, b).
p
Example: Fit on an ellipse: F (x, β) = ± (1 − ax 2 )/b, where
β = (a, b) and the ellipse is ax 2 + by 2 = 1.
If F (x, β) comes from a Deep Neural Network, then this is an
example of Deep Learning!
The difference is that in the 1960s and so, it is hard to solve
this if F (x, β) is not simple (e.g. Linear Regression).
Nowadays, one can solve for very complicated F (x, β).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images
It was impossible for a computer to recognise if there is a cat
in a picture or not.
While this is a piece of cake for human children (2 year old or
more).
No rigorous theory can do this! Even if it is published in
Annals of Mathematics.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images
It was impossible for a computer to recognise if there is a cat
in a picture or not.
While this is a piece of cake for human children (2 year old or
more).
No rigorous theory can do this! Even if it is published in
Annals of Mathematics.
A specific problem: A picture has either a dog, a cat or a
mouse, and only one of them. Can we design an algorithm for
computers to detect?
How to proceed?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images
It was impossible for a computer to recognise if there is a cat
in a picture or not.
While this is a piece of cake for human children (2 year old or
more).
No rigorous theory can do this! Even if it is published in
Annals of Mathematics.
A specific problem: A picture has either a dog, a cat or a
mouse, and only one of them. Can we design an algorithm for
computers to detect?
How to proceed?
First step is to collect a large dataset of the form
(xi , yi )i=1,...,N where N is ”big enough”. Where xi is an
image, and yi is one of the labels ”dog”, ”cat”, ”mouse”.
A convenient way is to use ”one hot vector”. (1, 0, 0) means
”dog”, (0, 1, 0) means ”cat”, (0, 0, 1) means ”mouse”.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images (cont. 2)
Second step: What type of values can F (x, β) get?
A nontrivial question: should F (x, β) just take values in the
discrete set (indeed, finite set) (1, 0, 0), (0, 1, 0) and (0, 0, 1)?
Why nontrivial? Since yi has values in this discrete set,
shouldn’t F (x, β)?
What’s not good? Then F (x, β) cannot be a continuous
function, and then it is very difficult to work with!
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images (cont. 2)
Second step: What type of values can F (x, β) get?
A nontrivial question: should F (x, β) just take values in the
discrete set (indeed, finite set) (1, 0, 0), (0, 1, 0) and (0, 0, 1)?
Why nontrivial? Since yi has values in this discrete set,
shouldn’t F (x, β)?
What’s not good? Then F (x, β) cannot be a continuous
function, and then it is very difficult to work with!
A new, better idea? Indeed, the idea that F (x, β) just takes
values in the discrete set is not good.
You can imagine that there are some pictures which look a bit
like dog, but also a bit like cat, and a bit like mouse.
A better way is to have F (x, β) = (a, b, c), where 0 ≤ a, b, c,
a + b + c = 1.
a means the probability that x is dog, b means the probability
that x is cat, c means the probability that x is mouse.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§1.3: Classification of images (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2 Specific fields in Mathematics and Optimization, with
usage in Machine Learning/Deep Learning
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.
Some names: Gottfried Leibniz, Bertrand Russell and Alfred
Whitehead, Walter Pitts, Warren McCulloch. ”Along the way,
they would create the first mechanistic theory of the mind,
the first computational approach to neuroscience, the logical
design of modern computers, and the pillars of artificial
intelligence. ”
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.
Some names: Gottfried Leibniz, Bertrand Russell and Alfred
Whitehead, Walter Pitts, Warren McCulloch. ”Along the way,
they would create the first mechanistic theory of the mind,
the first computational approach to neuroscience, the logical
design of modern computers, and the pillars of artificial
intelligence. ”
”A neuron’s signal is a proposition, and neurons seemed to
work like logic gates, taking in multiple inputs and producing
a single output.” https://nautil.us/the-man-who-tried-to-
redeem-the-world-with-logic-235253/
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.1: Logic (cont. 2)
Logic is at the foundation of all mathematics and science.
Hence, lie at all the applications of mathematics and science,
including Machine Learning/Deep Learning. Obviously!
Indeed, Logic has a more direct and crucial to both Artificial
Neural Networks and Computer Science.
Some names: Gottfried Leibniz, Bertrand Russell and Alfred
Whitehead, Walter Pitts, Warren McCulloch. ”Along the way,
they would create the first mechanistic theory of the mind,
the first computational approach to neuroscience, the logical
design of modern computers, and the pillars of artificial
intelligence. ”
”A neuron’s signal is a proposition, and neurons seemed to
work like logic gates, taking in multiple inputs and producing
a single output.” https://nautil.us/the-man-who-tried-to-
redeem-the-world-with-logic-235253/
Logic is expected to play an important role to creating true AI.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2 Approximation theory
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 2)
Linear regression: approximate data by a line in dimension 2
(more generally a linear subspace in higher dimensions).
Classical statistics/Classical Machine Learning. Still useful for
certain datasets.
More general: Polynomial interpolation. Approximate data by
polynomial functions.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 2)
Linear regression: approximate data by a line in dimension 2
(more generally a linear subspace in higher dimensions).
Classical statistics/Classical Machine Learning. Still useful for
certain datasets.
More general: Polynomial interpolation. Approximate data by
polynomial functions.
However, not enough for general approximation problem.
Must go beyond polynomials to be much more useful.
Exponential functions have been preferred (like sigmoid,
softmax, hyperbolic functions...). More research and
experiments show that even piece-wise linear functions (like
ReLU, maxpool,...) can also be effective.
Indeed, probably our brains work better with these piece-wise
functions than highly non-linear functions (even working with
polynomials can be already difficult in general, if we do not
get assistance from computers). Why Linear Algebra matters.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 3)
Universal approximation theorem: see Theorem 3.1 in
https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf
Fix a function σ which is continuous and not a polynomial.
Any continuous function can be
P approximated on compact
sets by functions of the form N i=1 ci σ(Li x + θi ), where ci are
real numbers, Li are matrices and θi are vectors. The number
N depends on which error we want to approximate.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 3)
Universal approximation theorem: see Theorem 3.1 in
https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf
Fix a function σ which is continuous and not a polynomial.
Any continuous function can be
P approximated on compact
sets by functions of the form N i=1 ci σ(Li x + θi ), where ci are
real numbers, Li are matrices and θi are vectors. The number
N depends on which error we want to approximate.
In terms of Neural Networks, the above theorem means that
any function can be approximated by Neural networks with 1
layer, provided we are allowed to increase N (the breath).
In practice, the Deep Neural Network paradigm seems to be
more efficient: we use more layers, but each layer has a
bounded dimension.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.2: Approximation Theory (cont. 3)
Universal approximation theorem: see Theorem 3.1 in
https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf
Fix a function σ which is continuous and not a polynomial.
Any continuous function can be
P approximated on compact
sets by functions of the form N i=1 ci σ(Li x + θi ), where ci are
real numbers, Li are matrices and θi are vectors. The number
N depends on which error we want to approximate.
In terms of Neural Networks, the above theorem means that
any function can be approximated by Neural networks with 1
layer, provided we are allowed to increase N (the breath).
In practice, the Deep Neural Network paradigm seems to be
more efficient: we use more layers, but each layer has a
bounded dimension.
The theorem appeared in 1960s but could not be used to
develop Deep Neural Networks since the computer powers
were not enough, Linear Regression was better fitted then.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 3)
Deep Neural Networks: A function which is a composition of
functions of the form σk ◦ (Lk xk + θk ), 0 ≤ k ≤ m (the bigger
m, the deeper the network).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 3)
Deep Neural Networks: A function which is a composition of
functions of the form σk ◦ (Lk xk + θk ), 0 ≤ k ≤ m (the bigger
m, the deeper the network).
xk are variables.
σk are nonlinear functions, and are fixed. For example,
sigmoid function, ReLU function, maxpool (meaning max of a
finite set of numbers)...
Lk are matrices and θk are vectors.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 3)
Deep Neural Networks: A function which is a composition of
functions of the form σk ◦ (Lk xk + θk ), 0 ≤ k ≤ m (the bigger
m, the deeper the network).
xk are variables.
σk are nonlinear functions, and are fixed. For example,
sigmoid function, ReLU function, maxpool (meaning max of a
finite set of numbers)...
Lk are matrices and θk are vectors.
The coefficients of Lk and θk are not before hand determined.
We need to decide the best coefficients for our specific
application.
The idea is to initiate these coefficients randomly. Then use
an optimization algorithm to minize an appropriate cost
function. Compare with ideas by Pitts and McCulloch!
These coefficients are called parameters of the DNN. For
example, GPT 4 is said to have 1.7 T parameters.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3: Approximation Theory (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus
Image from
https://www.mindingthecampus.org/2023/03/02/calculus-
without-fear/
§2.3 Calculus (cont. 2)
Crucial/useful notations:
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 2)
Crucial/useful notations:
First derivatives:
Gradient: ∇f = (∂f /∂x1 , . . . , ∂f /∂xn ). Used in checking if a
point is a critical point. Used in Taylor’s series. Used in
gradient descent.
Jacobian: generalisation of Gradient to a tuple of functions.
JF = (∇f1 , . . . , ∇fm ), where F = (f1 , . . . , fm ). Used in
Newton’s method and Levenberg-Marquardt method for Least
Square Problem (including solving systems of equations).
To solve f1 = . . . = fm = 0, we consider F = f12 + . . . + fm2 .
Relation between roots to the system, and global minima of
F?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 3)
Second derivatives:
Hessian matrix: ∇2 f = (∂ 2 f /∂xi ∂xj )i,j=1,...,n . Used to classify
critical points (local minimum, local maximum, saddle
points). Used in Newton’s method for optimization.
Represent ”curvature”.
We have ∇2 f = J(∇f ). (Hessian matrix is Jacobian matrix of
the gradient).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 3)
Second derivatives:
Hessian matrix: ∇2 f = (∂ 2 f /∂xi ∂xj )i,j=1,...,n . Used to classify
critical points (local minimum, local maximum, saddle
points). Used in Newton’s method for optimization.
Represent ”curvature”.
We have ∇2 f = J(∇f ). (Hessian matrix is Jacobian matrix of
the gradient).
One important property: If f is C 2 , then ∇2 f is a symmetric
square matrix with real coefficients.
Consequence: all eigenvalues of Hessian matrix are real, and
Hessian matrix is diagonalisable.
https://en.wikipedia.org/wiki/Diagonalizable matrix
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.3 Calculus (cont. 3)
Second derivatives:
Hessian matrix: ∇2 f = (∂ 2 f /∂xi ∂xj )i,j=1,...,n . Used to classify
critical points (local minimum, local maximum, saddle
points). Used in Newton’s method for optimization.
Represent ”curvature”.
We have ∇2 f = J(∇f ). (Hessian matrix is Jacobian matrix of
the gradient).
One important property: If f is C 2 , then ∇2 f is a symmetric
square matrix with real coefficients.
Consequence: all eigenvalues of Hessian matrix are real, and
Hessian matrix is diagonalisable.
https://en.wikipedia.org/wiki/Diagonalizable matrix
Higher derivatives are rarely used, and not yet produced
algorithms more advantegous compared to first and second
derivatives.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 2)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 2)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:
Construct a Deep Neural Network F (x, α), where x is input
and α is parameters.
P
Construct a cost function G (α) = x∈I d(F (x, α), yx ).
Here I is the dataset, yx is the ”correct value” for the data x,
and d is a metric (distance function).
Initialize a random value α0 , choose an optimization method
and run it to find a good value α̂, and use it in F (x, α̂) to
provide new predictions.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:
Construct a Deep Neural Network F (x, α), where x is input
and α is parameters.
P
Construct a cost function G (α) = x∈I d(F (x, α), yx ).
Here I is the dataset, yx is the ”correct value” for the data x,
and d is a metric (distance function).
Initialize a random value α0 , choose an optimization method
and run it to find a good value α̂, and use it in F (x, α̂) to
provide new predictions.
Not enough, need to test with reality to make sure!
Each time getting a value αn from the optimization algorithm,
test with so-called validation dataset, and choose the one
which gives the best match.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 3)
Restating the question for a Deep Learning model to classify
images:
Construct a Deep Neural Network F (x, α), where x is input
and α is parameters.
P
Construct a cost function G (α) = x∈I d(F (x, α), yx ).
Here I is the dataset, yx is the ”correct value” for the data x,
and d is a metric (distance function).
Initialize a random value α0 , choose an optimization method
and run it to find a good value α̂, and use it in F (x, α̂) to
provide new predictions.
Not enough, need to test with reality to make sure!
Each time getting a value αn from the optimization algorithm,
test with so-called validation dataset, and choose the one
which gives the best match.
This is close to what actually done. But there are many other
modifications, some will be covered next.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 5)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 5)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 5)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 6)
P
We considered the cost function G (α) = x∈I d(F (x, α), yx ),
where x runs all over the whole dataset I .
However, practice shows that this is not a smart way.
Better is to use ”mini batches”.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 6)
P
We considered the cost function G (α) = x∈I d(F (x, α), yx ),
where x runs all over the whole dataset I .
However, practice shows that this is not a smart way.
Better is to use ”mini batches”.
We divide the dataset I into smaller subsets I1 , . . . , Im
randomly, Each Ij is a mini-batch.
We will run our Poptimization algorithm with changing cost
functions
P G1 = x∈I1 d(F (x, α), yx ), P
G2 = x∈I2 d(F (x, α), yx ), . . . Gm = x∈Im d(F (x, α), yx ).
When we are done with Gm , we finish an epoch, and redivide
the dataset I into minibatch again, and do this again and
again til we are satisfied.
Why is this a better way? Maybe because we are exposed to
many more variations in the dataset, which makes the
algorithm more robust to new data.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.4 Probability & Statistics (cont. 6)
P
We considered the cost function G (α) = x∈I d(F (x, α), yx ),
where x runs all over the whole dataset I .
However, practice shows that this is not a smart way.
Better is to use ”mini batches”.
We divide the dataset I into smaller subsets I1 , . . . , Im
randomly, Each Ij is a mini-batch.
We will run our Poptimization algorithm with changing cost
functions
P G1 = x∈I1 d(F (x, α), yx ), P
G2 = x∈I2 d(F (x, α), yx ), . . . Gm = x∈Im d(F (x, α), yx ).
When we are done with Gm , we finish an epoch, and redivide
the dataset I into minibatch again, and do this again and
again til we are satisfied.
Why is this a better way? Maybe because we are exposed to
many more variations in the dataset, which makes the
algorithm more robust to new data.
We do not work with normal optimization, but stochastic
optimization.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear algebra
Image from
https://www.codecademy.com/learn/dsml-math-for-machine-
learning/modules/math-ds-linear-algebra/cheatsheet
§2.5 Linear Algebra (cont. 2)
Fast multiplication of matrix is important in applications.
E.g. Calculate gradient of the output of a Deep Neural
Network is, by Chain rule in Calculus, reduced to calculating
multiplications of matrices.
Computer chips are more powerful when the incorporated
matrix multiplication in the chip is more fast.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 2)
Fast multiplication of matrix is important in applications.
E.g. Calculate gradient of the output of a Deep Neural
Network is, by Chain rule in Calculus, reduced to calculating
multiplications of matrices.
Computer chips are more powerful when the incorporated
matrix multiplication in the chip is more fast.
Textbook’s algorithm for multiplying two matrices of size n is
n3 .
The complexity of multiplying two matrices of size n is O(nθ ),
where 2 ≤ θ < 3. Conjecturally θ = 2 (we cannot do better,
Why?).
The one which is fastest in practice, even though not with the
best complexity, is Strassen’s algorithm.
https://en.wikipedia.org/wiki/Computational complexity of
matrix multiplication
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 3)
Finding inverse of a matrix, finding eigenvalue/eigenvector
pairs of a matrix is important in application.
Strassen proves that finding inverse of a matrix, finding its
determinant, and doing Gaussian elimination (in solving linear
equations) have the same complexity O(nθ ).
Finding eigenvalues is a root finding problem: the eigenvalues
are the roots of the characteristic matrix.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 3)
Finding inverse of a matrix, finding eigenvalue/eigenvector
pairs of a matrix is important in application.
Strassen proves that finding inverse of a matrix, finding its
determinant, and doing Gaussian elimination (in solving linear
equations) have the same complexity O(nθ ).
Finding eigenvalues is a root finding problem: the eigenvalues
are the roots of the characteristic matrix.
Numerical Linear Algebra: usually, try to use an iterative
method. Example: How to find the principal λ1 /e1 pair for a
matrix (the Google’s algorithm)?
Another idea for fast computation is to compute just a few
significant values, and not all.
For example, compute the first 10 eigenvalue/eigenvector
pairs, where the eigenvalues are the 10 largest. Lanczos
algorithm: https://en.wikipedia.org/wiki/Lanczos algorithm
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 4)
Linear Algebra
P is intimately related to optimizing a quadratic
function i,j ai,j xi xj , where ai,j = aj,i .
By
P finding eigenvalue/eigenvector pairs, can write in the form
λ x 2.
j j j
An important result:
https://en.wikipedia.org/wiki/Sylvester%27s law of inertia
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 4)
Linear Algebra
P is intimately related to optimizing a quadratic
function i,j ai,j xi xj , where ai,j = aj,i .
By
P finding eigenvalue/eigenvector pairs, can write in the form
λ x 2.
j j j
An important result:
https://en.wikipedia.org/wiki/Sylvester%27s law of inertia
Classical machine learning (in particular, unsupervised
algorithms) use optimization of square functions, hence
directly related to Linear Algebra.
E.g. Linear regression, SVM (Support vector machine), PCA
analysis
https://en.wikipedia.org/wiki/Support vector machine
https://en.wikipedia.org/wiki/Principal component analysis
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 5)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 5)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 6)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 6)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 7)
Linear Algebra is also important in LLM!
Besides matrix multiplication.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 7)
Linear Algebra is also important in LLM!
Besides matrix multiplication.
A crucial component is Transfomers, which relies on
Attention. (also positional encoding, see later)
https://en.wikipedia.org/wiki/Attention (machine learning)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.5 Linear Algebra (cont. 7)
Linear Algebra is also important in LLM!
Besides matrix multiplication.
A crucial component is Transfomers, which relies on
Attention. (also positional encoding, see later)
https://en.wikipedia.org/wiki/Attention (machine learning)
Simplified statement of attention problem: Let’s be given a
sequence of objects Y1 , . . . , Ym , and another object X . How
to decide which of the Yj′ s is closest to X ?
Assume that X and each Yj are represented by a vector in RN .
One idea is to say that two vectors are most similar if their
angle is 0o , and most opposite if their angle is 180o .
Which is written by a simple statement in Linear Algebra: X
and Y are most similar if < X , Y > (the inner product) is
biggest. Can combine with softmax function for probability.
Maybe we should normalise by dividing by the norm of Y ?
Example: X = (1, 2, 3) is closer to Y1 = (1, −1, 1) or
Y2 = (−1, 1, 1)?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 2)
Main problem (unconstrained): Find x ∗ which minimizes a
function F (x).
First clue: Any minimum is a critical point: if
F (x ∗ ) = minx F (x) then ∇F (x ∗ ) = 0.
Symbolic approach (high school approach!): find all critical
points, then check the values to find the minima.
This allows to find exact minima. (But is this needed in
reality, the formulation of the Deep Neural Network is not
that.)
This can work for simple problems, where the set of critical
points is easy to determine.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 2)
Main problem (unconstrained): Find x ∗ which minimizes a
function F (x).
First clue: Any minimum is a critical point: if
F (x ∗ ) = minx F (x) then ∇F (x ∗ ) = 0.
Symbolic approach (high school approach!): find all critical
points, then check the values to find the minima.
This allows to find exact minima. (But is this needed in
reality, the formulation of the Deep Neural Network is not
that.)
This can work for simple problems, where the set of critical
points is easy to determine.
In general, one needs numerical methods.
A popular choice is that of iterative methods. One start with
a random choice of x0 , then has a way to create a new point
x1 from x0 , then a new point x2 from x1 and so on. Will
discuss later some of these methods.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 4)
Iterative methods: IM(F,x), depending on the function F and
the current point x.
Choose a random point x0 .
Define a sequence by induction xn+1 = IM(F , xn ).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 4)
Iterative methods: IM(F,x), depending on the function F and
the current point x.
Choose a random point x0 .
Define a sequence by induction xn+1 = IM(F , xn ).
First order methods: involve only first derivatives of F . Like
Gradient Descent.
Second order methods: involve second derivatives of F . Like
Newton’s method.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 4)
Iterative methods: IM(F,x), depending on the function F and
the current point x.
Choose a random point x0 .
Define a sequence by induction xn+1 = IM(F , xn ).
First order methods: involve only first derivatives of F . Like
Gradient Descent.
Second order methods: involve second derivatives of F . Like
Newton’s method.
Some criteria for a good iterative method:
Any cluster point should be a critical point.
The sequence should converge for good enough functions F .
(In the literature, usually require for example that F is convex,
or ∇F is Lipschitz continuous ⇒ too restrictive, not
applicable for interesting cases)
Any limit point should not be a saddle point. (Refer to
previous slide.)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 5)
Gradient descent: IM(F , x) = x − ∇F (x).
Avoid saddle points (Good).
Not having global convergence guarantee (Bad).
May be slow if converges (Bad).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 5)
Gradient descent: IM(F , x) = x − ∇F (x).
Avoid saddle points (Good).
Not having global convergence guarantee (Bad).
May be slow if converges (Bad).
Newton’s method: IM(F , x) = x − (∇2 F (x))−1 .∇F (x).
If x0 is close to a non-degenerate local minima, then the
convergence rate is quadratic. (good)
Has the tendency to converge to the nearest critical point
(even if it is a saddle point) : Bad. Example: What happens if
we apply Newton’s method to a quadratic function?
Must be careful when ∇2 F (x) is not invertible (i.e. has 0 as
an eigenvalue).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 5)
Gradient descent: IM(F , x) = x − ∇F (x).
Avoid saddle points (Good).
Not having global convergence guarantee (Bad).
May be slow if converges (Bad).
Newton’s method: IM(F , x) = x − (∇2 F (x))−1 .∇F (x).
If x0 is close to a non-degenerate local minima, then the
convergence rate is quadratic. (good)
Has the tendency to converge to the nearest critical point
(even if it is a saddle point) : Bad. Example: What happens if
we apply Newton’s method to a quadratic function?
Must be careful when ∇2 F (x) is not invertible (i.e. has 0 as
an eigenvalue).
Many popular algorithms in Deep Learning are variants of
Gradient descent like Adam. Second order methods (BFGS...)
are expensive to use, and not show better results yet.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).
Trust region: Choose an rn > 0 and define wn to minimize
min||xn −w ||≤rn F (w ). (Or having some associated cost function
at step n).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).
Trust region: Choose an rn > 0 and define wn to minimize
min||xn −w ||≤rn F (w ). (Or having some associated cost function
at step n).
Armijo’s condition: If < wn , ∇F (xn ) > is strictly positive,
then there is α0 > 0 so that for all 0 ≤ αn ≤ α0 , then
F (xn − αwn ) − F (xn )+ < αwn , ∇F (xn ) > /3 ≤ 0. (i.e. the
function value at the new point xn − αwn decreases enough).
Backtracking Armijo’s algorithn: explain how.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 6)
Some mechanics to boost convergence: Assume at point xn ,
and already chose a direction wn to move. We want to find a
good number αn > 0 and define xn+1 = xn − αn wn .
Exact line search: We find αn which is a minimum for the
function α 7→ F (xn − αwn ).
Trust region: Choose an rn > 0 and define wn to minimize
min||xn −w ||≤rn F (w ). (Or having some associated cost function
at step n).
Armijo’s condition: If < wn , ∇F (xn ) > is strictly positive,
then there is α0 > 0 so that for all 0 ≤ αn ≤ α0 , then
F (xn − αwn ) − F (xn )+ < αwn , ∇F (xn ) > /3 ≤ 0. (i.e. the
function value at the new point xn − αwn decreases enough).
Backtracking Armijo’s algorithn: explain how.
Backtracking Armijo’s algorithm is easy to implement, while
working quite efficient. Exact line search and trust region may
be difficult to implement to be stable. Wolfe’ s conditions:
more complicated but not better.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 7)
Gradient boosting: This is a different boosting mechanic than
what described in previous slide. E.g. Gradient Boosted Tree.
Here we do not modify the point x, but modify the model g0 !
The idea is that, for example, after training we already got a
good model g0 . But now we want to modify it to a new
model g , which may work better.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 7)
Gradient boosting: This is a different boosting mechanic than
what described in previous slide. E.g. Gradient Boosted Tree.
Here we do not modify the point x, but modify the model g0 !
The idea is that, for example, after training we already got a
good model g0 . But now we want to modify it to a new
model g , which may work better.
Setting: Let our data I consists of x1 , . . . , xm . At xi let yi be
the correct value. Let the metric is d. Let g be any model,
then we look at the cost function
P
F (g ) = i∈I d(yi , g (xi )). Note that F (g ) depends only on
the values g (xi ), for i ∈ I . Hence, we can define zi = g (xi ).
Then our cost function is now F (z1 , . . . , zm ). We can apply
any optimization method to this, with a special modification
that the initial point is fixed, it is (g0 (x1 ), . . . , g0 (xm )).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 7)
Gradient boosting: This is a different boosting mechanic than
what described in previous slide. E.g. Gradient Boosted Tree.
Here we do not modify the point x, but modify the model g0 !
The idea is that, for example, after training we already got a
good model g0 . But now we want to modify it to a new
model g , which may work better.
Setting: Let our data I consists of x1 , . . . , xm . At xi let yi be
the correct value. Let the metric is d. Let g be any model,
then we look at the cost function
P
F (g ) = i∈I d(yi , g (xi )). Note that F (g ) depends only on
the values g (xi ), for i ∈ I . Hence, we can define zi = g (xi ).
Then our cost function is now F (z1 , . . . , zm ). We can apply
any optimization method to this, with a special modification
that the initial point is fixed, it is (g0 (x1 ), . . . , g0 (xm )).
Explain how to do it if we use Gradient Descent (or
Backtracking Gradient Descent = Backtracking Armijo’s
method for Gradient Descent), and d is Euclidean distance.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 8)
Stochastic optimization: Here we formalise the mini-batch
practice in Deep Neural Networks.
Our cost function F depends on extra variable ξ, so F (x, ξ).
The expectation is E (F (x, ξ)) = G (x).
At each step n, we generate a new value ξn based on the
given distribution for ξ, and run our algorithm
xn+1 = IM(F (., ξn ), xn ). We hope xn ”converges” to a
minimum of G (x). (Will discuss later on the delicate issue of
convergence in stochastic optimization.)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 8)
Stochastic optimization: Here we formalise the mini-batch
practice in Deep Neural Networks.
Our cost function F depends on extra variable ξ, so F (x, ξ).
The expectation is E (F (x, ξ)) = G (x).
At each step n, we generate a new value ξn based on the
given distribution for ξ, and run our algorithm
xn+1 = IM(F (., ξn ), xn ). We hope xn ”converges” to a
minimum of G (x). (Will discuss later on the delicate issue of
convergence in stochastic optimization.)
If F (x, ξ) = (ξ + 1)x 2 − 2x, where ξ has normal distribution,
then E (ξ) = 0 and hence G (x) = E (F (x, ξ)) = x 2 − 2x. G (x)
obtains minimum value −1 at x = 1. If using Gradient
Descent, then xn+1 = xn − [2(ξ + 1)xn − 2].
Test with the Optimization link. Observation?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 8)
Stochastic optimization: Here we formalise the mini-batch
practice in Deep Neural Networks.
Our cost function F depends on extra variable ξ, so F (x, ξ).
The expectation is E (F (x, ξ)) = G (x).
At each step n, we generate a new value ξn based on the
given distribution for ξ, and run our algorithm
xn+1 = IM(F (., ξn ), xn ). We hope xn ”converges” to a
minimum of G (x). (Will discuss later on the delicate issue of
convergence in stochastic optimization.)
If F (x, ξ) = (ξ + 1)x 2 − 2x, where ξ has normal distribution,
then E (ξ) = 0 and hence G (x) = E (F (x, ξ)) = x 2 − 2x. G (x)
obtains minimum value −1 at x = 1. If using Gradient
Descent, then xn+1 = xn − [2(ξ + 1)xn − 2].
Test with the Optimization link. Observation?
Similar for F (x, ξ) = (10ξ + 10)x 2 − 2x. Other functions, like
x 2 − 2 ∗ (1 + ξ)x?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 9)
Constrained optimization: minimize a function F (x) under
some constraints.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 9)
Constrained optimization: minimize a function F (x) under
some constraints.
Projected algorithms: Assume that the constraint set S is
convex. Assume that you already have an iterative method
xn+1 = IM(F (x), xn ). Then the projected version is
xn+1 = prS (IM(F (x), xn )), where prS (x) is the projection of
the point x to the set S, i.e. the point in S which is closest to
x.
If the set S is complicated, then may be difficult to check if it
is convex, or to calculate explicitly the projection.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 9)
Constrained optimization: minimize a function F (x) under
some constraints.
Projected algorithms: Assume that the constraint set S is
convex. Assume that you already have an iterative method
xn+1 = IM(F (x), xn ). Then the projected version is
xn+1 = prS (IM(F (x), xn )), where prS (x) is the projection of
the point x to the set S, i.e. the point in S which is closest to
x.
If the set S is complicated, then may be difficult to check if it
is convex, or to calculate explicitly the projection.
Lagrange multiplier: If the constraints are equalities, then
Lagrange multiplier method reduces to finding critical points
of a new function, i.e. to finding roots of a system of
equations. https://en.wikipedia.org/wiki/Lagrange multiplier
Karush-Kuhn-Tucker conditions: generalization of Lagrange
multiplier method, where inequality constraints are allowed.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 10)
These can be useful when the minima lie on the boundary of
the domain. On the other hand, the complexity may be high.
Solving Lagrange multiplier, KKT conditions = finding roots
of a system of equations. Explain why!
On the other hand, root finding can be considered as a special
case of Square Least Problem. Explain why (sum of squares of
functions in the system)!
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 10)
These can be useful when the minima lie on the boundary of
the domain. On the other hand, the complexity may be high.
Solving Lagrange multiplier, KKT conditions = finding roots
of a system of equations. Explain why!
On the other hand, root finding can be considered as a special
case of Square Least Problem. Explain why (sum of squares of
functions in the system)!
Replacing the cost function by a new cost function (barrier
methods): For example, one can add a term into the cost
function which becomes infinity at the boundary of S. For
example F (x) − log dist(x, ∂S).
Minima of the new function may not have any relation to
those of the original function. Some special designs of the
new function may help.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.6 Optimization (cont. 10)
These can be useful when the minima lie on the boundary of
the domain. On the other hand, the complexity may be high.
Solving Lagrange multiplier, KKT conditions = finding roots
of a system of equations. Explain why!
On the other hand, root finding can be considered as a special
case of Square Least Problem. Explain why (sum of squares of
functions in the system)!
Replacing the cost function by a new cost function (barrier
methods): For example, one can add a term into the cost
function which becomes infinity at the boundary of S. For
example F (x) − log dist(x, ∂S).
Minima of the new function may not have any relation to
those of the original function. Some special designs of the
new function may help.
Constrained optimization also appears naturally in
optimization on submanifolds of ambient Euclidean space, like
on spheres. In that case, Riemann geometry may be helpful.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 2)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 5)
If F has only countably many saddle points, then one can
show that we can avoid all saddle points. See: J. D. Lee, M.
Simchowitz, M. I. Jordan and B. Recht, Gradient descent only
converges to minimizers, JMRL: Workshop and conference
proceedings, vol 49 (2016), 1–12.
For the general case, an idea is to use Lindelof’s lemma in
analysis, see I. Panageas and G. Piliouras, Gradient descent
only converges to minimizers: Non-isolated critical points and
invariant regions, 8th Innovations in theoretical computer
science conference (ITCS 2017), article no 2, 2:1–2:12.
After this, some additional improvements.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 5)
If F has only countably many saddle points, then one can
show that we can avoid all saddle points. See: J. D. Lee, M.
Simchowitz, M. I. Jordan and B. Recht, Gradient descent only
converges to minimizers, JMRL: Workshop and conference
proceedings, vol 49 (2016), 1–12.
For the general case, an idea is to use Lindelof’s lemma in
analysis, see I. Panageas and G. Piliouras, Gradient descent
only converges to minimizers: Non-isolated critical points and
invariant regions, 8th Innovations in theoretical computer
science conference (ITCS 2017), article no 2, 2:1–2:12.
After this, some additional improvements.
However, unknown if Backtracking Gradient Descent can
avoid saddle points.
This is an interesting and useful question, since Backtracking
Gradient Descent has much better theoretical guarantee
compared to Gradient Descent and other variants.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 6)
How about Newton’s method? Again coming back to the
example F (x, y ) = x 2 − y 2 . Newton’s method
H(x) = x − (∇2 F (x))−1 .∇F (x) will converge to the saddle
point (0, 0).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 6)
How about Newton’s method? Again coming back to the
example F (x, y ) = x 2 − y 2 . Newton’s method
H(x) = x − (∇2 F (x))−1 .∇F (x) will converge to the saddle
point (0, 0).
How about we change the matrix ∇2 F (x) to another matrix
which has only positive eigenvalues, by changing the signs of
negative eigenvalues of ∇2 F (x)? Do calculations to see that
indeed this works: we can avoid the saddle point!
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 6)
How about Newton’s method? Again coming back to the
example F (x, y ) = x 2 − y 2 . Newton’s method
H(x) = x − (∇2 F (x))−1 .∇F (x) will converge to the saddle
point (0, 0).
How about we change the matrix ∇2 F (x) to another matrix
which has only positive eigenvalues, by changing the signs of
negative eigenvalues of ∇2 F (x)? Do calculations to see that
indeed this works: we can avoid the saddle point!
In general, need some additional ideas. New Q-Newton’s
method variant of Newton’s method, keeps good properties of
Newton’s method, also avoid saddle points. See
https://link.springer.com/article/10.1007/s10957-023-02270-
9
Backtracking New Q-Newton’s method is a variety of
Newton’s method, which keeps the good properties of New
Q-Newton’s method, while having better global convergence
guarantee, see https://arxiv.org/abs/2209.05378
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?
Gradient descent: We get Gradient flow x ′ (t) = −∇F (x(t)).
Newton’s method: We get Newton’s flow:
x ′ (t) = −(∇2 F (x(t)))−1 .∇F (x(t)).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?
Gradient descent: We get Gradient flow x ′ (t) = −∇F (x(t)).
Newton’s method: We get Newton’s flow:
x ′ (t) = −(∇2 F (x(t)))−1 .∇F (x(t)).
Usually, flow has better properties than discrete version.
Therefore, to study the discrete iterative method, one can
start with studying its flow, to get some initial ideas.
On the other hand, showing that the flow has good properties
does not necessarily mean the discrete version has the same
good properties. Also, in practice, one cannot use flows, but
only discrete versions (like Euler’s scheme to solve ODE).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.7 Dynamical Systems - ODE (cont. 7)
If we have an iterative method xn+1 = xn + Q(xn ), then
replacing xn+1 − xn by dx(t)/dt, and xn by x(t), we obtain
the ODE x ′ (t) = Q(x(t)).
We want to see if x0 is given, the solutions to x ′ (t) = Q(x(t))
and x(0) = x0 will have a limit limt→∞ x(t)?
Gradient descent: We get Gradient flow x ′ (t) = −∇F (x(t)).
Newton’s method: We get Newton’s flow:
x ′ (t) = −(∇2 F (x(t)))−1 .∇F (x(t)).
Usually, flow has better properties than discrete version.
Therefore, to study the discrete iterative method, one can
start with studying its flow, to get some initial ideas.
On the other hand, showing that the flow has good properties
does not necessarily mean the discrete version has the same
good properties. Also, in practice, one cannot use flows, but
only discrete versions (like Euler’s scheme to solve ODE).
Some people use this correspondence (ODE v.s. discrete
version) to study CNN.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 2)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 3)
Lp Regularization
A common technique in Machine Learning/Deep Learning is
to add an Lp metric into the main metric d(x, y ) + ||x − y ||p .
It is observed from many experiments that this can help
improve the performance. Can we explain?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 3)
Lp Regularization
A common technique in Machine Learning/Deep Learning is
to add an Lp metric into the main metric d(x, y ) + ||x − y ||p .
It is observed from many experiments that this can help
improve the performance. Can we explain?
One explanation is that this process makes the cost function
to be a Morse function. There are algorithms which have
guarantee to find local minima of Morse functions. (Recall: a
function is Morse if all its critical points are non-degenerate.
I.e. if ∇F (x ∗ ) = 0, then ∇2 F (x ∗ ) is an invertible matrix.)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 4)
An important convergence theorem: M. D. Asic and D. D.
Adamovic, Limit points of sequences in metric spaces, The
American mathematical monthly, vol 77, so 6 (June–July
1970), 613–616
Let (X , d) be a compact metric space, and let {xn } be a
sequence in X . Assume that limn→∞ d(xn , xn+1 ) = 0. Let S
be the set of cluster points of {xn } (i.e. points x ∗ such that
there is a subsequence {xnk } converging to x ∗ ). Then S is a
connected set.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 4)
An important convergence theorem: M. D. Asic and D. D.
Adamovic, Limit points of sequences in metric spaces, The
American mathematical monthly, vol 77, so 6 (June–July
1970), 613–616
Let (X , d) be a compact metric space, and let {xn } be a
sequence in X . Assume that limn→∞ d(xn , xn+1 ) = 0. Let S
be the set of cluster points of {xn } (i.e. points x ∗ such that
there is a subsequence {xnk } converging to x ∗ ). Then S is a
connected set.
An application is as follows: In many algorithms, one can
show that limn→∞ ||xn+1 − xn || = 0 for the sequence
constructed by the algorithm.
If cluster points of {xn } are all critical points of F , and
assume that all critical points of F are isolated (e.g. if F is
Morse), and can show that the set {xn } is bounded, then one
can apply the above theorem to show that if S is non-empty
then the whole sequence {xn } converges to a point x ∗ .
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 5)
This is basically a common scheme to prove convergence for
certain optimization algorithms in the literature.
However, the requirement that {xn } is bounded is quite
restrictive (for example, a common requirement is that F has
compact sublevels). Can we overcome this?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 5)
This is basically a common scheme to prove convergence for
certain optimization algorithms in the literature.
However, the requirement that {xn } is bounded is quite
restrictive (for example, a common requirement is that F has
compact sublevels). Can we overcome this?
Yes, see
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation, Part 2: Algorithms and experiments, Applied
Mathematics and Optimization, 84 (2021), no 3, 2557-2586.
Tuyen Trung Truong and Hang-Tuan Nguyen, Backtracking
gradient descent method and some applications in large scale
optimisation. Part 1: Theory, Minimax Theory and its
Applications, 7 (2022), no 1, 79-108.
Main idea is to embed Rm as a topological subspace (but not
metric subspace) of a compact metric space PRm (real
projective space).
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 6)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.8 Metric/Topology (cont. 6)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry
Image from
https://www.pas.va/en/academicians/deceased/lojasiewicz.html
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 2)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 2)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 2)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.9 Semi-algebraic/Semi-analytic geometry (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemannian geometry
Image from
https://vccvisualization.org/RiemannianGeometryTutorial/
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.
Fisher information metric, a natural choice of Riemannian
metric on a space of parametrised statistical models (such as
Deep Neural Networks).
S. Amari and H. Nagaoka, Methods of information geometry,
Translations of mathematical monographs, AMS and Oxford
University Press, 2000.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.
Fisher information metric, a natural choice of Riemannian
metric on a space of parametrised statistical models (such as
Deep Neural Networks).
S. Amari and H. Nagaoka, Methods of information geometry,
Translations of mathematical monographs, AMS and Oxford
University Press, 2000.
Many constrained optimization problems concerning
constraints on a submanifold of Rm .
Of which methods in Riemann geometry may be useful and
better.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 2)
Netflix competition prize problem can be connected to matrix
completion, on a subset of matrices
https://arxiv.org/pdf/0903.1476, of which Riemannian
geometry can be utilised.
Fisher information metric, a natural choice of Riemannian
metric on a space of parametrised statistical models (such as
Deep Neural Networks).
S. Amari and H. Nagaoka, Methods of information geometry,
Translations of mathematical monographs, AMS and Oxford
University Press, 2000.
Many constrained optimization problems concerning
constraints on a submanifold of Rm .
Of which methods in Riemann geometry may be useful and
better.
Even unconstrained problems can also be stated in terms of
Riemann geometry. E.g.: Finding eigenvalue/eigenvector pairs
of a matrix. Explain how.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 3)
https://math.stackexchange.com/questions/2228747/definition-
of-gradient-of-a-function-f-in-riemannian-manifold
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 3)
https://math.stackexchange.com/questions/2228747/definition-
of-gradient-of-a-function-f-in-riemannian-manifold
Similarly, we can compute Hessian of a real function
F : X → R.
https://math.stackexchange.com/questions/81203/definitions-
of-hessian-in-riemannian-geometry
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 3)
https://math.stackexchange.com/questions/2228747/definition-
of-gradient-of-a-function-f-in-riemannian-manifold
Similarly, we can compute Hessian of a real function
F : X → R.
https://math.stackexchange.com/questions/81203/definitions-
of-hessian-in-riemannian-geometry
It seems we have enough to define Gradient descent and
Newton’s method in the Riemann geometry setting.
However, there is an obstacle. What is it?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 4)
The obstacle is this.
We have the point xn and the gradient grad(F )(xn ). Note
that the gradient is a vector field, hence grad(F )(xn ) is a
vector in Txn X .
If X = Rm , then there is a natural identification between
Txn Rm and Rm . Then we can subtract them together to get a
new point xn+1 . This is how we define Gradient Descent on
Rm . Similarly, this is also the reason we can define Newton’s
method on Rm .
In general, there is no way to combine a point and a vector to
get a new point.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 4)
The obstacle is this.
We have the point xn and the gradient grad(F )(xn ). Note
that the gradient is a vector field, hence grad(F )(xn ) is a
vector in Txn X .
If X = Rm , then there is a natural identification between
Txn Rm and Rm . Then we can subtract them together to get a
new point xn+1 . This is how we define Gradient Descent on
Rm . Similarly, this is also the reason we can define Newton’s
method on Rm .
In general, there is no way to combine a point and a vector to
get a new point.
Exponential map: An idea in Differential Geometry which
makes it possible to combine a point and a tangent vector
into a new point is Exponential map.
Any point on a manifold of real dimension m has a small
neighbourhood which is isomorphic to a small ball in Rm .
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 5)
But such an isomorphism may be arbitrary.
Exponential map is an intrinsic way to construct such
isomorphisms.
(Definition-Theorem) Let v ∈ Tx X be a tangent vector where
||v || ≤ r , r is a small enough positive number.
Then there is a unique geodesic γv : [0, 1] → X such that
γv (0) = x and γv′ (0) = 0.
The exponential map is defined as v 7→ γv (1).
How to prove? Note that the geodesic satisfies a certain
ODE, hence we can apply existence and uniqueness results.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 5)
But such an isomorphism may be arbitrary.
Exponential map is an intrinsic way to construct such
isomorphisms.
(Definition-Theorem) Let v ∈ Tx X be a tangent vector where
||v || ≤ r , r is a small enough positive number.
Then there is a unique geodesic γv : [0, 1] → X such that
γv (0) = x and γv′ (0) = 0.
The exponential map is defined as v 7→ γv (1).
How to prove? Note that the geodesic satisfies a certain
ODE, hence we can apply existence and uniqueness results.
How to define Gradient Descent on Riemannian manifolds?
xn+1 = γxn (−r × grad(F )(xn )/||grad(F )(xn )||) (here r
depends on xn ). Why not using γxn (−grad(F )(xn )), which is
simpler?
Similarly, one can define Newton’s method on Riemannian
manifolds.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 6)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 6)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?
Need stronger properties: ”Strong local retractions” and ”real
analytic-like local retractions”. Details see:
https://arxiv.org/pdf/2008.11091
Example with finding eigenvalues/eigenvectors of symmetric
real matrices.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?
Need stronger properties: ”Strong local retractions” and ”real
analytic-like local retractions”. Details see:
https://arxiv.org/pdf/2008.11091
Example with finding eigenvalues/eigenvectors of symmetric
real matrices.
How do we prove convergence in the Riemannian manifold
setting?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.10 Riemann geometry (cont. 7)
Over the Euclidean space, Gradient Descent and Newton’s
method do not have good global convergence guarantee.
The same for the Riemannian versions.
How can we boost the convergence?
Need stronger properties: ”Strong local retractions” and ”real
analytic-like local retractions”. Details see:
https://arxiv.org/pdf/2008.11091
Example with finding eigenvalues/eigenvectors of symmetric
real matrices.
How do we prove convergence in the Riemannian manifold
setting?
Nash-Kuiper embedding theorems:
https://en.wikipedia.org/wiki/Nash embedding theorems. In
particularly, isometric embeddings, which allow to reduce the
convergence problem to that of subspaces of Euclidean spaces.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical education
Image from
https://www.curry.edu/academics/undergraduate-degrees-
and-programs/science-and-mathematics-degrees-and-
programs/mathematics-education-minor
§2.11 Mathematical Education (cont. 2)
The ultimate dream is to create AGI (Artificial General
Intelligence). Roughly speaking, machines which are as
intelligent as humans.
LLM (Large Language Models) seems to be a right direction
to go.
However, LLM, as other Deep Learning models, works on
probability.
Which is good for tasks not require precise answers. E.g.:
good for art, image classification, assistant,...
Which is not good for mathematics, logic.
E.g. no LLM can always solve linear equations in 1 real
variable correctly. E.g.
2(−3(4x + 1) + 10) − 9(2(5(100x + 3) − 27) + 4) =
8(7(6x + 5) − 4) − 73 (if a LLM can solve this equation, make
more parentheses, and it will fail!). Run with Mistral AI Chat.
[More example on mathematics and LLM in the final section.]
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 3)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§2.11 Mathematical Education (cont. 4)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 2)
Isn’t AI already can solve Olympics math questions and hence
intelligent?
In the news: AlphaProof and AlphaGeometry 2 solve correctly
4 problems in the 2024 IMO, which is equivalent to silver
medal.
Since codes/papers for AlphaProof and AlphaGeometry 2 not
available, cannot comment. However, reading the blog post
by DeepMind, it seems that AlphaGeometry2 is similar to
Alpha Geometry (last year). And also maybe to FunSearch.
https://www.nature.com/articles/s41586-023-06747-5
https://deepmind.google/discover/blog/funsearch-making-
new-discoveries-in-mathematical-sciences-using-large-
language-models/
As we see, LLM cannot solve linear equations in 1 variable.
How does Alpha Geometry or Fun Search do?
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 3)
The trick is in high school geometry, there are several
common auxiliary constructions which can help solve problem.
Usually, you may not need more than 10 such auxiliary
constructions to solve a problem.
In the Alpha Geometry paper, they have a list of about 50
such constructions.
So if you restrict yourself to maximum 10 auxiliary
constructions, you have about 1050 choices. No hope to
search all possibilities.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 3)
The trick is in high school geometry, there are several
common auxiliary constructions which can help solve problem.
Usually, you may not need more than 10 such auxiliary
constructions to solve a problem.
In the Alpha Geometry paper, they have a list of about 50
such constructions.
So if you restrict yourself to maximum 10 auxiliary
constructions, you have about 1050 choices. No hope to
search all possibilities.
What LLM can help is that it can classify a finite number of
categories well, after training on big and representative
datasets.
E.g. it can classify whether each sentence in a paragraph has
”positive”, ”negative” or ”neutral” meaning (3 categories).
Run example with Mistral AI Chat.
Now, for Alpha Geometry it is more complicated, but main
idea is similar.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 4)
A direction in Alpha Geometry is to combine with formal
proof checking: they send their proof proposal to a Proof
Assistant to check if it is correct.
Another direction is to utilise the abilities of computers to do
in parallel: basically, they have like 1000 computers, each
computer propose a different proof proposal.
Both directions are in the right track: AI should be combined
with formal computations in order to do mathematics.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 4)
A direction in Alpha Geometry is to combine with formal
proof checking: they send their proof proposal to a Proof
Assistant to check if it is correct.
Another direction is to utilise the abilities of computers to do
in parallel: basically, they have like 1000 computers, each
computer propose a different proof proposal.
Both directions are in the right track: AI should be combined
with formal computations in order to do mathematics.
However, there are differences to humans.
Humans don’t train on million problems (and cannot have
time to do so).
To be in fair comparison, that Alpha Geometry uses parallel
computations, we should have 1000 people working together.
Also, people should just need to come up with proof proposal,
and send to a Proof Assistant to check, like Alpha Geometry.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 5)
Not mentioning that computers do computations much
quicker than humans. Also, Alpha Geometry can only do
planar geometric questions. If one wants it to do another type
of questions, one needs to train another AI.
So probably we are on the right track, but will need
significant/radical changes in how to train AI.
Need to reduce the needed resources, the cost and so on.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 5)
Not mentioning that computers do computations much
quicker than humans. Also, Alpha Geometry can only do
planar geometric questions. If one wants it to do another type
of questions, one needs to train another AI.
So probably we are on the right track, but will need
significant/radical changes in how to train AI.
Need to reduce the needed resources, the cost and so on.
Moreover, if one can state a high school problem in terms of
algebraic equations, then formal tools like Grobner basis can
solve efficiently. Don’t need to train on million datasets, you
don’t need to propose thousand different proof proposals, ...
The Alpha Geometry paper said that Grobner basis ... cannot
solve these planar geometry problems (in a reasonable time),
but that claim does not match what I know about Grobner
basis and these planar geometry problems! Need a verification
by an independent committee.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 5)
Not mentioning that computers do computations much
quicker than humans. Also, Alpha Geometry can only do
planar geometric questions. If one wants it to do another type
of questions, one needs to train another AI.
So probably we are on the right track, but will need
significant/radical changes in how to train AI.
Need to reduce the needed resources, the cost and so on.
Moreover, if one can state a high school problem in terms of
algebraic equations, then formal tools like Grobner basis can
solve efficiently. Don’t need to train on million datasets, you
don’t need to propose thousand different proof proposals, ...
The Alpha Geometry paper said that Grobner basis ... cannot
solve these planar geometry problems (in a reasonable time),
but that claim does not match what I know about Grobner
basis and these planar geometry problems! Need a verification
by an independent committee.
⇒ Need a stronger reason for why LLM is needed.
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 6)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 7)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 7)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 8)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 8)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
§3 Looking forward - Some open questions (cont. 8)
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning
Thank you very much for your attention!
Tuyen Trung Truong Mathematics, optimization & Deep Learning, Machine Learning