DSA5105 Lecture5
DSA5105 Lecture5
Soufiane Hayou
Department of Mathematics
What we’ve seen so far
So far, we introduced two types of hypothesis spaces containing
Main difference:
1. The feature maps are fixed
Examples: linear models, linear basis models,
SVM
2. The feature maps are adapted to data
Examples: standard/boosted/bagged decision trees
In this lecture, we introduce another class belonging to 2: neural
networks
Shallow Neural Networks
History
Neural networks originated from an attempt to model collective
interaction of neurons
https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7
Neural Networks for Regression
The neural network (NN) hypothesis space is quite like linear
basis models
Trainable variables:
• are the weights of the hidden layer
• are the biases of the hidden layer
• are the weights of the output layer
• is the activation function
Activation Functions
• Sigmoid
• Tanh
• Leaky-ReLU
Universal Approximation Theorem
One of the foundational results for neural networks is the
universal approximation theorem.
Distance
𝓗
∗
𝑓
~
𝑓
“Proof” in a Special Case
Let us consider a 1D continuous function on the unit interval
Step 1: Approximate by a step functions
Step 2: Use two sigmoids to make a step
Step 3: Sum the resulting function
Curse of Dimensionality
Although this idea can be extended to high dimensions, it
introduces an issue.
How many patches of linear size are there in ?
• : pieces
• : pieces
• General : pieces
Even if somehow, we only need a constant number of neurons to
approximate each piece, we would still need neurons!
This is known as the curse of dimensionality that plagues high
dimensional problems
Do neural networks
suffer from the curse of
dimensionality?
Linear and Nonlinear
Approximation
Linear vs Nonlinear Approximation
Recall:
𝑢3 𝑢
𝑢=𝑢1 𝑒1 +𝑢 2 𝑒 2 +𝑢3 𝑒 3
𝑒 3 𝑢2
𝑒2
𝑒1 𝑢1
Suppose we can only use 2 coordinate axes, say and
What is the best approximation of ?
Example:
𝑢
2
1
𝑒3 𝑒2 ^
𝑢
Error: 𝑒1 3
• What if we can pick which two bases to use after seeing ?
• What if we can pick two bases from a much larger set?
Functions behave just like vectors!
• Each is like a coordinate axis. They play the role of .
Important difference: there are an infinite number of them
• The oracle function plays the role of
Writing
𝓗 ~ ∗
𝑓≈𝑓
𝑓0
^
𝑓
Optimization
(using )
Empirical Risk Minimization for
Neural Networks
We can parameterize the hypothesis space
Two choices:
• Solve
• Use an iterative method
The Effect of Learning Rate
Look at the GD iteration
GD iterates
Solution
Convergence of GD
Provided , it can be shown that
is a global minimum of if
𝛿
Definition:
A function is convex if
Geometric meaning?
Examples
Convex:
Non-convex:
Important Property
If is convex, then all local minima are also global!
Proof by picture:
Local Min
𝛿
GD on Convex Functions
When is convex, GD finds the global minima. In fact, there is a
rate estimate!
When is convex?
Is convex in
• Linear Basis Models?
• SVM?
• Neural Networks?
Stochastic Gradient Descent
GD is an optimization algorithm for general differentiable
functions, but in empirical risk minimization we have some
structure
Challenges to GD?
• , so a gradient evaluation requires a summation of terms
• This is very expensive when is large
Stochastic gradient descent relies on the following idea: at each
step, we take a random sub-sample of the dataset as an
approximation of the full gradient
Total objective:
SGD vs GD dynamics?
Deep Neural Networks
Deep Neural Networks
Deep neural networks are an extension of shallow networks.
Idea: we stack many hidden layers together
Optimizing Deep Neural Networks
Analogous to shallow NNs, deep NNs can also be optimized with
(stochastic) gradient descent.
We want to compute
1. Generally, has the following dependence
giving
4. So, we have defined
and so
Summary
Approximation Properties of Neural Networks
• Nonlinear approximation: adapted to data
• Universal approximation property, overcomes curse of
dimensionality