0% found this document useful (0 votes)
21 views

DSA5105 Lecture5

Uploaded by

Laura Zhou
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

DSA5105 Lecture5

Uploaded by

Laura Zhou
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Foundations of Machine Learning

DSA 5105 • Lecture 5

Soufiane Hayou
Department of Mathematics
What we’ve seen so far
So far, we introduced two types of hypothesis spaces containing

Main difference:
1. The feature maps are fixed
Examples: linear models, linear basis models,
SVM
2. The feature maps are adapted to data
Examples: standard/boosted/bagged decision trees
In this lecture, we introduce another class belonging to 2: neural
networks
Shallow Neural Networks
History
Neural networks originated from an attempt to model collective
interaction of neurons

https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7
Neural Networks for Regression
The neural network (NN) hypothesis space is quite like linear
basis models

Trainable variables:
• are the weights of the hidden layer
• are the biases of the hidden layer
• are the weights of the output layer
• is the activation function
Activation Functions
• Sigmoid

• Tanh

• Rectified Linear Unit (ReLU)

• Leaky-ReLU
Universal Approximation Theorem
One of the foundational results for neural networks is the
universal approximation theorem.

In words, it says the following:

Any continuous function on a compact domain can be


approximated by neural networks to arbitrary precision, provided
there are enough neurons ( large enough).
Neural networks hypothesis space of arbitrarily large width has
zero approximation error!

Distance

𝓗

𝑓

~
𝑓
“Proof” in a Special Case
Let us consider a 1D continuous function on the unit interval
Step 1: Approximate by a step functions
Step 2: Use two sigmoids to make a step
Step 3: Sum the resulting function
Curse of Dimensionality
Although this idea can be extended to high dimensions, it
introduces an issue.
How many patches of linear size are there in ?
• : pieces
• : pieces
• General : pieces
Even if somehow, we only need a constant number of neurons to
approximate each piece, we would still need neurons!
This is known as the curse of dimensionality that plagues high
dimensional problems
Do neural networks
suffer from the curse of
dimensionality?
Linear and Nonlinear
Approximation
Linear vs Nonlinear Approximation
Recall:

1. Linear approximation: are fixed


2. Nonlinear approximation: are adapted

What is the difference?


The significance of data-dependent
feature maps
Let us consider some motivating examples

Suppose we want to write a vector in 3D in terms of its


coordinate components

𝑢3 𝑢
𝑢=𝑢1 𝑒1 +𝑢 2 𝑒 2 +𝑢3 𝑒 3
𝑒 3 𝑢2
𝑒2

𝑒1 𝑢1
Suppose we can only use 2 coordinate axes, say and
What is the best approximation of ?

Example:
𝑢
2
1
𝑒3 𝑒2 ^
𝑢
Error: 𝑒1 3
• What if we can pick which two bases to use after seeing ?
• What if we can pick two bases from a much larger set?
Functions behave just like vectors!
• Each is like a coordinate axis. They play the role of .
Important difference: there are an infinite number of them
• The oracle function plays the role of

Writing

is like expanding a vector into its components, but we can’t have


all components since is finite
If we get to choose which components to have in the sum after
seeing some information on , we can usually do much better!
Linear Approximation
Basis independent of data
Nonlinear Approximation
Basis depends on data
Overcoming the Curse of
Dimensionality
Under some technical assumptions, for any continuous (+ other
conditions) function , there exists a width- neural network such
that

This result is first proved in [Baron, 1993]

This is a tremendous improvement over linear approximation,


where we usually have

The constant measures the smoothness of


Optimizing Neural Networks
Optimization
The universal approximation theorem is an approximation result
• We know there is a good approximator of in
• But, we do not yet know how to find it

𝓗 ~ ∗
𝑓≈𝑓
𝑓0

^
𝑓
Optimization
(using )
Empirical Risk Minimization for
Neural Networks
We can parameterize the hypothesis space

Then, empirical risk minimization is

Here, is the total loss and is the sample loss


Gradient Descent
Consider minimizing the total loss or objective

A necessary first-order optimality condition

Two choices:
• Solve
• Use an iterative method
The Effect of Learning Rate
Look at the GD iteration

• When is too small, updates are slow


• When is too large, the updates may become unstable
Example
One dimension

GD iterates

Solution
Convergence of GD
Provided , it can be shown that

However, this does not mean that

Most important problem: there may be local minima


Local vs Global Minima
is a local minimum of if there
exist
Local Min

for all such that Global Min

is a global minimum of if
𝛿

When does GD find a global minimum?


Convex Functions
A class of objective/loss functions for which local minima are
also global is called convex functions

Definition:

A function is convex if

for all and

Geometric meaning?
Examples

Convex:

Non-convex:
Important Property
If is convex, then all local minima are also global!

Proof by picture:

Local Min

𝛿
GD on Convex Functions
When is convex, GD finds the global minima. In fact, there is a
rate estimate!

When is convex?

Is convex in
• Linear Basis Models?
• SVM?
• Neural Networks?
Stochastic Gradient Descent
GD is an optimization algorithm for general differentiable
functions, but in empirical risk minimization we have some
structure

Challenges to GD?
• , so a gradient evaluation requires a summation of terms
• This is very expensive when is large
Stochastic gradient descent relies on the following idea: at each
step, we take a random sub-sample of the dataset as an
approximation of the full gradient

Gradient Descent (GD)

Stochastic Gradient Descent (SGD)

where is a random sub-sample of of size

This is efficient if is small and is large!


The Dynamics of SGD
Consider sample objectives

Total objective:
SGD vs GD dynamics?
Deep Neural Networks
Deep Neural Networks
Deep neural networks are an extension of shallow networks.
Idea: we stack many hidden layers together
Optimizing Deep Neural Networks
Analogous to shallow NNs, deep NNs can also be optimized with
(stochastic) gradient descent.

However, due to the repeated feed-forward structure we need an


efficient algorithm to compute the gradients.

This is known as the back-propagation algorithm


Review: Chain Rule
Consider functions

Then, the chain rule of calculus gives

In component form, we have


Back-Propagation
Let us consider a network

Loss function (just consider one sample)

We want to compute
1. Generally, has the following dependence

2. But, given , no longer depends on

3. Use chain rule on

giving
4. So, we have defined

once we know , we are done! How to compute ?

For , this is easy

For , we use chain rule again to derive a recursion

and so
Summary
Approximation Properties of Neural Networks
• Nonlinear approximation: adapted to data
• Universal approximation property, overcomes curse of
dimensionality

Optimizing Neural Networks


• (Stochastic) Gradient Descent
• For deep NNs, compute gradients using back-propagation

You might also like