0% found this document useful (0 votes)

21 views

DSA5105 Lecture5

Uploaded by

Laura Zhou

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

DSA5105 Lecture5

Uploaded by

Laura Zhou

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Foundations of Machine Learning

DSA 5105 • Lecture 5

Soufiane Hayou
Department of Mathematics
What we’ve seen so far
So far, we introduced two types of hypothesis spaces containing

Main difference:
1. The feature maps are fixed
Examples: linear models, linear basis models,
SVM
2. The feature maps are adapted to data
Examples: standard/boosted/bagged decision trees
In this lecture, we introduce another class belonging to 2: neural
networks
Shallow Neural Networks
History
Neural networks originated from an attempt to model collective
interaction of neurons

https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7
Neural Networks for Regression
The neural network (NN) hypothesis space is quite like linear
basis models

Trainable variables:
• are the weights of the hidden layer
• are the biases of the hidden layer
• are the weights of the output layer
• is the activation function
Activation Functions
• Sigmoid

• Tanh

• Rectified Linear Unit (ReLU)

• Leaky-ReLU
Universal Approximation Theorem
One of the foundational results for neural networks is the
universal approximation theorem.

In words, it says the following:

Any continuous function on a compact domain can be

approximated by neural networks to arbitrary precision, provided
there are enough neurons ( large enough).
Neural networks hypothesis space of arbitrarily large width has
zero approximation error!

Distance

𝓗
∗
𝑓

~
𝑓
“Proof” in a Special Case
Let us consider a 1D continuous function on the unit interval
Step 1: Approximate by a step functions
Step 2: Use two sigmoids to make a step
Step 3: Sum the resulting function
Curse of Dimensionality
Although this idea can be extended to high dimensions, it
introduces an issue.
How many patches of linear size are there in ?
• : pieces
• : pieces
• General : pieces
Even if somehow, we only need a constant number of neurons to
approximate each piece, we would still need neurons!
This is known as the curse of dimensionality that plagues high
dimensional problems
Do neural networks
suffer from the curse of
dimensionality?
Linear and Nonlinear
Approximation
Linear vs Nonlinear Approximation
Recall:

1. Linear approximation: are fixed

2. Nonlinear approximation: are adapted

What is the difference?

The significance of data-dependent
feature maps
Let us consider some motivating examples

Suppose we want to write a vector in 3D in terms of its

coordinate components

𝑢3 𝑢
𝑢=𝑢1 𝑒1 +𝑢 2 𝑒 2 +𝑢3 𝑒 3
𝑒 3 𝑢2
𝑒2

𝑒1 𝑢1
Suppose we can only use 2 coordinate axes, say and
What is the best approximation of ?

Example:
𝑢
2
1
𝑒3 𝑒2 ^
𝑢
Error: 𝑒1 3
• What if we can pick which two bases to use after seeing ?
• What if we can pick two bases from a much larger set?
Functions behave just like vectors!
• Each is like a coordinate axis. They play the role of .
Important difference: there are an infinite number of them
• The oracle function plays the role of

Writing

is like expanding a vector into its components, but we can’t have

all components since is finite
If we get to choose which components to have in the sum after
seeing some information on , we can usually do much better!
Linear Approximation
Basis independent of data
Nonlinear Approximation
Basis depends on data
Overcoming the Curse of
Dimensionality
Under some technical assumptions, for any continuous (+ other
conditions) function , there exists a width- neural network such
that

This result is first proved in [Baron, 1993]

This is a tremendous improvement over linear approximation,

where we usually have

The constant measures the smoothness of

Optimizing Neural Networks
Optimization
The universal approximation theorem is an approximation result
• We know there is a good approximator of in
• But, we do not yet know how to find it

𝓗 ~ ∗
𝑓≈𝑓
𝑓0

^
𝑓
Optimization
(using )
Empirical Risk Minimization for
Neural Networks
We can parameterize the hypothesis space

Then, empirical risk minimization is

Here, is the total loss and is the sample loss

Gradient Descent
Consider minimizing the total loss or objective

A necessary first-order optimality condition

Two choices:
• Solve
• Use an iterative method
The Effect of Learning Rate
Look at the GD iteration

• When is too small, updates are slow

• When is too large, the updates may become unstable
Example
One dimension

GD iterates

Solution
Convergence of GD
Provided , it can be shown that

However, this does not mean that

Most important problem: there may be local minima

Local vs Global Minima
is a local minimum of if there
exist
Local Min

for all such that Global Min

is a global minimum of if
𝛿

When does GD find a global minimum?

Convex Functions
A class of objective/loss functions for which local minima are
also global is called convex functions

Definition:

A function is convex if

for all and

Geometric meaning?
Examples

Convex:

Non-convex:
Important Property
If is convex, then all local minima are also global!

Proof by picture:

Local Min

𝛿
GD on Convex Functions
When is convex, GD finds the global minima. In fact, there is a
rate estimate!

When is convex?

Is convex in
• Linear Basis Models?
• SVM?
• Neural Networks?
Stochastic Gradient Descent
GD is an optimization algorithm for general differentiable
functions, but in empirical risk minimization we have some
structure

Challenges to GD?
• , so a gradient evaluation requires a summation of terms
• This is very expensive when is large
Stochastic gradient descent relies on the following idea: at each
step, we take a random sub-sample of the dataset as an
approximation of the full gradient

Gradient Descent (GD)

Stochastic Gradient Descent (SGD)

where is a random sub-sample of of size

This is efficient if is small and is large!

The Dynamics of SGD
Consider sample objectives

Total objective:
SGD vs GD dynamics?
Deep Neural Networks
Deep Neural Networks
Deep neural networks are an extension of shallow networks.
Idea: we stack many hidden layers together
Optimizing Deep Neural Networks
Analogous to shallow NNs, deep NNs can also be optimized with
(stochastic) gradient descent.

However, due to the repeated feed-forward structure we need an

efficient algorithm to compute the gradients.

This is known as the back-propagation algorithm

Review: Chain Rule
Consider functions

Then, the chain rule of calculus gives

In component form, we have

Back-Propagation
Let us consider a network

Loss function (just consider one sample)

We want to compute
1. Generally, has the following dependence

2. But, given , no longer depends on

3. Use chain rule on

giving
4. So, we have defined

once we know , we are done! How to compute ?

For , this is easy

For , we use chain rule again to derive a recursion

and so
Summary
Approximation Properties of Neural Networks
• Nonlinear approximation: adapted to data
• Universal approximation property, overcomes curse of
dimensionality

Optimizing Neural Networks

• (Stochastic) Gradient Descent
• For deep NNs, compute gradients using back-propagation

Neural Networks
No ratings yet
Neural Networks
14 pages
Introtodeeplearning MIT 6.S191
No ratings yet
Introtodeeplearning MIT 6.S191
36 pages
Lesson 7.0 Supervised Learning With Neural Networks (1)
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks (1)
22 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
Ann Cae-3
No ratings yet
Ann Cae-3
22 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
A Probabilistic Theory of Deep Learning: Unit 2
No ratings yet
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Deep Neural Network (DNN)
100% (1)
Deep Neural Network (DNN)
80 pages
Character Recognition Using Neural Networks: Rókus Arnold, Póth Miklós
No ratings yet
Character Recognition Using Neural Networks: Rókus Arnold, Póth Miklós
4 pages
Unit 3 Self Made
No ratings yet
Unit 3 Self Made
23 pages
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
No ratings yet
Week - 5 (Deep Learning) Q. 1) Explain The Architecture of Feed Forward Neural Network or Multilayer Perceptron. (12 Marks)
7 pages
Multi Layer Perceptron Neural Networks
No ratings yet
Multi Layer Perceptron Neural Networks
30 pages
L05 Slides.mlp2
No ratings yet
L05 Slides.mlp2
21 pages
Deep Learning-Summery
No ratings yet
Deep Learning-Summery
24 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
Unit 4
No ratings yet
Unit 4
18 pages
Unsupervised Clustering
No ratings yet
Unsupervised Clustering
30 pages
SP18 Practice Midterm
No ratings yet
SP18 Practice Midterm
5 pages
DL-2
No ratings yet
DL-2
62 pages
Neural Networks and Their Statistical Application
No ratings yet
Neural Networks and Their Statistical Application
41 pages
UNIT IV Non Parametric Methods
No ratings yet
UNIT IV Non Parametric Methods
37 pages
MODULE 2 DL
No ratings yet
MODULE 2 DL
9 pages
dl_mod4
No ratings yet
dl_mod4
18 pages
Unit 2
No ratings yet
Unit 2
92 pages
DL UNIT 3
No ratings yet
DL UNIT 3
14 pages
Chapter 11 Neural Nets
No ratings yet
Chapter 11 Neural Nets
39 pages
Deep Learning Assignment 1 Solution: Name: Vivek Rana Roll No.: 1709113908
No ratings yet
Deep Learning Assignment 1 Solution: Name: Vivek Rana Roll No.: 1709113908
5 pages
tutorial 1,2
No ratings yet
tutorial 1,2
12 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
DeepLearning L1 Intro
No ratings yet
DeepLearning L1 Intro
92 pages
Basics of Deep Learning
No ratings yet
Basics of Deep Learning
20 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
35 pages
Module 2
No ratings yet
Module 2
44 pages
Deep Learning 15
No ratings yet
Deep Learning 15
13 pages
AIML-Module-3-part 2
No ratings yet
AIML-Module-3-part 2
122 pages
ca3dl
No ratings yet
ca3dl
6 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Artificial Neural Network (2)
No ratings yet
Artificial Neural Network (2)
75 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Artificial Neural Network MID
No ratings yet
Artificial Neural Network MID
13 pages
AAM QB With Answer
No ratings yet
AAM QB With Answer
4 pages
1.4+Computing+Gradient+Using+Backpropagation
No ratings yet
1.4+Computing+Gradient+Using+Backpropagation
5 pages
AI ML Nov 15
No ratings yet
AI ML Nov 15
32 pages
neural network -test questions
No ratings yet
neural network -test questions
9 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
Unit 2 (Q&A)
No ratings yet
Unit 2 (Q&A)
23 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
Deep Learning L3
No ratings yet
Deep Learning L3
37 pages
CNN Course-Notes 365
No ratings yet
CNN Course-Notes 365
29 pages
Machine Learning For Data Science 2 - Normalizing Flows V2
No ratings yet
Machine Learning For Data Science 2 - Normalizing Flows V2
50 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Week 5 Slides
No ratings yet
Week 5 Slides
25 pages
COMP1901 Research Project
No ratings yet
COMP1901 Research Project
12 pages
Deep Learning PDF
No ratings yet
Deep Learning PDF
55 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Sdc under DFT
100% (1)
Sdc under DFT
8 pages
Sample Custom Keywords in Katalon Studio
No ratings yet
Sample Custom Keywords in Katalon Studio
5 pages
Syllabus For All Sems
No ratings yet
Syllabus For All Sems
115 pages
4 - OHRS - V10 - 2 - Chapter 4 - 2018 - 04 - 01 Final
No ratings yet
4 - OHRS - V10 - 2 - Chapter 4 - 2018 - 04 - 01 Final
29 pages
Mobile App Testing Strate 6667405
No ratings yet
Mobile App Testing Strate 6667405
23 pages
Huawei CloudEngine S5732-H-V2 Series Multi-GE Switches Datasheet
No ratings yet
Huawei CloudEngine S5732-H-V2 Series Multi-GE Switches Datasheet
27 pages
1.1 Background of The Study
No ratings yet
1.1 Background of The Study
5 pages
Kenwood Kdc-203s 209s
No ratings yet
Kenwood Kdc-203s 209s
19 pages
ISDN Architecture
No ratings yet
ISDN Architecture
19 pages
Smart Waste Segregation and Monitoring System Using Iot: I R J M T
No ratings yet
Smart Waste Segregation and Monitoring System Using Iot: I R J M T
10 pages
Computer Architecture and Parallel Processing
No ratings yet
Computer Architecture and Parallel Processing
1 page
SRS-PRJ566 Group-7 (1)
No ratings yet
SRS-PRJ566 Group-7 (1)
69 pages
Sensors Data
No ratings yet
Sensors Data
27 pages
File Handling
No ratings yet
File Handling
25 pages
Mpi Assignment
No ratings yet
Mpi Assignment
14 pages
Intersight Overview
No ratings yet
Intersight Overview
18 pages
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
No ratings yet
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
56 pages
Perm Debug Info
No ratings yet
Perm Debug Info
18 pages
Computation Theory: Expressions Languages Grammar
No ratings yet
Computation Theory: Expressions Languages Grammar
51 pages
Designing Data Intensive Web Applications 1st Edition Stefano Ceriinstant download
100% (2)
Designing Data Intensive Web Applications 1st Edition Stefano Ceriinstant download
57 pages
Working With Microsoft PowerPoint 2013
No ratings yet
Working With Microsoft PowerPoint 2013
6 pages
A Study On Self Driving Car An Applicati
No ratings yet
A Study On Self Driving Car An Applicati
10 pages
9618_w23_qp_23
No ratings yet
9618_w23_qp_23
24 pages
Unit 4
No ratings yet
Unit 4
96 pages
5.5G Next Generation Core Network
No ratings yet
5.5G Next Generation Core Network
26 pages
SLOW OR FREEZING (1) .Final 2.0
No ratings yet
SLOW OR FREEZING (1) .Final 2.0
27 pages
Sales - Ip Network - Material Estudio
No ratings yet
Sales - Ip Network - Material Estudio
57 pages
BIM and GIS Data Integration Guidelines (June 2023 Edition)
No ratings yet
BIM and GIS Data Integration Guidelines (June 2023 Edition)
14 pages
Question Bank - Big Data
No ratings yet
Question Bank - Big Data
6 pages
Cooling System: PC Maintenanc E TCT 1023
No ratings yet
Cooling System: PC Maintenanc E TCT 1023
17 pages

DSA5105 Lecture5

Uploaded by

DSA5105 Lecture5

Uploaded by

Foundations of Machine Learning

DSA 5105 • Lecture 5

• Rectified Linear Unit (ReLU)

In words, it says the following:

Any continuous function on a compact domain can be

1. Linear approximation: are fixed

What is the difference?

Suppose we want to write a vector in 3D in terms of its

is like expanding a vector into its components, but we can’t have

This result is first proved in [Baron, 1993]

This is a tremendous improvement over linear approximation,

The constant measures the smoothness of

Then, empirical risk minimization is

Here, is the total loss and is the sample loss

A necessary first-order optimality condition

• When is too small, updates are slow

However, this does not mean that

Most important problem: there may be local minima

for all such that Global Min

When does GD find a global minimum?

for all and

Gradient Descent (GD)

Stochastic Gradient Descent (SGD)

where is a random sub-sample of of size

This is efficient if is small and is large!

However, due to the repeated feed-forward structure we need an

This is known as the back-propagation algorithm

Then, the chain rule of calculus gives

In component form, we have

Loss function (just consider one sample)

2. But, given , no longer depends on

3. Use chain rule on

once we know , we are done! How to compute ?

For , this is easy

For , we use chain rule again to derive a recursion

Optimizing Neural Networks

You might also like