0% found this document useful (0 votes)

2 views

1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations

The document discusses the challenges of setting parameters in deep neural networks, particularly focusing on the optimization problems associated with them. It emphasizes the difference between convex and non-convex optimization problems, explaining how gradient descent can be used to find parameters despite the non-convex nature of the functions involved. The document concludes that while gradient descent may not always yield a globally optimal solution, it often finds sufficiently good local optima for deep learning applications.

Uploaded by

Mandy Law

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations

Uploaded by

Mandy Law

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Setting Parameters of a Deep Neural Network - Hierarchical

Representations

In the previous videos, we learned that we need to set the parameters or weights, but
how do we set the parameters of a deep neural network? Our first goal is to cast the
problem of finding good parameters as an optimization problem.

The catch is that not all optimization problems are created equal. There's an important
class of optimization problems called convex optimization problems for which we
have many good algorithms and which we understand quite well; when and why they
work. But life is not so easy in the case of deep neural networks - the optimization
problems that come out are usually non-convex and unwieldy.

Let’s first define the optimization problem we'll be interested in. Returning to the
ImageNet example, our deep neural networks have n inputs and 1000 outputs. The
image will be assigned to one of the 1000 categories.
[email protected]
ZV0GDF798E

What we're hoping for is that when we feed in a picture as an n-dimensional vector,
for example, an image of a dog, then the output in the last layer contains a ‘1’ in the
correct category and ‘0’s in all the other categories.

1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Now let's say we have all the parameters of a deep network; how may we evaluate
how good these parameters are? We could take each one of the 1 million examples of
the ImageNet, then feed each picture in the network and check whether we got the
correct label or not.

To be concrete, let’s discuss what's called the quadratic cost function. Let's say the
input is some vector x, and its true label is category J, then the output that we would
like to get is zeros in every other category, except for a 1 in the J category of outputs.

[email protected]
ZV0GDF798E

For example, in the above image, we pass the input as the dog image (it’s some vector
x). So the true label of the image is the dog. Then as an output from the network, we
need to get zero for all the other categories and 1 for the dog category.

Now we could penalize the network by how far off the output is from this idealized
output. Let’s say that on input x we output 𝑎1, 𝑎2, 𝑎3.... 𝑎𝑚 where 𝑎 refers to one of the
categories. Then the penalty is:

2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
This function evaluates to zero if you get exactly the correct output, and evaluates to
something non-zero if you get the wrong category. Here 𝑎𝑗 is the correct category the
image belongs to.

Let’s understand the function with an example,

Suppose we have passed the image x and the true label for the image is the cat. Now
the network has predicted cats. In this case, 𝑎𝑗 will be 1 because it has been predicted
correctly, and the rest of the 𝑎𝑖’s are zero.
2
So the penalty = (1 − 1) + (0 + 0 + .....) = 0.
Since the category is correctly predicted, there is no penalty.

What
[email protected] we've done is, we’ve defined a cost function. The cost function is a technique of
ZV0GDF798E
“evaluating the performance of our algorithm/model”. Now if we have the parameters
of a deep neural network, we can evaluate how good it is by computing the average
cost. We want the average cost to be minimum, in order to be confident we’ve
computed the best performing deep neural network.

Now if we want to find a good setting of the parameters, we will look for the setting of
the parameters that minimizes this average cost. This is an optimization problem, we
have an explicit function that depends on the parameters as well as the training data,
that we'd like to minimize. It turns out that if you find a set of parameters that work
well on the training examples, it has a low average cost and achieves a low error on a
new set of examples. The key idea is to find the optimal parameters that have a low
average cost and can predict the new set of examples with a minimal error rate.

We have an optimization problem that if we could solve would find good parameters.
Is there just some off-the-shelf way that we can plug in any optimization problem we'd
like, and get the best answer? Absolutely not.

3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The difference between which optimization problems are easy to solve and which
ones are hard, is one of the foundational issues in theoretical computer science.

What types of optimization problems are easy?

Let's start with one-dimensional data to keep things as simple as possible. Let’s say
you have some function 𝑓(𝑥) and we want to find the 𝑥 that minimizes the equation
𝑓(𝑥) where 𝑥 is a real variable, so it can take on any real value.

There's an important class of functions called convex functions. There are many ways
2
to define convex functions, but let's start with an example: 𝑓(𝑥) = 𝑥

[email protected]
ZV0GDF798E

The above graph is convex. It's convex because:

1. Whenever you take any two points on the curve and draw a line between them,
the line lies entirely above (inside) the curve.

4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
2. The more mathematical definition is “a function is convex if and only if its
second derivative is always non-negative”.
2
In the above example, the equation is 𝑓(𝑥) = 𝑥 . To calculate the second
derivative we need to calculate the first-order derivative and then differentiate
the first-order derivative to get the second-order derivative.
𝑛 𝑛−1
From calculus, we know that the derivative of 𝑥 is 𝑛𝑥 .

The first derivative of f(x) is represented by the term 𝑓'(𝑥).

𝑛 𝑛−1
From the formula above, which says that if 𝑓(𝑥) = 𝑥 , 𝑓'(𝑥) = 𝑛𝑥 :
Substituting 2 for n,
2
For 𝑓(𝑥) = 𝑥 , 𝑓'(𝑥) = 2𝑥 which is the first-order derivative.
For the second-order derivative, we need to calculate the derivative of the first
order derivative itself.

So the second-order derivative is:

[email protected] 2
ZV0GDF798E 𝑓'(𝑓'(𝑥)) = 𝑓'(2𝑥) = 2. This is non-zero, and hence 𝑓(𝑥) = 𝑥 is a convex
function.

In a convex function, the global minima is equal to the local minima. The point at
which a function takes its absolute minimum value is called the global minima.
However, when the goal is to minimize the function and solve using optimization
problems, it may so happen that the function may appear to have a minimum value at
different points. These several points appear to be minima, but are not actually the
point where the function actually takes its absolute minimum value, and these are
called the local minima of the function.

So what's an example of a non-convex function?

2
Let's say: 𝑓(𝑥) = (𝑥 − 1) (𝑥 − 2)(𝑥 − 3) :

5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
There are points on this curve, where the derivative is zero, and it's positive before and
negative after.

[email protected]
ZV0GDF798E

This definitely means that its second derivative is negative around the green region.
The real question is, why is this bad? The important point is that convex functions have
no local minima that are not also global minima. For a convex function, the local and
global minima are at the same point. So there is nowhere that you can get stuck if
you're greedily following the path of steepest descent because you will never reach a
minima in a convex function that isn't actually the globally optimal solution.

6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
But in the non-convex case, you absolutely can get stuck.
In our second example, you could get stuck at x = 1, which achieves f(x) = 0, even
though there is another X that is a better minima than x = 1.
[email protected]
ZV0GDF798E

Let’s understand what's going on here.

If you have a convex function, and you're on some value x, and you're searching for the
value that minimizes the function f, what you would do is take the derivative at x, if it's
negative you would take a step to the right increase x and if it's positive, you would
take a step to the left.

So in a convex function, to reach the global minima the steps carried out are:

7
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
1. Calculate the derivative at x.
2. If the derivative is negative, you would take a step right or increase the value of
x.
3. If the derivative is positive, you would take a step left or decrease the value of x.
4. Repeat the process until the derivative is zero.

If you choose the step size correctly, this is guaranteed to converge to the global
minimizer of the function. As you may have noticed, you need to decrease the
magnitude of the step based on how large the derivatives are.

Now, what happens if you try the same strategy on a non-convex function? All you can
say is that you may reach a local minimum.
But in general, it will not be a global minimum.

[email protected]
ZV0GDF798E

All of these ideas can be extended in a straightforward manner into higher

dimensional space.

8
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Instead of taking the derivative, you would take the gradient. The gradient points in the
direction of the largest increase for the local linear approximation and wherever you
currently are, you would take a step in the direction opposite to the gradient. When
you have a convex function in higher dimensions, this is again guaranteed to converge
to the globally optimal solution.

When you have a non-convex function, it might only converge to a locally optimal
solution,
[email protected] or even worse, it could get stuck in a saddle point, which is not a local
ZV0GDF798E
optimum but for which the gradient is zero.

9
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
So what we're going to do is use gradient descent to find the best parameters for our
deep neural network, even though the function we're trying to minimize is non-convex.
We're taking an algorithm that's guaranteed to work in the convex case, that we
know does not always work in the non-convex case and using it anyways. One of
the greatest mysteries of deep learning is that this still seems to work; what it finds is
not necessarily the globally optimal solution, but even the locally optimal solution
which it finds, is seemingly still good enough in many cases.

[email protected]
ZV0GDF798E

10
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.

Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
1.4+Computing+Gradient+Using+Backpropagation
No ratings yet
1.4+Computing+Gradient+Using+Backpropagation
5 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
09_convex
No ratings yet
09_convex
48 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
ML Notes
No ratings yet
ML Notes
14 pages
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
18 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Chapter
No ratings yet
Chapter
46 pages
Lecture 91
No ratings yet
Lecture 91
17 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Unit IV BPA GD
No ratings yet
Unit IV BPA GD
12 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Convex Functions
No ratings yet
Convex Functions
13 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
DL- Unit 2
No ratings yet
DL- Unit 2
60 pages
Optimization: 1 Motivation
No ratings yet
Optimization: 1 Motivation
20 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Connexions Module: m11240
100% (2)
Connexions Module: m11240
4 pages
Unit3_rev3
No ratings yet
Unit3_rev3
201 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
No ratings yet
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
23 pages
10 Convex Optimisation
No ratings yet
10 Convex Optimisation
31 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Concave + Convex
No ratings yet
Concave + Convex
37 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
Module2-Optimizations
No ratings yet
Module2-Optimizations
65 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Lec_11
No ratings yet
Lec_11
13 pages
Why Do Local Methods Solve Nonconvex Problems
No ratings yet
Why Do Local Methods Solve Nonconvex Problems
19 pages
Lec 18
No ratings yet
Lec 18
6 pages
Why Convexity Is The Key To Optimization: Convex Sets
No ratings yet
Why Convexity Is The Key To Optimization: Convex Sets
4 pages
I. Introduction To Convex Optimization
No ratings yet
I. Introduction To Convex Optimization
12 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
12 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Appendices e F
No ratings yet
Appendices e F
6 pages
4 - PIE
No ratings yet
4 - PIE
7 pages
3 - 1 Logistic Regression
No ratings yet
3 - 1 Logistic Regression
9 pages
Password and Authentication (PPT Final)
No ratings yet
Password and Authentication (PPT Final)
14 pages
05 Introduction PID Controller Design
No ratings yet
05 Introduction PID Controller Design
22 pages
CS407 Neural Computation: Neural Networks Based On Competition. Lecturer: A/Prof. M. Bennamoun
No ratings yet
CS407 Neural Computation: Neural Networks Based On Competition. Lecturer: A/Prof. M. Bennamoun
62 pages
Year - B.Sc. (Data Science) (NEP Pattern) First Year Semester-I Subject - BSCDS011 - Data Structure and Algorithm Using Python
No ratings yet
Year - B.Sc. (Data Science) (NEP Pattern) First Year Semester-I Subject - BSCDS011 - Data Structure and Algorithm Using Python
2 pages
unit 5 cosm short notes
No ratings yet
unit 5 cosm short notes
6 pages
Theory of Computation
100% (1)
Theory of Computation
48 pages
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
No ratings yet
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
12 pages
Sample Exam Exercises Math SL Binomial
No ratings yet
Sample Exam Exercises Math SL Binomial
2 pages
Shreyas Patil report
No ratings yet
Shreyas Patil report
15 pages
Power System Stabilizer Design Based On Optimal Mo
No ratings yet
Power System Stabilizer Design Based On Optimal Mo
8 pages
CTT102 Chapter 0 Introduction
No ratings yet
CTT102 Chapter 0 Introduction
13 pages
Applied and Computational Optimal Control A Control Parametrization Approach Kok Lay Teo Bin Li Changjun Yu Volker Rehbock - Instantly access the complete ebook with just one click
100% (1)
Applied and Computational Optimal Control A Control Parametrization Approach Kok Lay Teo Bin Li Changjun Yu Volker Rehbock - Instantly access the complete ebook with just one click
75 pages
Lesson Part 1: Example 1
No ratings yet
Lesson Part 1: Example 1
94 pages
CCN Assignment: Name:-Hermain Fayyaz Karim
No ratings yet
CCN Assignment: Name:-Hermain Fayyaz Karim
7 pages
Combining Transformer and CNN For Object Detection in UAV Imagery
No ratings yet
Combining Transformer and CNN For Object Detection in UAV Imagery
6 pages
Capitulo 1 Big data uc3m
No ratings yet
Capitulo 1 Big data uc3m
10 pages
Applications of Network Flow
No ratings yet
Applications of Network Flow
75 pages
Floquetlecture - Konrad Viebahn
No ratings yet
Floquetlecture - Konrad Viebahn
12 pages
Sketch Analysis & Revolve & Groove
No ratings yet
Sketch Analysis & Revolve & Groove
9 pages
Matlab Codes: Appendix C
No ratings yet
Matlab Codes: Appendix C
5 pages
Experiment 1: Objective: - Introduction With MATLAB Software and Plotting of General Functions
No ratings yet
Experiment 1: Objective: - Introduction With MATLAB Software and Plotting of General Functions
4 pages
Sciencedirect: © 2019, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
No ratings yet
Sciencedirect: © 2019, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
6 pages
QFT1notes PDF
No ratings yet
QFT1notes PDF
271 pages
Outline: Lyapunov's Linearization Method
No ratings yet
Outline: Lyapunov's Linearization Method
8 pages
Cryptography and Its Types and Stenography
No ratings yet
Cryptography and Its Types and Stenography
4 pages
DBMS Unit 1 Notes
No ratings yet
DBMS Unit 1 Notes
13 pages
Lyapunov Exponents and Chaos Theory
No ratings yet
Lyapunov Exponents and Chaos Theory
33 pages

1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations

Uploaded by

1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations

Uploaded by

Setting Parameters of a Deep Neural Network - Hierarchical

Let’s understand the function with an example,

What types of optimization problems are easy?

The above graph is convex. It's convex because:

The first derivative of f(x) is represented by the term 𝑓'(𝑥).

So the second-order derivative is:

So what's an example of a non-convex function?

Let’s understand what's going on here.

All of these ideas can be extended in a straightforward manner into higher

You might also like