0% found this document useful (0 votes)
2 views

1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations

The document discusses the challenges of setting parameters in deep neural networks, particularly focusing on the optimization problems associated with them. It emphasizes the difference between convex and non-convex optimization problems, explaining how gradient descent can be used to find parameters despite the non-convex nature of the functions involved. The document concludes that while gradient descent may not always yield a globally optimal solution, it often finds sufficiently good local optima for deep learning applications.

Uploaded by

Mandy Law
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations

The document discusses the challenges of setting parameters in deep neural networks, particularly focusing on the optimization problems associated with them. It emphasizes the difference between convex and non-convex optimization problems, explaining how gradient descent can be used to find parameters despite the non-convex nature of the functions involved. The document concludes that while gradient descent may not always yield a globally optimal solution, it often finds sufficiently good local optima for deep learning applications.

Uploaded by

Mandy Law
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Setting Parameters of a Deep Neural Network - Hierarchical

Representations

In the previous videos, we learned that we need to set the parameters or weights, but
how do we set the parameters of a deep neural network? Our first goal is to cast the
problem of finding good parameters as an optimization problem.

The catch is that not all optimization problems are created equal. There's an important
class of optimization problems called convex optimization problems for which we
have many good algorithms and which we understand quite well; when and why they
work. But life is not so easy in the case of deep neural networks - the optimization
problems that come out are usually non-convex and unwieldy.

Let’s first define the optimization problem we'll be interested in. Returning to the
ImageNet example, our deep neural networks have n inputs and 1000 outputs. The
image will be assigned to one of the 1000 categories.
[email protected]
ZV0GDF798E

What we're hoping for is that when we feed in a picture as an n-dimensional vector,
for example, an image of a dog, then the output in the last layer contains a ‘1’ in the
correct category and ‘0’s in all the other categories.

1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Now let's say we have all the parameters of a deep network; how may we evaluate
how good these parameters are? We could take each one of the 1 million examples of
the ImageNet, then feed each picture in the network and check whether we got the
correct label or not.

To be concrete, let’s discuss what's called the quadratic cost function. Let's say the
input is some vector x, and its true label is category J, then the output that we would
like to get is zeros in every other category, except for a 1 in the J category of outputs.

[email protected]
ZV0GDF798E

For example, in the above image, we pass the input as the dog image (it’s some vector
x). So the true label of the image is the dog. Then as an output from the network, we
need to get zero for all the other categories and 1 for the dog category.

Now we could penalize the network by how far off the output is from this idealized
output. Let’s say that on input x we output 𝑎1, 𝑎2, 𝑎3.... 𝑎𝑚 where 𝑎 refers to one of the
categories. Then the penalty is:

2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
This function evaluates to zero if you get exactly the correct output, and evaluates to
something non-zero if you get the wrong category. Here 𝑎𝑗 is the correct category the
image belongs to.

Let’s understand the function with an example,


Suppose we have passed the image x and the true label for the image is the cat. Now
the network has predicted cats. In this case, 𝑎𝑗 will be 1 because it has been predicted
correctly, and the rest of the 𝑎𝑖’s are zero.
2
So the penalty = (1 − 1) + (0 + 0 + .....) = 0.
Since the category is correctly predicted, there is no penalty.

What
[email protected] we've done is, we’ve defined a cost function. The cost function is a technique of
ZV0GDF798E
“evaluating the performance of our algorithm/model”. Now if we have the parameters
of a deep neural network, we can evaluate how good it is by computing the average
cost. We want the average cost to be minimum, in order to be confident we’ve
computed the best performing deep neural network.

Now if we want to find a good setting of the parameters, we will look for the setting of
the parameters that minimizes this average cost. This is an optimization problem, we
have an explicit function that depends on the parameters as well as the training data,
that we'd like to minimize. It turns out that if you find a set of parameters that work
well on the training examples, it has a low average cost and achieves a low error on a
new set of examples. The key idea is to find the optimal parameters that have a low
average cost and can predict the new set of examples with a minimal error rate.

We have an optimization problem that if we could solve would find good parameters.
Is there just some off-the-shelf way that we can plug in any optimization problem we'd
like, and get the best answer? Absolutely not.

3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The difference between which optimization problems are easy to solve and which
ones are hard, is one of the foundational issues in theoretical computer science.

What types of optimization problems are easy?


Let's start with one-dimensional data to keep things as simple as possible. Let’s say
you have some function 𝑓(𝑥) and we want to find the 𝑥 that minimizes the equation
𝑓(𝑥) where 𝑥 is a real variable, so it can take on any real value.

There's an important class of functions called convex functions. There are many ways
2
to define convex functions, but let's start with an example: 𝑓(𝑥) = 𝑥

[email protected]
ZV0GDF798E

The above graph is convex. It's convex because:


1. Whenever you take any two points on the curve and draw a line between them,
the line lies entirely above (inside) the curve.

4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
2. The more mathematical definition is “a function is convex if and only if its
second derivative is always non-negative”.
2
In the above example, the equation is 𝑓(𝑥) = 𝑥 . To calculate the second
derivative we need to calculate the first-order derivative and then differentiate
the first-order derivative to get the second-order derivative.
𝑛 𝑛−1
From calculus, we know that the derivative of 𝑥 is 𝑛𝑥 .

The first derivative of f(x) is represented by the term 𝑓'(𝑥).


𝑛 𝑛−1
From the formula above, which says that if 𝑓(𝑥) = 𝑥 , 𝑓'(𝑥) = 𝑛𝑥 :
Substituting 2 for n,
2
For 𝑓(𝑥) = 𝑥 , 𝑓'(𝑥) = 2𝑥 which is the first-order derivative.
For the second-order derivative, we need to calculate the derivative of the first
order derivative itself.

So the second-order derivative is:


[email protected] 2
ZV0GDF798E 𝑓'(𝑓'(𝑥)) = 𝑓'(2𝑥) = 2. This is non-zero, and hence 𝑓(𝑥) = 𝑥 is a convex
function.

In a convex function, the global minima is equal to the local minima. The point at
which a function takes its absolute minimum value is called the global minima.
However, when the goal is to minimize the function and solve using optimization
problems, it may so happen that the function may appear to have a minimum value at
different points. These several points appear to be minima, but are not actually the
point where the function actually takes its absolute minimum value, and these are
called the local minima of the function.

So what's an example of a non-convex function?


2
Let's say: 𝑓(𝑥) = (𝑥 − 1) (𝑥 − 2)(𝑥 − 3) :

5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
There are points on this curve, where the derivative is zero, and it's positive before and
negative after.

[email protected]
ZV0GDF798E

This definitely means that its second derivative is negative around the green region.
The real question is, why is this bad? The important point is that convex functions have
no local minima that are not also global minima. For a convex function, the local and
global minima are at the same point. So there is nowhere that you can get stuck if
you're greedily following the path of steepest descent because you will never reach a
minima in a convex function that isn't actually the globally optimal solution.

6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
But in the non-convex case, you absolutely can get stuck.
In our second example, you could get stuck at x = 1, which achieves f(x) = 0, even
though there is another X that is a better minima than x = 1.
[email protected]
ZV0GDF798E

Let’s understand what's going on here.


If you have a convex function, and you're on some value x, and you're searching for the
value that minimizes the function f, what you would do is take the derivative at x, if it's
negative you would take a step to the right increase x and if it's positive, you would
take a step to the left.

So in a convex function, to reach the global minima the steps carried out are:

7
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
1. Calculate the derivative at x.
2. If the derivative is negative, you would take a step right or increase the value of
x.
3. If the derivative is positive, you would take a step left or decrease the value of x.
4. Repeat the process until the derivative is zero.

If you choose the step size correctly, this is guaranteed to converge to the global
minimizer of the function. As you may have noticed, you need to decrease the
magnitude of the step based on how large the derivatives are.

Now, what happens if you try the same strategy on a non-convex function? All you can
say is that you may reach a local minimum.
But in general, it will not be a global minimum.

[email protected]
ZV0GDF798E

All of these ideas can be extended in a straightforward manner into higher


dimensional space.

8
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Instead of taking the derivative, you would take the gradient. The gradient points in the
direction of the largest increase for the local linear approximation and wherever you
currently are, you would take a step in the direction opposite to the gradient. When
you have a convex function in higher dimensions, this is again guaranteed to converge
to the globally optimal solution.

When you have a non-convex function, it might only converge to a locally optimal
solution,
[email protected] or even worse, it could get stuck in a saddle point, which is not a local
ZV0GDF798E
optimum but for which the gradient is zero.

9
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
So what we're going to do is use gradient descent to find the best parameters for our
deep neural network, even though the function we're trying to minimize is non-convex.
We're taking an algorithm that's guaranteed to work in the convex case, that we
know does not always work in the non-convex case and using it anyways. One of
the greatest mysteries of deep learning is that this still seems to work; what it finds is
not necessarily the globally optimal solution, but even the locally optimal solution
which it finds, is seemingly still good enough in many cases.

[email protected]
ZV0GDF798E

10
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like