1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
Representations
In the previous videos, we learned that we need to set the parameters or weights, but
how do we set the parameters of a deep neural network? Our first goal is to cast the
problem of finding good parameters as an optimization problem.
The catch is that not all optimization problems are created equal. There's an important
class of optimization problems called convex optimization problems for which we
have many good algorithms and which we understand quite well; when and why they
work. But life is not so easy in the case of deep neural networks - the optimization
problems that come out are usually non-convex and unwieldy.
Let’s first define the optimization problem we'll be interested in. Returning to the
ImageNet example, our deep neural networks have n inputs and 1000 outputs. The
image will be assigned to one of the 1000 categories.
[email protected]
ZV0GDF798E
What we're hoping for is that when we feed in a picture as an n-dimensional vector,
for example, an image of a dog, then the output in the last layer contains a ‘1’ in the
correct category and ‘0’s in all the other categories.
1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Now let's say we have all the parameters of a deep network; how may we evaluate
how good these parameters are? We could take each one of the 1 million examples of
the ImageNet, then feed each picture in the network and check whether we got the
correct label or not.
To be concrete, let’s discuss what's called the quadratic cost function. Let's say the
input is some vector x, and its true label is category J, then the output that we would
like to get is zeros in every other category, except for a 1 in the J category of outputs.
[email protected]
ZV0GDF798E
For example, in the above image, we pass the input as the dog image (it’s some vector
x). So the true label of the image is the dog. Then as an output from the network, we
need to get zero for all the other categories and 1 for the dog category.
Now we could penalize the network by how far off the output is from this idealized
output. Let’s say that on input x we output 𝑎1, 𝑎2, 𝑎3.... 𝑎𝑚 where 𝑎 refers to one of the
categories. Then the penalty is:
2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
This function evaluates to zero if you get exactly the correct output, and evaluates to
something non-zero if you get the wrong category. Here 𝑎𝑗 is the correct category the
image belongs to.
What
[email protected] we've done is, we’ve defined a cost function. The cost function is a technique of
ZV0GDF798E
“evaluating the performance of our algorithm/model”. Now if we have the parameters
of a deep neural network, we can evaluate how good it is by computing the average
cost. We want the average cost to be minimum, in order to be confident we’ve
computed the best performing deep neural network.
Now if we want to find a good setting of the parameters, we will look for the setting of
the parameters that minimizes this average cost. This is an optimization problem, we
have an explicit function that depends on the parameters as well as the training data,
that we'd like to minimize. It turns out that if you find a set of parameters that work
well on the training examples, it has a low average cost and achieves a low error on a
new set of examples. The key idea is to find the optimal parameters that have a low
average cost and can predict the new set of examples with a minimal error rate.
We have an optimization problem that if we could solve would find good parameters.
Is there just some off-the-shelf way that we can plug in any optimization problem we'd
like, and get the best answer? Absolutely not.
3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The difference between which optimization problems are easy to solve and which
ones are hard, is one of the foundational issues in theoretical computer science.
There's an important class of functions called convex functions. There are many ways
2
to define convex functions, but let's start with an example: 𝑓(𝑥) = 𝑥
[email protected]
ZV0GDF798E
4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
2. The more mathematical definition is “a function is convex if and only if its
second derivative is always non-negative”.
2
In the above example, the equation is 𝑓(𝑥) = 𝑥 . To calculate the second
derivative we need to calculate the first-order derivative and then differentiate
the first-order derivative to get the second-order derivative.
𝑛 𝑛−1
From calculus, we know that the derivative of 𝑥 is 𝑛𝑥 .
In a convex function, the global minima is equal to the local minima. The point at
which a function takes its absolute minimum value is called the global minima.
However, when the goal is to minimize the function and solve using optimization
problems, it may so happen that the function may appear to have a minimum value at
different points. These several points appear to be minima, but are not actually the
point where the function actually takes its absolute minimum value, and these are
called the local minima of the function.
5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
There are points on this curve, where the derivative is zero, and it's positive before and
negative after.
[email protected]
ZV0GDF798E
This definitely means that its second derivative is negative around the green region.
The real question is, why is this bad? The important point is that convex functions have
no local minima that are not also global minima. For a convex function, the local and
global minima are at the same point. So there is nowhere that you can get stuck if
you're greedily following the path of steepest descent because you will never reach a
minima in a convex function that isn't actually the globally optimal solution.
6
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
But in the non-convex case, you absolutely can get stuck.
In our second example, you could get stuck at x = 1, which achieves f(x) = 0, even
though there is another X that is a better minima than x = 1.
[email protected]
ZV0GDF798E
So in a convex function, to reach the global minima the steps carried out are:
7
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
1. Calculate the derivative at x.
2. If the derivative is negative, you would take a step right or increase the value of
x.
3. If the derivative is positive, you would take a step left or decrease the value of x.
4. Repeat the process until the derivative is zero.
If you choose the step size correctly, this is guaranteed to converge to the global
minimizer of the function. As you may have noticed, you need to decrease the
magnitude of the step based on how large the derivatives are.
Now, what happens if you try the same strategy on a non-convex function? All you can
say is that you may reach a local minimum.
But in general, it will not be a global minimum.
[email protected]
ZV0GDF798E
8
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Instead of taking the derivative, you would take the gradient. The gradient points in the
direction of the largest increase for the local linear approximation and wherever you
currently are, you would take a step in the direction opposite to the gradient. When
you have a convex function in higher dimensions, this is again guaranteed to converge
to the globally optimal solution.
When you have a non-convex function, it might only converge to a locally optimal
solution,
[email protected] or even worse, it could get stuck in a saddle point, which is not a local
ZV0GDF798E
optimum but for which the gradient is zero.
9
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
So what we're going to do is use gradient descent to find the best parameters for our
deep neural network, even though the function we're trying to minimize is non-convex.
We're taking an algorithm that's guaranteed to work in the convex case, that we
know does not always work in the non-convex case and using it anyways. One of
the greatest mysteries of deep learning is that this still seems to work; what it finds is
not necessarily the globally optimal solution, but even the locally optimal solution
which it finds, is seemingly still good enough in many cases.
[email protected]
ZV0GDF798E
10
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.