0% found this document useful (0 votes)
41 views

Curs4site PDF

Uploaded by

Gigi Florica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Curs4site PDF

Uploaded by

Gigi Florica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

25 October, 2016

Neural Networks
Course 4: Making the neural network more efficient
Overview

 The Problem With Quadratic Cost


 Cross Entropy
 Softmax

 Weight initialization
 How to adjust hyper-parameters
 Conclusions
The Problem With Quadratic
Cost
The Problem With Quadratic Cost

 On the last course we have used the Mean Square Error as our cost function

 Even though we have achieved a good accuracy using this cost function,
this is not the best one since learning can be slow

 An important feature that we want from a neuron (and from a neural


network) is to learn fast. For this to work, the weights must be lowered in
direct proportion to how big the error is.
 If the error is big, then big adjustment must be made in order to drive the cost
down
 If the error is small, then we want to make small adjustments to not overshoot our
target
The Problem With Quadratic Cost

 A small experiment.
 We will take a neuron with only one input (one weight) and one bias. The input will
always be 1.
 The role of the neuron is to find the weights that make is output a zero. So drive
the 1 to zero

𝑏
𝑤
𝑥=1 𝜎 𝑡=0

 We will test the network in two variants:


 𝑤 = 0.6 𝑏 = 0.9 . 𝑧 = 1.5. 𝜎 1.5 = 0.81
 𝑤 = 2 𝑏 = 2 . 𝑧 = 4. 𝜎 4 = 0.98
 Observe how fast the value drops to 0 for each case
The Problem With Quadratic Cost

 As it can be observed, when the error is big (second case) the learning is
slower. This is the opposite as what we want.

 Why is this happening?


𝑡−𝑎 2
 The cost 𝐶 = 2
 The weight and bias are adjusted according to the formula :
𝜕𝐶 𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝐶 𝜕𝑎 𝜕𝑧
𝑤 = 𝑤 − 𝜂 𝜕𝑤 = 𝜕𝑎 ∙ 𝜕𝑤 = 𝜕𝑎 ∙ 𝜕𝑧 ∙ 𝜕𝑤 = 𝑎 − 𝑡 𝜎 ′ 𝑧 𝑥 = 𝑎𝜎 ′ 𝑧 (𝑓𝑜𝑟 𝑜𝑢𝑟 𝑐𝑎𝑠𝑒)
𝜕𝑤

𝜕𝐶 𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝐶 𝜕𝑎 𝜕𝑧
𝑏 = 𝑏 − 𝜂 𝜕𝑏 = 𝜕𝑎 ∙ 𝜕𝑏 = 𝜕𝑎 ∙ 𝜕𝑧 ∙ 𝜕𝑏 = 𝑎 − 𝑡 𝜎 ′ 𝑧 = 𝑎𝜎 ′ 𝑧 (𝑓𝑜𝑟 𝑜𝑢𝑟 𝑐𝑎𝑠𝑒)
𝜕𝑏
The Problem With Quadratic Cost

 So, how the error changes in respect to the weight or the bias, depends on 𝜎′(𝑧)
𝜎′ 𝑧

 For large values of z, the function is almost flat. So the derivate is very small.
Thus, learning is slow. In this case, we say that the neuron has saturated on the
wrong value
Cross Entropy
Cross Entropy
 One of the way to solve the slow learning problem is to change the cost
function.

 We want a function that its derivate does not contain the 𝜎′(𝑧)

 For a neuron with multiple inputs (vector x) and an output (a), the cross entropy
is defined as:
1
𝐶= − 𝑥 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
where:
n =number of training items
x = a training item
a = activation for item x
y = expected output for item x

The sum is over all training items.


Cross Entropy

 Does this function fix our problem?


1
𝐶=− 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
𝑥
First, observe that is behaves like a cost function:
 since 𝑎 ∈ 0,1 that means that 𝑙𝑛𝑎 and ln(1 − 𝑎) are negative and the output is
multiplied by a negative number. So it results a positive number
 When 𝑦 = 0 𝑎𝑛𝑑 𝑎 ≈ 0, then the sum is 0 (or very close to 0)
 When 𝑦 = 1 𝑎𝑛𝑑 𝑎 ≈ 1 , then the sum is 0 (or very close to 0)
 When 𝑦 = 1 the function depends on ln(a). 𝑙𝑛 is a monotonic function

𝜕𝐶 𝜕𝐶
So this seems to work like a cost function, but how do 𝜕𝑤 and 𝜕𝑏 look like?
Cross Entropy
𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝑧 1 𝜕 𝑦𝑙𝑛𝑎+ 1−𝑦 ln 1−𝑎
= ∙ ∙ =− 𝑥 ∙ σ′ 𝑧 ∙ 𝑥𝑖
𝜕𝑤𝑖 𝜕𝑎 𝜕𝑧 𝜕𝑤𝑖 𝑛 𝜕𝑎

𝜕 𝑦𝑙𝑛𝑎+ 1−𝑦 ln 1−𝑎 𝑦 1−𝑦 ′ 𝑦 1−𝑦


= + ∙ 1−𝑎 = −
𝜕𝑎 𝑎 1−𝑎 𝑎 1−𝑎

𝜎′ 𝑧 = 1 − 𝑎 𝑎
-------------------------------------------------------------------------------
𝜕𝐶 1 𝑦 1−𝑦 1
=− 𝑥 ( − ) ∙ (1 − 𝑎) ∙ 𝑎 ∙ 𝑥𝑖 = − 𝑥 𝑦 1−𝑎 − 1−𝑦 𝑎 𝑥
𝜕𝑤𝑖 𝑛 𝑎 1−𝑎 𝑛

𝜕𝐶 1
 =− 𝑥 𝑦−𝑎 𝑥
𝜕𝑤𝑖 𝑛
𝜕𝐶 1
 =− 𝑥 𝑦−𝑎
𝜕𝑏 𝑛
Cross Entropy

 We will try to repeat the previous experiment, but this time with the cross
entropy function.

 A thing that we must also change is the learning rate. Learning rate is
dependent on the cost function. Changing the learning rate is not cheating
since we are interested in how the learning speed changes and not how
fast it is learning.
Cross Entropy

 By now we have been using a cost function for only one output. Of course,
this can be generalized:
1
𝐶=− [𝑦𝑗 𝑙𝑛𝑎𝑗𝐿 + 1 − 𝑦𝑗 ln 1 − 𝑎𝑗𝐿 ]
𝑛
𝑥 𝑗

𝜕𝐶
 The error in the final layer, becomes
𝜕𝑧𝑗𝐿
𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝑦𝑗 1−𝑦𝑗 𝐿 𝐿 1 𝐿
= 𝐿 ∙ =− 𝑥( 𝐿 − 𝐿 ) 1 − 𝑎𝑗 𝑎𝑗 = − 𝑥(𝑦𝑗 − 𝑎𝑗𝐿 )
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖 𝜕𝑧𝑖𝐿 𝑛 𝑎𝑗 1−𝑎𝑗 𝑛

𝜕𝐶 1
= (𝑎𝑗 − 𝑦𝑗 )
𝜕𝑧𝑗𝐿 𝑛
𝑥
Cross Entropy

 Of course, since we will be using Stochastic Gradient Descent, we will not


divide the cost of each element by the total dataset (n), but by the length
of the mini batch(m)

 So, what we actually need to change in the backpropagation algorithm to


make this work, is just how the error is computed

𝜕𝐶
𝛻𝑎 𝐶 = 𝐿
= 𝑎𝐿 − 𝑦
𝜕𝑧
(more on)Cross Entropy

 Where did the function come from?! (it looks very complicated at first sight)

 Consider that we have a system (neural network) that has some


configuration and must classify the inputs to several m classes (ex. Digits).

 The probability of classifying an input to each class is 𝑎𝑗 , where 𝑗 𝑎𝑗 =1

 Let’s suppose that for our dataset we have 𝑘𝑗 elements for each 𝑗 class.
According, to the model, the likelihood of this happening is:

𝑘 𝑘 𝑘
P data model) = 𝑎1 1 𝑎2 2 … 𝑎𝑚𝑚
(more on)Cross Entropy

If we apply the logarithm function, then


𝑘 𝑘 𝑘 𝑘 𝑘 𝑘
ln(P data model)) = ln(𝑎1 1 𝑎2 2 … 𝑎𝑚𝑚 ) = ln(𝑎1 1 ) + ln(𝑎2 2 ) + ⋯ + ln(𝑎𝑚𝑚 ) =
= 𝑗 𝑘𝑗 ln(𝑎𝑗 )

Using the logarithm function, we have some advantages:


• It’s a monotonic function.
• Transforms the product into a sum
• Logarithm of a very small number is a negative number
(more on)Cross Entropy

Obviously, we want to increase this probability, but since we’re used to minimizing a cost
function, we’ll minimize the same function but with the opposite sign (-)

𝑘𝑗
If we divide by the number of elements in the dataset (n), then becomes the true
𝑛
probability of the elements of each class.

Another Formula for Cross Entropy: − 𝑗 𝑝𝑗 ln 𝑎𝑗


(more on)Cross Entropy

If we divide by the number of elements in the dataset (n), we’ll have:


1
− 𝑗 𝑘𝑗 ln(𝑎𝑗 )
n

Since the output vector (y) is a one-hot element (only one of its elements
has value 1, the others have value 0. Ex, for digits: 0, 0, 0, 1, 0 , 0, 0, 0, 0, 0)
𝑘𝑗 = 𝑥 𝑦𝑗

The above equation becomes:


1
− 𝑗 ( 𝑥 𝑦𝑗 )ln(𝑎𝑗 )
𝑛
(more on)Cross Entropy

Standard formula for Cross Entropy :


1
𝐶=− 𝑥 𝑗 𝑦𝑗 ln(𝑎𝑗 )
𝑛

If the number of possible classes is just 2, we can really use just one output. The
above formula becomes

1
C=− 𝑥 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
(more on)Cross Entropy

Another often used cost function, when doing online training is to use
𝐶 = −ln(𝑎𝑗 )

Of course, this is still the cross entropy, but in a more simplified version that
takes account for the fact that 𝑦𝑗 =1 for the right label and 0 for the rest. (one
hot)
Softmax
Softmax

When we’ve classified the MNIST digits we didn’t consider the outputs as
probabilities, yet we’ve used cross entropy which works with probabilities.

The only thing that must be changed is the output layer.


Instead of outputting a𝐿j = 𝜎 𝑧 , 𝑤ℎ𝑒𝑟𝑒 𝑧 = 𝑘 𝑤𝑗𝑘 𝑎𝑘 + 𝑏𝑗𝐿 we’ll compute a
𝐿 𝐿−1

probability using z.

More exactly,

𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝐿
𝑘 𝑒 𝑘
Softmax
𝜕𝐶
How does looks like?
𝜕𝑧𝑗𝐿
𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝐿
𝑘 𝑒 𝑘

1
𝐶= − 𝑥 𝑗 𝑦𝑗 ln(𝑎𝑗 )
𝑛
Softmax

𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝜕𝑦𝑖 𝜕𝑎𝑖𝐿


= =−
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖𝐿 𝜕𝑧𝑗𝐿 𝑛 𝜕aLi 𝜕𝑧𝑗𝐿
𝑥 𝑖

𝑧𝑖𝐿
𝑒 ′
𝜕 𝐿 𝐿 𝐿 𝐿
𝜕𝑎𝑖𝐿
𝐿
𝑒 𝑧𝑘 𝑒 𝑧𝑖 𝑘𝑒
𝑧𝑘 − 𝑒 𝑧𝑖 ( 𝑧𝑘
𝑘 𝑒 )′
𝑘
= =
𝜕𝑧𝑗𝐿 𝜕𝑧𝑗𝐿 (
𝐿
𝑒 𝑧𝑘 )
2
𝑘
2 2
𝑧𝑗𝐿 𝐿 𝑧𝑗𝐿
𝜕a𝐿𝑖 𝑒 𝑘 𝑒 𝑧𝑘 − 𝑒 𝑒 𝑧𝑗𝐿
𝑒 𝑧𝑗𝐿
𝑖𝑓 𝑖 = 𝑗, 𝐿 = 2 = 𝐿 − = 𝑎𝑗 − 𝑎𝑗2 = 𝑎𝑗 (1 − 𝑎𝑗 )
𝜕𝑧𝑗 𝐿 𝑧𝑘 𝑧𝑘𝐿
𝑘𝑒
𝑧𝑘 𝑘𝑒 𝑘𝑒

𝐿 𝐿 𝐿 𝐿
𝜕a𝐿𝑖 −𝑒 𝑧𝑖 𝑒 𝑧𝑗 𝑒 𝑧𝑖 𝑒 𝑧𝑗
𝑖𝑓 𝑖! = 𝑗, 𝐿 = 2 =− 𝐿 𝐿 = −𝑎𝑖 𝑎𝑗
𝜕𝑧𝑗 𝐿
𝑧𝑘 𝑘𝑒
𝑧𝑘
𝑘𝑒
𝑧𝑘
𝑘𝑒
Softmax

𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝑦𝑖 𝜕𝑎𝑖𝐿 1 𝑦𝑗 𝑦𝑖
= =− =− 𝑎 1 − 𝑎𝑗 + −𝑎𝑖 𝑎𝑗 =
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖𝐿 𝜕𝑧𝑗𝐿 𝑛 ai 𝜕𝑧𝑗𝐿 𝑛 𝑎𝑗 𝑗 𝑎𝑖
𝑥 𝑖 𝑥 𝑖!=𝑗

1 1
=− 𝑦𝑗 − 𝑦𝑗 𝑎𝑗 − 𝑦𝑖 𝑎𝑗 =− 𝑦𝑗 − 𝑎𝑗 𝑦𝑗 + 𝑦𝑖 =
𝑛 𝑛
𝑥 𝑖!=𝑗 𝑥 𝑖!=𝑗

𝑦𝑖 = 1
1
=− (𝑦𝑗 − 𝑎𝑗 ) 𝑖
𝑛
𝑥
Softmax

In order to use softmax function, the only thing that must be modified, in
addition to using cross entropy, is the activation function in the output layer
𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝑘𝐿
𝑘 𝑒

In fact, the reason why the previous version of the network works (the one that
doesn’t use probabilities in the output layer) is because it has the same
gradient as the cross entropy + softmax
Softmax

 Usually, the cross entropy


function achieves best results
when used together with
softmax (if the problem allows)
Weight Initialization
Weight Initialization

Until now we’ve been using random weights with normal standard
distribution. (normal distribution with 𝜎 = 1)

That means that 68% of the weights have values in interval [-1,1], 95%
have values in interval [-2,2], 99.3 in [-3,3]
Weight Initialization

The problem with these kind of values is when we compute the net
input 𝑧 = 𝑥 𝑤𝑥 + 𝑏

Let’s consider a neuron with 1000 inputs. Halve of which are 0. The
other ones are 1 𝑦43
1

𝑤1
1 𝑤500
𝑧
0 𝑤501

𝑤1000 𝑏

0
1
Weight Initialization

That means that the net input 𝑧 = 500


𝑖=1 𝑤𝑖 +b
Since 𝑤𝑖 and 𝑏 are normally distributed, that means that:

500
𝜇𝑧 = 𝑖=1 𝜇𝑤 𝑖 + 𝜇𝑏 = 0
𝑣𝑎𝑟 𝑧 = 500
𝑖=1 𝑣𝑎𝑟(𝑤𝑖 ) + 𝑣𝑎𝑟 𝑏 =501

So, in this case, z is a variable that with a normal distribution with mean
0 and standard deviation of 501
Weight Initialization

That means that 95% of z values will be in the interval [-1002, 1002].
That is a very big interval, since a neuron usually saturates for values
greater than 4.
Weight Initialization

The solution is to initialize the weights with such values that when
added in the net input will not saturate the neuron.

Thus, all values will be initialized with a random value from a normal
1
distribution with mean 0 and a standard deviation of where 𝑛𝑖𝑛 is
𝑛𝑖𝑛
the total number of connection that go into the neuron.

1
In our case, the standard deviation will be
1000

The bias can still be generated from a standard normal distribution


since it just adds 1 to the variance.
Weight Initialization

 By using small weights, the


network accuracy increases

 Also, using small weights, the


network arrives faster at the
best accuracy
How to adjust hyper-parameters
How to adjust hyper-parameters

Besides the weights, our network has some parameters that control
how it learns:
 Learning rate 𝜂
 The mini-batch size
 The number of epochs
 The number of hidden neurons

All parameters should be tested on a separate dataset (validation set) in


order to avoid fitting the parameters to the test set
How to adjust hyper-parameters

The first, and probably the most difficult is to achieve any non-trivial
learning. You must obtain results better than you would obtain by a
random selection.

In the case of MNIST digits, this means you should obtain something
greater than 10%

 Start with a smaller dataset. This increases the speed


 In the case of MNIST digits this could mean to work with only two numbers (0
and 1)
 Start with a smaller network.
 In the case of MNIST digits, that could mean to start with a network of 784x10
neurons. (No hidden layer)
How to adjust hyper-parameters

 Increase monitoring frequency


 Work with just a fraction of the validation set
 Monitor the accuracy not only on iterations, but also after you have
computed some mini batches (for example, 10 minibatches)

All of the above steps are useful to allow you to receive quick feedback
from the network. This allows to test many values for the parameters.

Start by adjusting the learning rate until you see some learning happens.
How to adjust hyper-parameters

 You should be increasing or


decreasing the learning rate
and monitor the cost of the
training data. Here is how it
looks like for different values
of the 𝜂

 For large values of 𝜂 the


gradient descent overshoots

 For low variants of 𝜂 the


gradient descent is low
How to adjust hyper-parameters

 You should start with a value for the learning rate where the training cost
decreases in the first iterations.

 Increase it by magnitude (10) until the network starts oscillating. This is the
threshold

 You can then refine it by slowly increasing it until the costs starts
oscillating again (gets close to the threshold). In fact, the value should
be a factor, or two below the threshold.
How to adjust hyper-parameters

 Choosing the number of iterations is simple: just use early stopping.


That means, after each iterations test the network on the validation
set. If after x iterations there is no improvement in the classification
accuracy, then stop. X can be for example 10 iterations.

 At the beginning you should let the network learn for a significant
number of iterations in order to avoid the situation where it gets to
a plateau only to continue learning again
How to adjust hyper-parameters

 Mini Batch Size:


 Mini Batch should be used since we can make use of modern libraries
that can compute the weights of all the elements in the batch, at
once.

 The validation accuracy should be plotted against time (real time, not
number of epochs) and choose the one that achieves the highest
increase.
Questions & Discussion
Bibliography

http://neuralnetworksanddeeplearning.com/
Chris Bishop, “Neural Network for Pattern Recognition”
https://visualstudiomagazine.com/articles/2014/04/01/neural-network-cross-
entropy-error.aspx
https://en.wikipedia.org/wiki/Standard_deviation

You might also like