Curs4site PDF
Curs4site PDF
Neural Networks
Course 4: Making the neural network more efficient
Overview
Weight initialization
How to adjust hyper-parameters
Conclusions
The Problem With Quadratic
Cost
The Problem With Quadratic Cost
On the last course we have used the Mean Square Error as our cost function
Even though we have achieved a good accuracy using this cost function,
this is not the best one since learning can be slow
A small experiment.
We will take a neuron with only one input (one weight) and one bias. The input will
always be 1.
The role of the neuron is to find the weights that make is output a zero. So drive
the 1 to zero
𝑏
𝑤
𝑥=1 𝜎 𝑡=0
As it can be observed, when the error is big (second case) the learning is
slower. This is the opposite as what we want.
𝜕𝐶 𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝐶 𝜕𝑎 𝜕𝑧
𝑏 = 𝑏 − 𝜂 𝜕𝑏 = 𝜕𝑎 ∙ 𝜕𝑏 = 𝜕𝑎 ∙ 𝜕𝑧 ∙ 𝜕𝑏 = 𝑎 − 𝑡 𝜎 ′ 𝑧 = 𝑎𝜎 ′ 𝑧 (𝑓𝑜𝑟 𝑜𝑢𝑟 𝑐𝑎𝑠𝑒)
𝜕𝑏
The Problem With Quadratic Cost
So, how the error changes in respect to the weight or the bias, depends on 𝜎′(𝑧)
𝜎′ 𝑧
For large values of z, the function is almost flat. So the derivate is very small.
Thus, learning is slow. In this case, we say that the neuron has saturated on the
wrong value
Cross Entropy
Cross Entropy
One of the way to solve the slow learning problem is to change the cost
function.
We want a function that its derivate does not contain the 𝜎′(𝑧)
For a neuron with multiple inputs (vector x) and an output (a), the cross entropy
is defined as:
1
𝐶= − 𝑥 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
where:
n =number of training items
x = a training item
a = activation for item x
y = expected output for item x
𝜕𝐶 𝜕𝐶
So this seems to work like a cost function, but how do 𝜕𝑤 and 𝜕𝑏 look like?
Cross Entropy
𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝑧 1 𝜕 𝑦𝑙𝑛𝑎+ 1−𝑦 ln 1−𝑎
= ∙ ∙ =− 𝑥 ∙ σ′ 𝑧 ∙ 𝑥𝑖
𝜕𝑤𝑖 𝜕𝑎 𝜕𝑧 𝜕𝑤𝑖 𝑛 𝜕𝑎
𝜎′ 𝑧 = 1 − 𝑎 𝑎
-------------------------------------------------------------------------------
𝜕𝐶 1 𝑦 1−𝑦 1
=− 𝑥 ( − ) ∙ (1 − 𝑎) ∙ 𝑎 ∙ 𝑥𝑖 = − 𝑥 𝑦 1−𝑎 − 1−𝑦 𝑎 𝑥
𝜕𝑤𝑖 𝑛 𝑎 1−𝑎 𝑛
𝜕𝐶 1
=− 𝑥 𝑦−𝑎 𝑥
𝜕𝑤𝑖 𝑛
𝜕𝐶 1
=− 𝑥 𝑦−𝑎
𝜕𝑏 𝑛
Cross Entropy
We will try to repeat the previous experiment, but this time with the cross
entropy function.
A thing that we must also change is the learning rate. Learning rate is
dependent on the cost function. Changing the learning rate is not cheating
since we are interested in how the learning speed changes and not how
fast it is learning.
Cross Entropy
By now we have been using a cost function for only one output. Of course,
this can be generalized:
1
𝐶=− [𝑦𝑗 𝑙𝑛𝑎𝑗𝐿 + 1 − 𝑦𝑗 ln 1 − 𝑎𝑗𝐿 ]
𝑛
𝑥 𝑗
𝜕𝐶
The error in the final layer, becomes
𝜕𝑧𝑗𝐿
𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝑦𝑗 1−𝑦𝑗 𝐿 𝐿 1 𝐿
= 𝐿 ∙ =− 𝑥( 𝐿 − 𝐿 ) 1 − 𝑎𝑗 𝑎𝑗 = − 𝑥(𝑦𝑗 − 𝑎𝑗𝐿 )
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖 𝜕𝑧𝑖𝐿 𝑛 𝑎𝑗 1−𝑎𝑗 𝑛
𝜕𝐶 1
= (𝑎𝑗 − 𝑦𝑗 )
𝜕𝑧𝑗𝐿 𝑛
𝑥
Cross Entropy
𝜕𝐶
𝛻𝑎 𝐶 = 𝐿
= 𝑎𝐿 − 𝑦
𝜕𝑧
(more on)Cross Entropy
Where did the function come from?! (it looks very complicated at first sight)
Let’s suppose that for our dataset we have 𝑘𝑗 elements for each 𝑗 class.
According, to the model, the likelihood of this happening is:
𝑘 𝑘 𝑘
P data model) = 𝑎1 1 𝑎2 2 … 𝑎𝑚𝑚
(more on)Cross Entropy
Obviously, we want to increase this probability, but since we’re used to minimizing a cost
function, we’ll minimize the same function but with the opposite sign (-)
𝑘𝑗
If we divide by the number of elements in the dataset (n), then becomes the true
𝑛
probability of the elements of each class.
Since the output vector (y) is a one-hot element (only one of its elements
has value 1, the others have value 0. Ex, for digits: 0, 0, 0, 1, 0 , 0, 0, 0, 0, 0)
𝑘𝑗 = 𝑥 𝑦𝑗
If the number of possible classes is just 2, we can really use just one output. The
above formula becomes
1
C=− 𝑥 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
(more on)Cross Entropy
Another often used cost function, when doing online training is to use
𝐶 = −ln(𝑎𝑗 )
Of course, this is still the cross entropy, but in a more simplified version that
takes account for the fact that 𝑦𝑗 =1 for the right label and 0 for the rest. (one
hot)
Softmax
Softmax
When we’ve classified the MNIST digits we didn’t consider the outputs as
probabilities, yet we’ve used cross entropy which works with probabilities.
probability using z.
More exactly,
𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝐿
𝑘 𝑒 𝑘
Softmax
𝜕𝐶
How does looks like?
𝜕𝑧𝑗𝐿
𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝐿
𝑘 𝑒 𝑘
1
𝐶= − 𝑥 𝑗 𝑦𝑗 ln(𝑎𝑗 )
𝑛
Softmax
𝑧𝑖𝐿
𝑒 ′
𝜕 𝐿 𝐿 𝐿 𝐿
𝜕𝑎𝑖𝐿
𝐿
𝑒 𝑧𝑘 𝑒 𝑧𝑖 𝑘𝑒
𝑧𝑘 − 𝑒 𝑧𝑖 ( 𝑧𝑘
𝑘 𝑒 )′
𝑘
= =
𝜕𝑧𝑗𝐿 𝜕𝑧𝑗𝐿 (
𝐿
𝑒 𝑧𝑘 )
2
𝑘
2 2
𝑧𝑗𝐿 𝐿 𝑧𝑗𝐿
𝜕a𝐿𝑖 𝑒 𝑘 𝑒 𝑧𝑘 − 𝑒 𝑒 𝑧𝑗𝐿
𝑒 𝑧𝑗𝐿
𝑖𝑓 𝑖 = 𝑗, 𝐿 = 2 = 𝐿 − = 𝑎𝑗 − 𝑎𝑗2 = 𝑎𝑗 (1 − 𝑎𝑗 )
𝜕𝑧𝑗 𝐿 𝑧𝑘 𝑧𝑘𝐿
𝑘𝑒
𝑧𝑘 𝑘𝑒 𝑘𝑒
𝐿 𝐿 𝐿 𝐿
𝜕a𝐿𝑖 −𝑒 𝑧𝑖 𝑒 𝑧𝑗 𝑒 𝑧𝑖 𝑒 𝑧𝑗
𝑖𝑓 𝑖! = 𝑗, 𝐿 = 2 =− 𝐿 𝐿 = −𝑎𝑖 𝑎𝑗
𝜕𝑧𝑗 𝐿
𝑧𝑘 𝑘𝑒
𝑧𝑘
𝑘𝑒
𝑧𝑘
𝑘𝑒
Softmax
𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝑦𝑖 𝜕𝑎𝑖𝐿 1 𝑦𝑗 𝑦𝑖
= =− =− 𝑎 1 − 𝑎𝑗 + −𝑎𝑖 𝑎𝑗 =
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖𝐿 𝜕𝑧𝑗𝐿 𝑛 ai 𝜕𝑧𝑗𝐿 𝑛 𝑎𝑗 𝑗 𝑎𝑖
𝑥 𝑖 𝑥 𝑖!=𝑗
1 1
=− 𝑦𝑗 − 𝑦𝑗 𝑎𝑗 − 𝑦𝑖 𝑎𝑗 =− 𝑦𝑗 − 𝑎𝑗 𝑦𝑗 + 𝑦𝑖 =
𝑛 𝑛
𝑥 𝑖!=𝑗 𝑥 𝑖!=𝑗
𝑦𝑖 = 1
1
=− (𝑦𝑗 − 𝑎𝑗 ) 𝑖
𝑛
𝑥
Softmax
In order to use softmax function, the only thing that must be modified, in
addition to using cross entropy, is the activation function in the output layer
𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝑘𝐿
𝑘 𝑒
In fact, the reason why the previous version of the network works (the one that
doesn’t use probabilities in the output layer) is because it has the same
gradient as the cross entropy + softmax
Softmax
Until now we’ve been using random weights with normal standard
distribution. (normal distribution with 𝜎 = 1)
That means that 68% of the weights have values in interval [-1,1], 95%
have values in interval [-2,2], 99.3 in [-3,3]
Weight Initialization
The problem with these kind of values is when we compute the net
input 𝑧 = 𝑥 𝑤𝑥 + 𝑏
Let’s consider a neuron with 1000 inputs. Halve of which are 0. The
other ones are 1 𝑦43
1
𝑤1
1 𝑤500
𝑧
0 𝑤501
𝑤1000 𝑏
0
1
Weight Initialization
500
𝜇𝑧 = 𝑖=1 𝜇𝑤 𝑖 + 𝜇𝑏 = 0
𝑣𝑎𝑟 𝑧 = 500
𝑖=1 𝑣𝑎𝑟(𝑤𝑖 ) + 𝑣𝑎𝑟 𝑏 =501
So, in this case, z is a variable that with a normal distribution with mean
0 and standard deviation of 501
Weight Initialization
That means that 95% of z values will be in the interval [-1002, 1002].
That is a very big interval, since a neuron usually saturates for values
greater than 4.
Weight Initialization
The solution is to initialize the weights with such values that when
added in the net input will not saturate the neuron.
Thus, all values will be initialized with a random value from a normal
1
distribution with mean 0 and a standard deviation of where 𝑛𝑖𝑛 is
𝑛𝑖𝑛
the total number of connection that go into the neuron.
1
In our case, the standard deviation will be
1000
Besides the weights, our network has some parameters that control
how it learns:
Learning rate 𝜂
The mini-batch size
The number of epochs
The number of hidden neurons
The first, and probably the most difficult is to achieve any non-trivial
learning. You must obtain results better than you would obtain by a
random selection.
In the case of MNIST digits, this means you should obtain something
greater than 10%
All of the above steps are useful to allow you to receive quick feedback
from the network. This allows to test many values for the parameters.
Start by adjusting the learning rate until you see some learning happens.
How to adjust hyper-parameters
You should start with a value for the learning rate where the training cost
decreases in the first iterations.
Increase it by magnitude (10) until the network starts oscillating. This is the
threshold
You can then refine it by slowly increasing it until the costs starts
oscillating again (gets close to the threshold). In fact, the value should
be a factor, or two below the threshold.
How to adjust hyper-parameters
At the beginning you should let the network learn for a significant
number of iterations in order to avoid the situation where it gets to
a plateau only to continue learning again
How to adjust hyper-parameters
The validation accuracy should be plotted against time (real time, not
number of epochs) and choose the one that achieves the highest
increase.
Questions & Discussion
Bibliography
http://neuralnetworksanddeeplearning.com/
Chris Bishop, “Neural Network for Pattern Recognition”
https://visualstudiomagazine.com/articles/2014/04/01/neural-network-cross-
entropy-error.aspx
https://en.wikipedia.org/wiki/Standard_deviation