0% found this document useful (0 votes)

41 views

Curs4site PDF

Uploaded by

Gigi Florica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Curs4site PDF

Uploaded by

Gigi Florica

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

25 October, 2016

Neural Networks
Course 4: Making the neural network more efficient
Overview

 The Problem With Quadratic Cost

 Cross Entropy
 Softmax

 Weight initialization
 How to adjust hyper-parameters
 Conclusions
The Problem With Quadratic
Cost
The Problem With Quadratic Cost

 On the last course we have used the Mean Square Error as our cost function

 Even though we have achieved a good accuracy using this cost function,
this is not the best one since learning can be slow

 An important feature that we want from a neuron (and from a neural

network) is to learn fast. For this to work, the weights must be lowered in
direct proportion to how big the error is.
 If the error is big, then big adjustment must be made in order to drive the cost
down
 If the error is small, then we want to make small adjustments to not overshoot our
target
The Problem With Quadratic Cost

 A small experiment.
 We will take a neuron with only one input (one weight) and one bias. The input will
always be 1.
 The role of the neuron is to find the weights that make is output a zero. So drive
the 1 to zero

𝑏
𝑤
𝑥=1 𝜎 𝑡=0

 We will test the network in two variants:

 𝑤 = 0.6 𝑏 = 0.9 . 𝑧 = 1.5. 𝜎 1.5 = 0.81
 𝑤 = 2 𝑏 = 2 . 𝑧 = 4. 𝜎 4 = 0.98
 Observe how fast the value drops to 0 for each case
The Problem With Quadratic Cost

 As it can be observed, when the error is big (second case) the learning is
slower. This is the opposite as what we want.

 Why is this happening?

𝑡−𝑎 2
 The cost 𝐶 = 2
 The weight and bias are adjusted according to the formula :
𝜕𝐶 𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝐶 𝜕𝑎 𝜕𝑧
𝑤 = 𝑤 − 𝜂 𝜕𝑤 = 𝜕𝑎 ∙ 𝜕𝑤 = 𝜕𝑎 ∙ 𝜕𝑧 ∙ 𝜕𝑤 = 𝑎 − 𝑡 𝜎 ′ 𝑧 𝑥 = 𝑎𝜎 ′ 𝑧 (𝑓𝑜𝑟 𝑜𝑢𝑟 𝑐𝑎𝑠𝑒)
𝜕𝑤

𝜕𝐶 𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝐶 𝜕𝑎 𝜕𝑧
𝑏 = 𝑏 − 𝜂 𝜕𝑏 = 𝜕𝑎 ∙ 𝜕𝑏 = 𝜕𝑎 ∙ 𝜕𝑧 ∙ 𝜕𝑏 = 𝑎 − 𝑡 𝜎 ′ 𝑧 = 𝑎𝜎 ′ 𝑧 (𝑓𝑜𝑟 𝑜𝑢𝑟 𝑐𝑎𝑠𝑒)
𝜕𝑏
The Problem With Quadratic Cost

 So, how the error changes in respect to the weight or the bias, depends on 𝜎′(𝑧)
𝜎′ 𝑧

 For large values of z, the function is almost flat. So the derivate is very small.
Thus, learning is slow. In this case, we say that the neuron has saturated on the
wrong value
Cross Entropy
Cross Entropy
 One of the way to solve the slow learning problem is to change the cost
function.

 We want a function that its derivate does not contain the 𝜎′(𝑧)

 For a neuron with multiple inputs (vector x) and an output (a), the cross entropy
is defined as:
1
𝐶= − 𝑥 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
where:
n =number of training items
x = a training item
a = activation for item x
y = expected output for item x

The sum is over all training items.

Cross Entropy

 Does this function fix our problem?

1
𝐶=− 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
𝑥
First, observe that is behaves like a cost function:
 since 𝑎 ∈ 0,1 that means that 𝑙𝑛𝑎 and ln(1 − 𝑎) are negative and the output is
multiplied by a negative number. So it results a positive number
 When 𝑦 = 0 𝑎𝑛𝑑 𝑎 ≈ 0, then the sum is 0 (or very close to 0)
 When 𝑦 = 1 𝑎𝑛𝑑 𝑎 ≈ 1 , then the sum is 0 (or very close to 0)
 When 𝑦 = 1 the function depends on ln(a). 𝑙𝑛 is a monotonic function

𝜕𝐶 𝜕𝐶
So this seems to work like a cost function, but how do 𝜕𝑤 and 𝜕𝑏 look like?
Cross Entropy
𝜕𝐶 𝜕𝐶 𝜕𝑎 𝜕𝑧 1 𝜕 𝑦𝑙𝑛𝑎+ 1−𝑦 ln 1−𝑎
= ∙ ∙ =− 𝑥 ∙ σ′ 𝑧 ∙ 𝑥𝑖
𝜕𝑤𝑖 𝜕𝑎 𝜕𝑧 𝜕𝑤𝑖 𝑛 𝜕𝑎

𝜕 𝑦𝑙𝑛𝑎+ 1−𝑦 ln 1−𝑎 𝑦 1−𝑦 ′ 𝑦 1−𝑦

= + ∙ 1−𝑎 = −
𝜕𝑎 𝑎 1−𝑎 𝑎 1−𝑎

𝜎′ 𝑧 = 1 − 𝑎 𝑎
-------------------------------------------------------------------------------
𝜕𝐶 1 𝑦 1−𝑦 1
=− 𝑥 ( − ) ∙ (1 − 𝑎) ∙ 𝑎 ∙ 𝑥𝑖 = − 𝑥 𝑦 1−𝑎 − 1−𝑦 𝑎 𝑥
𝜕𝑤𝑖 𝑛 𝑎 1−𝑎 𝑛

𝜕𝐶 1
 =− 𝑥 𝑦−𝑎 𝑥
𝜕𝑤𝑖 𝑛
𝜕𝐶 1
 =− 𝑥 𝑦−𝑎
𝜕𝑏 𝑛
Cross Entropy

 We will try to repeat the previous experiment, but this time with the cross
entropy function.

 A thing that we must also change is the learning rate. Learning rate is
dependent on the cost function. Changing the learning rate is not cheating
since we are interested in how the learning speed changes and not how
fast it is learning.
Cross Entropy

 By now we have been using a cost function for only one output. Of course,
this can be generalized:
1
𝐶=− [𝑦𝑗 𝑙𝑛𝑎𝑗𝐿 + 1 − 𝑦𝑗 ln 1 − 𝑎𝑗𝐿 ]
𝑛
𝑥 𝑗

𝜕𝐶
 The error in the final layer, becomes
𝜕𝑧𝑗𝐿
𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝑦𝑗 1−𝑦𝑗 𝐿 𝐿 1 𝐿
= 𝐿 ∙ =− 𝑥( 𝐿 − 𝐿 ) 1 − 𝑎𝑗 𝑎𝑗 = − 𝑥(𝑦𝑗 − 𝑎𝑗𝐿 )
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖 𝜕𝑧𝑖𝐿 𝑛 𝑎𝑗 1−𝑎𝑗 𝑛

𝜕𝐶 1
= (𝑎𝑗 − 𝑦𝑗 )
𝜕𝑧𝑗𝐿 𝑛
𝑥
Cross Entropy

 Of course, since we will be using Stochastic Gradient Descent, we will not

divide the cost of each element by the total dataset (n), but by the length
of the mini batch(m)

 So, what we actually need to change in the backpropagation algorithm to

make this work, is just how the error is computed

𝜕𝐶
𝛻𝑎 𝐶 = 𝐿
= 𝑎𝐿 − 𝑦
𝜕𝑧
(more on)Cross Entropy

 Where did the function come from?! (it looks very complicated at first sight)

 Consider that we have a system (neural network) that has some

configuration and must classify the inputs to several m classes (ex. Digits).

 The probability of classifying an input to each class is 𝑎𝑗 , where 𝑗 𝑎𝑗 =1

 Let’s suppose that for our dataset we have 𝑘𝑗 elements for each 𝑗 class.
According, to the model, the likelihood of this happening is:

𝑘 𝑘 𝑘
P data model) = 𝑎1 1 𝑎2 2 … 𝑎𝑚𝑚
(more on)Cross Entropy

If we apply the logarithm function, then

𝑘 𝑘 𝑘 𝑘 𝑘 𝑘
ln(P data model)) = ln(𝑎1 1 𝑎2 2 … 𝑎𝑚𝑚 ) = ln(𝑎1 1 ) + ln(𝑎2 2 ) + ⋯ + ln(𝑎𝑚𝑚 ) =
= 𝑗 𝑘𝑗 ln(𝑎𝑗 )

Using the logarithm function, we have some advantages:

• It’s a monotonic function.
• Transforms the product into a sum
• Logarithm of a very small number is a negative number
(more on)Cross Entropy

Obviously, we want to increase this probability, but since we’re used to minimizing a cost
function, we’ll minimize the same function but with the opposite sign (-)

𝑘𝑗
If we divide by the number of elements in the dataset (n), then becomes the true
𝑛
probability of the elements of each class.

Another Formula for Cross Entropy: − 𝑗 𝑝𝑗 ln 𝑎𝑗

(more on)Cross Entropy

If we divide by the number of elements in the dataset (n), we’ll have:

1
− 𝑗 𝑘𝑗 ln(𝑎𝑗 )
n

Since the output vector (y) is a one-hot element (only one of its elements
has value 1, the others have value 0. Ex, for digits: 0, 0, 0, 1, 0 , 0, 0, 0, 0, 0)
𝑘𝑗 = 𝑥 𝑦𝑗

The above equation becomes:

1
− 𝑗 ( 𝑥 𝑦𝑗 )ln(𝑎𝑗 )
𝑛
(more on)Cross Entropy

Standard formula for Cross Entropy :

1
𝐶=− 𝑥 𝑗 𝑦𝑗 ln(𝑎𝑗 )
𝑛

If the number of possible classes is just 2, we can really use just one output. The
above formula becomes

1
C=− 𝑥 𝑦𝑙𝑛𝑎 + 1 − 𝑦 ln 1 − 𝑎
𝑛
(more on)Cross Entropy

Another often used cost function, when doing online training is to use
𝐶 = −ln(𝑎𝑗 )

Of course, this is still the cross entropy, but in a more simplified version that
takes account for the fact that 𝑦𝑗 =1 for the right label and 0 for the rest. (one
hot)
Softmax
Softmax

When we’ve classified the MNIST digits we didn’t consider the outputs as
probabilities, yet we’ve used cross entropy which works with probabilities.

The only thing that must be changed is the output layer.

Instead of outputting a𝐿j = 𝜎 𝑧 , 𝑤ℎ𝑒𝑟𝑒 𝑧 = 𝑘 𝑤𝑗𝑘 𝑎𝑘 + 𝑏𝑗𝐿 we’ll compute a
𝐿 𝐿−1

probability using z.

More exactly,

𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝐿
𝑘 𝑒 𝑘
Softmax
𝜕𝐶
How does looks like?
𝜕𝑧𝑗𝐿
𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝐿
𝑘 𝑒 𝑘

1
𝐶= − 𝑥 𝑗 𝑦𝑗 ln(𝑎𝑗 )
𝑛
Softmax

𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝜕𝑦𝑖 𝜕𝑎𝑖𝐿

= =−
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖𝐿 𝜕𝑧𝑗𝐿 𝑛 𝜕aLi 𝜕𝑧𝑗𝐿
𝑥 𝑖

𝑧𝑖𝐿
𝑒 ′
𝜕 𝐿 𝐿 𝐿 𝐿
𝜕𝑎𝑖𝐿
𝐿
𝑒 𝑧𝑘 𝑒 𝑧𝑖 𝑘𝑒
𝑧𝑘 − 𝑒 𝑧𝑖 ( 𝑧𝑘
𝑘 𝑒 )′
𝑘
= =
𝜕𝑧𝑗𝐿 𝜕𝑧𝑗𝐿 (
𝐿
𝑒 𝑧𝑘 )
2
𝑘
2 2
𝑧𝑗𝐿 𝐿 𝑧𝑗𝐿
𝜕a𝐿𝑖 𝑒 𝑘 𝑒 𝑧𝑘 − 𝑒 𝑒 𝑧𝑗𝐿
𝑒 𝑧𝑗𝐿
𝑖𝑓 𝑖 = 𝑗, 𝐿 = 2 = 𝐿 − = 𝑎𝑗 − 𝑎𝑗2 = 𝑎𝑗 (1 − 𝑎𝑗 )
𝜕𝑧𝑗 𝐿 𝑧𝑘 𝑧𝑘𝐿
𝑘𝑒
𝑧𝑘 𝑘𝑒 𝑘𝑒

𝐿 𝐿 𝐿 𝐿
𝜕a𝐿𝑖 −𝑒 𝑧𝑖 𝑒 𝑧𝑗 𝑒 𝑧𝑖 𝑒 𝑧𝑗
𝑖𝑓 𝑖! = 𝑗, 𝐿 = 2 =− 𝐿 𝐿 = −𝑎𝑖 𝑎𝑗
𝜕𝑧𝑗 𝐿
𝑧𝑘 𝑘𝑒
𝑧𝑘
𝑘𝑒
𝑧𝑘
𝑘𝑒
Softmax

𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝑦𝑖 𝜕𝑎𝑖𝐿 1 𝑦𝑗 𝑦𝑖
= =− =− 𝑎 1 − 𝑎𝑗 + −𝑎𝑖 𝑎𝑗 =
𝜕𝑧𝑗𝐿 𝜕𝑎𝑖𝐿 𝜕𝑧𝑗𝐿 𝑛 ai 𝜕𝑧𝑗𝐿 𝑛 𝑎𝑗 𝑗 𝑎𝑖
𝑥 𝑖 𝑥 𝑖!=𝑗

1 1
=− 𝑦𝑗 − 𝑦𝑗 𝑎𝑗 − 𝑦𝑖 𝑎𝑗 =− 𝑦𝑗 − 𝑎𝑗 𝑦𝑗 + 𝑦𝑖 =
𝑛 𝑛
𝑥 𝑖!=𝑗 𝑥 𝑖!=𝑗

𝑦𝑖 = 1
1
=− (𝑦𝑗 − 𝑎𝑗 ) 𝑖
𝑛
𝑥
Softmax

In order to use softmax function, the only thing that must be modified, in
addition to using cross entropy, is the activation function in the output layer
𝑧𝐿
𝑒 𝑗
𝑎𝑗𝐿 = 𝑧𝑘𝐿
𝑘 𝑒

In fact, the reason why the previous version of the network works (the one that
doesn’t use probabilities in the output layer) is because it has the same
gradient as the cross entropy + softmax
Softmax

 Usually, the cross entropy

function achieves best results
when used together with
softmax (if the problem allows)
Weight Initialization
Weight Initialization

Until now we’ve been using random weights with normal standard
distribution. (normal distribution with 𝜎 = 1)

That means that 68% of the weights have values in interval [-1,1], 95%
have values in interval [-2,2], 99.3 in [-3,3]
Weight Initialization

The problem with these kind of values is when we compute the net
input 𝑧 = 𝑥 𝑤𝑥 + 𝑏

Let’s consider a neuron with 1000 inputs. Halve of which are 0. The
other ones are 1 𝑦43
1

𝑤1
1 𝑤500
𝑧
0 𝑤501

𝑤1000 𝑏

0
1
Weight Initialization

That means that the net input 𝑧 = 500

𝑖=1 𝑤𝑖 +b
Since 𝑤𝑖 and 𝑏 are normally distributed, that means that:

500
𝜇𝑧 = 𝑖=1 𝜇𝑤 𝑖 + 𝜇𝑏 = 0
𝑣𝑎𝑟 𝑧 = 500
𝑖=1 𝑣𝑎𝑟(𝑤𝑖 ) + 𝑣𝑎𝑟 𝑏 =501

So, in this case, z is a variable that with a normal distribution with mean
0 and standard deviation of 501
Weight Initialization

That means that 95% of z values will be in the interval [-1002, 1002].
That is a very big interval, since a neuron usually saturates for values
greater than 4.
Weight Initialization

The solution is to initialize the weights with such values that when
added in the net input will not saturate the neuron.

Thus, all values will be initialized with a random value from a normal
1
distribution with mean 0 and a standard deviation of where 𝑛𝑖𝑛 is
𝑛𝑖𝑛
the total number of connection that go into the neuron.

1
In our case, the standard deviation will be
1000

The bias can still be generated from a standard normal distribution

since it just adds 1 to the variance.
Weight Initialization

 By using small weights, the

network accuracy increases

 Also, using small weights, the

network arrives faster at the
best accuracy
How to adjust hyper-parameters
How to adjust hyper-parameters

Besides the weights, our network has some parameters that control
how it learns:
 Learning rate 𝜂
 The mini-batch size
 The number of epochs
 The number of hidden neurons

All parameters should be tested on a separate dataset (validation set) in

order to avoid fitting the parameters to the test set
How to adjust hyper-parameters

The first, and probably the most difficult is to achieve any non-trivial
learning. You must obtain results better than you would obtain by a
random selection.

In the case of MNIST digits, this means you should obtain something
greater than 10%

 Start with a smaller dataset. This increases the speed

 In the case of MNIST digits this could mean to work with only two numbers (0
and 1)
 Start with a smaller network.
 In the case of MNIST digits, that could mean to start with a network of 784x10
neurons. (No hidden layer)
How to adjust hyper-parameters

 Increase monitoring frequency

 Work with just a fraction of the validation set
 Monitor the accuracy not only on iterations, but also after you have
computed some mini batches (for example, 10 minibatches)

All of the above steps are useful to allow you to receive quick feedback
from the network. This allows to test many values for the parameters.

Start by adjusting the learning rate until you see some learning happens.
How to adjust hyper-parameters

 You should be increasing or

decreasing the learning rate
and monitor the cost of the
training data. Here is how it
looks like for different values
of the 𝜂

 For large values of 𝜂 the

gradient descent overshoots

 For low variants of 𝜂 the

gradient descent is low
How to adjust hyper-parameters

 You should start with a value for the learning rate where the training cost
decreases in the first iterations.

 Increase it by magnitude (10) until the network starts oscillating. This is the
threshold

 You can then refine it by slowly increasing it until the costs starts
oscillating again (gets close to the threshold). In fact, the value should
be a factor, or two below the threshold.
How to adjust hyper-parameters

 Choosing the number of iterations is simple: just use early stopping.

That means, after each iterations test the network on the validation
set. If after x iterations there is no improvement in the classification
accuracy, then stop. X can be for example 10 iterations.

 At the beginning you should let the network learn for a significant
number of iterations in order to avoid the situation where it gets to
a plateau only to continue learning again
How to adjust hyper-parameters

 Mini Batch Size:

 Mini Batch should be used since we can make use of modern libraries
that can compute the weights of all the elements in the batch, at
once.

 The validation accuracy should be plotted against time (real time, not
number of epochs) and choose the one that achieves the highest
increase.
Questions & Discussion
Bibliography

http://neuralnetworksanddeeplearning.com/
Chris Bishop, “Neural Network for Pattern Recognition”
https://visualstudiomagazine.com/articles/2014/04/01/neural-network-cross-
entropy-error.aspx
https://en.wikipedia.org/wiki/Standard_deviation

DL Unit-2
No ratings yet
DL Unit-2
24 pages
Document
No ratings yet
Document
2 pages
Curs5site PDF
No ratings yet
Curs5site PDF
47 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Deep Learning Week 201
No ratings yet
Deep Learning Week 201
3 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Dat 300
No ratings yet
Dat 300
12 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Part 1.2. Back Propagation
No ratings yet
Part 1.2. Back Propagation
30 pages
Final Ppt DataMining
No ratings yet
Final Ppt DataMining
64 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
NNDL
No ratings yet
NNDL
4 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
DeepLearning Workshop Humayun
No ratings yet
DeepLearning Workshop Humayun
63 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Lec 8
No ratings yet
Lec 8
43 pages
Deep learning
No ratings yet
Deep learning
15 pages
cours4
No ratings yet
cours4
30 pages
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
No ratings yet
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
90 pages
DL 3
No ratings yet
DL 3
72 pages
Homework2
No ratings yet
Homework2
3 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Lect 8
No ratings yet
Lect 8
117 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
3a Variations
No ratings yet
3a Variations
17 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
NN
No ratings yet
NN
12 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Python Unit 5
No ratings yet
Python Unit 5
36 pages
Fundamentals of Deep Learning: Part 2: How A Neural Network Trains
No ratings yet
Fundamentals of Deep Learning: Part 2: How A Neural Network Trains
54 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Lec 15 MLP Cont
No ratings yet
Lec 15 MLP Cont
34 pages
Slide 2-f2
No ratings yet
Slide 2-f2
52 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Experiments - With - Convolutional - Neural - Network - 2 - 6b.ipynb - Colaboratory
No ratings yet
Experiments - With - Convolutional - Neural - Network - 2 - 6b.ipynb - Colaboratory
6 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Lecture 4
No ratings yet
Lecture 4
50 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Neural Network
100% (1)
Neural Network
54 pages
PowerPoint Presentation-2
No ratings yet
PowerPoint Presentation-2
52 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
02 Training
No ratings yet
02 Training
51 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Basic Exercises for Competitive Programming: Python
From Everand
Basic Exercises for Competitive Programming: Python
Jan Pol
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
25 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
45 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
117 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
45 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
50 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
51 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
78 pages
Florin Olariu & Andrei Arusoaie: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu & Andrei Arusoaie: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
20 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
44 pages
Neural Networks: 10 January, 2017
No ratings yet
Neural Networks: 10 January, 2017
74 pages
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
No ratings yet
Florin Olariu: "Alexandru Ioan Cuza", University of Iași Department of Computer Science
48 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Curs11 PDF
No ratings yet
Curs11 PDF
41 pages
Crash Report 4 29 1008
No ratings yet
Crash Report 4 29 1008
32 pages
Workflow.: Modeling, Verification, Security
No ratings yet
Workflow.: Modeling, Verification, Security
5 pages
Curs3site PDF
No ratings yet
Curs3site PDF
38 pages
Curs7 PDF
No ratings yet
Curs7 PDF
46 pages
Vina Screen Local
No ratings yet
Vina Screen Local
1 page
Real-Time GraphQL - Tech9
No ratings yet
Real-Time GraphQL - Tech9
57 pages
8-Star-Choosability of A Graph With Maximum Average Degree Less Than 3
No ratings yet
8-Star-Choosability of A Graph With Maximum Average Degree Less Than 3
14 pages
Rsa - TCR PDF
No ratings yet
Rsa - TCR PDF
89 pages
Graph Theory Chap
No ratings yet
Graph Theory Chap
12 pages
Andre Silvius Battu Ok
No ratings yet
Andre Silvius Battu Ok
17 pages
Cheat Sheet Midterm
No ratings yet
Cheat Sheet Midterm
1 page
Statistics for Business and Economics Revised 12th Edition Anderson Solutions Manual - Complete Set Of Chapters Available For Instant Download
100% (2)
Statistics for Business and Economics Revised 12th Edition Anderson Solutions Manual - Complete Set Of Chapters Available For Instant Download
29 pages
Ma-Economics Syllabus
No ratings yet
Ma-Economics Syllabus
47 pages
8.language and Literature Research
No ratings yet
8.language and Literature Research
14 pages
Chapter Sixteen: Analysis of Variance and Covariance
No ratings yet
Chapter Sixteen: Analysis of Variance and Covariance
64 pages
Business Mathematics PDF
No ratings yet
Business Mathematics PDF
29 pages
CSC - English - Guide Book - NCSC 2016 - 2017
No ratings yet
CSC - English - Guide Book - NCSC 2016 - 2017
182 pages
Question 2
No ratings yet
Question 2
2 pages
Eco 355 - 0 PDF
No ratings yet
Eco 355 - 0 PDF
137 pages
Falling Weight Deflectometer Relative Calibration Analysis: - SHRP-P-652
No ratings yet
Falling Weight Deflectometer Relative Calibration Analysis: - SHRP-P-652
114 pages
Particle Filter
No ratings yet
Particle Filter
17 pages
Chapter 3 Extra Practice Problems
No ratings yet
Chapter 3 Extra Practice Problems
2 pages
SE1 Solutions
No ratings yet
SE1 Solutions
16 pages
Bayesian_and_Kalman
No ratings yet
Bayesian_and_Kalman
3 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
3 pages
Data:: Cusum Test Example
No ratings yet
Data:: Cusum Test Example
4 pages
Faculty of Information Science & Technology (FIST) : PSM 0325 Introduction To Probability and Statistics
No ratings yet
Faculty of Information Science & Technology (FIST) : PSM 0325 Introduction To Probability and Statistics
10 pages
Thirishali Report
No ratings yet
Thirishali Report
119 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
Manual PPAP 4th Edition
No ratings yet
Manual PPAP 4th Edition
74 pages
Andale (2015, July 19) - Types of Variables in Statistics and Research. Retrieved From
No ratings yet
Andale (2015, July 19) - Types of Variables in Statistics and Research. Retrieved From
2 pages
Ri D RQ Ri Qi: That Which Follows From Observations and Facts Rather Than From Theory or Logic
No ratings yet
Ri D RQ Ri Qi: That Which Follows From Observations and Facts Rather Than From Theory or Logic
2 pages
NN 2
No ratings yet
NN 2
42 pages
Inferring The Components of The Bid-Ask Spread: Theory and Empirical Tests
No ratings yet
Inferring The Components of The Bid-Ask Spread: Theory and Empirical Tests
21 pages
s2-hypothesis-testing-exercise
No ratings yet
s2-hypothesis-testing-exercise
18 pages
Download ebooks file Handbook on Material and Energy Balance Calculations in Material Processing Includes CD ROM 3rd Edition Arthur E. Morris all chapters
100% (3)
Download ebooks file Handbook on Material and Energy Balance Calculations in Material Processing Includes CD ROM 3rd Edition Arthur E. Morris all chapters
51 pages
The Optimality of Naive Bayes: Harry Zhang
No ratings yet
The Optimality of Naive Bayes: Harry Zhang
6 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages

Curs4site PDF

Uploaded by

Curs4site PDF

Uploaded by

25 October, 2016

 The Problem With Quadratic Cost

 An important feature that we want from a neuron (and from a neural

 We will test the network in two variants:

 Why is this happening?

The sum is over all training items.

 Does this function fix our problem?

𝜕 𝑦𝑙𝑛𝑎+ 1−𝑦 ln 1−𝑎 𝑦 1−𝑦 ′ 𝑦 1−𝑦

 Of course, since we will be using Stochastic Gradient Descent, we will not

 So, what we actually need to change in the backpropagation algorithm to

 Consider that we have a system (neural network) that has some

 The probability of classifying an input to each class is 𝑎𝑗 , where 𝑗 𝑎𝑗 =1

If we apply the logarithm function, then

Using the logarithm function, we have some advantages:

Another Formula for Cross Entropy: − 𝑗 𝑝𝑗 ln 𝑎𝑗

If we divide by the number of elements in the dataset (n), we’ll have:

The above equation becomes:

Standard formula for Cross Entropy :

The only thing that must be changed is the output layer.

𝜕𝐶 𝜕𝐶 𝜕𝑎𝑖𝐿 1 𝜕𝑦𝑖 𝜕𝑎𝑖𝐿

 Usually, the cross entropy

That means that the net input 𝑧 = 500

The bias can still be generated from a standard normal distribution

 By using small weights, the

 Also, using small weights, the

All parameters should be tested on a separate dataset (validation set) in

 Start with a smaller dataset. This increases the speed

 Increase monitoring frequency

 You should be increasing or

 For large values of 𝜂 the

 For low variants of 𝜂 the

 Choosing the number of iterations is simple: just use early stopping.

 Mini Batch Size:

You might also like