Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning

This document discusses techniques for improving the efficiency of mini-batch gradient descent for training neural networks. It describes how mini-batches can help address issues with convergence speed when the error surface is non-circular. Some techniques discussed include initializing weights randomly to break symmetry, normalizing and decorrelating input features, using momentum to incorporate past gradients, and adjusting the learning rate over time.

Uploaded by

Mahapatra Milon

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views

Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning

Uploaded by

Mahapatra Milon

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Neural Networks for Machine Learning

Lecture 6a
Overview of mini-batch gradient descent

Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Reminder: The error surface for a linear neuron
• The error surface lies in a space with a
horizontal axis for each weight and one vertical
axis for the error. E
– For a linear neuron with a squared error, it is
a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
• For multi-layer, non-linear nets the error surface
is much more complicated. w1
– But locally, a piece of a quadratic bowl is
usually a very good approximation.
w2
Convergence speed of full batch learning when the error
surface is a quadratic bowl
• Going downhill reduces the error, but the
direction of steepest descent does not point
at the minimum unless the ellipse is a circle.
– The gradient is big in the direction in
which we only want to travel a small Even for non-linear
distance. multi-layer nets, the
error surface is locally
– The gradient is small in the direction in
quadratic, so the same
which we want to travel a large distance.
speed issues apply.
How the learning goes wrong
• If the learning rate is big, the weights slosh to
and fro across the ravine.
– If the learning rate is too big, this
oscillation diverges.
• What we would like to achieve:
– Move quickly in directions with small but E
consistent gradients.
– Move slowly in directions with big but
w
inconsistent gradients.
Stochastic gradient descent
• If the dataset is highly redundant, the • Mini-batches are usually better
gradient on the first half is almost than online.
identical to the gradient on the second – Less computation is used
half.
updating the weights.
– So instead of computing the full
gradient, update the weights using – Computing the gradient for
the gradient on the first half and many cases simultaneously
then get a gradient for the new uses matrix-matrix
weights on the second half. multiplies which are very
– The extreme version of this efficient, especially on GPUs
approach updates weights after
• Mini-batches need to be
each case. Its called “online”.
balanced for classes
Two types of learning algorithm
If we use the full gradient computed from all For large neural networks with
the training cases, there are many clever ways very large and highly redundant
to speed up learning (e.g. non-linear conjugate training sets, it is nearly always
gradient). best to use mini-batch learning.
– The optimization community has – The mini-batches may
studied the general problem of need to be quite big
optimizing smooth non-linear when adapting fancy
functions for many years. methods.
– Multilayer neural nets are not typical – Big mini-batches are
of the problems they study so their more computationally
methods may need a lot of adaptation. efficient.
A basic mini-batch gradient descent algorithm
• Guess an initial learning rate. • Towards the end of mini-batch
– If the error keeps getting worse learning it nearly always helps to
or oscillates wildly, reduce the turn down the learning rate.
learning rate. – This removes fluctuations in the
– If the error is falling fairly final weights caused by the
consistently but slowly, increase variations between mini-
the learning rate. batches.
• Write a simple program to automate • Turn down the learning rate when
this way of adjusting the learning the error stops decreasing.
rate. – Use the error on a separate
validation set
Neural Networks for Machine Learning

Lecture 6b
A bag of tricks for mini-batch gradient descent

Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Be careful about turning down the learning rate
• Turning down the learning
rate reduces the random reduce
fluctuations in the error due learning rate

error
to the different gradients on
different mini-batches.
– So we get a quick win.
– But then we get slower
learning.
• Don’t turn down the epoch
learning rate too soon!
Initializing the weights
• If two hidden units have exactly • If a hidden unit has a big fan-in,
the same bias and exactly the small changes on many of its
same incoming and outgoing incoming weights can cause the
weights, they will always get learning to overshoot.
exactly the same gradient. – We generally want smaller
– So they can never learn to be incoming weights when the fan-
different features. in is big, so initialize the weights
to be proportional to sqrt(fan-
– We break symmetry by
in).
initializing the weights to
• We can also scale the learning rate
have small random values.
the same way.
Shifting the inputs color indicates
training case
w1 w2
• When using steepest descent, shifting
the input values makes a big difference.
– It usually helps to transform each
component of the input vector so
that it has zero mean over the 101, 101  2 gives error
whole training set. 101, 99  0 surface
• The hypberbolic tangent (which is
2*logistic -1) produces hidden
activations that are roughly zero mean. 1, 1  2 gives error
– In this respect its better than the 1, -1  0 surface
logistic.
Scaling the inputs color indicates
weight axis
w1 w2
• When using steepest descent,
scaling the input values
makes a big difference.
– It usually helps to 0.1, 10  2 gives error
transform each 0.1, -10  0 surface
component of the input
vector so that it has unit
variance over the whole
training set. 1, 1  2 gives error
1, -1  0 surface
A more thorough method: Decorrelate the input components
• For a linear neuron, we get a big win by decorrelating each component of the
input from the other input components.
• There are several different ways to decorrelate inputs. A reasonable method is
to use Principal Components Analysis.
– Drop the principal components with the smallest eigenvalues.
• This achieves some dimensionality reduction.
– Divide the remaining principal components by the square roots of their
eigenvalues. For a linear neuron, this converts an axis aligned elliptical
error surface into a circular one.
• For a circular error surface, the gradient points straight towards the minimum.
Common problems that occur in multilayer networks
• If we start with a very big learning • In classification networks that use
rate, the weights of each hidden a squared error or a cross-entropy
unit will all become very big and error, the best guessing strategy is
positive or very big and negative. to make each output unit always
– The error derivatives for the produce an output equal to the
hidden units will all become proportion of time it should be a
tiny and the error will not 1.
decrease. – The network finds this strategy
– This is usually a plateau, but quickly and may take a long
people often mistake it for a time to improve on it by
local minimum. making use of the input.
– This is another plateau that
looks like a local minimum.
Four ways to speed up mini-batch learning

• Use “momentum” • rmsprop: Divide the learning rate for a

weight by a running average of the
– Instead of using the gradient
magnitudes of recent gradients for that
to change the position of the weight.
weight “particle”, use it to – This is the mini-batch version of just
change the velocity. using the sign of the gradient.
• Use separate adaptive learning • Take a fancy method from the optimization
rates for each parameter literature that makes use of curvature
– Slowly adjust the rate using information (not this lecture)
the consistency of the – Adapt it to work for neural nets
gradient for that parameter. – Adapt it to work for mini-batches.
Neural Networks for Machine Learning

Lecture 6c
The momentum method

Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
The intuition behind the momentum method

Imagine a ball on the error surface. The • It damps oscillations in directions of

location of the ball in the horizontal high curvature by combining
plane represents the weight vector. gradients with opposite signs.
– The ball starts off by following the • It builds up speed in directions with
gradient, but once it has velocity, it a gentle but consistent gradient.
no longer does steepest descent.
– Its momentum makes it keep going
in the previous direction.
The equations of the momentum method
The effect of the gradient is to
¶E increment the previous velocity. The
v(t) =a v(t - 1) - e (t)
¶w velocity also decays by a which is
slightly less then 1.

D w(t) =v(t) The weight change is equal to the current

velocity.
¶E
=a v(t - 1) - e (t)
¶w
The weight change can be expressed in
¶E
=a D w(t - 1) - e (t) terms of the previous weight change and
¶w the current gradient.
The behavior of the momentum method
• At the beginning of learning there may be
• If the error surface is a tilted very large gradients.
plane, the ball reaches a terminal – So it pays to use a small momentum
velocity. (e.g. 0.5).
– If the momentum is close to – Once the large gradients have
1, this is much faster than disappeared and the weights are stuck
simple gradient descent. in a ravine the momentum can be
smoothly raised to its final value (e.g.
0.9 or even 0.99)
1 æ ¶E ö • This allows us to learn at a rate that would
v(¥ ) = ç- e ÷
1- a è ¶w ø cause divergent oscillations without the
momentum.
A better type of momentum (Nesterov 1983)
• The standard momentum method • First make a big jump in the
first computes the gradient at the direction of the previous
current location and then takes a big accumulated gradient.
jump in the direction of the updated • Then measure the gradient
accumulated gradient. where you end up and make a
• Ilya Sutskever (2012 unpublished) correction.
suggested a new form of momentum – Its better to correct a
that often works better. mistake after you have
– Inspired by the Nesterov method made it!
for optimizing convex functions.
A picture of the Nesterov method
• First make a big jump in the direction of the previous accumulated gradient.
• Then measure the gradient where you end up and make a correction.

brown vector = jump, red vector = correction, green vector = accumulated gradient

blue vectors = standard momentum

Neural Networks for Machine Learning

Lecture 6d
A separate, adaptive learning rate for each connection

Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
The intuition behind separate adaptive learning rates
• In a multilayer net, the appropriate learning rates
can vary widely between weights:
– The magnitudes of the gradients are often very
different for different layers, especially if the initial
weights are small.
– The fan-in of a unit determines the size of the
“overshoot” effects caused by simultaneously Gradients can get very
changing many of the incoming weights of a unit to
small in the early layers of
correct the same error.
very deep nets.
• So use a global learning rate (set by hand)
multiplied by an appropriate local gain that is The fan-in often varies
determined empirically for each weight. widely between layers.
One way to determine the individual learning rates
• Start with a local gain of 1 for every weight. ¶E
• Increase the local gain if the gradient for D wij =- e gij
¶wij
that weight does not change sign.
• Use small additive increases and
multiplicative decreases (for mini-batch) æ ¶E ö
¶E
– This ensures that big gains decay rapidly if çç (t) (t - 1)÷
÷ >0
when oscillations start. è¶wij ¶wij ø
– If the gradient is totally random the gain
will hover around 1 when we increase then gij (t) =gij (t - 1) +.05
by plus d half the time and decrease else gij (t) =gij (t - 1)*.95
by times 1- d half the time.
Tricks for making adaptive learning rates work better
• Limit the gains to lie in some • Adaptive learning rates can be
reasonable range combined with momentum.
– e.g. [0.1, 10] or [.01, 100] – Use the agreement in sign
• Use full batch learning or big mini- between the current gradient for a
batches weight and the velocity for that
– This ensures that changes in weight (Jacobs, 1989).
the sign of the gradient are • Adaptive learning rates only deal with
not mainly due to the axis-aligned effects.
sampling error of a mini-
batch. – Momentum does not care about
the alignment of the axes.
Neural Networks for Machine Learning

Lecture 6e
rmsprop: Divide the gradient by a running average of its recent magnitude

Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
rprop: Using only the sign of the gradient
• The magnitude of the gradient can be very • rprop: This combines the idea of only using
different for different weights and can the sign of the gradient with the idea of
change during learning. adapting the step size separately for each
– This makes it hard to choose a single weight.
global learning rate. – Increase the step size for a weight
• For full batch learning, we can deal with multiplicatively (e.g. times 1.2) if the
this variation by only using the sign of the signs of its last two gradients agree.
gradient. – Otherwise decrease the step size
– The weight updates are all of the same multiplicatively (e.g. times 0.5).
magnitude. – Limit the step sizes to be less than 50 and
– This escapes from plateaus with tiny more than a millionth (Mike Shuster’s
gradients quickly. advice).
Why rprop does not work with mini-batches
• The idea behind stochastic gradient • rprop would increment the weight nine
descent is that when the learning times and decrement it once by about
rate is small, it averages the the same amount (assuming any
gradients over successive mini- adaptation of the step sizes is small on
batches. this time-scale).
– Consider a weight that gets a – So the weight would grow a lot.
gradient of +0.1 on nine mini- • Is there a way to combine:
batches and a gradient of -0.9 – The robustness of rprop.
on the tenth mini-batch. – The efficiency of mini-batches.
– We want this weight to stay – The effective averaging of gradients
roughly where it is. over mini-batches.
rmsprop: A mini-batch version of rprop
• rprop is equivalent to using the gradient but also dividing by the size of the
gradient.
– The problem with mini-batch rprop is that we divide by a different number
for each mini-batch. So why not force the number we divide by to be very
similar for adjacent mini-batches?
• rmsprop: Keep a moving average of the squared gradient for each weight2
MeanSquare(w, t) =0.9 MeanSquare(w, t- 1) + 0.1 ¶E ( ¶w )
(t)
MeanSquare(w, t)
• Dividing the gradient by makes the learning work much
better (Tijmen Tieleman, unpublished).
Further developments of rmsprop
• Combining rmsprop with standard momentum
– Momentum does not help as much as it normally does. Needs more
investigation.
• Combining rmsprop with Nesterov momentum (Sutskever 2012)
– It works best if the RMS of the recent gradients is used to divide the correction
rather than the jump in the direction of accumulated corrections.
• Combining rmsprop with adaptive learning rates for each connection
– Needs more investigation.
• Other methods related to rmsprop
– Yann LeCun’s group has a fancy version in “No more pesky learning rates”
Summary of learning methods for neural networks
• For small datasets (e.g. 10,000 cases) or • Why there is no simple recipe:
bigger datasets without much Neural nets differ a lot:
redundancy, use a full-batch method. – Very deep nets (especially ones
– Conjugate gradient, LBFGS ... with narrow bottlenecks).
– adaptive learning rates, rprop ... – Recurrent nets.
• For big, redundant datasets use mini- – Wide shallow nets.
batches. Tasks differ a lot:
– Try gradient descent with – Some require very accurate
momentum. weights, some don’t.
– Try rmsprop (with momentum ?) – Some have many very rare cases
– Try LeCun’s latest recipe. (e.g. words).

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (81)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
Morse, Ingard - Theoretical Acoustics (1968)
90% (10)
Morse, Ingard - Theoretical Acoustics (1968)
951 pages
1001 Songs
69% (72)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
No ratings yet
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
31 pages
Neural Networks Tricks: Patrick Van Der Smagt
No ratings yet
Neural Networks Tricks: Patrick Van Der Smagt
20 pages
cours5
No ratings yet
cours5
23 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
No ratings yet
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
20 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Lec 8
No ratings yet
Lec 8
43 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Training NNs
No ratings yet
Training NNs
34 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
nn2
No ratings yet
nn2
12 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Artificial neural networks-optimization
No ratings yet
Artificial neural networks-optimization
4 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
Advantages Bpa
No ratings yet
Advantages Bpa
38 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
C15-Momentum RMSProp Adam
No ratings yet
C15-Momentum RMSProp Adam
23 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Week 5 Optimisation
No ratings yet
Week 5 Optimisation
24 pages
Curs3site PDF
No ratings yet
Curs3site PDF
38 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
06-backprop
No ratings yet
06-backprop
63 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Neural Networks Four
No ratings yet
Neural Networks Four
70 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
DL_26-09 (3)
No ratings yet
DL_26-09 (3)
22 pages
Multi Layer Feed-Forward Network Learning
No ratings yet
Multi Layer Feed-Forward Network Learning
5 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
19 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
On The Momentum Term in Gradient Descent Learning Algorithms
No ratings yet
On The Momentum Term in Gradient Descent Learning Algorithms
7 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Backpropagation, Sgmiod Neuron & Gradient Discend
No ratings yet
Backpropagation, Sgmiod Neuron & Gradient Discend
29 pages
DL UNIT 2
No ratings yet
DL UNIT 2
46 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Active Contour: Advancing Computer Vision with Active Contour Techniques
From Everand
Active Contour: Advancing Computer Vision with Active Contour Techniques
Fouad Sabry
No ratings yet
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Interesting Selection Tests
No ratings yet
Interesting Selection Tests
3 pages
Anti Money Laundering (AML) Learnings From Banks: Compliance Group-AML July 16, 2010
100% (1)
Anti Money Laundering (AML) Learnings From Banks: Compliance Group-AML July 16, 2010
23 pages
AML PPT For Icrier
No ratings yet
AML PPT For Icrier
41 pages
Role of Technology e
No ratings yet
Role of Technology e
13 pages
KYC & Anti Money Laundering: CA. Ramesh Shetty
No ratings yet
KYC & Anti Money Laundering: CA. Ramesh Shetty
56 pages
Sound Mind Set Leads To Growth and Prosperity
No ratings yet
Sound Mind Set Leads To Growth and Prosperity
87 pages
MATH F243 Handout
No ratings yet
MATH F243 Handout
3 pages
Optimization for engineering design algorithms and examples Introduction only 45 page 1st Edition Kalyanmoy Deb instant download
100% (1)
Optimization for engineering design algorithms and examples Introduction only 45 page 1st Edition Kalyanmoy Deb instant download
29 pages
Derivative of Product and Quotient
No ratings yet
Derivative of Product and Quotient
18 pages
Equations of Mathematical Physics - Generalized Functions and Historical Notes 1st Edition A. S. Demidov All Chapters Instant Download
100% (4)
Equations of Mathematical Physics - Generalized Functions and Historical Notes 1st Edition A. S. Demidov All Chapters Instant Download
40 pages
Ebooks File Mathematics For Joint Entrance Examination JEE (Advanced) : Calculus 1st Edition - Ebook PDF All Chapters
100% (11)
Ebooks File Mathematics For Joint Entrance Examination JEE (Advanced) : Calculus 1st Edition - Ebook PDF All Chapters
41 pages
1 Functions and Models: 1.1 Four Ways To Represent A Function
50% (2)
1 Functions and Models: 1.1 Four Ways To Represent A Function
1 page
Function of Several Variables Assignment: Mathophilic Education Mob. No.-7239082744
No ratings yet
Function of Several Variables Assignment: Mathophilic Education Mob. No.-7239082744
10 pages
RPS Fisika Matematika II
No ratings yet
RPS Fisika Matematika II
9 pages
WAMAPSyllabus M151 Spring 20202
No ratings yet
WAMAPSyllabus M151 Spring 20202
4 pages
Jigsaw Activity Transformations Quadratics
No ratings yet
Jigsaw Activity Transformations Quadratics
5 pages
Jackson 11 14 Homework Solution PDF
No ratings yet
Jackson 11 14 Homework Solution PDF
4 pages
Solving The General Cubic Equation
No ratings yet
Solving The General Cubic Equation
4 pages
What Is Geometric Nonlinearity - COMSOL
No ratings yet
What Is Geometric Nonlinearity - COMSOL
15 pages
Eric SASGM
No ratings yet
Eric SASGM
5 pages
QB-1 Bmats101
No ratings yet
QB-1 Bmats101
3 pages
Unit-2 Basic Probability-Complete Notes
No ratings yet
Unit-2 Basic Probability-Complete Notes
41 pages
Math 10 Week 1-2 - Pattern and Sequence
No ratings yet
Math 10 Week 1-2 - Pattern and Sequence
38 pages
Surfer Gridding
No ratings yet
Surfer Gridding
19 pages
Signals & Systems Unit I
No ratings yet
Signals & Systems Unit I
34 pages
Section 9.5 Question 9 To 19
No ratings yet
Section 9.5 Question 9 To 19
4 pages
Research Article
No ratings yet
Research Article
20 pages
Computer Graphics (CSE 4103)
No ratings yet
Computer Graphics (CSE 4103)
19 pages
Aiits 2224 HCT Iv Jeem (GZB)
No ratings yet
Aiits 2224 HCT Iv Jeem (GZB)
9 pages

Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning

Uploaded by

Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning

Uploaded by

Neural Networks for Machine Learning

• Use “momentum” • rmsprop: Divide the learning rate for a

Imagine a ball on the error surface. The • It damps oscillations in directions of

D w(t) =v(t) The weight change is equal to the current

blue vectors = standard momentum

You might also like