0% found this document useful (0 votes)
8 views

Gradient Descent 5 Part 2

Uploaded by

trendysyncs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Gradient Descent 5 Part 2

Uploaded by

trendysyncs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

TRAINING NEURAL

NETWORKS
Key Points
Cost function:
It is a function that measures the performance of a model for any given data. Cost Function quantifies the

error between predicted values and expected values and presents it in the form of a single real number.

After making a hypothesis with initial parameters, we calculate the Cost function. And with a goal to reduce

the cost function, we modify the parameters by using the Gradient descent algorithm over the given data.
Steps of Gradient Descent
● The algorithm starts with an initial set of parameters and updates them in small steps to minimize the cost
function.
● In each iteration of the algorithm, the gradient of the cost function with respect to each parameter is computed.
● The gradient tells us the direction of the steepest ascent, and by moving in the opposite direction, we can find
the direction of the steepest descent.
● The size of the step is controlled by the learning rate, which determines how quickly the algorithm moves
towards the minimum.
● The process is repeated until the cost function converges to a minimum, indicating that the model has reached
the optimal set of parameters.
x y
Example
1 2

2 4

3 6

4 8
x y w y-pred MSE Derivative
(y-y-pred)2 of MSE

1 2

2 4

3 6

4 8
x y w y-pred MSE Derivative
(y-y-pred)2 of MSE

1 2

2 4

3 6

4 8
Online Learning
Entire sample is not given to network, but we are given instances one by one and would like the
network to update its parameters after each instance, adapting itself slowly in time.

Advantages

1. It saves us the cost of storing the training sample in an external memory and storing the
intermediate results during optimization.

2. The problem may be changing in time, which means that the sample distribution is not fixed,
and a training set cannot be chosen a priori. For example, we may be implementing a speech
recognition system that adapts itself to its user.

3. There may be physical changes in the system. For example, in a robotic system, the
components of the system may wear out, or sensors may degrade.
Stochastic Gradient Descent
Ways of Updating Weights

Input Actual Output Predicted Output Update in weights


Key points
● The process of incrementally updating the weights is also called “stochastic” gradient descent
since it approximates the minimization of the cost function.
● Although the stochastic gradient descent approach might sound inferior to gradient descent
due its “stochastic” nature and the “approximated” direction (gradient), it can have certain
advantages in practice.
● Often, stochastic gradient descent converges much faster than gradient descent since the
updates are applied immediately after each training sample; stochastic gradient descent is
computationally more efficient, especially for very large datasets.
Advantages:
Speed: SGD is faster than other variants of Gradient Descent such as Batch Gradient Descent and Mini-Batch
Gradient Descent since it uses only one example to update the parameters.
Memory Efficiency: Since SGD updates the parameters for each training example one at a time, it is
memory-efficient and can handle large datasets that cannot fit into memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape from local minima and
converge to a global minimum.
Disadvantages:
Noisy updates: The updates in SGD are noisy and have a high variance, which can make the optimization process
less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it updates the
parameters for each training example one at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since using a high learning rate
can cause the algorithm to overshoot the minimum, while a low learning rate can make the algorithm converge
slowly.

You might also like