0% found this document useful (0 votes)

33 views15 pages

Gradient Descent - A Quick, Simple Introduction - Built in

gradient descent

Uploaded by

moyunaquamarine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views15 pages

Gradient Descent - A Quick, Simple Introduction - Built in

gradient descent

Uploaded by

moyunaquamarine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

BETA

Gradient Descent: An
Introduction to 1 of
Machine Learning’s Most
Popular Algorithms
Take a high-level look into gradient descent- one of ML's most
popular algorithms

Niklas Donges
July 23, 2021
Updated: August 1, 2021

combined with every algorithm and is easy to understand and

implement. Everyone working with machine learning should
understand its concept. We'll walk through how gradient descent
works, what types of it are used today, and its advantages and
tradeoffs.

Table of Contents

Introduction
What is a gradient?
How gradient descent works
Learning rate
How to make sure it works properly
Types of gradient descent: batch, stochastic, mini-batch

Introduction to Gradient Descent

Gradient descent is an optimization algorithm that's used

when training a machine learning model. It's based on a convex
function and tweaks its parameters iteratively to minimize a given
function to its local minimum.

WHAT IS GRADIENT DESCENT?

Gradient Descent is an optimization algorithm for finding a local minimum

of a differentiable function. Gradient descent is simply used in machine
learning to find the values of a function's parameters (coefficients) that
minimize a cost function as far as possible.
:
You start by defining the initial parameter's values and from there
gradient descent uses calculus to iteratively adjust the values so they
minimize the given cost-function. To understand this concept full, it's
important to know about gradients.

What is a Gradient?

"A gradient measures how much the output of a

function changes if you change the inputs a little
bit." —Lex Fridman (MIT)

A gradient simply measures the change in all weights with regard to

the change in error. You can also think of a gradient as the slope of a
function. The higher the gradient, the steeper the slope and the faster
a model can learn. But if the slope is zero, the model stops learning. In
mathematical terms, a gradient is a partial derivative with respect to
its inputs.

WHAT IS A GRADIENT?

In machine learning, a gradient is a derivative of a function that has more

than one input variable. Known as the slope of a function in mathematical
terms, the gradient simply measures the change in all weights with regard to
the change in error.
:
Imagine a blindfolded man who wants to climb to the top of a hill with
the fewest steps along the way as possible. He might start climbing
the hill by taking really big steps in the steepest direction, which he
can do as long as he is not close to the top. As he comes closer to the
top, however, his steps will get smaller and smaller to avoid
overshooting it. This process can be described mathematically using
the gradient.

Imagine the image below illustrates our hill from a top-down view
and the red arrows are the steps of our climber. Think of a gradient in
this context as a vector that contains the direction of the steepest
step the blindfolded man can take and also how long that step should
be.
:
Note that the gradient ranging from X0 to X1 is much longer than the
one reaching from X3 to X4. This is because the steepness/slope of
the hill, which determines the length of the vector, is less. This
perfectly represents the example of the hill because the hill is getting
less steep the higher it's climbed. Therefore a reduced gradient goes
along with a reduced slope and a reduced step size for the hill
climber.
:
Find out who's hiring.
See all Data + Analytics jobs at top tech companies & startups

VIEW JOBS

How Gradient Descent works

Instead of climbing up a hill, think of gradient descent as hiking down

to the bottom of a valley. This is a better analogy because it is a
minimization algorithm that minimizes a given function.

The equation below describes what gradient descent does: b is the

next position of our climber, while a represents his current position.
The minus sign refers to the minimization part of gradient descent.
The gamma in the middle is a waiting factor and the gradient term (
Δf(a) ) is simply the direction of the steepest descent.

So this formula basically tells us the next position we need to go,

which is the direction of the steepest descent. Let's look at another
example to really drive the concept home.

Imagine you have a machine learning problem and want to train your
algorithm with gradient descent to minimize your cost-function J(w,
b) and reach its local minimum by tweaking its parameters (w and b).
:
The image below shows the horizontal axes represent the parameters
(w and b), while the cost function J(w, b) is represented on the vertical
axes. Gradient descent is a convex function.

We know we want to find the values of w and b that correspond to the

minimum of the cost function (marked with the red arrow). To start
finding the right values we initialize w and b with some random
numbers. Gradient descent then starts at that point (somewhere
around the top of our illustration), and it takes one step after another
in the steepest downside direction (i.e., from the top to the bottom of
the illustration) until it reaches the point where the cost function is as
small as possible.

Importance of the Learning Rate

How big the steps are gradient descent takes into the direction of the
local minimum are determined by the learning rate, which figures out
how fast or slow we will move towards the optimal weights.
:
For gradient descent to reach the local minimum we must set the
learning rate to an appropriate value, which is neither too low nor too
high. This is important because if the steps it takes are too big, it may
not reach the local minimum because it bounces back and forth
between the convex function of gradient descent (see left image
below). If we set the learning rate to a very small value, gradient
descent will eventually reach the local minimum but that may take a
while (see the right image).

So, the learning rate should never be too high or too low for this
reason. You can check if you’re learning rate is doing well by plotting
it on a graph.

How to make sure it works properly

A good way to make sure gradient descent runs properly is by plotting

the cost function as the optimization runs. Put the number of
iterations on the x-axis and the value of the cost-function on the y-
axis. This helps you see the value of your cost function after each
iteration of gradient descent, and provides a way to easily spot how
appropriate your learning rate is. You can just try different values for
it and plot them all together. The left image below shows such a plot,
while the image on the right illustrates the difference between good
:
and bad learning rates.

If gradient descent is working properly, the cost function should

decrease after every iteration.

When gradient descent can’t decrease the cost-function anymore and

remains more or less on the same level, it has converged. The number
of iterations gradient descent needs to converge can sometimes vary
a lot. It can take 50 iterations, 60,000 or maybe even 3 million,
making the number of iterations to convergence hard to estimate in
advance.

There are some algorithms that can automatically tell you if gradient
descent has converged, but you must define a threshold for the
convergence beforehand, which is also pretty hard to estimate. For
this reason, simple plots are the preferred convergence test.

Another advantage of monitoring gradient descent via plots is it

allows us to easily spot if it doesn’t work properly, for example if the
cost function is increasing. Most of the time the reason for an
increasing cost-function when using gradient descent is a learning
rate that's too high.

If the plot shows the learning curve just going up and down, without
:
really reaching a lower point, try decreasing the learning rate. Also,
when starting out with gradient descent on a given problem, simply
try 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, etc., as the learning rates and
look at which one performs the best.

This introductory video to gradient descent helps to explain one of machine learning's most
useful algorithms.

Types of Gradient Descent

There are three popular types of gradient descent that mainly differ
in the amount of data they use:

BATCH GRADIENT DESCENT

Batch gradient descent, also called vanilla gradient descent, calculates

the error for each example within the training dataset, but only after
all training examples have been evaluated does the model
get updated. This whole process is like a cycle and it's called a
training epoch.
:
Some advantages of batch gradient descent are its computational
efficient, it produces a stable error gradient and a stable convergence.
Some disadvantages are the stable error gradient can sometimes
result in a state of convergence that isn’t the best the model can
achieve. It also requires the entire training dataset be in memory and
available to the algorithm.

STOCHASTIC GRADIENT DESCENT

By contrast, stochastic gradient descent (SGD) does this for each

training example within the dataset, meaning it updates the
parameters for each training example one by one. Depending on the
problem, this can make SGD faster than batch gradient descent. One
advantage is the frequent updates allow us to have a pretty detailed
rate of improvement.

The frequent updates, however, are more computationally expensive

than the batch gradient descent approach. Additionally, the frequency
of those updates can result in noisy gradients, which may cause the
error rate to jump around instead of slowly decreasing.

MINI-BATCH GRADIENT DESCENT

Mini-batch gradient descent is the go-to method since it’s a

combination of the concepts of SGD and batch gradient descent. It
simply splits the training dataset into small batches and performs an
update for each of those batches. This creates a balance between the
robustness of stochastic gradient descent and the efficiency of batch
gradient descent.

Common mini-batch sizes range between 50 and 256, but like any
other machine learning technique, there is no clear rule because it
varies for different applications. This is the go-to algorithm when
training a neural network and it is the most common type of gradient
:
descent within deep learning.

Find out who's hiring.

See all Data + Analytics jobs at top tech companies & startups

VIEW JOBS

Niklas Donges is an entrepreneur, technical writer and AI expert. He

worked on an AI team of SAP for 1.5 years, after which he founded
Markov Solutions. The Berlin-based company specializes in artificial
intelligence, machine learning and deep learning, offering customized
AI-powered software solutions and consulting programs to various
companies.

Real Face vs. AI-Generated Fake: The Science Behind GANs

Get Started With AI Using Scikit-Learn

Data Science Is a Key Weapon in the Fight Against Fraud

Data Science Expert Contributors

:
Expert Contributors

Built In’s expert contributor network publishes

thoughtful, solutions-oriented stories written by
innovative tech professionals. It is the tech industry’s
definitive destination for sharing compelling, first-person
accounts of problem-solving on the road to innovation.

LEARN MORE

Great Companies Need Great People. That's Where RECRUIT WITH

We Come In. US

About Get Involved Tech Hubs

Our Story Recruit With Built In

Built In Austin
Careers
Become an Built In
Built In is the online Our Staff Expert Boston
community for startups Writers Contributor
Built In
and tech companies. Find
Content Chicago
startup jobs, tech news and
events.
Descriptions Resources
Built In
Colorado
Customer
Support
Built In LA

Share
Built In NYC
Feedback
:
Report a Bug Built In San
Francisco
Remote Jobs
in Atlanta Built In
Seattle
Remote Jobs
in Dallas See All Tech
Hubs
Marketing
Jobs in DC

Do Not Sell My Personal Info CA Notice of Collection

Gradient Descent
No ratings yet
Gradient Descent
17 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
What Is Gradient Descent - Built in
No ratings yet
What Is Gradient Descent - Built in
11 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
GRADIENT DESCENT
No ratings yet
GRADIENT DESCENT
5 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Introduction-to-Gradient-Descent (2)
No ratings yet
Introduction-to-Gradient-Descent (2)
8 pages
Gradient_Descent_(1)
No ratings yet
Gradient_Descent_(1)
8 pages
LInear
No ratings yet
LInear
14 pages
chp2 Gradient Descent algorithm
No ratings yet
chp2 Gradient Descent algorithm
5 pages
Gradient Descent: By-Vineet Ahuja BCA-V1-E 00221102021
No ratings yet
Gradient Descent: By-Vineet Ahuja BCA-V1-E 00221102021
10 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Deep Learning (Part 8) - Coursesteach
No ratings yet
Deep Learning (Part 8) - Coursesteach
16 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Gradient Descend
No ratings yet
Gradient Descend
64 pages
Interview Question What Is Gradient Descent 1679467271
No ratings yet
Interview Question What Is Gradient Descent 1679467271
16 pages
Gradient Descent
No ratings yet
Gradient Descent
14 pages
AI33
No ratings yet
AI33
6 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Descent Algorithm in Machine Learning - Analytics Vidhya
No ratings yet
Gradient Descent Algorithm in Machine Learning - Analytics Vidhya
11 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Gradient Descent (3) (2)
No ratings yet
Gradient Descent (3) (2)
27 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
Gradient Decent
No ratings yet
Gradient Decent
40 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages
GD Algo.pptx
No ratings yet
GD Algo.pptx
18 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Lecture Notes 3 &4
No ratings yet
Lecture Notes 3 &4
35 pages
Gradient Descent a Fundamental Optimization Algorithm
No ratings yet
Gradient Descent a Fundamental Optimization Algorithm
30 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Enigma Submission
No ratings yet
Enigma Submission
3 pages
ML Lecture # 03 Gradient Descent
No ratings yet
ML Lecture # 03 Gradient Descent
23 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Upload_Unit_2
No ratings yet
Upload_Unit_2
19 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
UNIT III Part-2
No ratings yet
UNIT III Part-2
39 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
MAT6007 - Session8 - Gradient Descent
No ratings yet
MAT6007 - Session8 - Gradient Descent
16 pages
gradient-descent-from-scratch-complete-intuition
No ratings yet
gradient-descent-from-scratch-complete-intuition
8 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
ML Lecture2
No ratings yet
ML Lecture2
36 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Algorith and Data Structure Revision - 1
No ratings yet
Algorith and Data Structure Revision - 1
9 pages
3.1. LPM + Graphic Approach
No ratings yet
3.1. LPM + Graphic Approach
44 pages
Chapter 4 Sensitivity Analysis 2 (Compatibility Mode)
100% (1)
Chapter 4 Sensitivity Analysis 2 (Compatibility Mode)
28 pages
GoogleNet
No ratings yet
GoogleNet
40 pages
Zero to Deep Learning
100% (4)
Zero to Deep Learning
753 pages
Buku 3. Polinomial
No ratings yet
Buku 3. Polinomial
42 pages
CISC 867: Deep Learning Assignment #2 (2 Points/question)
No ratings yet
CISC 867: Deep Learning Assignment #2 (2 Points/question)
2 pages
CE 2201 - Pagcaliwagan, Larra Marie J.
No ratings yet
CE 2201 - Pagcaliwagan, Larra Marie J.
15 pages
3 - Newton Raphson Method of Solving A Nonlinear Equation
No ratings yet
3 - Newton Raphson Method of Solving A Nonlinear Equation
9 pages
Sample Final Examination Questions IE406 - Introduction To Mathematical Programming Dr. Ralphs
No ratings yet
Sample Final Examination Questions IE406 - Introduction To Mathematical Programming Dr. Ralphs
10 pages
A New Algorithm For Solving Linear Programming Problems
No ratings yet
A New Algorithm For Solving Linear Programming Problems
6 pages
ASOC-D-24-01798
No ratings yet
ASOC-D-24-01798
45 pages
Newtonsbackward
No ratings yet
Newtonsbackward
5 pages
LESSON 1 - BISECTION _REGULA_FALSI
No ratings yet
LESSON 1 - BISECTION _REGULA_FALSI
29 pages
Constrained Optimization Matop
No ratings yet
Constrained Optimization Matop
6 pages
Introduction To CFD
No ratings yet
Introduction To CFD
16 pages
Advanced MATLAB For Scientific Computing: Course Description
No ratings yet
Advanced MATLAB For Scientific Computing: Course Description
3 pages
3.3 Special Products and Factoring
No ratings yet
3.3 Special Products and Factoring
37 pages
8-Big-M Method -II-11-01-2025
No ratings yet
8-Big-M Method -II-11-01-2025
20 pages
Chebyshev Polynomialsandthe Minimial Polynomialofcosine 2 Piovern
No ratings yet
Chebyshev Polynomialsandthe Minimial Polynomialofcosine 2 Piovern
6 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
ETH D-InFK Master Brochure
No ratings yet
ETH D-InFK Master Brochure
16 pages
Lesson 3.5 Multiplying Polynomials
No ratings yet
Lesson 3.5 Multiplying Polynomials
32 pages
The Simplex Method: MAXIMIZATION: Z 2 X X X 10 X X 20 X 5 X, X, X 0
No ratings yet
The Simplex Method: MAXIMIZATION: Z 2 X X X 10 X X 20 X 5 X, X, X 0
5 pages
Marquardt method (1)
No ratings yet
Marquardt method (1)
4 pages
G8 second term B
No ratings yet
G8 second term B
5 pages
02 Edge-Detection-example C4W1L02 EdgeDetectionExample
No ratings yet
02 Edge-Detection-example C4W1L02 EdgeDetectionExample
4 pages
Robot Path Planning For Maze Navigation
No ratings yet
Robot Path Planning For Maze Navigation
5 pages
Algorithm ST-1: Solve The Following Given Recurrence Relation: T (N) 4T (n/2) + N 2
No ratings yet
Algorithm ST-1: Solve The Following Given Recurrence Relation: T (N) 4T (n/2) + N 2
7 pages
Module IV - Numerical Methods - I
No ratings yet
Module IV - Numerical Methods - I
7 pages