0% found this document useful (0 votes)
10 views

Gradient_decent

Uploaded by

akter12345b
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Gradient_decent

Uploaded by

akter12345b
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Stochastic Gradient Descent, Hyper-parameter Tuning: Optimizing

Learning Efficiency

Course: Machine Learning in Healthcare


Orientador: MD. MAHFUZ AHMED
Session: 2022-23 (M.Sc)

Department of Biomedical Engineering


Islamic University

November 22, 2024


Islam, M. K. (BME) Tı́tle Page November 22, 2024 1 / 15
Outline

1. Gradient Descent

2. Stochastic Gradient Descent

3. Hyper-parameter Tuning Techniques

Islam, M. K. (BME) Tı́tle Page November 22, 2024 2 / 15


Gradient Descent

Introduction to Gradient Descent


What is Gradient Descent?
Gradient Descent is a first-order iterative optimization algorithm used to minimize a
cost/loss function by iteratively moving towards the steepest descent as defined by
the negative of the gradient. It is widely used in machine learning for adjusting model
parameters.

Figure 1 – Gradient Descent Optimization Example.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 3 / 15


Gradient Descent

Steps in Gradient Descent


Steps:
Objective Function: Define the function that needs to be minimized. This function
represents the error or discrepancy between predicted and actual values in a machine
learning model.
Gradient Calculation: Compute the gradient of the objective function. The gradient
indicates the direction of the steepest ascent, which is used to guide the model
parameters in the opposite direction to approach the minimum.
Learning Rate: Select an appropriate learning rate (α), a hyperparameter that
determines the size of steps taken during the optimization process. The learning
rate significantly influences how quickly the algorithm converges to the minimum.
Update Rule: Apply the update rule to iteratively adjust the model parameters in the
direction opposite to the gradient. This step is repeated until the algorithm reaches
the minimum of the objective function.
Islam, M. K. (BME) Tı́tle Page November 22, 2024 4 / 15
Gradient Descent

Cost/Loss Functions
Commonly Used Loss Functions:
Mean Squared Error (MSE) / Mean Absolute Error (MAE): Primarily used in
regression tasks to measure the average squared or absolute differences between the
predicted and actual values.
Cross-Entropy Loss (Log Loss): Applied in binary and multi-class classification
tasks to evaluate the accuracy of predicted probabilities.
Hinge Loss: Frequently used for binary classification, especially with Support Vector
Machines (SVMs), to measure the error between predicted and actual class labels.
Huber Loss: A robust alternative to MSE, used in regression tasks to reduce
sensitivity to outliers by combining characteristics of MSE and MAE.
Softmax Cross-Entropy Loss: An extension of Cross-Entropy Loss for multi-class
classification tasks, commonly applied to models that output probability distributions
across multiple classes.
Islam, M. K. (BME) Tı́tle Page November 22, 2024 5 / 15
Gradient Descent

Gradient Descent Update Rule

General formula for updating parameters:


The Gradient Descent algorithm updates model parameters iteratively to minimize the
objective function.
The general formula for updating parameters is:
xnext = x − α∇f (x)
Where:
xnext : The updated parameter value.
x : The current parameter value.
α : The learning rate.
∇f (x) : The gradient of the objective function at x.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 6 / 15


Gradient Descent

Learning Rate (α)


The learning rate is a crucial hyperparameter in Gradient Descent, determining the step
size during each iteration of the optimization process.
Impact of Learning Rate:
If Too Large: A high learning rate can cause the algorithm to overshoot the
minimum, leading to divergence or oscillation around the minimum without achieving
convergence.
If Too Small: A low learning rate can result in slow convergence to the minimum,
increasing the training time and risking stagnation in local minima.

Typical Range:
The learning rate typically falls within the range of 0.00001 to 0.1. Hyperparameter
tuning is essential to identify the optimal learning rate for specific training tasks.
Islam, M. K. (BME) Tı́tle Page November 22, 2024 7 / 15
Gradient Descent

Types of Gradient Descent

Three Main Types of Gradient Descent includes:


Batch Gradient Descent: Uses the entire dataset to compute the gradient.
Stochastic Gradient Descent (SGD): Uses one data point per iteration, making it
faster but noisier.
Mini-batch Gradient Descent: A middle ground that uses a small batch of data for
each iteration.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 8 / 15


Stochastic Gradient Descent

Stochastic Gradient Descent

Definition
Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize
the objective function, typically in machine learning models, by iteratively adjusting
the model parameters (like weights) to reduce the error. It is a variant of the more
general Gradient Descent method but differs in how it updates the parameters.
In standard Gradient Descent, the algorithm calculates the gradient of the error for
the entire dataset at each iteration, which can be computationally expensive for large
datasets. SGD, on the other hand, updates the model parameters based on the
gradient of the error for one training sample (or a small batch) at a time. This makes
it faster and more efficient for large datasets but can introduce more noise into the
optimization process.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 9 / 15


Stochastic Gradient Descent

SGD Cont.

Advantages of SGD:
Faster computation for large datasets.
Allows the model to start learning immediately rather than waiting for the entire
dataset to be processed.
More frequent updates.

Disadvantages of SGD:
Due to the randomness, it can lead to fluctuations in the learning process and may
not converge as smoothly as batch Gradient Descent.
Requires careful tuning of learning rates and often needs more iterations to stabilize.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 10 / 15


Stochastic Gradient Descent

Comparison between GD and SGD


Stochastic Gradient
Aspect Gradient Descent (GD)
Descent (SGD)
Entire Dataset: ∇J(θ) =
Gradient Computation 1 PN Single Sample: ∇J(θ; xi , yi )
N i=1 ∇J(θ; xi , yi )

After processing the entire


Update Frequency After processing each sample.
dataset (batch).
Computation Cost High (for large datasets). Low (faster updates).
Noisy convergence (fluctuates
Convergence Stability Smooth convergence.
around the minimum).
Requires the entire dataset in Processes one sample at a
Memory Requirements
memory. time.
Efficiency Efficient for small datasets. Efficient for large datasets.
Islam, M. K. (BME) Tı́tle Page November 22, 2024 11 / 15
Stochastic Gradient Descent

Hyper-parameters

What are hyper-parameters in machine learning?


Hyper-parameters are configuration variables set before the training process of a
machine learning model that governs how the model is trained. Unlike model
parameters (such as weights and biases), which are learned during training, hyper-
parameters are predefined and control aspects like the learning behaviour, model
structure, and the optimization process. They play a critical role in determining
the performance and efficiency of a model.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 12 / 15


Stochastic Gradient Descent

Types of Hyper-parameters

1. Model-related Hyper-parameters:
These define the architecture of the model or control regularization techniques:
Number of layers in a neural network.
Number of neurons per layer.
Kernel size in a convolutional neural network.
Dropout rate to prevent overfitting.

2. Regularization Hyperparameters:
These influence the optimization process and how the model is trained:
L2 regularization (weight decay): Prevents large weights in the model.
Dropout rate: Randomly drops neurons during training to prevent overfitting.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 13 / 15


Stochastic Gradient Descent

Types of Hyper-parameters cont.

3. Optimization-Related Hyperparameters:
These influence the optimization process and how the model is trained:
Learning rate: Controls the step size in each iteration during optimization.
Batch size: Number of samples processed before the model is updated.
Number of epochs: The number of times the entire dataset passes through the model.
Momentum: Helps accelerate gradient vectors in the right direction to converge faster.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 14 / 15


Hyper-parameter Tuning Techniques

Hyper-parameter Tuning Techniques

How do we tune hyper-parameters effectively?


SGD is widely used in-
Grid Search: Exhaustively searches a predefined range of hyper-parameters.
Random Search: Samples hyper-parameters from a distribution.
Random Bayesian Optimization: Models the objective function to make informed
guesses.

Islam, M. K. (BME) Tı́tle Page November 22, 2024 15 / 15

You might also like