Gradient_decent
Gradient_decent
Learning Efficiency
1. Gradient Descent
Cost/Loss Functions
Commonly Used Loss Functions:
Mean Squared Error (MSE) / Mean Absolute Error (MAE): Primarily used in
regression tasks to measure the average squared or absolute differences between the
predicted and actual values.
Cross-Entropy Loss (Log Loss): Applied in binary and multi-class classification
tasks to evaluate the accuracy of predicted probabilities.
Hinge Loss: Frequently used for binary classification, especially with Support Vector
Machines (SVMs), to measure the error between predicted and actual class labels.
Huber Loss: A robust alternative to MSE, used in regression tasks to reduce
sensitivity to outliers by combining characteristics of MSE and MAE.
Softmax Cross-Entropy Loss: An extension of Cross-Entropy Loss for multi-class
classification tasks, commonly applied to models that output probability distributions
across multiple classes.
Islam, M. K. (BME) Tı́tle Page November 22, 2024 5 / 15
Gradient Descent
Typical Range:
The learning rate typically falls within the range of 0.00001 to 0.1. Hyperparameter
tuning is essential to identify the optimal learning rate for specific training tasks.
Islam, M. K. (BME) Tı́tle Page November 22, 2024 7 / 15
Gradient Descent
Definition
Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize
the objective function, typically in machine learning models, by iteratively adjusting
the model parameters (like weights) to reduce the error. It is a variant of the more
general Gradient Descent method but differs in how it updates the parameters.
In standard Gradient Descent, the algorithm calculates the gradient of the error for
the entire dataset at each iteration, which can be computationally expensive for large
datasets. SGD, on the other hand, updates the model parameters based on the
gradient of the error for one training sample (or a small batch) at a time. This makes
it faster and more efficient for large datasets but can introduce more noise into the
optimization process.
SGD Cont.
Advantages of SGD:
Faster computation for large datasets.
Allows the model to start learning immediately rather than waiting for the entire
dataset to be processed.
More frequent updates.
Disadvantages of SGD:
Due to the randomness, it can lead to fluctuations in the learning process and may
not converge as smoothly as batch Gradient Descent.
Requires careful tuning of learning rates and often needs more iterations to stabilize.
Hyper-parameters
Types of Hyper-parameters
1. Model-related Hyper-parameters:
These define the architecture of the model or control regularization techniques:
Number of layers in a neural network.
Number of neurons per layer.
Kernel size in a convolutional neural network.
Dropout rate to prevent overfitting.
2. Regularization Hyperparameters:
These influence the optimization process and how the model is trained:
L2 regularization (weight decay): Prevents large weights in the model.
Dropout rate: Randomly drops neurons during training to prevent overfitting.
3. Optimization-Related Hyperparameters:
These influence the optimization process and how the model is trained:
Learning rate: Controls the step size in each iteration during optimization.
Batch size: Number of samples processed before the model is updated.
Number of epochs: The number of times the entire dataset passes through the model.
Momentum: Helps accelerate gradient vectors in the right direction to converge faster.