Deep Learning Unit - I Notes
Deep Learning Unit - I Notes
Let’s get started with Linear algebra. The understanding of linear algebra is a must for better comprehension and
effective implementation of Neural Networks. Getting started (in a black-box way) with deep learning libraries
such as TensorFlow and PyTorch should not be difficult after following detailed tutorials. Tons of them are
readily available on the internet. However, to understand the inside working of the neural networks and to
comprehend the research papers we need to make ourselves comfortable with the maths (linear algebra,
probability, calculus) that collectively build the neural networks.
In this article, I’ll discuss the fundamentals of linear algebra: scalars, vector, matrices and tensors.
Scalars
Scalars are the physical quantities that are described by magnitude only. In other words, scalars are those
quantities that are represented just by their numerical value. Examples include volume, density, speed, age, etc.
Scalar variables are denoted by ordinary lower-case letters (eg. x,y,zx,y,z). The continuous space of real-value
scalars is denoted by RR. For a scalar variable, the expression x∈RR denotes that xx is a real value scalar.
Vectors
A vector is an array of numbers or a list of scalar values. The single values in the array/list of a vector are called
the entries or components of the vector. Vector variables are usually denoted by lower-case letters with a right
arrow on top (x→,y→,z→x→,y→,z→) or bold-faced lower case letters (x, y, z).
Mathematical notation: x→ϵRnx→ϵRn, where xx is a vector. The expression says that the vector x→x→ has nn
real-value scalars.
x→=[x1x2x3..xn]x→=⎣⎡x1x2x3..xn⎦⎤
A list of scalar values, concatenated together in a column (for the sake of representation) become a vector.
#loading numpy
import numpy as np
Matrices
Like vectors generalize scalars from order zero to order one, matrices generalize vectors from order one to order
two. In other words, Matrices are 2-D array of numbers. They are represented by bold-faced capital letters
(A,B,CA,B,C).
Mathematical notation: AϵRm∗nAϵRm∗n means that the matrix AA is of size m∗nm∗n where mm is rows and
nn is columns.
A=[a11a12a13a14a21a22a23a24a31a32a33a34]A=⎣⎡a11a21a31a12a22a32a13a23a33a14a24a34⎦⎤
Real-world use case: The learning algorithms need numbers to work on. Data such as images have to be
converted to an array of numbers before the data is fed to the algorithms. In the above illustration, an image of
50*50 is first converted to a 50*50 matrix. Here we are supposing that the image is grayscale. Thus the value of a
pixel ranges from 0 to 255. The resulting image is then converted to a vector of dimension 2500. This way an
image is converted to a vector.
Tensors
Tensors can be simply understood as an n-dimensional array with n greater than 2. Vectors are first order tensors
and matrices are second order tensors.
The mathematical notation of tensors is similar to that of matrices. Only the number of letters that define the
dimension are greater than two. Ai,j,kAi,j,k defines a tensor AA with i,j,ki,j,k dimensions.
tensor = np.array([
[[1,2,3], [4,5,6], [7,8,9]],
[[10,11,12], [13,14,15], [16,17,18]]
[[19,20,21], [22,23,24], [25,26,27]],
])
Con
verting a RGB image to a vector
We’ll be dealing with tensors while working with color image and video data. When a color image is converted
to an array of numbers the dimension will be something like height, width and color axis. The color axis
basically defines a different set of numbers for Red-Green-Blue channels. For each color channel, we’ll be
creating a matrix based on the pixel value. Then the resulting matrices are unrolled to form a vector.
Probability distribution represents an abstract representation of the frequency distribution. While a frequency
distribution pertains to a particular sample or dataset, detailing how often each potential value of a variable
appears within it, the occurrence of each value in the sample is dictated by its probability.
A probability distribution, not only shows the frequencies of different outcomes but also assigns probabilities to
each outcome. These probabilities indicate the likelihood of each outcome occurring.
In this article, we will learn what is probability distribution, types of probability distribution, probability
distribution function, and formulas.
Gradient Descent is a widely used optimization algorithm for machine learning models. However, there are
several optimization techniques that can be used to improve the performance of Gradient Descent. Here are some
of the most popular optimization techniques for Gradient Descent:
Learning Rate Scheduling: The learning rate determines the step size of the Gradient Descent algorithm.
Learning Rate Scheduling involves changing the learning rate during the training process, such as decreasing the
learning rate as the number of iterations increases. This technique helps the algorithm to converge faster and
avoid overshooting the minimum.
Momentum-based Updates: The Momentum-based Gradient Descent technique involves adding a fraction of
the previous update to the current update. This technique helps the algorithm to overcome local minima and
accelerates convergence.
Batch Normalization: Batch Normalization is a technique used to normalize the inputs to each layer of the
neural network. This helps the Gradient Descent algorithm to converge faster and avoid vanishing or exploding
gradients.
Weight Decay: Weight Decay is a regularization technique that involves adding a penalty term to the cost
function proportional to the magnitude of the weights. This helps to prevent overfitting and improve the
generalization of the model.
Adaptive Learning Rates: Adaptive Learning Rate techniques involve adjusting the learning rate adaptively
during the training process. Examples include Adagrad, RMSprop, and Adam. These techniques adjust the
learning rate based on the historical gradient information, which can improve the convergence speed and
accuracy of the algorithm.
Second-Order Methods: Second-Order Methods use the second-order derivatives of the cost function to update
the parameters. Examples include Newton’s Method and Quasi-Newton Methods. These methods can converge
faster than Gradient Descent, but require more computation and may be less stable.
Gradient Descent is an iterative optimization algorithm, used to find the minimum value for a function. The
general idea is to initialize the parameters to random values, and then take small steps in the direction of the
“slope” at each iteration. Gradient descent is highly used in supervised learning to minimize the error function
and find the optimal values for the parameters. Various extensions have been designed for the gradient descent
algorithms. Some of them are discussed below:
Momentum method: This method is used to accelerate the gradient descent algorithm by taking into
consideration the exponentially weighted average of the gradients. Using averages makes the algorithm converge
towards the minima in a faster way, as the gradients towards the uncommon directions are canceled out. The
pseudocode for the momentum method is given below.
V=0
for each iteration i:
compute dW
V = β V + (1 - β) dW
W=W-αV
V and dW are analogous to velocity and acceleration respectively. α is the learning rate, and β is analogous to
momentum normally kept at 0.9. Physics interpretation is that the velocity of a ball rolling downhill builds up
momentum according to the direction of slope(gradient) of the hill and therefore helps in better arrival of the ball
at a minimum value (in our case – at a minimum loss).
RMSprop: RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The intuition is to apply an
exponentially weighted average method to the second moment of the gradients (dW 2). The pseudocode for this is
as follows:
S=0
for each iteration i
compute dW
S = β S + (1 - β) dW2
W = W - α dW⁄√S + ε
Adam Optimization: Adam optimization algorithm incorporates the momentum method and RMSprop, along
with bias correction. The pseudocode for this approach is as follows:
V=0
S=0
for each iteration i
compute dW
V = β1 S + (1 - β1) dW
S = β2 S + (1 - β2) dW2
V = V⁄{1 - β1i}
S = S⁄{1 - β2i}
W = W - α V⁄√S + ε
Kingma and Ba, the proposers of Adam, recommended the following values for the hyperparameters.
α = 0.001
β1 = 0.9
β2 = 0.999
ε = 10-8
Bias: Bias refers to the error due to overly simplistic assumptions in the learning algorithm. These
assumptions make the model easier to comprehend and learn but might not capture the underlying
complexities of the data. It is the error due to the model’s inability to represent the true relationship
between input and output accurately. When a model has poor performance both on the training and
testing data means high bias because of the simple model, indicating underfitting.
Variance: Variance, on the other hand, is the error due to the model’s sensitivity to fluctuations in the
training data. It’s the variability of the model’s predictions for different instances of training data. High
variance occurs when a model learns the training data’s noise and random fluctuations rather than the
underlying pattern. As a result, the model performs well on the training data but poorly on the testing
data, indicating overfitting.
Under fitting in Machine Learning
A statistical model or a machine learning algorithm is said to have underfitting when a model is too simple to
capture data complexities. It represents the inability of the model to learn the training data effectively result in
poor performance both on the training and testing data. In simple terms, an underfit model’s are inaccurate,
especially when applied to new, unseen examples. It mainly happens when we uses very simple model with
overly simplified assumptions. To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.
Note: The underfitting model has High bias and low variance.
1. The model is too simple, So it may be not capable to represent the complexities in the data.
2. The input features which is used to train the model is not the adequate representations of underlying
factors influencing the target variable.
4. Excessive regularization are used to prevent the overfitting, which constraint the model to capture the
data well.
4. Increase the number of epochs or increase the duration of training to get better results.
A statistical model is said to be overfitted when the model does not make accurate predictions on testing data.
When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our
data set. And when testing with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have more freedom in building the model based on
the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.
In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is
different from unseen data.
Reasons for Overfitting:
1. Improving the quality of training data reduces overfitting by focusing on meaningful patterns, mitigate
the risk of fitting the noise or irrelevant features.
2. Increase the training data can improve the model’s ability to generalize to unseen data and reduce the
likelihood of overfitting.
4. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss
begins to increase stop training).
There are several cases of random variables that appear often and have useful properties. Below are the ones we
will explore further in this course. The numbers in parentheses are the parameters of a random variable, which
are constants. Parameters define a random variable’s shape (i.e., distribution) and its values. For this lecture,
we’ll focus more heavily on the bolded random variables and their special properties, but you should familiarize
yourself with all the ones listed below:
Example
Suppose you win cash based on the number of heads you get in a series of 20 coin flips. Let Xi=1
if the -ith coin is heads, otherwise. Which payout strategy would you choose?
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is used for optimizing
machine learning models. It addresses the computational inefficiency of traditional Gradient Descent methods
when dealing with large datasets in machine learning projects.
In SGD, instead of using the entire dataset for each iteration, only a single random training example (or a small
batch) is selected to calculate the gradient and update the model parameters. This random selection introduces
randomness into the optimization process, hence the term “stochastic” in stochastic Gradient Descent
The advantage of using SGD is its computational efficiency, especially when dealing with large datasets. By
using a single example or a small batch, the computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire dataset.
Set Parameters: Determine the number of iterations and the learning rate (alpha) for updating the parameters.
Stochastic Gradient Descent Loop: Repeat the following steps until the model converges or reaches the maximum
number of iterations:
o Iterate over each training example (or a small batch) in the shuffled order.
o Compute the gradient of the cost function with respect to the model parameters using the current
training
example (or batch).
o Update the model parameters by taking a step in the direction of the negative gradient, scaled by the
learning rate.
o Evaluate the convergence criteria, such as the difference in the cost function between iterations of the
gradient.
Return Optimized Parameters: Once the convergence criteria are met or the maximum number of iterations is
reached, return the optimized model parameters.
In SGD, since only one sample from the dataset is chosen at random for each iteration, the path taken by the
algorithm to reach the minima is usually noisier than your typical Gradient Descent algorithm. But that doesn’t
matter all that much because the path taken by the algorithm does not matter, as long as we reach the minimum
and with a significantly shorter training time.
One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it usually took a higher
number of iterations to reach the minima, because of the randomness in its descent. Even though it requires a
higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally much
less expensive than typical Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient
Descent for optimizing a learning algorithm.
1. The first line imports the NumPy library, which is used for numerical computations in Python.
1. The class SGD encapsulates the Stochastic Gradient Descent algorithm for training a linear
regression model.
2. Then we initialize the SGD optimizer parameters such as learning rate, number of epochs, batch
size, and tolerance. It also initializes weights and bias to None.
3. predict function: This function computes the predictions for input data X using the current weights
and bias. It performs matrix multiplication between input X and weights, and then adds the bias
term.
4. mean_squared_error function: This function calculates the mean squared error between the true
target values y_true and the predicted values y_pred.
5. gradient function: This computes the gradients of the loss function with respect to the weights and
bias using the formula for the gradient of the mean squared error loss function.
6. fit method: This method fits the model to the training data using stochastic gradient descent. It
iterates through the specified number of epochs, shuffling the data and updating the weights and
bias in each epoch. It also prints the loss periodically and checks for convergence based on the
tolerance.
1. it calculates gradients using a mini-batch of data (X_batch, y_batch). This aligns with the
stochastic nature of SGD.
2. After computing the gradients, the method updates the model parameters (self.weights and
self.bias) using the learning rate and the calculated gradients. This is consistent with the
parameter update step in SGD.
3. The fit method iterates through the entire dataset (X, y) for each epoch. However, it
updates the parameters using mini-batches, as indicated by the nested loop that iterates
through the dataset in mini-batches (for i in range(0, n_samples, self.batch_size)). This
reflects the stochastic nature of SGD, where parameters are updated using a subset of the
data.
import numpy as np
class SGD:
self.learning_rate = lr
self.epochs = epochs
self.batch_size = batch_size
self.tolerance = tol
self.weights = None
self.bias = None
y_pred = self.predict(X_batch)
gradient_bias = np.mean(error)
self.weights = np.random.randn(n_features)
self.bias = np.random.randn()
indices = np.random.permutation(n_samples)
X_shuffled = X[indices]
y_shuffled = y[indices]
X_batch = X_shuffled[i:i+self.batch_size]
y_batch = y_shuffled[i:i+self.batch_size]
if epoch % 100 == 0:
y_pred = self.predict(X)
loss = self.mean_squared_error(y, y_pred)
print("Convergence reached.")
break
Code 2
X = np.random.randn(100, 5)
+ np.random.randn(100) * 0.1
batch_size=32, tol=1e-3)
w,b=model.fit(X,y)
y_pred = w*X+b
#y_pred
Output
Epoch 0: Loss 64.66196845798673
Epoch 100: Loss 0.03999940087439455
Epoch 200: Loss 0.008260358272771882
Epoch 300: Loss 0.00823731979566282
Epoch 400: Loss 0.008243022613956992
Epoch 500: Loss 0.008239370268212335
Epoch 600: Loss 0.008236363304624746
Epoch 700: Loss 0.00823205131002819
Epoch 800: Loss 0.00823566681302786
Epoch 900: Loss 0.008237441485197143
Stochastic Gradient Descent (SGD) using TensorFLow
import tensorflow as tf
import numpy as np
class SGD:
self.learning_rate = lr
self.epochs = epochs
self.batch_size = batch_size
self.tolerance = tol
self.weights = None
self.bias = None
y_pred = self.predict(X_batch)
self.bias = tf.Variable(tf.random.normal(()))
indices = tf.random.shuffle(tf.range(n_samples))
X_batch = X_shuffled[i:i+self.batch_size]
y_batch = y_shuffled[i:i+self.batch_size]
# Gradient clipping
self.weights.assign_sub(self.learning_rate * gradient_weights)
self.bias.assign_sub(self.learning_rate * gradient_bias)
if epoch % 100 == 0:
y_pred = self.predict(X)
print("Convergence reached.")
break
return self.weights.numpy(), self.bias.numpy()
X = np.random.randn(100, 5).astype(np.float32)
w, b = model.fit(X, y)
y_pred = np.dot(X, w) + b
Output
Epoch 0: Loss 52.73115158081055
Epoch 100: Loss 44.69907760620117
Epoch 200: Loss 44.693603515625
Epoch 300: Loss 44.69377136230469
Epoch 400: Loss 44.67509460449219
Epoch 500: Loss 44.67082595825195
Epoch 600: Loss 44.674285888671875
Epoch 700: Loss 44.666194915771484
Epoch 800: Loss 44.66718292236328
Epoch 900: Loss 44.65559005737305
Memory Efficiency: Since SGD updates the parameters for each training example one at a time, it is
memory-efficient and can handle large datasets that cannot fit into memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape from local
minima and converges to a global minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since it updates the
parameters for each training example one at a time.
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since using a high
learning rate can cause the algorithm to overshoot the minimum, while a low learning rate can make the
algorithm converge slowly.
Less Accurate: Due to the noisy updates, SGD may not converge to the exact global minimum and can
result in a suboptimal solution. This can be mitigated by using techniques such as learning rate scheduling
and momentum-based updates.
1. Data availability: It requires large amounts of data to learn from. For using deep learning it’s a big
concern to gather as much data for training.
2. Computational Resources: For training the deep learning model, it is computationally expensive
because it requires specialized hardware like GPUs and TPUs.
3. Time-consuming: While working on sequential data depending on the computational resource it can
take very large even in days or months.
4. Interpretability: Deep learning models are complex, it works like a black box. it is very difficult to
interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized for the training data,
leading to overfitting and poor performance on new data.
Input Layer:
The input layer accepts the input data and passes it to the next layer.
Hidden Layers:
One or more hidden layers that process and transform the input data. Each hidden layer has a set of
neurons connected to the neurons of the previous and next layers. These layers use activation functions,
such as ReLU or sigmoid, to introduce non-linearity into the network, allowing it to learn and model
more complex relationships between the inputs and outputs.
Output Layer:
The output layer generates the final output. Depending on the type of problem, the number of neurons in
the output layer may vary. For example, in a binary classification problem, it would only have one
neuron. In contrast, a multi-class classification problem would have as many neurons as the number of
classes.