0% found this document useful (0 votes)

36 views20 pages

Deep Learning Unit - I Notes

notes

Uploaded by

athirayanpericse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views20 pages

Deep Learning Unit - I Notes

notes

Uploaded by

athirayanpericse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Fundamentals of Linear Algbebra,Scalars,Vectors,Matrices,Tensors

Let’s get started with Linear algebra. The understanding of linear algebra is a must for better comprehension and
effective implementation of Neural Networks. Getting started (in a black-box way) with deep learning libraries
such as TensorFlow and PyTorch should not be difficult after following detailed tutorials. Tons of them are
readily available on the internet. However, to understand the inside working of the neural networks and to
comprehend the research papers we need to make ourselves comfortable with the maths (linear algebra,
probability, calculus) that collectively build the neural networks.

In this article, I’ll discuss the fundamentals of linear algebra: scalars, vector, matrices and tensors.

Scalars

Scalars are the physical quantities that are described by magnitude only. In other words, scalars are those
quantities that are represented just by their numerical value. Examples include volume, density, speed, age, etc.
Scalar variables are denoted by ordinary lower-case letters (eg. x,y,zx,y,z). The continuous space of real-value
scalars is denoted by RR. For a scalar variable, the expression x∈RR denotes that xx is a real value scalar.

Mathematical notation: aϵRaϵR, where aa is learning rate

Vectors

A vector is an array of numbers or a list of scalar values. The single values in the array/list of a vector are called
the entries or components of the vector. Vector variables are usually denoted by lower-case letters with a right
arrow on top (x→,y→,z→x→,y→,z→) or bold-faced lower case letters (x, y, z).

Mathematical notation: x→ϵRnx→ϵRn, where xx is a vector. The expression says that the vector x→x→ has nn
real-value scalars.

x→=[x1x2x3..xn]x→=⎣⎡x1x2x3..xn⎦⎤

where, x1,x2,...,xnx1,x2,...,xn are the entries/components of the vector x→x→.

A list of scalar values, concatenated together in a column (for the sake of representation) become a vector.

Creating vectors in NumPy:

#loading numpy
import numpy as np

# creating a row vector

row_v = np.array([1, 2, 3])

#creating a column vector

column_v = np.array([[1],
[2],
[3]])

Matrices
Like vectors generalize scalars from order zero to order one, matrices generalize vectors from order one to order
two. In other words, Matrices are 2-D array of numbers. They are represented by bold-faced capital letters
(A,B,CA,B,C).

Mathematical notation: AϵRm∗nAϵRm∗n means that the matrix AA is of size m∗nm∗n where mm is rows and
nn is columns.

Visually, a matrix AA with size 3∗43∗4 can be illustrated as:

A=[a11a12a13a14a21a22a23a24a31a32a33a34]A=⎣⎡a11a21a31a12a22a32a13a23a33a14a24a34⎦⎤

Creating matrices in NumPy:

# Creating a 3*3 matrix

matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

Converting a grayscale image to a vector

Real-world use case: The learning algorithms need numbers to work on. Data such as images have to be
converted to an array of numbers before the data is fed to the algorithms. In the above illustration, an image of
50*50 is first converted to a 50*50 matrix. Here we are supposing that the image is grayscale. Thus the value of a
pixel ranges from 0 to 255. The resulting image is then converted to a vector of dimension 2500. This way an
image is converted to a vector.

Tensors
Tensors can be simply understood as an n-dimensional array with n greater than 2. Vectors are first order tensors
and matrices are second order tensors.

The mathematical notation of tensors is similar to that of matrices. Only the number of letters that define the
dimension are greater than two. Ai,j,kAi,j,k defines a tensor AA with i,j,ki,j,k dimensions.

Creating Tensor in NumPy:

tensor = np.array([
[[1,2,3], [4,5,6], [7,8,9]],
[[10,11,12], [13,14,15], [16,17,18]]
[[19,20,21], [22,23,24], [25,26,27]],
])

Con
verting a RGB image to a vector

We’ll be dealing with tensors while working with color image and video data. When a color image is converted
to an array of numbers the dimension will be something like height, width and color axis. The color axis
basically defines a different set of numbers for Red-Green-Blue channels. For each color channel, we’ll be
creating a matrix based on the pixel value. Then the resulting matrices are unrolled to form a vector.

Probability Distribution – Function, Formula, Table

A probability distribution is an idealized frequency distribution. In statistics, a frequency distribution

represents the number of occurrences of different outcomes in a dataset. It shows how often each different value
appears within a dataset.

Probability distribution represents an abstract representation of the frequency distribution. While a frequency
distribution pertains to a particular sample or dataset, detailing how often each potential value of a variable
appears within it, the occurrence of each value in the sample is dictated by its probability.
A probability distribution, not only shows the frequencies of different outcomes but also assigns probabilities to
each outcome. These probabilities indicate the likelihood of each outcome occurring.

In this article, we will learn what is probability distribution, types of probability distribution, probability
distribution function, and formulas.

Optimization techniques for Gradient Descent

Gradient Descent is a widely used optimization algorithm for machine learning models. However, there are
several optimization techniques that can be used to improve the performance of Gradient Descent. Here are some
of the most popular optimization techniques for Gradient Descent:

Learning Rate Scheduling: The learning rate determines the step size of the Gradient Descent algorithm.
Learning Rate Scheduling involves changing the learning rate during the training process, such as decreasing the
learning rate as the number of iterations increases. This technique helps the algorithm to converge faster and
avoid overshooting the minimum.

Momentum-based Updates: The Momentum-based Gradient Descent technique involves adding a fraction of
the previous update to the current update. This technique helps the algorithm to overcome local minima and
accelerates convergence.

Batch Normalization: Batch Normalization is a technique used to normalize the inputs to each layer of the
neural network. This helps the Gradient Descent algorithm to converge faster and avoid vanishing or exploding
gradients.

Weight Decay: Weight Decay is a regularization technique that involves adding a penalty term to the cost
function proportional to the magnitude of the weights. This helps to prevent overfitting and improve the
generalization of the model.

Adaptive Learning Rates: Adaptive Learning Rate techniques involve adjusting the learning rate adaptively
during the training process. Examples include Adagrad, RMSprop, and Adam. These techniques adjust the
learning rate based on the historical gradient information, which can improve the convergence speed and
accuracy of the algorithm.

Second-Order Methods: Second-Order Methods use the second-order derivatives of the cost function to update
the parameters. Examples include Newton’s Method and Quasi-Newton Methods. These methods can converge
faster than Gradient Descent, but require more computation and may be less stable.

Gradient Descent is an iterative optimization algorithm, used to find the minimum value for a function. The
general idea is to initialize the parameters to random values, and then take small steps in the direction of the
“slope” at each iteration. Gradient descent is highly used in supervised learning to minimize the error function
and find the optimal values for the parameters. Various extensions have been designed for the gradient descent
algorithms. Some of them are discussed below:

Momentum method: This method is used to accelerate the gradient descent algorithm by taking into
consideration the exponentially weighted average of the gradients. Using averages makes the algorithm converge
towards the minima in a faster way, as the gradients towards the uncommon directions are canceled out. The
pseudocode for the momentum method is given below.

V=0
for each iteration i:
compute dW
V = β V + (1 - β) dW
W=W-αV

V and dW are analogous to velocity and acceleration respectively. α is the learning rate, and β is analogous to
momentum normally kept at 0.9. Physics interpretation is that the velocity of a ball rolling downhill builds up
momentum according to the direction of slope(gradient) of the hill and therefore helps in better arrival of the ball
at a minimum value (in our case – at a minimum loss).

RMSprop: RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The intuition is to apply an
exponentially weighted average method to the second moment of the gradients (dW 2). The pseudocode for this is
as follows:

S=0
for each iteration i
compute dW
S = β S + (1 - β) dW2
W = W - α dW⁄√S + ε

Adam Optimization: Adam optimization algorithm incorporates the momentum method and RMSprop, along
with bias correction. The pseudocode for this approach is as follows:

V=0
S=0
for each iteration i
compute dW
V = β1 S + (1 - β1) dW
S = β2 S + (1 - β2) dW2
V = V⁄{1 - β1i}
S = S⁄{1 - β2i}
W = W - α V⁄√S + ε
Kingma and Ba, the proposers of Adam, recommended the following values for the hyperparameters.

α = 0.001
β1 = 0.9
β2 = 0.999
ε = 10-8

Machine learning Basics

Capacity
The ability of the learner (or called model) to discover a function taken from a family of functions.
Examples:
Linear predictor
y = wx + b
Quadratic predictor
y = w2x2 + w1x + b
Degree-10 polynomial predictor
y=b+
10∑
i=1
wi xi
The latter family is richer, allowing to capture more complex functions Capacity can be measured by the number
of training examples{x(train)i , y(train)i } that the learner could always fit, no matter how to change the values of
x(train)i and y(train).

Bias and Variance in Machine Learning

 Bias: Bias refers to the error due to overly simplistic assumptions in the learning algorithm. These
assumptions make the model easier to comprehend and learn but might not capture the underlying
complexities of the data. It is the error due to the model’s inability to represent the true relationship
between input and output accurately. When a model has poor performance both on the training and
testing data means high bias because of the simple model, indicating underfitting.

 Variance: Variance, on the other hand, is the error due to the model’s sensitivity to fluctuations in the
training data. It’s the variability of the model’s predictions for different instances of training data. High
variance occurs when a model learns the training data’s noise and random fluctuations rather than the
underlying pattern. As a result, the model performs well on the training data but poorly on the testing
data, indicating overfitting.
Under fitting in Machine Learning

A statistical model or a machine learning algorithm is said to have underfitting when a model is too simple to
capture data complexities. It represents the inability of the model to learn the training data effectively result in
poor performance both on the training and testing data. In simple terms, an underfit model’s are inaccurate,
especially when applied to new, unseen examples. It mainly happens when we uses very simple model with
overly simplified assumptions. To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.

Note: The underfitting model has High bias and low variance.

Reasons for Underfitting

1. The model is too simple, So it may be not capable to represent the complexities in the data.

2. The input features which is used to train the model is not the adequate representations of underlying
factors influencing the target variable.

3. The size of the training dataset used is not enough.

4. Excessive regularization are used to prevent the overfitting, which constraint the model to capture the
data well.

5. Features are not scaled.

Techniques to Reduce Underfitting

1. Increase model complexity.

2. Increase the number of features, performing feature engineering.

3. Remove noise from the data.

4. Increase the number of epochs or increase the duration of training to get better results.

Overfitting in Machine Learning

A statistical model is said to be overfitted when the model does not make accurate predictions on testing data.
When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our
data set. And when testing with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have more freedom in building the model based on
the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.

In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is
different from unseen data.
Reasons for Overfitting:

1. High variance and low bias.

2. The model is too complex.

3. The size of the training data.

Techniques to Reduce Overfitting

1. Improving the quality of training data reduces overfitting by focusing on meaningful patterns, mitigate
the risk of fitting the noise or irrelevant features.

2. Increase the training data can improve the model’s ability to generalize to unseen data and reduce the
likelihood of overfitting.

3. Reduce model complexity.

4. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss
begins to increase stop training).

5. Ridge Regularization and Lasso Regularization.

6. Use dropout for neural networks to tackle overfitting.

Estimators, Bias, and Variance

There are several cases of random variables that appear often and have useful properties. Below are the ones we
will explore further in this course. The numbers in parentheses are the parameters of a random variable, which
are constants. Parameters define a random variable’s shape (i.e., distribution) and its values. For this lecture,
we’ll focus more heavily on the bolded random variables and their special properties, but you should familiarize
yourself with all the ones listed below:
Example

Suppose you win cash based on the number of heads you get in a series of 20 coin flips. Let Xi=1

if the -ith coin is heads, otherwise. Which payout strategy would you choose?
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is used for optimizing
machine learning models. It addresses the computational inefficiency of traditional Gradient Descent methods
when dealing with large datasets in machine learning projects.

In SGD, instead of using the entire dataset for each iteration, only a single random training example (or a small
batch) is selected to calculate the gradient and update the model parameters. This random selection introduces
randomness into the optimization process, hence the term “stochastic” in stochastic Gradient Descent

The advantage of using SGD is its computational efficiency, especially when dealing with large datasets. By
using a single example or a small batch, the computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire dataset.

Stochastic Gradient Descent Algorithm

 Initialization: Randomly initialize the parameters of the model.

 Set Parameters: Determine the number of iterations and the learning rate (alpha) for updating the parameters.

 Stochastic Gradient Descent Loop: Repeat the following steps until the model converges or reaches the maximum
number of iterations:

o Shuffle the training dataset to introduce randomness.

o Iterate over each training example (or a small batch) in the shuffled order.

o Compute the gradient of the cost function with respect to the model parameters using the current
training
example (or batch).
o Update the model parameters by taking a step in the direction of the negative gradient, scaled by the
learning rate.

o Evaluate the convergence criteria, such as the difference in the cost function between iterations of the
gradient.

 Return Optimized Parameters: Once the convergence criteria are met or the maximum number of iterations is
reached, return the optimized model parameters.

In SGD, since only one sample from the dataset is chosen at random for each iteration, the path taken by the
algorithm to reach the minima is usually noisier than your typical Gradient Descent algorithm. But that doesn’t
matter all that much because the path taken by the algorithm does not matter, as long as we reach the minimum
and with a significantly shorter training time.

The path taken by Batch Gradient Descent is shown below:

path taken by Stochastic Gradient Descent looks as follows –

One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it usually took a higher
number of iterations to reach the minima, because of the randomness in its descent. Even though it requires a
higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally much
less expensive than typical Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient
Descent for optimizing a learning algorithm.

Difference between Stochastic Gradient Descent & batch Gradient

Descent
The comparison between Stochastic Gradient Descent (SGD) and Batch Gradient Descent are as follows:

Aspect Stochastic Gradient Descent (SGD) Batch Gradient Descent

Uses a single random sample or a small Uses the entire dataset (batch) at each
Dataset Usage
batch of samples at each iteration. iteration.
Computational Computationally less expensive per iteration,
Computationally more expensive per
Efficiency as it processes fewer data points. iteration, as it processes the entire dataset.
Slower convergence due to less frequent
Convergence Faster convergence due to frequent updates.
updates.
High noise due to frequent updates with a Low noise as it updates parameters using all
Noise in Updates
single or few samples. data points.
Less stable as it may oscillate around the More stable as it converges smoothly towards
Stability
optimal solution. the optimum.
Memory Requires less memory as it processes fewer Requires more memory to hold the entire
Requirement data points at a time. dataset in memory.
Frequent updates make it suitable for online Less frequent updates make it suitable for
Update Frequency
learning and large datasets. smaller datasets.
Initialization Less sensitive to initial parameter values due
More sensitive to initial parameter values.
Sensitivity to frequent updates.

Python Code For Stochastic Gradient Descent

We will create an SGD class with methods that we will use while updating the parameters, fitting the training
data set, and predicting the new test data. The methods we will be using are as :

1. The first line imports the NumPy library, which is used for numerical computations in Python.

2. Define the SGD class:

1. The class SGD encapsulates the Stochastic Gradient Descent algorithm for training a linear
regression model.

2. Then we initialize the SGD optimizer parameters such as learning rate, number of epochs, batch
size, and tolerance. It also initializes weights and bias to None.

3. predict function: This function computes the predictions for input data X using the current weights
and bias. It performs matrix multiplication between input X and weights, and then adds the bias
term.
4. mean_squared_error function: This function calculates the mean squared error between the true
target values y_true and the predicted values y_pred.

5. gradient function: This computes the gradients of the loss function with respect to the weights and
bias using the formula for the gradient of the mean squared error loss function.

6. fit method: This method fits the model to the training data using stochastic gradient descent. It
iterates through the specified number of epochs, shuffling the data and updating the weights and
bias in each epoch. It also prints the loss periodically and checks for convergence based on the
tolerance.

1. it calculates gradients using a mini-batch of data (X_batch, y_batch). This aligns with the
stochastic nature of SGD.

2. After computing the gradients, the method updates the model parameters (self.weights and
self.bias) using the learning rate and the calculated gradients. This is consistent with the
parameter update step in SGD.

3. The fit method iterates through the entire dataset (X, y) for each epoch. However, it
updates the parameters using mini-batches, as indicated by the nested loop that iterates
through the dataset in mini-batches (for i in range(0, n_samples, self.batch_size)). This
reflects the stochastic nature of SGD, where parameters are updated using a subset of the
data.

import numpy as np

class SGD:

def init(self, lr=0.01, epochs=1000, batch_size=32, tol=1e-3):

self.learning_rate = lr

self.epochs = epochs

self.batch_size = batch_size

self.tolerance = tol

self.weights = None

self.bias = None

def predict(self, X):

return np.dot(X, self.weights) + self.bias

def mean_squared_error(self, y_true, y_pred):

return np.mean((y_true - y_pred) ** 2)

def gradient(self, X_batch, y_batch):

y_pred = self.predict(X_batch)

error = y_pred - y_batch

gradient_weights = np.dot(X_batch.T, error) / X_batch.shape[0]

gradient_bias = np.mean(error)

return gradient_weights, gradient_bias

def fit(self, X, y):

n_samples, n_features = X.shape

self.weights = np.random.randn(n_features)

self.bias = np.random.randn()

for epoch in range(self.epochs):

indices = np.random.permutation(n_samples)

X_shuffled = X[indices]

y_shuffled = y[indices]

for i in range(0, n_samples, self.batch_size):

X_batch = X_shuffled[i:i+self.batch_size]

y_batch = y_shuffled[i:i+self.batch_size]

gradient_weights, gradient_bias = self.gradient(X_batch, y_batch)

self.weights -= self.learning_rate * gradient_weights

self.bias -= self.learning_rate * gradient_bias

if epoch % 100 == 0:

y_pred = self.predict(X)
loss = self.mean_squared_error(y, y_pred)

print(f"Epoch {epoch}: Loss {loss}")

if np.linalg.norm(gradient_weights) < self.tolerance:

print("Convergence reached.")

break

return self.weights, self.bias

Code 2

# Create random dataset with 100 rows and 5 columns

X = np.random.randn(100, 5)

# create corresponding target value by adding random

# noise in the dataset

y = np.dot(X, np.array([1, 2, 3, 4, 5]))\

+ np.random.randn(100) * 0.1

# Create an instance of the SGD class

model = SGD(lr=0.01, epochs=1000,

batch_size=32, tol=1e-3)

w,b=model.fit(X,y)

# Predict using predict method from model

y_pred = w*X+b

#y_pred

Output
Epoch 0: Loss 64.66196845798673
Epoch 100: Loss 0.03999940087439455
Epoch 200: Loss 0.008260358272771882
Epoch 300: Loss 0.00823731979566282
Epoch 400: Loss 0.008243022613956992
Epoch 500: Loss 0.008239370268212335
Epoch 600: Loss 0.008236363304624746
Epoch 700: Loss 0.00823205131002819
Epoch 800: Loss 0.00823566681302786
Epoch 900: Loss 0.008237441485197143
Stochastic Gradient Descent (SGD) using TensorFLow
import tensorflow as tf

import numpy as np

class SGD:

def init(self, lr=0.001, epochs=2000, batch_size=32, tol=1e-3):

self.learning_rate = lr

self.epochs = epochs

self.batch_size = batch_size

self.tolerance = tol

self.weights = None

self.bias = None

def predict(self, X):

return tf.matmul(X, self.weights) + self.bias

def mean_squared_error(self, y_true, y_pred):

return tf.reduce_mean(tf.square(y_true - y_pred))

def gradient(self, X_batch, y_batch):

with tf.GradientTape() as tape:

y_pred = self.predict(X_batch)

loss = self.mean_squared_error(y_batch, y_pred)

gradient_weights, gradient_bias = tape.gradient(loss, [self.weights, self.bias])

return gradient_weights, gradient_bias

def fit(self, X, y):

n_samples, n_features = X.shape

self.weights = tf.Variable(tf.random.normal((n_features, 1)))

self.bias = tf.Variable(tf.random.normal(()))

for epoch in range(self.epochs):

indices = tf.random.shuffle(tf.range(n_samples))

X_shuffled = tf.gather(X, indices)

y_shuffled = tf.gather(y, indices)

for i in range(0, n_samples, self.batch_size):

X_batch = X_shuffled[i:i+self.batch_size]

y_batch = y_shuffled[i:i+self.batch_size]

gradient_weights, gradient_bias = self.gradient(X_batch, y_batch)

# Gradient clipping

gradient_weights = tf.clip_by_value(gradient_weights, -1, 1)

gradient_bias = tf.clip_by_value(gradient_bias, -1, 1)

self.weights.assign_sub(self.learning_rate * gradient_weights)

self.bias.assign_sub(self.learning_rate * gradient_bias)

if epoch % 100 == 0:

y_pred = self.predict(X)

loss = self.mean_squared_error(y, y_pred)

print(f"Epoch {epoch}: Loss {loss}")

if tf.norm(gradient_weights) < self.tolerance:

print("Convergence reached.")

break
return self.weights.numpy(), self.bias.numpy()

# Create random dataset with 100 rows and 5 columns

X = np.random.randn(100, 5).astype(np.float32)

# Create corresponding target value by adding random

# noise in the dataset

y = np.dot(X, np.array([1, 2, 3, 4, 5], dtype=np.float32)) + np.random.randn(100).astype(np.float32) * 0.1

# Create an instance of the SGD class

model = SGD(lr=0.005, epochs=1000, batch_size=12, tol=1e-3)

w, b = model.fit(X, y)

# Predict using predict method from model

y_pred = np.dot(X, w) + b

Output
Epoch 0: Loss 52.73115158081055
Epoch 100: Loss 44.69907760620117
Epoch 200: Loss 44.693603515625
Epoch 300: Loss 44.69377136230469
Epoch 400: Loss 44.67509460449219
Epoch 500: Loss 44.67082595825195
Epoch 600: Loss 44.674285888671875
Epoch 700: Loss 44.666194915771484
Epoch 800: Loss 44.66718292236328
Epoch 900: Loss 44.65559005737305

Advantages of Stochastic Gradient Descent

 Speed: SGD is faster than other variants of Gradient Descent such as Batch Gradient Descent and Mini-
Batch Gradient Descent since it uses only one example to update the parameters.

 Memory Efficiency: Since SGD updates the parameters for each training example one at a time, it is
memory-efficient and can handle large datasets that cannot fit into memory.

 Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape from local
minima and converges to a global minimum.

Disadvantages of Stochastic Gradient Descent

 Noisy updates: The updates in SGD are noisy and have a high variance, which can make the optimization
process less stable and lead to oscillations around the minimum.

 Slow Convergence: SGD may require more iterations to converge to the minimum since it updates the
parameters for each training example one at a time.

 Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since using a high
learning rate can cause the algorithm to overshoot the minimum, while a low learning rate can make the
algorithm converge slowly.

 Less Accurate: Due to the noisy updates, SGD may not converge to the exact global minimum and can
result in a suboptimal solution. This can be mitigated by using techniques such as learning rate scheduling
and momentum-based updates.

Challenges in Deep Learning

Deep learning has made significant advancements in various fields, but there are still some challenges that need
to be addressed. Here are some of the main challenges in deep learning:

1. Data availability: It requires large amounts of data to learn from. For using deep learning it’s a big
concern to gather as much data for training.

2. Computational Resources: For training the deep learning model, it is computationally expensive
because it requires specialized hardware like GPUs and TPUs.

3. Time-consuming: While working on sequential data depending on the computational resource it can
take very large even in days or months.

4. Interpretability: Deep learning models are complex, it works like a black box. it is very difficult to
interpret the result.

5. Overfitting: when the model is trained again and again, it becomes too specialized for the training data,
leading to overfitting and poor performance on new data.

Feed-forward Neural Network

Neural networks feedforward, also known as multi-layered networks of neurons, are called "feedforward,"
where information flows in one direction from the input to the output layer without looping back. It is composed
of three types of layers:

 Input Layer:
The input layer accepts the input data and passes it to the next layer.
 Hidden Layers:
One or more hidden layers that process and transform the input data. Each hidden layer has a set of
neurons connected to the neurons of the previous and next layers. These layers use activation functions,
such as ReLU or sigmoid, to introduce non-linearity into the network, allowing it to learn and model
more complex relationships between the inputs and outputs.
 Output Layer:
The output layer generates the final output. Depending on the type of problem, the number of neurons in
the output layer may vary. For example, in a binary classification problem, it would only have one
neuron. In contrast, a multi-class classification problem would have as many neurons as the number of
classes.

Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
No ratings yet
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
5 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
mathophilia
No ratings yet
mathophilia
18 pages
Math 4 AI
No ratings yet
Math 4 AI
25 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
89 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
39 pages
Vertopal.com Math Linear Algebra
No ratings yet
Vertopal.com Math Linear Algebra
54 pages
009-Neural_Networks-Complete
No ratings yet
009-Neural_Networks-Complete
61 pages
Reader 133A
No ratings yet
Reader 133A
164 pages
Math Linear Algebra
No ratings yet
Math Linear Algebra
54 pages
Complete UNIT III DEEP LEARNING PPT (1)
No ratings yet
Complete UNIT III DEEP LEARNING PPT (1)
126 pages
Mathematical Modelling of Continuous Systems
No ratings yet
Mathematical Modelling of Continuous Systems
99 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2024-12-14_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2024-12-14_Reference-Material-I
36 pages
Mat Lab Tutor Ail
No ratings yet
Mat Lab Tutor Ail
13 pages
MLBasicsBook
No ratings yet
MLBasicsBook
287 pages
reserch papers on deep learning mpgi
No ratings yet
reserch papers on deep learning mpgi
6 pages
1 Linear Algebra Basics 25-07-2024
No ratings yet
1 Linear Algebra Basics 25-07-2024
30 pages
More on Numpy
No ratings yet
More on Numpy
50 pages
Data - Science and - Artificial - Intelligence
No ratings yet
Data - Science and - Artificial - Intelligence
106 pages
Lesson 2 - Background for AI [Autosaved]new
No ratings yet
Lesson 2 - Background for AI [Autosaved]new
37 pages
tensorflow-deep-learning-and-artificial-intelligence-machine-learning
No ratings yet
tensorflow-deep-learning-and-artificial-intelligence-machine-learning
97 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
DL (Unit I)
No ratings yet
DL (Unit I)
25 pages
(eBook-PDF) - Mathematics - Mathematical Methods For Robotic
No ratings yet
(eBook-PDF) - Mathematics - Mathematical Methods For Robotic
99 pages
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
ML Lecture 2 2023
No ratings yet
ML Lecture 2 2023
59 pages
Numpy and Scipy: Numerical Computing in Python
No ratings yet
Numpy and Scipy: Numerical Computing in Python
47 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
Learn Basics to become a Generative AI Engineer.pdf
No ratings yet
Learn Basics to become a Generative AI Engineer.pdf
25 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
DL Notes Unit 1
No ratings yet
DL Notes Unit 1
28 pages
Numpy in Visually Appealing Manner
No ratings yet
Numpy in Visually Appealing Manner
12 pages
1 & 2 Linear Algebra and Probability Distribution
No ratings yet
1 & 2 Linear Algebra and Probability Distribution
11 pages
Linear Algebra For Deep Learning. The Math Behind Every Deep Learning - by Vihar Kurama - Towards Data Science
No ratings yet
Linear Algebra For Deep Learning. The Math Behind Every Deep Learning - by Vihar Kurama - Towards Data Science
20 pages
Data Science
No ratings yet
Data Science
74 pages
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
No ratings yet
Machine Learning Notation: 1 Numbers & Arrays 4 Functions
2 pages
Vertopal.com C1 W2 Lab01 Python Numpy Vectorization Soln
No ratings yet
Vertopal.com C1 W2 Lab01 Python Numpy Vectorization Soln
12 pages
HCIP-AI-EI Developer V2.0 Training Material
No ratings yet
HCIP-AI-EI Developer V2.0 Training Material
508 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
21 pages
Maths For Intelligent Systems
No ratings yet
Maths For Intelligent Systems
76 pages
CG lab week 1- for students
No ratings yet
CG lab week 1- for students
26 pages
optim
No ratings yet
optim
33 pages
A Visual Intro To Numpy and Data Representation
No ratings yet
A Visual Intro To Numpy and Data Representation
16 pages
Lecture 4
No ratings yet
Lecture 4
101 pages
Optimizers
No ratings yet
Optimizers
4 pages
13 - NumPy
No ratings yet
13 - NumPy
46 pages
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
8 pages
Numpy
No ratings yet
Numpy
18 pages
ML-1
No ratings yet
ML-1
48 pages
[Fall 2024] Deep Learning 2
No ratings yet
[Fall 2024] Deep Learning 2
46 pages
Final2 Math EE
No ratings yet
Final2 Math EE
77 pages
Deep Learning notes
No ratings yet
Deep Learning notes
155 pages
Lec1_mathreview
No ratings yet
Lec1_mathreview
61 pages
Derivative Networks Reducedversion - 2022
No ratings yet
Derivative Networks Reducedversion - 2022
14 pages
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Exercises of Numerical Analysis
From Everand
Exercises of Numerical Analysis
Simone Malacrida
No ratings yet
Foundation of Data Science - CS3352 - Hand Written Notes - Unit 4 - Python Libraries For Data Wrangling
No ratings yet
Foundation of Data Science - CS3352 - Hand Written Notes - Unit 4 - Python Libraries For Data Wrangling
42 pages
Deep Learning - AD3501 - Notes - Unit 4 - Model Evaluation
No ratings yet
Deep Learning - AD3501 - Notes - Unit 4 - Model Evaluation
18 pages
CS 3353 C Programming and Data Structure QB
No ratings yet
CS 3353 C Programming and Data Structure QB
7 pages
Network Addressing
No ratings yet
Network Addressing
23 pages
Ad3381 Set1
No ratings yet
Ad3381 Set1
3 pages
AD3501 - Deep Learning University Question
No ratings yet
AD3501 - Deep Learning University Question
2 pages
Unit-III Notes
No ratings yet
Unit-III Notes
33 pages
VANET Review 1
No ratings yet
VANET Review 1
10 pages
Innovation
No ratings yet
Innovation
7 pages
Spanish Gcse Coursework Work Experience
100% (2)
Spanish Gcse Coursework Work Experience
5 pages
Pass
No ratings yet
Pass
102 pages
Unit - 1 FP 3ea10de4
No ratings yet
Unit - 1 FP 3ea10de4
8 pages
https___app.oswaalbooks.com_download_sample-qp_subsolution_184Mathematics SAP-2 Sol (1)
No ratings yet
https___app.oswaalbooks.com_download_sample-qp_subsolution_184Mathematics SAP-2 Sol (1)
9 pages
Inheritance
No ratings yet
Inheritance
36 pages
Hydra
No ratings yet
Hydra
55 pages
Types KBCH 120, 130, 140 Differential Protection For Transformers and Generators
No ratings yet
Types KBCH 120, 130, 140 Differential Protection For Transformers and Generators
20 pages
XFOIL User Guide
No ratings yet
XFOIL User Guide
62 pages
Kedar Py 123
No ratings yet
Kedar Py 123
14 pages
Cs Project 1l Daily Travel Booking
No ratings yet
Cs Project 1l Daily Travel Booking
24 pages
The Almighty Gods of Yan Kor
No ratings yet
The Almighty Gods of Yan Kor
12 pages
Tripura Sundari
No ratings yet
Tripura Sundari
11 pages
Choreography Dances
No ratings yet
Choreography Dances
36 pages
UGC-NET Dec 2018
No ratings yet
UGC-NET Dec 2018
66 pages
Project Report On Email Marketing
No ratings yet
Project Report On Email Marketing
10 pages
XII IP Practical File 1 Complete (1)
No ratings yet
XII IP Practical File 1 Complete (1)
38 pages
1 Introduction to Computers 032617
No ratings yet
1 Introduction to Computers 032617
18 pages
Daily Lesson Log: Las Pinas East National High School Kasoy ST., Verdant Acres Subd., Pamplona III, Las Pinas City
No ratings yet
Daily Lesson Log: Las Pinas East National High School Kasoy ST., Verdant Acres Subd., Pamplona III, Las Pinas City
3 pages
5 6066867401268397343
No ratings yet
5 6066867401268397343
52 pages
The Use of Code Switching in Twitter (A Case Study in English Education Department)
No ratings yet
The Use of Code Switching in Twitter (A Case Study in English Education Department)
10 pages
Mock Cat 6 Sol
No ratings yet
Mock Cat 6 Sol
11 pages
Makalah Simple Past Tense B.inggris KLMPK 6
No ratings yet
Makalah Simple Past Tense B.inggris KLMPK 6
9 pages
Essay Peer Review Checklist
No ratings yet
Essay Peer Review Checklist
1 page
Agenda Medicina
No ratings yet
Agenda Medicina
41 pages
Reflections On Education in Indonesia
No ratings yet
Reflections On Education in Indonesia
9 pages
All About Buddhism - Gongyo Lyrics With Silent Prayers
No ratings yet
All About Buddhism - Gongyo Lyrics With Silent Prayers
32 pages
Compilation of Maths (9 Nov) - 31 PAGES
No ratings yet
Compilation of Maths (9 Nov) - 31 PAGES
31 pages
CSC 220.52 Fall 2024 BD
No ratings yet
CSC 220.52 Fall 2024 BD
11 pages
Jurnal - Fis - Ant.17 19 Amr S
No ratings yet
Jurnal - Fis - Ant.17 19 Amr S
15 pages
Infrastructure as Code 2nd Edition Early Access Kief Morris download
100% (5)
Infrastructure as Code 2nd Edition Early Access Kief Morris download
65 pages

Deep Learning Unit - I Notes

Uploaded by

Deep Learning Unit - I Notes

Uploaded by

Fundamentals of Linear Algbebra,Scalars,Vectors,Matrices,Tensors

Mathematical notation: aϵRaϵR, where aa is learning rate

where, x1,x2,...,xnx1,x2,...,xn are the entries/components of the vector x→x→.

Creating vectors in NumPy:

# creating a row vector

#creating a column vector

Visually, a matrix AA with size 3∗43∗4 can be illustrated as:

Creating matrices in NumPy:

# Creating a 3*3 matrix

Converting a grayscale image to a vector

Creating Tensor in NumPy:

Probability Distribution – Function, Formula, Table

A probability distribution is an idealized frequency distribution. In statistics, a frequency distribution

Optimization techniques for Gradient Descent

Machine learning Basics

Bias and Variance in Machine Learning

Reasons for Underfitting

3. The size of the training dataset used is not enough.

5. Features are not scaled.

Techniques to Reduce Underfitting

1. Increase model complexity.

2. Increase the number of features, performing feature engineering.

3. Remove noise from the data.

Overfitting in Machine Learning

1. High variance and low bias.

2. The model is too complex.

3. The size of the training data.

Techniques to Reduce Overfitting

3. Reduce model complexity.

5. Ridge Regularization and Lasso Regularization.

6. Use dropout for neural networks to tackle overfitting.

Estimators, Bias, and Variance

Stochastic Gradient Descent Algorithm

o Shuffle the training dataset to introduce randomness.

The path taken by Batch Gradient Descent is shown below:

path taken by Stochastic Gradient Descent looks as follows –

Difference between Stochastic Gradient Descent & batch Gradient

Aspect Stochastic Gradient Descent (SGD) Batch Gradient Descent

Python Code For Stochastic Gradient Descent

2. Define the SGD class:

def __init__(self, lr=0.01, epochs=1000, batch_size=32, tol=1e-3):

def predict(self, X):

return np.dot(X, self.weights) + self.bias

def mean_squared_error(self, y_true, y_pred):

return np.mean((y_true - y_pred) ** 2)

error = y_pred - y_batch

gradient_weights = np.dot(X_batch.T, error) / X_batch.shape[0]

return gradient_weights, gradient_bias

def fit(self, X, y):

n_samples, n_features = X.shape

for epoch in range(self.epochs):

for i in range(0, n_samples, self.batch_size):

gradient_weights, gradient_bias = self.gradient(X_batch, y_batch)

self.weights -= self.learning_rate * gradient_weights

self.bias -= self.learning_rate * gradient_bias

print(f"Epoch {epoch}: Loss {loss}")

if np.linalg.norm(gradient_weights) < self.tolerance:

return self.weights, self.bias

# Create random dataset with 100 rows and 5 columns

# create corresponding target value by adding random

# noise in the dataset

y = np.dot(X, np.array([1, 2, 3, 4, 5]))\

# Create an instance of the SGD class

model = SGD(lr=0.01, epochs=1000,

# Predict using predict method from model

def __init__(self, lr=0.001, epochs=2000, batch_size=32, tol=1e-3):

def predict(self, X):

return tf.matmul(X, self.weights) + self.bias

def mean_squared_error(self, y_true, y_pred):

return tf.reduce_mean(tf.square(y_true - y_pred))

def gradient(self, X_batch, y_batch):

with tf.GradientTape() as tape:

loss = self.mean_squared_error(y_batch, y_pred)

gradient_weights, gradient_bias = tape.gradient(loss, [self.weights, self.bias])

return gradient_weights, gradient_bias

def fit(self, X, y):

def init(self, lr=0.01, epochs=1000, batch_size=32, tol=1e-3):

def init(self, lr=0.001, epochs=2000, batch_size=32, tol=1e-3):