0% found this document useful (0 votes)
13 views

Deep Learning Unit - I Notes

notes

Uploaded by

athirayanpericse
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Deep Learning Unit - I Notes

notes

Uploaded by

athirayanpericse
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Fundamentals of Linear Algbebra,Scalars,Vectors,Matrices,Tensors

Let’s get started with Linear algebra. The understanding of linear algebra is a must for better comprehension and
effective implementation of Neural Networks. Getting started (in a black-box way) with deep learning libraries
such as TensorFlow and PyTorch should not be difficult after following detailed tutorials. Tons of them are
readily available on the internet. However, to understand the inside working of the neural networks and to
comprehend the research papers we need to make ourselves comfortable with the maths (linear algebra,
probability, calculus) that collectively build the neural networks.

In this article, I’ll discuss the fundamentals of linear algebra: scalars, vector, matrices and tensors.

Scalars

Scalars are the physical quantities that are described by magnitude only. In other words, scalars are those
quantities that are represented just by their numerical value. Examples include volume, density, speed, age, etc.
Scalar variables are denoted by ordinary lower-case letters (eg. x,y,zx,y,z). The continuous space of real-value
scalars is denoted by RR. For a scalar variable, the expression x∈RR denotes that xx is a real value scalar.

Mathematical notation: aϵRaϵR, where aa is learning rate

Vectors

A vector is an array of numbers or a list of scalar values. The single values in the array/list of a vector are called
the entries or components of the vector. Vector variables are usually denoted by lower-case letters with a right
arrow on top (x→,y→,z→x→,y→,z→) or bold-faced lower case letters (x, y, z).

Mathematical notation: x→ϵRnx→ϵRn, where xx is a vector. The expression says that the vector x→x→ has nn
real-value scalars.

x→=[x1x2x3..xn]x→=⎣⎡x1x2x3..xn⎦⎤

where, x1,x2,...,xnx1,x2,...,xn are the entries/components of the vector x→x→.

A list of scalar values, concatenated together in a column (for the sake of representation) become a vector.

Creating vectors in NumPy:

#loading numpy
import numpy as np

# creating a row vector


row_v = np.array([1, 2, 3])

#creating a column vector


column_v = np.array([[1],
[2],
[3]])

Matrices
Like vectors generalize scalars from order zero to order one, matrices generalize vectors from order one to order
two. In other words, Matrices are 2-D array of numbers. They are represented by bold-faced capital letters
(A,B,CA,B,C).

Mathematical notation: AϵRm∗nAϵRm∗n means that the matrix AA is of size m∗nm∗n where mm is rows and
nn is columns.

Visually, a matrix AA with size 3∗43∗4 can be illustrated as:

A=[a11a12a13a14a21a22a23a24a31a32a33a34]A=⎣⎡a11a21a31a12a22a32a13a23a33a14a24a34⎦⎤

Creating matrices in NumPy:

# Creating a 3*3 matrix


matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

Converting a grayscale image to a vector

Real-world use case: The learning algorithms need numbers to work on. Data such as images have to be
converted to an array of numbers before the data is fed to the algorithms. In the above illustration, an image of
50*50 is first converted to a 50*50 matrix. Here we are supposing that the image is grayscale. Thus the value of a
pixel ranges from 0 to 255. The resulting image is then converted to a vector of dimension 2500. This way an
image is converted to a vector.

Tensors
Tensors can be simply understood as an n-dimensional array with n greater than 2. Vectors are first order tensors
and matrices are second order tensors.

The mathematical notation of tensors is similar to that of matrices. Only the number of letters that define the
dimension are greater than two. Ai,j,kAi,j,k defines a tensor AA with i,j,ki,j,k dimensions.

Creating Tensor in NumPy:

tensor = np.array([
[[1,2,3], [4,5,6], [7,8,9]],
[[10,11,12], [13,14,15], [16,17,18]]
[[19,20,21], [22,23,24], [25,26,27]],
])

Con
verting a RGB image to a vector

We’ll be dealing with tensors while working with color image and video data. When a color image is converted
to an array of numbers the dimension will be something like height, width and color axis. The color axis
basically defines a different set of numbers for Red-Green-Blue channels. For each color channel, we’ll be
creating a matrix based on the pixel value. Then the resulting matrices are unrolled to form a vector.

Probability Distribution – Function, Formula, Table

A probability distribution is an idealized frequency distribution. In statistics, a frequency distribution


represents the number of occurrences of different outcomes in a dataset. It shows how often each different value
appears within a dataset.

Probability distribution represents an abstract representation of the frequency distribution. While a frequency
distribution pertains to a particular sample or dataset, detailing how often each potential value of a variable
appears within it, the occurrence of each value in the sample is dictated by its probability.
A probability distribution, not only shows the frequencies of different outcomes but also assigns probabilities to
each outcome. These probabilities indicate the likelihood of each outcome occurring.

In this article, we will learn what is probability distribution, types of probability distribution, probability
distribution function, and formulas.

Optimization techniques for Gradient Descent

Gradient Descent is a widely used optimization algorithm for machine learning models. However, there are
several optimization techniques that can be used to improve the performance of Gradient Descent. Here are some
of the most popular optimization techniques for Gradient Descent:

Learning Rate Scheduling: The learning rate determines the step size of the Gradient Descent algorithm.
Learning Rate Scheduling involves changing the learning rate during the training process, such as decreasing the
learning rate as the number of iterations increases. This technique helps the algorithm to converge faster and
avoid overshooting the minimum.

Momentum-based Updates: The Momentum-based Gradient Descent technique involves adding a fraction of
the previous update to the current update. This technique helps the algorithm to overcome local minima and
accelerates convergence.

Batch Normalization: Batch Normalization is a technique used to normalize the inputs to each layer of the
neural network. This helps the Gradient Descent algorithm to converge faster and avoid vanishing or exploding
gradients.

Weight Decay: Weight Decay is a regularization technique that involves adding a penalty term to the cost
function proportional to the magnitude of the weights. This helps to prevent overfitting and improve the
generalization of the model.

Adaptive Learning Rates: Adaptive Learning Rate techniques involve adjusting the learning rate adaptively
during the training process. Examples include Adagrad, RMSprop, and Adam. These techniques adjust the
learning rate based on the historical gradient information, which can improve the convergence speed and
accuracy of the algorithm.

Second-Order Methods: Second-Order Methods use the second-order derivatives of the cost function to update
the parameters. Examples include Newton’s Method and Quasi-Newton Methods. These methods can converge
faster than Gradient Descent, but require more computation and may be less stable.

Gradient Descent is an iterative optimization algorithm, used to find the minimum value for a function. The
general idea is to initialize the parameters to random values, and then take small steps in the direction of the
“slope” at each iteration. Gradient descent is highly used in supervised learning to minimize the error function
and find the optimal values for the parameters. Various extensions have been designed for the gradient descent
algorithms. Some of them are discussed below:

Momentum method: This method is used to accelerate the gradient descent algorithm by taking into
consideration the exponentially weighted average of the gradients. Using averages makes the algorithm converge
towards the minima in a faster way, as the gradients towards the uncommon directions are canceled out. The
pseudocode for the momentum method is given below.

V=0
for each iteration i:
compute dW
V = β V + (1 - β) dW
W=W-αV

V and dW are analogous to velocity and acceleration respectively. α is the learning rate, and β is analogous to
momentum normally kept at 0.9. Physics interpretation is that the velocity of a ball rolling downhill builds up
momentum according to the direction of slope(gradient) of the hill and therefore helps in better arrival of the ball
at a minimum value (in our case – at a minimum loss).

RMSprop: RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The intuition is to apply an
exponentially weighted average method to the second moment of the gradients (dW 2). The pseudocode for this is
as follows:

S=0
for each iteration i
compute dW
S = β S + (1 - β) dW2
W = W - α dW⁄√S + ε

Adam Optimization: Adam optimization algorithm incorporates the momentum method and RMSprop, along
with bias correction. The pseudocode for this approach is as follows:

V=0
S=0
for each iteration i
compute dW
V = β1 S + (1 - β1) dW
S = β2 S + (1 - β2) dW2
V = V⁄{1 - β1i}
S = S⁄{1 - β2i}
W = W - α V⁄√S + ε
Kingma and Ba, the proposers of Adam, recommended the following values for the hyperparameters.

α = 0.001
β1 = 0.9
β2 = 0.999
ε = 10-8

Machine learning Basics


Capacity
The ability of the learner (or called model) to discover a function taken from a family of functions.
Examples:
Linear predictor
y = wx + b
Quadratic predictor
y = w2x2 + w1x + b
Degree-10 polynomial predictor
y=b+
10∑
i=1
wi xi
The latter family is richer, allowing to capture more complex functions Capacity can be measured by the number
of training examples{x(train)i , y(train)i } that the learner could always fit, no matter how to change the values of
x(train)i and y(train).

Bias and Variance in Machine Learning

 Bias: Bias refers to the error due to overly simplistic assumptions in the learning algorithm. These
assumptions make the model easier to comprehend and learn but might not capture the underlying
complexities of the data. It is the error due to the model’s inability to represent the true relationship
between input and output accurately. When a model has poor performance both on the training and
testing data means high bias because of the simple model, indicating underfitting.

 Variance: Variance, on the other hand, is the error due to the model’s sensitivity to fluctuations in the
training data. It’s the variability of the model’s predictions for different instances of training data. High
variance occurs when a model learns the training data’s noise and random fluctuations rather than the
underlying pattern. As a result, the model performs well on the training data but poorly on the testing
data, indicating overfitting.
Under fitting in Machine Learning

A statistical model or a machine learning algorithm is said to have underfitting when a model is too simple to
capture data complexities. It represents the inability of the model to learn the training data effectively result in
poor performance both on the training and testing data. In simple terms, an underfit model’s are inaccurate,
especially when applied to new, unseen examples. It mainly happens when we uses very simple model with
overly simplified assumptions. To address underfitting problem of the model, we need to use more complex
models, with enhanced feature representation, and less regularization.

Note: The underfitting model has High bias and low variance.

Reasons for Underfitting

1. The model is too simple, So it may be not capable to represent the complexities in the data.

2. The input features which is used to train the model is not the adequate representations of underlying
factors influencing the target variable.

3. The size of the training dataset used is not enough.

4. Excessive regularization are used to prevent the overfitting, which constraint the model to capture the
data well.

5. Features are not scaled.

Techniques to Reduce Underfitting

1. Increase model complexity.

2. Increase the number of features, performing feature engineering.

3. Remove noise from the data.

4. Increase the number of epochs or increase the duration of training to get better results.

Overfitting in Machine Learning

A statistical model is said to be overfitted when the model does not make accurate predictions on testing data.
When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our
data set. And when testing with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear
methods because these types of machine learning algorithms have more freedom in building the model based on
the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear
algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.

In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is
different from unseen data.
Reasons for Overfitting:

1. High variance and low bias.

2. The model is too complex.

3. The size of the training data.

Techniques to Reduce Overfitting

1. Improving the quality of training data reduces overfitting by focusing on meaningful patterns, mitigate
the risk of fitting the noise or irrelevant features.

2. Increase the training data can improve the model’s ability to generalize to unseen data and reduce the
likelihood of overfitting.

3. Reduce model complexity.

4. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss
begins to increase stop training).

5. Ridge Regularization and Lasso Regularization.

6. Use dropout for neural networks to tackle overfitting.

Estimators, Bias, and Variance

There are several cases of random variables that appear often and have useful properties. Below are the ones we
will explore further in this course. The numbers in parentheses are the parameters of a random variable, which
are constants. Parameters define a random variable’s shape (i.e., distribution) and its values. For this lecture,
we’ll focus more heavily on the bolded random variables and their special properties, but you should familiarize
yourself with all the ones listed below:
Example

Suppose you win cash based on the number of heads you get in a series of 20 coin flips. Let Xi=1

if the -ith coin is heads, otherwise. Which payout strategy would you choose?
Stochastic Gradient Descent (SGD):
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm that is used for optimizing
machine learning models. It addresses the computational inefficiency of traditional Gradient Descent methods
when dealing with large datasets in machine learning projects.

In SGD, instead of using the entire dataset for each iteration, only a single random training example (or a small
batch) is selected to calculate the gradient and update the model parameters. This random selection introduces
randomness into the optimization process, hence the term “stochastic” in stochastic Gradient Descent

The advantage of using SGD is its computational efficiency, especially when dealing with large datasets. By
using a single example or a small batch, the computational cost per iteration is significantly reduced compared to
traditional Gradient Descent methods that require processing the entire dataset.

Stochastic Gradient Descent Algorithm


 Initialization: Randomly initialize the parameters of the model.

 Set Parameters: Determine the number of iterations and the learning rate (alpha) for updating the parameters.

 Stochastic Gradient Descent Loop: Repeat the following steps until the model converges or reaches the maximum
number of iterations:

o Shuffle the training dataset to introduce randomness.

o Iterate over each training example (or a small batch) in the shuffled order.

o Compute the gradient of the cost function with respect to the model parameters using the current
training
example (or batch).
o Update the model parameters by taking a step in the direction of the negative gradient, scaled by the
learning rate.

o Evaluate the convergence criteria, such as the difference in the cost function between iterations of the
gradient.

 Return Optimized Parameters: Once the convergence criteria are met or the maximum number of iterations is
reached, return the optimized model parameters.

In SGD, since only one sample from the dataset is chosen at random for each iteration, the path taken by the
algorithm to reach the minima is usually noisier than your typical Gradient Descent algorithm. But that doesn’t
matter all that much because the path taken by the algorithm does not matter, as long as we reach the minimum
and with a significantly shorter training time.

The path taken by Batch Gradient Descent is shown below:

path taken by Stochastic Gradient Descent looks as follows –

One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it usually took a higher
number of iterations to reach the minima, because of the randomness in its descent. Even though it requires a
higher number of iterations to reach the minima than typical Gradient Descent, it is still computationally much
less expensive than typical Gradient Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient
Descent for optimizing a learning algorithm.

Difference between Stochastic Gradient Descent & batch Gradient


Descent
The comparison between Stochastic Gradient Descent (SGD) and Batch Gradient Descent are as follows:

Aspect Stochastic Gradient Descent (SGD) Batch Gradient Descent


Uses a single random sample or a small Uses the entire dataset (batch) at each
Dataset Usage
batch of samples at each iteration. iteration.
Computational Computationally less expensive per iteration,
Computationally more expensive per
Efficiency as it processes fewer data points. iteration, as it processes the entire dataset.
Slower convergence due to less frequent
Convergence Faster convergence due to frequent updates.
updates.
High noise due to frequent updates with a Low noise as it updates parameters using all
Noise in Updates
single or few samples. data points.
Less stable as it may oscillate around the More stable as it converges smoothly towards
Stability
optimal solution. the optimum.
Memory Requires less memory as it processes fewer Requires more memory to hold the entire
Requirement data points at a time. dataset in memory.
Frequent updates make it suitable for online Less frequent updates make it suitable for
Update Frequency
learning and large datasets. smaller datasets.
Initialization Less sensitive to initial parameter values due
More sensitive to initial parameter values.
Sensitivity to frequent updates.

Python Code For Stochastic Gradient Descent


We will create an SGD class with methods that we will use while updating the parameters, fitting the training
data set, and predicting the new test data. The methods we will be using are as :

1. The first line imports the NumPy library, which is used for numerical computations in Python.

2. Define the SGD class:

1. The class SGD encapsulates the Stochastic Gradient Descent algorithm for training a linear
regression model.

2. Then we initialize the SGD optimizer parameters such as learning rate, number of epochs, batch
size, and tolerance. It also initializes weights and bias to None.

3. predict function: This function computes the predictions for input data X using the current weights
and bias. It performs matrix multiplication between input X and weights, and then adds the bias
term.
4. mean_squared_error function: This function calculates the mean squared error between the true
target values y_true and the predicted values y_pred.

5. gradient function: This computes the gradients of the loss function with respect to the weights and
bias using the formula for the gradient of the mean squared error loss function.

6. fit method: This method fits the model to the training data using stochastic gradient descent. It
iterates through the specified number of epochs, shuffling the data and updating the weights and
bias in each epoch. It also prints the loss periodically and checks for convergence based on the
tolerance.

1. it calculates gradients using a mini-batch of data (X_batch, y_batch). This aligns with the
stochastic nature of SGD.

2. After computing the gradients, the method updates the model parameters (self.weights and
self.bias) using the learning rate and the calculated gradients. This is consistent with the
parameter update step in SGD.

3. The fit method iterates through the entire dataset (X, y) for each epoch. However, it
updates the parameters using mini-batches, as indicated by the nested loop that iterates
through the dataset in mini-batches (for i in range(0, n_samples, self.batch_size)). This
reflects the stochastic nature of SGD, where parameters are updated using a subset of the
data.

import numpy as np

class SGD:

def __init__(self, lr=0.01, epochs=1000, batch_size=32, tol=1e-3):

self.learning_rate = lr

self.epochs = epochs

self.batch_size = batch_size

self.tolerance = tol

self.weights = None

self.bias = None

def predict(self, X):

return np.dot(X, self.weights) + self.bias

def mean_squared_error(self, y_true, y_pred):

return np.mean((y_true - y_pred) ** 2)


def gradient(self, X_batch, y_batch):

y_pred = self.predict(X_batch)

error = y_pred - y_batch

gradient_weights = np.dot(X_batch.T, error) / X_batch.shape[0]

gradient_bias = np.mean(error)

return gradient_weights, gradient_bias

def fit(self, X, y):

n_samples, n_features = X.shape

self.weights = np.random.randn(n_features)

self.bias = np.random.randn()

for epoch in range(self.epochs):

indices = np.random.permutation(n_samples)

X_shuffled = X[indices]

y_shuffled = y[indices]

for i in range(0, n_samples, self.batch_size):

X_batch = X_shuffled[i:i+self.batch_size]

y_batch = y_shuffled[i:i+self.batch_size]

gradient_weights, gradient_bias = self.gradient(X_batch, y_batch)

self.weights -= self.learning_rate * gradient_weights

self.bias -= self.learning_rate * gradient_bias

if epoch % 100 == 0:

y_pred = self.predict(X)
loss = self.mean_squared_error(y, y_pred)

print(f"Epoch {epoch}: Loss {loss}")

if np.linalg.norm(gradient_weights) < self.tolerance:

print("Convergence reached.")

break

return self.weights, self.bias

Code 2

# Create random dataset with 100 rows and 5 columns

X = np.random.randn(100, 5)

# create corresponding target value by adding random

# noise in the dataset

y = np.dot(X, np.array([1, 2, 3, 4, 5]))\

+ np.random.randn(100) * 0.1

# Create an instance of the SGD class

model = SGD(lr=0.01, epochs=1000,

batch_size=32, tol=1e-3)

w,b=model.fit(X,y)

# Predict using predict method from model

y_pred = w*X+b

#y_pred

Output
Epoch 0: Loss 64.66196845798673
Epoch 100: Loss 0.03999940087439455
Epoch 200: Loss 0.008260358272771882
Epoch 300: Loss 0.00823731979566282
Epoch 400: Loss 0.008243022613956992
Epoch 500: Loss 0.008239370268212335
Epoch 600: Loss 0.008236363304624746
Epoch 700: Loss 0.00823205131002819
Epoch 800: Loss 0.00823566681302786
Epoch 900: Loss 0.008237441485197143
Stochastic Gradient Descent (SGD) using TensorFLow
import tensorflow as tf

import numpy as np

class SGD:

def __init__(self, lr=0.001, epochs=2000, batch_size=32, tol=1e-3):

self.learning_rate = lr

self.epochs = epochs

self.batch_size = batch_size

self.tolerance = tol

self.weights = None

self.bias = None

def predict(self, X):

return tf.matmul(X, self.weights) + self.bias

def mean_squared_error(self, y_true, y_pred):

return tf.reduce_mean(tf.square(y_true - y_pred))

def gradient(self, X_batch, y_batch):

with tf.GradientTape() as tape:

y_pred = self.predict(X_batch)

loss = self.mean_squared_error(y_batch, y_pred)

gradient_weights, gradient_bias = tape.gradient(loss, [self.weights, self.bias])

return gradient_weights, gradient_bias

def fit(self, X, y):

n_samples, n_features = X.shape


self.weights = tf.Variable(tf.random.normal((n_features, 1)))

self.bias = tf.Variable(tf.random.normal(()))

for epoch in range(self.epochs):

indices = tf.random.shuffle(tf.range(n_samples))

X_shuffled = tf.gather(X, indices)

y_shuffled = tf.gather(y, indices)

for i in range(0, n_samples, self.batch_size):

X_batch = X_shuffled[i:i+self.batch_size]

y_batch = y_shuffled[i:i+self.batch_size]

gradient_weights, gradient_bias = self.gradient(X_batch, y_batch)

# Gradient clipping

gradient_weights = tf.clip_by_value(gradient_weights, -1, 1)

gradient_bias = tf.clip_by_value(gradient_bias, -1, 1)

self.weights.assign_sub(self.learning_rate * gradient_weights)

self.bias.assign_sub(self.learning_rate * gradient_bias)

if epoch % 100 == 0:

y_pred = self.predict(X)

loss = self.mean_squared_error(y, y_pred)

print(f"Epoch {epoch}: Loss {loss}")

if tf.norm(gradient_weights) < self.tolerance:

print("Convergence reached.")

break
return self.weights.numpy(), self.bias.numpy()

# Create random dataset with 100 rows and 5 columns

X = np.random.randn(100, 5).astype(np.float32)

# Create corresponding target value by adding random

# noise in the dataset

y = np.dot(X, np.array([1, 2, 3, 4, 5], dtype=np.float32)) + np.random.randn(100).astype(np.float32) * 0.1

# Create an instance of the SGD class

model = SGD(lr=0.005, epochs=1000, batch_size=12, tol=1e-3)

w, b = model.fit(X, y)

# Predict using predict method from model

y_pred = np.dot(X, w) + b

Output
Epoch 0: Loss 52.73115158081055
Epoch 100: Loss 44.69907760620117
Epoch 200: Loss 44.693603515625
Epoch 300: Loss 44.69377136230469
Epoch 400: Loss 44.67509460449219
Epoch 500: Loss 44.67082595825195
Epoch 600: Loss 44.674285888671875
Epoch 700: Loss 44.666194915771484
Epoch 800: Loss 44.66718292236328
Epoch 900: Loss 44.65559005737305

Advantages of Stochastic Gradient Descent


 Speed: SGD is faster than other variants of Gradient Descent such as Batch Gradient Descent and Mini-
Batch Gradient Descent since it uses only one example to update the parameters.

 Memory Efficiency: Since SGD updates the parameters for each training example one at a time, it is
memory-efficient and can handle large datasets that cannot fit into memory.

 Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape from local
minima and converges to a global minimum.

Disadvantages of Stochastic Gradient Descent


 Noisy updates: The updates in SGD are noisy and have a high variance, which can make the optimization
process less stable and lead to oscillations around the minimum.

 Slow Convergence: SGD may require more iterations to converge to the minimum since it updates the
parameters for each training example one at a time.

 Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since using a high
learning rate can cause the algorithm to overshoot the minimum, while a low learning rate can make the
algorithm converge slowly.

 Less Accurate: Due to the noisy updates, SGD may not converge to the exact global minimum and can
result in a suboptimal solution. This can be mitigated by using techniques such as learning rate scheduling
and momentum-based updates.

Challenges in Deep Learning


Deep learning has made significant advancements in various fields, but there are still some challenges that need
to be addressed. Here are some of the main challenges in deep learning:

1. Data availability: It requires large amounts of data to learn from. For using deep learning it’s a big
concern to gather as much data for training.

2. Computational Resources: For training the deep learning model, it is computationally expensive
because it requires specialized hardware like GPUs and TPUs.

3. Time-consuming: While working on sequential data depending on the computational resource it can
take very large even in days or months.

4. Interpretability: Deep learning models are complex, it works like a black box. it is very difficult to
interpret the result.

5. Overfitting: when the model is trained again and again, it becomes too specialized for the training data,
leading to overfitting and poor performance on new data.

Feed-forward Neural Network


Neural networks feedforward, also known as multi-layered networks of neurons, are called "feedforward,"
where information flows in one direction from the input to the output layer without looping back. It is composed
of three types of layers:

 Input Layer:
The input layer accepts the input data and passes it to the next layer.
 Hidden Layers:
One or more hidden layers that process and transform the input data. Each hidden layer has a set of
neurons connected to the neurons of the previous and next layers. These layers use activation functions,
such as ReLU or sigmoid, to introduce non-linearity into the network, allowing it to learn and model
more complex relationships between the inputs and outputs.
 Output Layer:
The output layer generates the final output. Depending on the type of problem, the number of neurons in
the output layer may vary. For example, in a binary classification problem, it would only have one
neuron. In contrast, a multi-class classification problem would have as many neurons as the number of
classes.

You might also like