0% found this document useful (0 votes)
40 views

Unit 1 Fundamentals of Deep Learning

The document discusses the fundamentals of deep learning including multilayer perceptrons, feedforward neural networks, backpropagation, activation functions, optimization algorithms, hyperparameters, and regularization techniques. It also defines deep learning, describes key characteristics like neural networks and feature representation, and outlines common deep learning applications and challenges.

Uploaded by

Reason
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Unit 1 Fundamentals of Deep Learning

The document discusses the fundamentals of deep learning including multilayer perceptrons, feedforward neural networks, backpropagation, activation functions, optimization algorithms, hyperparameters, and regularization techniques. It also defines deep learning, describes key characteristics like neural networks and feature representation, and outlines common deep learning applications and challenges.

Uploaded by

Reason
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Unit I

Fundamentals of Deep Learning

Syllabus: What is Deep Learning?, Multilayer Perceptron ,Feed forward neural, Back
propagation, Gradient descent, Vanishing gradient problem, Activation Functions:
RELU, LRELU, ERELU, Optimization Algorithms, Hyper parameters: Layer size,
Magnitude (momentum, learning rate),Regularization (dropout, drop connect, L1, L2)

What is Deep Learning?

Deep learning is a subfield of machine learning that focuses on building and training
artificial neural networks to perform tasks that typically require human-like
intelligence. It is inspired by the structure and function of the human brain, where
interconnected neurons process and transmit information. Deep learning has led to
significant advancements in various domains, including image and speech
recognition, natural language processing, and even game playing.

Key characteristics:

1. Neural Networks: Deep learning primarily involves the use of artificial neural
networks, which are composed of layers of interconnected nodes (neurons).
These networks attempt to simulate the behavior of neurons in the human
brain to process and learn from data.
2. Depth: The term "deep" in deep learning refers to the depth of the neural
network, which comprises multiple hidden layers between the input and
output layers. Deeper networks can capture more intricate patterns in data but
can also be more challenging to train.
3. Feature Representation: Deep learning excels at automatic feature extraction
and representation learning. Instead of manually designing features for a task,
deep learning models can learn relevant features from raw data, reducing the
need for domain expertise.
4. Training Data: Deep learning models require a large amount of labeled
training data to learn from. The network adjusts its parameters iteratively
based on the discrepancies between its predictions and the ground truth
labels in the training data.
5. Backpropagation: To train a neural network, the backpropagation algorithm is
used. It calculates the gradients of the network's parameters with respect to
the loss function, allowing the network to update its weights to minimize the
error.
6. Activation Functions: Activation functions introduce non-linearity to neural
networks, enabling them to model complex relationships in data. Common
activation functions include ReLU (Rectified Linear Activation), sigmoid, and
tanh.
7. Types of Neural Networks:
a. Convolutional Neural Networks (CNNs): Specialized for image and
video analysis, CNNs use convolutional layers to automatically learn
spatial hierarchies of features.
b. Recurrent Neural Networks (RNNs): Suitable for sequence data, RNNs
maintain a hidden state that captures information from previous steps
in the sequence.
c. Long Short-Term Memory (LSTM) Networks: A type of RNN designed to
handle long-range dependencies in sequences.
d. Gated Recurrent Units (GRUs): Similar to LSTMs, GRUs are designed to
be computationally more efficient.
8. Transfer Learning: Transfer learning involves using a pre-trained model on one
task as a starting point for another related task. This approach leverages the
learned features and can significantly reduce the amount of required training
data.
9. Applications:
a. Image and Video Analysis: Deep learning is widely used in image
classification, object detection, image generation, and facial
recognition.
b. Natural Language Processing (NLP): NLP tasks such as language
translation, sentiment analysis, and chatbots benefit from deep
learning techniques.
c. Speech Recognition: Deep learning is used to convert spoken language
into text, enabling voice assistants and transcription services.
d. Autonomous Vehicles: Deep learning plays a crucial role in enabling
self-driving cars to understand their environment and make real-time
decisions.
10.Challenges:
a. Data Quality and Quantity: Deep learning requires large, high-quality
datasets for effective training.
b. Overfitting: Networks can become too specialized to the training data
and perform poorly on new data.
c. Computational Resources: Training deep networks demands significant
computational power.
d. Interpretability: Deep learning models can be challenging to interpret,
leading to issues in critical applications like healthcare.
Perceptron
● The perceptron was invented in 1957 at the Cornell Aeronautical Laboratory
by Frank Rosenblatt.
● Perceptron is the most commonly used term. It is the primary step to learn
Machine Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a building
block of an Artificial Neural Network
● The Perceptron for performing certain calculations to detect input data
capabilities or business intelligence.
● Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to learn
elements and processes them one by one during preparation.
● Perceptron model is also treated as one of the best and simplest types of
Artificial Neural networks. However, it is a supervised learning algorithm of
binary classifiers. Hence, we can consider it as a single-layer neural network
with four main parameters, i.e., input values, weights and Bias, net sum, and
an activation function

● Input Nodes or Input Layer: This is the primary component of Perceptron


which accepts the initial data into the system for further processing. Each
input node contains a real numerical value.
● Weight and Bias: Weight parameter represents the strength of the
connection between units. In other words, a weight decides how much
influence the input will have on the output. This is another most important
parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias
can be considered as the line of intercept in a linear equation.
● Activation Function: These are the final and important components that help
to determine whether the neuron will fire or not. Activation Function can be
considered primarily as a step function.

How does Perceptron work?


● The perceptron model begins with the multiplication of all input values and
their weights, then adds these values together to create the weighted sum.
Then this weighted sum is applied to the activation function 'f' to obtain the
desired output. This activation function is also known as the step function and
is represented by 'f'.
● Perceptron model works in two important steps as follows:
○ Step-1 In the first step, multiply all input values with corresponding
weight values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:

■ ∑wi*xi = x1*w1 + x2*w2 +…xn*wn

Add a special term called bias 'b' to this weighted sum to


improve the model's performance. ∑wi*xi + b

○ Step-2 In the second step, an activation function is applied with the


above-mentioned weighted sum, which gives us output either in binary
form or a continuous value as follows: Y = f(∑wi*xi + b)

Types of Perceptron Models


● Based on the layers, Perceptron models are divided into two types. These are
as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model

➔ Single Layer Perceptron Model:


◆ This is one of the easiest Artificial neural networks (ANN) types. A
single-layered perceptron model consists feed-forward network and
also includes a threshold transfer function inside the model. The main
objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
◆ After adding all inputs, if the total sum of all inputs is more than a
pre-determined value, the model gets activated and shows the output
value as +1.

Multilayer Perceptron

A Multilayer Perceptron (MLP) is a class of artificial neural networks that


consists of multiple layers of interconnected nodes, also known as neurons.
It's designed to handle complex tasks by learning and modeling intricate
patterns in data through a process of forward and backward propagation.
MLPs are a foundational concept in deep learning and serve as the basis for
more sophisticated neural network architectures.

Key Components and Concepts:

​ Layers:
● Input Layer: The input layer receives the raw data. Each neuron
corresponds to a feature in the input data. The number of neurons in
this layer is determined by the dimensionality of the input data.
● Hidden Layers: These intermediate layers process and transform the
data. Each layer consists of multiple neurons that compute their
outputs based on weighted inputs from the previous layer.
● Output Layer: The final layer produces the network's prediction or
output. The number of neurons in this layer depends on the problem
type (e.g., binary classification, regression, multi-class classification).
​ Neurons and Activation Functions:
● Each neuron computes a weighted sum of its inputs (outputs from the
previous layer), adds a bias term, and passes the result through an
activation function.
● Activation functions introduce non-linearity, allowing the network to
capture complex relationships in the data.
● Common activation functions include Rectified Linear Unit (ReLU),
sigmoid, and hyperbolic tangent (tanh).
​ Feedforward Propagation:
● During feedforward propagation, information flows from the input layer
through the hidden layers to the output layer.
● Neurons in each layer compute their outputs using the weighted inputs
and activation functions.
​ Weights and Biases:
● Weights represent the strength of connections between neurons.
● Biases provide an offset to the weighted sum before applying the
activation function.
● Weights and biases are learned during training to minimize prediction
errors.
​ Activation and Loss Functions:
● Activation Functions: Each neuron applies an activation function to its
computed weighted sum.
● Loss Function: Measures the discrepancy between predicted and
actual outputs. Common loss functions include Mean Squared Error
(MSE) for regression and Cross-Entropy for classification.

Advantages of Multi-Layer Perceptron:


A multi-layered perceptron model can be used to solve complex non-linear
problems.
It works well with both small and large input data.
It helps us to obtain quick predictions after the training.
It helps to obtain the same accuracy ratio with large as well as small data.
Disadvantages of Multi-Layer Perceptron:
In Multi-layer perceptron, computations are difficult and time-consuming.
In multi-layer Perceptron, it is difficult to predict how much the dependent
variable affects each independent variable.
The model functioning depends on the quality of the training.

(Also refer Book)

Feed forward neural


● A Feed Forward Neural Network is an artificial neural network in which the
connections between nodes does not form a cycle.
● In the feed-forward neural network, there are not any feedback loops or
connections in the network. Here is simply an input layer, a hidden layer, and
an output layer.
● The feed forward model is the simplest form of neural network as information
is only processed in one direction. While the data may pass through multiple
hidden nodes, it always moves in one direction and never backwards
Input Layer
➢ It contains the neurons that receive input. The data is subsequently
passed on to the next tier. The input layer’s total number of neurons is
equal to the number of variables in the dataset.
Hidden layer
➢ This is the intermediate layer, which is concealed between the input
and output layers. This layer has a large number of neurons that
perform alterations on the inputs. They then communicate with the
output layer.
Output layer
➢ It is the last layer and is depending on the model’s construction.
Additionally, the output layer is the expected feature, as you are aware
of the desired outcome.
Neurons weights
➢ Weights are used to describe the strength of a connection between
neurons. The range of a weight’s value is from 0 to 1.
➢ Depending on the setup of the neural network, the final output may be
a real-valued output (regression) or a set of probabilities
(classification). This is controlled by the type of activation function we
use on the neurons in the output layer
➢ The output layer typically uses either a softmax or sigmoid activation
function for classification
Back propagation
Algorithm intuition
● Backpropagation learning is similar to the perceptron learning algorithm. We
want to compute the input example’s output with a forward pass through the
network. If the output matches the label, we don’t do anything. If the output
does not match the label, we need to adjust the weights on the connections in
the neural network.
● Back-propagation is the essence of neural network training. It is the practice
of fine-tuning the weights of a neural network based on the error rate (i.e.
loss) obtained in the previous epoch (i.e. iteration). Proper tuning of the
weights ensures lower error rates, making the model reliable by increasing its
generalization.
● Backpropagation is a pragmatic approach to dividing the contribution of error
for each weight. It is similar to the perceptron learning algorithm. With
backpropagation, we’re trying to minimize the error between the label (or
“actual”) output associated with the training input and the value generated
from the network output
(Book)

Gradient descent
Vanishing gradient problem

Activation Functions:
Activation function decides, whether a neuron should be activated or not by
calculating weighted sum and further adding bias with it. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.
Why do we need Non-linear activation functions? A neural network without an
activation function is essentially just a linear regression model. The activation
function does the non-linear transformation to the input making it capable to learn
and perform more complex tasks

Types of Activation Function :-


LRELU
ERELU

Optimization Algorithms
optimization algorithms are divided into two camps: -
First-order -
Second-order
● First-order optimization algorithms calculate the Jacobian matrix. The
Jacobian has one partial derivative per parameter (to calculate partial
derivatives, all other variables are momentarily treated as constants). The
algorithm then takes one step in the direction specified by the Jacobian.
● Second-order algorithms calculate the derivative of the Jacobian (i.e., the
derivative of a matrix of derivatives) by approximating the Hessian. Second
order methods take into account interdependencies between parameters
when choosing how much to modify each parameter.
● Gradient descent is a member of this path-finding class of algorithms.
Variations of gradient descent exist, but at its core, it finds the next step in the
right direction with respect to an objective at each iteration. Those steps move
us toward a global minimum error or maximum likelihood.
● Stochastic gradient descent (SGD) is machine learning’s workhorse
optimization algorithm. SGD trains several orders of magnitude faster than
methods such as batch gradient decent, with no loss of model accuracy.
● The strengths of SGD are easy implementation and the quick processing of
large datasets. You can adjust SGD by adapting the learning rate or using
second-order information.
● SGD is also a popular algorithm for training neural networks due to its
robustness in the face of noisy updates.

● Second-order methods All second-order methods calculate or approximate


the Hessian. As described earlier, we can think of the Hessian as the
derivative of the Jacobian. That is, it is a matrix of second-order partial
derivatives, analogous to “tracking acceleration rather than speed.”
Second-order methods include:
➔ Limited-memory BFGS (L-BFGS)
➔ Conjugate gradient10
➔ Hessian-free11

L-BFGS
➢ L-BFGS is an optimization algorithm and a so-called quasi-Newton
method. As its name indicates, it’s a variation of the
Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, and it limits how
much gradient is stored in memory.
➢ By this, we mean the algorithm does not compute the full Hessian
matrix, which is more computationally expensive.
➢ L-BFGS approximates the inverse Hessian matrix to direct weight
adjustments search toward more promising areas of parameter space.
Whereas BFGS stores the gradient’s full n × n inverse matrix, Hessian
L-BFGS stores only a few vectors that represent a local approximation
of it.
➢ L-BFGS performs faster because it uses approximated second-order
information. L-BFGS and conjugate gradient in practice can be faster
and more stable than SGD methods.
Conjugate gradient
➢ Conjugate gradient guides the direction of the line search process
based on conjugacy information.
➢ Conjugate gradient methods focus on minimizing the conjugate L2
norm.
➢ Conjugate gradient is very similar to gradient descent in that it performs
line search.
➢ The major difference is that conjugate gradient requires each
successive step in the line search process to be conjugate to one
another with respect to direction.

Hessian-free
➢ Hessian-free optimization is related to Newton’s method, but it better
minimizes the quadratic function we get.
➢ It is a powerful optimization method adapted to neural networks by
James Martens in 2010.
➢ We find the minimum of the quadratic function with an iterative method
called conjugate gradient.

Nesterov’s momentum
The “vanilla” version of SGD uses gradient directly, and this can be
problematic because gradient can be nearly zero for any parameter. This
causes SGD to take tiny steps in some cases, and steps that are too big for
situations in which the gradient is too large. To alleviate these issues, we can
use techniques such as the following:
• Nesterov’s momentum
• RMSProp
• Adam
• AdaDelta

Momentum is a factor between 0.0 and 1.0 that is applied to the change rate
of the weights over time. Typically, we see the value for momentum between
0.9 and 0.99.
AdaGrad
AdaGrad is one technique that has been developed to help augment
finding the “right” learning rate. AdaGrad is named in reference to how
it “adaptively” uses subgradient methods to dynamically control the
learning rate of an optimization algorithm. AdaGrad is monotonically
decreasing and never increases the learning rate above whatever the
base learning rate was set at initially. AdaGrad is the square root of the
sum of squares of the history of gradient computations. AdaGrad
speeds our training in the beginning and slows it appropriately toward
convergence, allowing for a smoother training process.
RMSProp
RMSprop is a very effective, but currently unpublished adaptive
learning rate method.
AdaDelta
AdaDelta is a variant of AdaGrad that keeps only the most recent
history rather than accumulating it like AdaGrad does.
ADAM
ADAM (a more recently developed updating technique from the
University of Toronto) derives learning rates from estimates of first and
second moments of the gradients.

Hyper parameters:
● Hyperparameters are the variables which determines the network
structure(Eg: Number of Hidden Units) and the variables which determine how
the network is trained(Eg: Learning Rate). Hyperparameters are set before
training(before optimizing the weights and bias).
● These are external to the model, and their values cannot be changed during
the training process.
● Hyperparameter selection focuses on ensuring that the model neither
underfits nor overfits the training dataset, while learning the structure of the
data as quickly as possible.
● Some of the hyperparameters are such as — dropout, Network Weight
Initialization, Activation function, Learning Rate, Momentum, Number of
epochs, Batch size
● Number of Hidden Layers and units
○ Hidden layers are the layers between input layer and output layer.
○ “Very simple. Just keep adding layers until the test error does not
improve anymore.”
○ Many hidden units within a layer with regularization techniques can
increase accuracy. Smaller number of units may cause underfitting.

Layer size (book)


Magnitude (momentum, learning rate) (book)

Regularization (dropout, drop connect, L1, L2)


➢ Regularization is a measure taken against overfitting. Overfitting occurs when
a model describes the training set but cannot generalize well over new
➢ Regularization for hyperparameters helps modify the gradient so that it
doesn’t step in directions that lead it to overfit.
➢ Regularization includes the following:
– Dropout
– DropConnect
– L1 penalty
– L2 penalty

Dropout
➢ Dropout is regularization technique to avoid overfitting (increase the validation
accuracy) thus increasing the generalizing power.
➢ Generally, use a small dropout value of 20%-50% of neurons with 20%
providing a good starting point. A probability too low has minimal effect and a
value too high results in under-learning by the network.
➢ Use a larger network. You are likely to get better performance when dropout is
used on a larger network, giving the model more of an opportunity to learn
independent representations.
DropConnect
➢ It does the same thing as Dropout, but instead of choosing a hidden unit, it
mutes the connection between two neurons

L1 and L2 regularization
➢ L1 regularization is considered computationally inefficient in the nonsparse
case, has sparse outputs, and includes built-in feature selection. L1
regularization multiplies the absolute value of weights rather than their
squares. This function drives many weights to zero while allowing a few to
grow large, making it easier to interpret the weights.
➢ In contrast, L2 regularization is computationally efficient due to it having
analytical solutions and nonsparse outputs, but it does not do feature
selection automatically for us.
➢ The L2 regularization function, a common and simple hyperparameter, adds a
term to the objective function that decreases the squared weights. You
multiply half the sum of the squared weights by a coefficient called the
weight-cost.
➢ L2 improves generalization, smooths the output of the model as input
changes, and helps the network ignore weights it does not use.

Network Weight Initialization


➢ Ideally, it may be better to use different weight initialization schemes
according to the activation function used on each layer.
➢ Mostly uniform distribution is used. Activation function

Activation functions
➢ Activation function are used to introduce nonlinearity to models, which allows
deep learning models to learn nonlinear prediction boundaries.
➢ Generally, the Relu activation function is the most popular. It is used in hidden
layers.
➢ Sigmoid is used in the output layer while making binary predictions. Softmax
is used in the output layer while making multi-class predictions.
Learning Rate
➢ The learning rate defines how quickly a network updates its parameters.
➢ Low learning rate slows down the learning process but converges smoothly.
➢ Larger learning rate speeds up the learning but may not converge.

Momentum
➢ Momentum helps to know the direction of the next step with the knowledge of
the previous steps. It helps to prevent oscillations. A typical choice of
momentum is between 0.5 to 0.9.

Number of epochs
➢ Number of epochs is the number of times the whole training data is shown to
the network while training.
➢ Increase the number of epochs until the validation accuracy starts decreasing
even when training accuracy is increasing(overfitting).

Batch size
➢ Mini batch size is the number of sub samples given to the network after which
parameter update happens.
➢ A good default for batch size might be 32. Also try 32, 64, 128, 256, and so
on.

Methods used to find out Hyperparameters


➔ Manual Search
➔ Grid Search
➔ Random Search
➔ Bayesian Optimization

You might also like