Unit 1 Fundamentals of Deep Learning
Unit 1 Fundamentals of Deep Learning
Syllabus: What is Deep Learning?, Multilayer Perceptron ,Feed forward neural, Back
propagation, Gradient descent, Vanishing gradient problem, Activation Functions:
RELU, LRELU, ERELU, Optimization Algorithms, Hyper parameters: Layer size,
Magnitude (momentum, learning rate),Regularization (dropout, drop connect, L1, L2)
Deep learning is a subfield of machine learning that focuses on building and training
artificial neural networks to perform tasks that typically require human-like
intelligence. It is inspired by the structure and function of the human brain, where
interconnected neurons process and transmit information. Deep learning has led to
significant advancements in various domains, including image and speech
recognition, natural language processing, and even game playing.
Key characteristics:
1. Neural Networks: Deep learning primarily involves the use of artificial neural
networks, which are composed of layers of interconnected nodes (neurons).
These networks attempt to simulate the behavior of neurons in the human
brain to process and learn from data.
2. Depth: The term "deep" in deep learning refers to the depth of the neural
network, which comprises multiple hidden layers between the input and
output layers. Deeper networks can capture more intricate patterns in data but
can also be more challenging to train.
3. Feature Representation: Deep learning excels at automatic feature extraction
and representation learning. Instead of manually designing features for a task,
deep learning models can learn relevant features from raw data, reducing the
need for domain expertise.
4. Training Data: Deep learning models require a large amount of labeled
training data to learn from. The network adjusts its parameters iteratively
based on the discrepancies between its predictions and the ground truth
labels in the training data.
5. Backpropagation: To train a neural network, the backpropagation algorithm is
used. It calculates the gradients of the network's parameters with respect to
the loss function, allowing the network to update its weights to minimize the
error.
6. Activation Functions: Activation functions introduce non-linearity to neural
networks, enabling them to model complex relationships in data. Common
activation functions include ReLU (Rectified Linear Activation), sigmoid, and
tanh.
7. Types of Neural Networks:
a. Convolutional Neural Networks (CNNs): Specialized for image and
video analysis, CNNs use convolutional layers to automatically learn
spatial hierarchies of features.
b. Recurrent Neural Networks (RNNs): Suitable for sequence data, RNNs
maintain a hidden state that captures information from previous steps
in the sequence.
c. Long Short-Term Memory (LSTM) Networks: A type of RNN designed to
handle long-range dependencies in sequences.
d. Gated Recurrent Units (GRUs): Similar to LSTMs, GRUs are designed to
be computationally more efficient.
8. Transfer Learning: Transfer learning involves using a pre-trained model on one
task as a starting point for another related task. This approach leverages the
learned features and can significantly reduce the amount of required training
data.
9. Applications:
a. Image and Video Analysis: Deep learning is widely used in image
classification, object detection, image generation, and facial
recognition.
b. Natural Language Processing (NLP): NLP tasks such as language
translation, sentiment analysis, and chatbots benefit from deep
learning techniques.
c. Speech Recognition: Deep learning is used to convert spoken language
into text, enabling voice assistants and transcription services.
d. Autonomous Vehicles: Deep learning plays a crucial role in enabling
self-driving cars to understand their environment and make real-time
decisions.
10.Challenges:
a. Data Quality and Quantity: Deep learning requires large, high-quality
datasets for effective training.
b. Overfitting: Networks can become too specialized to the training data
and perform poorly on new data.
c. Computational Resources: Training deep networks demands significant
computational power.
d. Interpretability: Deep learning models can be challenging to interpret,
leading to issues in critical applications like healthcare.
Perceptron
● The perceptron was invented in 1957 at the Cornell Aeronautical Laboratory
by Frank Rosenblatt.
● Perceptron is the most commonly used term. It is the primary step to learn
Machine Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a building
block of an Artificial Neural Network
● The Perceptron for performing certain calculations to detect input data
capabilities or business intelligence.
● Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to learn
elements and processes them one by one during preparation.
● Perceptron model is also treated as one of the best and simplest types of
Artificial Neural networks. However, it is a supervised learning algorithm of
binary classifiers. Hence, we can consider it as a single-layer neural network
with four main parameters, i.e., input values, weights and Bias, net sum, and
an activation function
Multilayer Perceptron
Layers:
● Input Layer: The input layer receives the raw data. Each neuron
corresponds to a feature in the input data. The number of neurons in
this layer is determined by the dimensionality of the input data.
● Hidden Layers: These intermediate layers process and transform the
data. Each layer consists of multiple neurons that compute their
outputs based on weighted inputs from the previous layer.
● Output Layer: The final layer produces the network's prediction or
output. The number of neurons in this layer depends on the problem
type (e.g., binary classification, regression, multi-class classification).
Neurons and Activation Functions:
● Each neuron computes a weighted sum of its inputs (outputs from the
previous layer), adds a bias term, and passes the result through an
activation function.
● Activation functions introduce non-linearity, allowing the network to
capture complex relationships in the data.
● Common activation functions include Rectified Linear Unit (ReLU),
sigmoid, and hyperbolic tangent (tanh).
Feedforward Propagation:
● During feedforward propagation, information flows from the input layer
through the hidden layers to the output layer.
● Neurons in each layer compute their outputs using the weighted inputs
and activation functions.
Weights and Biases:
● Weights represent the strength of connections between neurons.
● Biases provide an offset to the weighted sum before applying the
activation function.
● Weights and biases are learned during training to minimize prediction
errors.
Activation and Loss Functions:
● Activation Functions: Each neuron applies an activation function to its
computed weighted sum.
● Loss Function: Measures the discrepancy between predicted and
actual outputs. Common loss functions include Mean Squared Error
(MSE) for regression and Cross-Entropy for classification.
Gradient descent
Vanishing gradient problem
Activation Functions:
Activation function decides, whether a neuron should be activated or not by
calculating weighted sum and further adding bias with it. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.
Why do we need Non-linear activation functions? A neural network without an
activation function is essentially just a linear regression model. The activation
function does the non-linear transformation to the input making it capable to learn
and perform more complex tasks
Optimization Algorithms
optimization algorithms are divided into two camps: -
First-order -
Second-order
● First-order optimization algorithms calculate the Jacobian matrix. The
Jacobian has one partial derivative per parameter (to calculate partial
derivatives, all other variables are momentarily treated as constants). The
algorithm then takes one step in the direction specified by the Jacobian.
● Second-order algorithms calculate the derivative of the Jacobian (i.e., the
derivative of a matrix of derivatives) by approximating the Hessian. Second
order methods take into account interdependencies between parameters
when choosing how much to modify each parameter.
● Gradient descent is a member of this path-finding class of algorithms.
Variations of gradient descent exist, but at its core, it finds the next step in the
right direction with respect to an objective at each iteration. Those steps move
us toward a global minimum error or maximum likelihood.
● Stochastic gradient descent (SGD) is machine learning’s workhorse
optimization algorithm. SGD trains several orders of magnitude faster than
methods such as batch gradient decent, with no loss of model accuracy.
● The strengths of SGD are easy implementation and the quick processing of
large datasets. You can adjust SGD by adapting the learning rate or using
second-order information.
● SGD is also a popular algorithm for training neural networks due to its
robustness in the face of noisy updates.
L-BFGS
➢ L-BFGS is an optimization algorithm and a so-called quasi-Newton
method. As its name indicates, it’s a variation of the
Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, and it limits how
much gradient is stored in memory.
➢ By this, we mean the algorithm does not compute the full Hessian
matrix, which is more computationally expensive.
➢ L-BFGS approximates the inverse Hessian matrix to direct weight
adjustments search toward more promising areas of parameter space.
Whereas BFGS stores the gradient’s full n × n inverse matrix, Hessian
L-BFGS stores only a few vectors that represent a local approximation
of it.
➢ L-BFGS performs faster because it uses approximated second-order
information. L-BFGS and conjugate gradient in practice can be faster
and more stable than SGD methods.
Conjugate gradient
➢ Conjugate gradient guides the direction of the line search process
based on conjugacy information.
➢ Conjugate gradient methods focus on minimizing the conjugate L2
norm.
➢ Conjugate gradient is very similar to gradient descent in that it performs
line search.
➢ The major difference is that conjugate gradient requires each
successive step in the line search process to be conjugate to one
another with respect to direction.
Hessian-free
➢ Hessian-free optimization is related to Newton’s method, but it better
minimizes the quadratic function we get.
➢ It is a powerful optimization method adapted to neural networks by
James Martens in 2010.
➢ We find the minimum of the quadratic function with an iterative method
called conjugate gradient.
Nesterov’s momentum
The “vanilla” version of SGD uses gradient directly, and this can be
problematic because gradient can be nearly zero for any parameter. This
causes SGD to take tiny steps in some cases, and steps that are too big for
situations in which the gradient is too large. To alleviate these issues, we can
use techniques such as the following:
• Nesterov’s momentum
• RMSProp
• Adam
• AdaDelta
Momentum is a factor between 0.0 and 1.0 that is applied to the change rate
of the weights over time. Typically, we see the value for momentum between
0.9 and 0.99.
AdaGrad
AdaGrad is one technique that has been developed to help augment
finding the “right” learning rate. AdaGrad is named in reference to how
it “adaptively” uses subgradient methods to dynamically control the
learning rate of an optimization algorithm. AdaGrad is monotonically
decreasing and never increases the learning rate above whatever the
base learning rate was set at initially. AdaGrad is the square root of the
sum of squares of the history of gradient computations. AdaGrad
speeds our training in the beginning and slows it appropriately toward
convergence, allowing for a smoother training process.
RMSProp
RMSprop is a very effective, but currently unpublished adaptive
learning rate method.
AdaDelta
AdaDelta is a variant of AdaGrad that keeps only the most recent
history rather than accumulating it like AdaGrad does.
ADAM
ADAM (a more recently developed updating technique from the
University of Toronto) derives learning rates from estimates of first and
second moments of the gradients.
Hyper parameters:
● Hyperparameters are the variables which determines the network
structure(Eg: Number of Hidden Units) and the variables which determine how
the network is trained(Eg: Learning Rate). Hyperparameters are set before
training(before optimizing the weights and bias).
● These are external to the model, and their values cannot be changed during
the training process.
● Hyperparameter selection focuses on ensuring that the model neither
underfits nor overfits the training dataset, while learning the structure of the
data as quickly as possible.
● Some of the hyperparameters are such as — dropout, Network Weight
Initialization, Activation function, Learning Rate, Momentum, Number of
epochs, Batch size
● Number of Hidden Layers and units
○ Hidden layers are the layers between input layer and output layer.
○ “Very simple. Just keep adding layers until the test error does not
improve anymore.”
○ Many hidden units within a layer with regularization techniques can
increase accuracy. Smaller number of units may cause underfitting.
Dropout
➢ Dropout is regularization technique to avoid overfitting (increase the validation
accuracy) thus increasing the generalizing power.
➢ Generally, use a small dropout value of 20%-50% of neurons with 20%
providing a good starting point. A probability too low has minimal effect and a
value too high results in under-learning by the network.
➢ Use a larger network. You are likely to get better performance when dropout is
used on a larger network, giving the model more of an opportunity to learn
independent representations.
DropConnect
➢ It does the same thing as Dropout, but instead of choosing a hidden unit, it
mutes the connection between two neurons
L1 and L2 regularization
➢ L1 regularization is considered computationally inefficient in the nonsparse
case, has sparse outputs, and includes built-in feature selection. L1
regularization multiplies the absolute value of weights rather than their
squares. This function drives many weights to zero while allowing a few to
grow large, making it easier to interpret the weights.
➢ In contrast, L2 regularization is computationally efficient due to it having
analytical solutions and nonsparse outputs, but it does not do feature
selection automatically for us.
➢ The L2 regularization function, a common and simple hyperparameter, adds a
term to the objective function that decreases the squared weights. You
multiply half the sum of the squared weights by a coefficient called the
weight-cost.
➢ L2 improves generalization, smooths the output of the model as input
changes, and helps the network ignore weights it does not use.
Activation functions
➢ Activation function are used to introduce nonlinearity to models, which allows
deep learning models to learn nonlinear prediction boundaries.
➢ Generally, the Relu activation function is the most popular. It is used in hidden
layers.
➢ Sigmoid is used in the output layer while making binary predictions. Softmax
is used in the output layer while making multi-class predictions.
Learning Rate
➢ The learning rate defines how quickly a network updates its parameters.
➢ Low learning rate slows down the learning process but converges smoothly.
➢ Larger learning rate speeds up the learning but may not converge.
Momentum
➢ Momentum helps to know the direction of the next step with the knowledge of
the previous steps. It helps to prevent oscillations. A typical choice of
momentum is between 0.5 to 0.9.
Number of epochs
➢ Number of epochs is the number of times the whole training data is shown to
the network while training.
➢ Increase the number of epochs until the validation accuracy starts decreasing
even when training accuracy is increasing(overfitting).
Batch size
➢ Mini batch size is the number of sub samples given to the network after which
parameter update happens.
➢ A good default for batch size might be 32. Also try 32, 64, 128, 256, and so
on.