0% found this document useful (0 votes)
154 views16 pages

DL Unit 1

Deep learning is a subfield of machine learning that uses algorithms inspired by the human brain to solve complex problems. It involves learning representations of raw data in an automated way rather than requiring manual feature engineering. Deep learning uses neural networks that contain many simple units connected in parallel layers, similar to the human brain. Backpropagation is a key algorithm that allows neural networks to learn from examples by adjusting weights to minimize error between the network's output and the correct output. Gradient descent is then used to update the parameters of the model by moving the weights downhill toward areas of lower error.

Uploaded by

nitin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views16 pages

DL Unit 1

Deep learning is a subfield of machine learning that uses algorithms inspired by the human brain to solve complex problems. It involves learning representations of raw data in an automated way rather than requiring manual feature engineering. Deep learning uses neural networks that contain many simple units connected in parallel layers, similar to the human brain. Backpropagation is a key algorithm that allows neural networks to learn from examples by adjusting weights to minimize error between the network's output and the correct output. Gradient descent is then used to update the parameters of the model by moving the weights downhill toward areas of lower error.

Uploaded by

nitin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

DL UNIT 1

Deep Learning
Deep learning is a subfield within machine learning that deals with algorithms that closely
resemble an over-simplified version of the human brain that solves a vast category of modern-
day machine intelligence.
The major bottleneck in machine learning systems was solved with deep learning. Here, we
essentially took the intelligence one step further, where the machine develops relevant features
for the task in an automated way instead of hand-crafting. Human beings learn concepts starting
from raw data.
Human learning goes from raw data to a conclusion without the explicit step where features
are identified and provided to the learner. In a sense, human beings learn the appropriate
representation of data from the data itself.
The field of deep learning has its primary focus on learning appropriate representations of data
such that these could be used to conclude.
The word “deep” in “deep learning” refers to the idea of learning the hierarchy of concepts
directly from raw data. A more technically appropriate term for deep learning would be
representation learning, and a more practical term for the same would be automated feature
engineering.
Neural Networks
Neural networks are a computational model that shares some properties with the animal brain
in which many simple units are working in parallel with no centralized control unit. The
weights between the units are the primary means of long-term information storage in neural
networks. Updating the weights is the primary way the neural network learns new information.
The behavior of neural networks is shaped by their network architecture. A network’s
architecture can be defined (in part) by the following:
• Number of neurons
• Number of layers
• Types of connections between layers
The most well-known and simplest-to-understand neural network is the feedforward multilayer
neural network. It has an input layer, one or many hidden layers, and a single output layer.
Each layer can have a different number of neurons and each layer is fully connected to the
adjacent layer.
Fig. Multilayer neural network
The Perceptron
The perceptron is a linear model used for binary classification. In the field of neural networks,
the perceptron is considered an artificial neuron using the Heaviside step function for the
activation function.
The perceptron is a linear-model binary classifier with a simple input-output relationship as n
number of inputs times their associated weights and then sends this “net input” to a step
function with a defined threshold.
Typically with perceptrons, this is a Heaviside step function with a threshold value of 0.5. This
function will output a real-valued single binary value (0 or a 1), depending on the input.
Fig. Single-layer perceptron

We can model the decision boundary and the classification output in the Heaviside step
function equation, as follows:

The output of the step function (activation function) is the output for the perceptron and gives
us a classification of the input values. If the bias value is negative, it forces the sum of the
learned weights to be a much greater value to get a 1 classification output. The bias term in this
capacity moves the decision boundary around for the model. Input values do not affect the bias
term, but the bias term is learned through the perceptron learning algorithm.
Multilayer Perceptron
The multilayer feed-forward network is a neural network with an input layer, one or more
hidden layers, and an output layer. Each layer has one or more artificial neurons.
These artificial neurons are similar to their perceptron precursor yet have a different activation
function depending on the layer’s specific purpose in the network.
Fig. Artificial neuron for a multilayer perceptron
This diagram is similar to the single-layer perceptron, yet we notice a more generalized
activation function.
The net input to the activation function is still the dot product of the weights and input features
yet the flexible activation function allows us to create different types out of output values. This
is a major contrast to the earlier perceptron design that used a piecewise linear Heaviside step
function as this improvement now allowed the artificial neuron because express more complex
activation output.

Feed Forward Neural Network

What is a Feed Forward Neural Network?

A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes do not form a cycle. The opposite of a feed-forward neural network is
a recurrent neural network, in which certain pathways are cycled. The feed-forward model is
the simplest form of a neural network as information is only processed in one direction. While
the data may pass through multiple hidden nodes, it always moves in one direction and never
backward.
How does a Feed Forward Neural Network work?

A Feed Forward Neural Network is commonly seen in its simplest form as a single
layer perceptron. In this model, a series of inputs enter the layer and are multiplied by the
weights.

Each value is then added together to get a sum of the weighted input values. If the sum of the
values is above a specific threshold, usually set at zero, the value produced is often 1, whereas
if the sum falls below the threshold, the output value is -1.

The single-layer perceptron is an important model of feed-forward neural networks and is often
used in classification tasks.

Furthermore, single-layer perceptrons can incorporate aspects of machine learning. Using a


property known as the delta rule, the neural network can compare the outputs of its nodes with
the intended values, thus allowing the network to adjust its weights through training in order
to produce more accurate output values.

This process of training and learning produces a form of gradient descent. In multi-layered
perceptrons, the process of updating weights is nearly analogous, however, the process is
defined more specifically as back-propagation.

In such cases, each hidden layer within the network is adjusted according to the output values
produced by the final layer.
Applications of Feed Forward Neural Networks

While Feed Forward Neural Networks are fairly straightforward, their simplified architecture
can be used as an advantage in particular machine learning applications. For example, one may
set up a series of feed forward neural networks with the intention of running them
independently from each other, but with a mild intermediary for moderation.

Like the human brain, this process relies on many individual neurons in order to handle and
process larger tasks. As the individual networks perform their tasks independently, the results
can be combined at the end to produce a synthesized, and cohesive output.

Backpropagation
Backpropagation is an important part of reducing error in a neural network model.
Backpropagation learning is similar to the perceptron learning algorithm. We want to compute
the input example’s output with a forward pass through the network. If the output matches the
label, we don’t do anything. If the output does not match the label, we need to adjust the weights
on the connections in the neural network.
The key is to distribute the blame for the error and divide it between the contributing weights.
With the perceptron learning algorithm, it’s easy because there is only one weight per input to
influence the output value. With feed-forward multilayer net‐ works learning algorithms have
a bigger challenge. There are many weights connecting each input to the output, so it becomes
more difficult. Each weight contributes to more than one output, so our learning algorithm must
be more clever.
Backpropagation is a pragmatic approach to dividing the contribution of error for each weight.
It is similar to the perceptron learning algorithm. With backpropagation, we’re trying to
minimize the error between the label (or “actual”) output associated with the training input and
the value generated from the network output. In the next section, we take a look at the
mathematical notation that you will see in most litera‐ ture on neural networks for
backpropagation of feed-forward neural networks.
Backpropagation and Mini-Batch Stochastic Gradient Descent
A variant of SGD called mini-batch, in which we train the model on multiple examples at once
as opposed to a single example at a time. We see mini-batch used with backpropagation and
SGD in neural networks as well to improve training. Under the hood, we’re computing the
average of the gradient across all the examples inside the mini-batch. Specifically, we compute
the forward pass for all of the examples to get their output scores as a batch linear algebra
matrix operation. During the backward pass for each layer, we are computing the average of
the gradient (for the layer). By doing backpropagation this way, we’re able to get a better
gradient approximation and use our hardware more efficiently at the same time.
Gradient descent
Gradient Descent is an optimization algorithm used for minimizing the cost function in various
machine learning algorithms. It is basically used for updating the parameters of the learning
model.
In gradient descent, we can imagine the quality of our network’s predictions (as a function of
the weight/parameter values) as a landscape. The hills represent locations (parameter values or
weights) that give a lot of prediction error; valleys represent locations with less error. We
choose one point on that landscape at which to place our initial weight.
We then can select the initial weight based on domain knowledge (if we’re training a network
to classify a flower species we know petal length is important, but color isn’t). Or, if we’re
letting the network do all the work, we might choose the initial weights randomly.
The purpose is to move that weight downhill, to areas of lower error, as quickly as possible.
An optimization algorithm like gradient descent can sense the actual slope of the hills with
regard to each weight; that is, it knows which direction is down.
Gradient descent measures the slope (the change in error caused by a change in the weight) and
takes the weight one step toward the bottom of the valley. It does so by taking a derivative of
the loss function to produce the gradient. The gradient gives the algorithm the direction for the
next step in the optimization algorithm.
A second problem that gradient descent encounters is with non-normalized features. When we
write “non-normalized features,” we mean features that can be measured by very different
scales. If you have one dimension measured in the millions, and another in decimals, gradient
descent will have a difficult time finding the steepest slope to minimize error.
Types of gradient Descent:
1. Batch Gradient Descent:
This is a type of gradient descent that processes all the training examples for each iteration of
gradient descent. But if the number of training examples is large, then batch gradient descent
is computationally very expensive. Hence if the number of training examples is large, then
batch gradient descent is not preferred. Instead, we prefer to use stochastic gradient descent or
mini-batch gradient descent.
2.Stochastic Gradient Descent:
This is a type of gradient descent that processes 1 training example per iteration. Hence, the
parameters are being updated even after one iteration in which only a single example has been
processed. Hence this is quite faster than batch gradient descent. But again, when the number
of training examples is large, even then it processes only one example which can be an
additional overhead for the system as the number of iterations will be quite large.
Mini Batch gradient descent: This is a type of gradient descent that works faster than both batch
gradient descent and stochastic gradient descent. Here b examples where b<m are processed
per iteration. So even if the number of training examples is large, it is processed in batches of
b training examples in one go. Thus, it works for larger training examples and that too with a
lesser number of iterations.
Convergence trends in different variants of Gradient Descents:
In the case of Batch Gradient Descent, the algorithm follows a straight path towards the
minimum. If the cost function is convex, then it converges to a global minimum and if the cost
function is not convex, then it converges to a local minimum. Here the learning rate is typically
held constant.
In the case of stochastic gradient Descent and mini-batch gradient descent, the algorithm does
not converge but keeps on fluctuating around the global minimum. Therefore in order to make
it converge, we have to slowly change the learning rate. However the convergence of Stochastic
gradient descent is much noisier as in one iteration, it processes only one training example.
Vanishing gradient problem
The vanishing gradient problem is an issue that sometimes arises when training machine
learning algorithms through gradient descent. This most often occurs in neural networks that
have several neuronal layers such as in a deep learning system, but also occurs in recurrent
neural networks. The key point is that the calculated partial derivatives used to compute the
gradient as one goes deeper into the network. Since the gradients control how much the
network learns during training, if the gradients are very small or zero, then little to no training
can take place, leading to poor predictive performance.

Proposed solutions

1. Multi-level hierarchy

This technique pre trains one layer at a time, and then performs backpropagation for fine-
tuning.

2. Residual networks

The technique introduces bypass connections that connect layers further behind the preceding
layer to a given layer. This allows gradients to propagate faster to deep layers before they can
be attenuated to small or zero values
3.Rectified linear units (ReLUs)

When using rectified linear units, the typical sigmoidal activation functions used for node
output is replaced with with a new function: f(x) = max(0, x). This activation only saturates
on one direction and thus are more resilient to the vanishing of gradients.

Activation Functions:
We use activation functions to propagate the output of one layer’s nodes forward to the next
layer (up to and including the output layer). Activation functions are a scalar-to-scalar function,
yielding the neuron’s activation.
We use activation functions for hidden neurons in a neural network to introduce nonlinearity
into the network’s modeling capabilities. Many activation functions belong to a logistic class
of trans‐ forms that (when graphed) resemble an S. This class of function is called sigmoidal.
The sigmoid family of functions contains several variations, one of which is known as the
Sigmoid function. Let’s now take a look at some useful activation functions in neural networks.
Linear
A linear transform is basically the identity function, and f(x) = Wx, where the dependent
variable has a direct, proportional relationship with the independent variable. In practical terms,
it means the function passes the signal through unchanged.

Fig. Linear activation function


Sigmoid
Like all logistic transforms, sigmoid can reduce extreme values or outliers in data without
removing them. The vertical line is the decision boundary.
Fig. Sigmoid activation function
A sigmoid function is a machine that converts independent variables of near infinite range into
simple probabilities between 0 and 1, and most of its output will be very close to 0 or 1.

Tanh
Pronounced “tanch,” tanh is a hyperbolic trigonometric function . Just as the tangent represents
a ratio between the opposite and adjacent sides of a right triangle, tanh represents the ratio of
the hyperbolic sine to the hyperbolic cosine: tanh(x) = sinh(x) / cosh(x). Unlike the Sigmoid
function, the normalized range of tanh is –1 to 1. The advantage of tanh is that it can deal more
easily with negative numbers.
Fig.Tanh activation function

Hard Tanh
Similar to tanh, hard tanh simply applies hard caps to the normalized range. Any‐ thing more
than 1 is made into 1, and anything less than –1 is made into –1. This allows for a more robust
activation function that allows for a limited decision boundary.
Softmax
Softmax is a generalization of logistic regression inasmuch as it can be applied to con‐ tinuous
data (rather than classifying binary) and can contain multiple decision boundaries. It handles
multinomial labeling systems. Softmax is the function you will often find at the output layer of
a classifier.
The softmax activation function returns the probability distribu‐ tion over mutually exclusive
output classes.
RELU
Rectified linear is a more interesting transform that activates a node only if the input is above
a certain quantity. While the input is below zero, the output is zero, but when the input rises
above a certain threshold, it has a linear relationship with the depen‐ dent variable f(x) = max(0,
x).
Fig. Rectified linear activation function
Rectified linear units (ReLU) are the current state of the art because they have proven to work
in many different situations. Because the gradient of a ReLU is either zero or a constant, it is
possible to reign in the vanishing exploding gradient issue. ReLU activation functions have
shown to train better in practice than sigmoid activation functions.
Compared to the sigmoid and tanh activation functions, the ReLU activation function does not
suffer from vanishing gradient issues. If we use hard max as the activation function, we can
induce sparsity in the activation output from the layer. Research has shown deep networks
using ReLU activation functions to train well without using pretraining techniques.
LRELU
Leaky ReLUs are a strategy to mitigate the “dying ReLU” issue.As opposed to having the
function being zero when x < 0, the leaky ReLU will instead have a small negative slope (e.g.,
“around 0.01”). Some success has been seen in practice with this ReLU variation but results
are not always consistent. The equation is given here:
Softplus
This activation function is considered to be the “smooth version of the ReLU,”. Compare to
the plot of ReLU the softplus activation function (f(x) = ln[ 1 + exp(x) ]) has a similar shape
to the ReLU. We also notice the differentiability and nonzero derivative of the softplus
everywhere on the graph, in contrast to the ReLU.

Fig. Visualizing the ReLU and sotplus activation functions


Optimization Algorithms
In optimization algorithms, we define “best set of values” for the parameter vector as the values
with the lowest loss function value.
We can divide optimization algorithms into two camps:
• First-order
First-order optimization algorithms calculate the Jacobian matrix.
First-order methods calculate a gradient (Jacobian) at each step to determine which direction
to go in next. This means that at each iteration, or step, we are trying to find the next best
possible direction to go, as defined by our objective function. This is why we consider
optimization algorithms to be a “search.” They are finding a path toward minimal error.
• Second-order
Second-order methods can take “better” steps; however, each step will take longer to calculate.
All second-order methods calculate or approximate the Hessian. Hessian as the derivative of
the Jacobian.
it is a matrix of second-order partial derivatives, analogous to “tracking acceleration rather than
speed.” The Hessian’s job is to describe the curvature of each point of the Jacobian. Second-
order methods include:
• Limited-memory BFGS (L-BFGS)
• Conjugate gradient
• Hessian-free

Hyperparameters:
Here we define a hyperparameter as any configuration setting that is free to be chosen by the
user that might affect performance. Hyperparameters fall into several categories:
• Layer size
• Magnitude (momentum, learning rate)
• Regularization (dropout, drop connect, L1, L2)
• Activations (and activation function families)
• Weight initialization strategy
• Loss functions
• Settings for epochs during training (mini-batch size)
• Normalization scheme for input data (vectorization)
1.Layer size
Layer size is defined by the number of neurons in a given layer. Input and output layers are
relatively easy to figure out because they correspond directly to how our modeling problem
handles input and ouput. For the input layer, this will match up to the number of features in the
input vector.
For the output layer, this will either be a single output neuron or a number of neurons matching
the number of classes we are trying to predict. Deciding on neuron counts for each hidden layer
is where hyperparameter tuning becomes a challenge.
We can use an arbitrary number of neurons to define a layer and there are no rules about how
big or small this number can be. However, how complex of a problem we can model is directly
correlated to how many neurons are in the hidden layers of our networks. This might push you
to begin with a large number of neurons from the start but these neurons come with a cost.
Depending on the deep network architecture, the connection schema between layers can vary.
However, the weights on the connections, are the parameters we must train. As we include
more parameters in our model, we increase the amount of effort needed to train the network.
Large parameter counts can lead to long training times and models that struggle to find
convergence.
2.Magnitude (momentum, learning rate)
Hyperparameters in the magnitude group involve the learning rate, and momentum.
2.1.Momentum
Momentum helps the learning algorithm get out of spots in the search space where it would
otherwise become stuck. In the errorscape, it helps the updater find the gulleys that lead toward
the minima. Momentum is to the learning rate what the learning rate is to weights, and it helps
us produce better quality models.
2.2 Learning rate
The learning rate in machine learning is how fast we change the parameter vector as we move
through search space. If the learning rate becomes too high, we can move toward our goal faster
(least amount of error for the function being evaluated), but we might also take a step so large
that we shoot right past the best answer to the problem, as well.
If we make our learning rate too small, it might take a lot longer than we’d like for our training
process to complete. A low learning rate can make our learning algorithm inefficient. Learning
rates are tricky because they end up being specific to the dataset and even to other
hyperparameters. This creates a lot of overhead for finding the right setting for
hyperparameters.
3.Regularization (dropout, drop connect, L1, L2)
Regularization is a measure taken against overfitting. Overfitting occurs when a model
describes the training set but cannot generalize well over new inputs. Overfitted models have
no predictive capacity for data that they haven’t seen.
Geoffery Hinton described the best way to build a neural network model:
“Cause it to overfit, and then regularize it to death.”
Regularization for hyperparameters helps modify the gradient so that it doesn’t step in
directions that lead it to overfit.
Regularization includes the following:
• Dropout
• DropConnect
• L1 penalty
• L2 penalty
Dropout and DropConnect mute parts of the input to each layer, such that the neural network
learns other portions. Zeroing-out parts of the data causes a neural network to learn more
general representations. Regularization works by adding an extra term to the normal gradient
computed.
Dropout
Dropout is a mechanism used to improve the training of neural networks by omitting a hidden
unit. It also speeds training. Dropout is driven by randomly dropping a neuron so that it will
not contribute to the forward pass and backpropagation.
DropConnect
DropConnect does the same thing as Dropout, but instead of choos‐ ing a hidden unit, it mutes
the connection between two neurons.
L1
The penalty methods L1 and L2, in contrast, are a way of preventing the neural network
parameter space from getting too big in one direction. They make large weights smaller. L1
regularization is considered computationally inefficient in the nonsparse case, has sparse
outputs, and includes built-in feature selection. L1 regularization multiplies the absolute value
of weights rather than their squares. This function drives many weights to zero while allowing
a few to grow large, making it easier to interpret the weights.
L2
In contrast, L2 regularization is computationally efficient due to it having analytical solutions
and non-sparse outputs, but it does not do feature selection automatically for us. The “L2”
regularization function, a common and simple hyperparameter, adds a term to the objective
function that decreases the squared weights. You multiply half the sum of the squared weights
by a coefficient called the weight-cost. L2 improves generalization, smooths the output of the
model as input changes, and helps the network ignore weights it does not use.

You might also like