02 Deep Feedforward Learning - Notes
02 Deep Feedforward Learning - Notes
Deep Learning
Unit – 2
1
06-09-2024
The procedure used to perform the learning process is called a learning algorithm.
Its function is to modify the synaptic weights of the network to attain a desired
design objective.
2
06-09-2024
3
06-09-2024
Human Brain
The brain continually receives the information, perceives it, and makes
appropriate decisions.
The arrows pointing from left to right indicate the forward transmission of
information-bearing signals through the system.
The arrows pointing from right to left signify the presence of feedback in the
system.
4
06-09-2024
Synapses are elementary structural and functional units that mediate the
interactions between neurons.
2. Models of a Neuron
5
06-09-2024
6
06-09-2024
Contd …
7
06-09-2024
8
06-09-2024
The working of deep learning is based on the structure and functioning of the human
brain.
Perceptron refers to an Artificial Neuron or neural network unit that helps to detect
certain input data computations in real time applications.
9
06-09-2024
The working of deep learning is based on the structure and functioning of the human
brain.
Perceptron refers to an Artificial Neuron or neural network unit that helps to detect
certain input data computations in in real time applications.
In feedforward networks, the information flows through the function being evaluated
from x, through the intermediate computations used to define f, and finally to the
output y.
There are no feedback connections in which outputs of the model are fed back into
itself.
When feedforward neural networks are extended to include feedback connections, they
are called recurrent neural networks.
10
06-09-2024
For example, the convolutional networks used for object recognition from photos are
a specialized kind of feedforward network.
For example, we might have three functions f(1) f(2) and f(3) connected in a chain, to form
f(x) = f(3)(f(2)(f(1)(x))).
In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and
so on.
The overall length of the chain gives the depth of the model.
These chain structures are the most commonly used structures of neural networks.
11
06-09-2024
The training data provides us with noisy, approximate examples of f*(x) evaluated at
different training points.
The training examples specify directly what the output layer must do at each point x; it
must produce a value that is close to y.
The behavior of the other layers is not directly specified by the training data.
In the NN, the first layer is called the input layer and the last layer is called the output
layer.
The layers in between the first and last layers are called the hidden layers, as their
results are not shown to the outer world.
4. Learning XOR
The output of the XOR function has only a true value if the two inputs are different.
If the two inputs are identical, the XOR function returns a false value.
The following table shows the inputs and outputs of the XOR function.
12
06-09-2024
In contrast to the OR problem and AND problem, the XOR problem cannot be
linearly separated by a single boundary line.
Now, we define a Neural Network that is able to solve the XOR problem.
As we know, the XOR function cannot be linearly separated by one boundary line.
This is also the reason why the function cannot be solved by a simple Perceptron.
With the correct choice of functions and weight parameters, a Neural Network with
one hidden layer is able to solve the XOR problem.
13
06-09-2024
So, the threshold value indicates from which value a neuron "fires".
In addition to the threshold values, we also need to define the weight parameters.
Question?
Why do we choose these specific weights and threshold values for the Network?
14
06-09-2024
Hidden Unit - 1
Hidden Unit - 2
Output Unit
Calculations
Let's check if we really get the outputs of the XOR-problem with these formulas.
Case 1
The first case has the inputs x1 = 0 and x2 = 0 and the output should be y = 0.
15
06-09-2024
Case 2
The first case has the inputs x1 = 0 and x2 = 1 and the output should be y = 1.
Case 3
The first case has the inputs x1 = 1 and x2 = 0 and the output should be y = 1.
Case 4
The first case has the inputs x1 = 1 and x2 = 1 and the output should be y = 0.
Conclusion
16
06-09-2024
5. Gradient-Based Learning
Gradient descent in deep learning is simply used to find the values of a function's
parameters that minimize a cost function as far as possible.
The largest difference between the linear models we have seen so far and neural
networks is that the nonlinearity of a neural network causes most interesting loss
functions to become nonconvex.
The neural networks are usually trained by using iterative, gradient-based optimizers
that merely drive the cost function to a very low value, rather than the linear equation
solvers used to train linear regression models.
17
06-09-2024
Cost Functions
An important aspect of the design of a deep neural network is the choice of the cost
function.
The cost functions for neural networks are more or less the same as those for other
parametric models, such as linear models.
In most cases, our parametric model defines a distribution p(y | x; θ) and we simply use
the principle of maximum likelihood.
This means that the cost function is simply the negative log-likelihood.
The specific form of the cost function changes from model to model, depending on the
specific form of logpmodel.
The expansion of J(θ) typically yields some terms that do not depend on the model
parameters and may be discarded.
An advantage of this approach of deriving the cost function from maximum likelihood is
that it removes the burden of designing cost functions for each model.
18
06-09-2024
Instead of learning a full probability distribution p(y|x; θ), we often want to learn just one
conditional statistic of y given x.
For example, we may have a predictor f(x; θ) that we wish to employ to predict the mean of y.
For sufficiently large applications, the neural network is able to represent any function f
from a wide class of functions.
From this point of view, we can view the cost function as being a functional rather than
just a function.
We can thus think of learning as choosing a function rather than merely choosing a set of
parameters.
Now, we can design the cost functional to have its minimum occur at some specific function
we desire.
Output Units
The choice of cost function is tightly coupled with the choice of output unit.
Most of the time, we simply use the cross-entropy between the data distribution and
the model distribution.
The choice of how to represent the output then determines the form of the cross-
entropy function.
Any kind of neural network unit that may be used as an output can also be used as a
hidden unit.
19
06-09-2024
Suppose that the feedforward network provides a set of hidden features defined by
h = f(x; θ).
The role of the output layer is then to provide some additional transformation from
the features to complete the task that the network must perform.
6. Hidden Units
The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles.
A hidden unit refers to the components comprising the layers of processors between
input and output units in a connectionist system.
It is usually impossible to predict in advance which will work best for hidden units.
The design process consists of trial and error, intuiting that a kind of hidden unit may
work well, and evaluating its performance on a validation set.
This may seem like it invalidates g for use with a gradient-based learning algorithm.
20
06-09-2024
Most hidden units are distinguished from each other only by the choice of the form of
the activation function g(z).
Rectified linear units are easy to optimize due to similarity with linear units.
Only difference with linear units is that they output 0 across half its domain
Derivative is 1 everywhere that the unit is active
Thus gradient direction is far more useful than with activation functions with
second-order effects
h=g(WTx+b).
21
06-09-2024
7. Architecture Design
Another key design consideration for neural networks is determining the architecture.
Most neural networks are organized into groups of units called layers.
Most neural network architectures arrange these layers in a chain structure, with each
layer being a function of the layer that preceded it.
And so on
22
06-09-2024
In some cases, a network with even one hidden layer is sufficient to fit the training
set.
Deeper networks are often able to use far fewer units per layer and far fewer
parameters, as well as frequently generalizing to the test set, but they also tend to be
harder to optimize.
The ideal network architecture for a task must be found via experimentation guided
by monitoring the validation set error.
A linear model, mapping from features to outputs via matrix multiplication, can
represent only linear functions.
It has the advantage of being easy to train because many loss functions result in
convex optimization problems when applied to linear models.
23
06-09-2024
The universal approximation theorem means that regardless of what function we are
trying to learn, we know that a large MLP will be able to represent the function.
However, we are not guaranteed that the training algorithm will be able to learn that
function.
Even if the MLP is able to represent the function, learning can fail for two different
reasons.
1. the optimization algorithm used for training may not be able to find the value of
the parameters that corresponds to the desired function.
2. the training algorithm might choose the wrong function as a result of overfitting.
8. Backpropagation
24
06-09-2024
Backpropagation Algorithm
By comparing desired outputs to achieved system outputs, the systems are tuned by
adjusting connection weights to narrow the difference between the two as much as
possible.
The algorithm gets its name because the weights are updated backward, from output to
input.
25
06-09-2024
1. It does not have any parameters to tune except for the number of inputs.
2. It is highly adaptable and efficient and does not require any prior knowledge
about the network.
3. It is a standard process that usually works well.
4. It is user-friendly, fast and easy to program.
5. Users do not need to learn any special functions.
26
06-09-2024
These algorithms efficiently compute the gradient of the loss function with respect to
the network weights.
This approach eliminates the inefficient process of directly computing the gradient with
respect to each individual weight.
It enables the use of gradient methods, like gradient descent or stochastic gradient
descent, to train multilayer networks and update weights to minimize loss.
Backpropagation requires a known, desired output for each input value in order to
calculate the loss function gradient (how a prediction differs from actual results) as a
type of supervised machine learning.
Along with classifiers such as Naïve Bayesian filters and decision trees, the
backpropagation training algorithm has emerged as an important part of machine
learning applications that involve predictive analytics.
27
06-09-2024
The purpose of training is to build a model that performs the exclusive OR (XOR)
functionality with two inputs and three hidden units, such that the training set (truth
table) looks something like the following:
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0
We also need an activation function that determines the activation value at every
node in the neural net.
We also need a hypothesis function that determines the input to the activation
function.
28
06-09-2024
This goes through two steps that happen at every node/unit in the network:
1. Getting the weighted sum of inputs of a particular unit using the h(x) function we
defined earlier.
2. Plugging the value we get from step one into the activation function f(a)=a. Using
the activation value we get the output of the activation function as the input
feature for the connected nodes in the next layer.
29
06-09-2024
Unit Z3:
Unit D0:
30
06-09-2024
As we mentioned earlier, the activation value (z) of the final unit (D0) is that of the whole
model.
Therefore, our model XOR predicts an output of one for the set of inputs {0, 0}.
The actual_y value comes from the training set, while the predicted_y value is what our
model yields.
The backpropagation goes through two steps that happen at every node/unit in the
network:
1. Getting the weighted sum of inputs of a particular unit using the h(x) function we
defined earlier.
2. Plugging the value we get from step one into the activation function f(a)=a. Using
the activation value we get the output of the activation function as the input
feature for the connected nodes in the next layer.
31
06-09-2024
Every loss the deep learning model arrives at is actually the mess that was caused by
all the nodes accumulated into one number.
Therefore, we need to find out which node is responsible for the most loss in every
layer, so that we can penalize it by giving it a smaller weight value, and thus lessening
the total loss of the model.
delta_0 = w.delta_1.f’(z)
Where values delta_0, w and f’(z) are those of the same unit’s, while delta_1 is the loss
of the unit on the other side of the weighted link.
32
06-09-2024
For example:
For example:
We do the delta calculation step at every unit,
backpropagating the loss into the neural net,
and find out what loss every node/unit is
responsible for.
delta_D0 = total_loss = -4
delta_Z0 = W.delta_D0.f'(Z0) = 1.(-4).1 = -4
delta_Z1 = W.delta_D0.f'(Z1) = 1.(-4).1 = -4
delta_Z2 = W.delta_D0.f'(Z2) = 1.(-4).1 = -4
delta_Z3 = W.delta_D0.f'(Z3) = 1.(-4).1 = -4
33
06-09-2024
1. The loss of the final unit (i.e. D0) is equal to the loss of the whole model. This is
because it is the output unit, and its loss is the accumulated loss of all the units
together.
2. The function f’(z) will always give the value one, no matter what the input (i.e. z) is
equal to. This is because the partial derivative, as we said earlier, follows: f’(a) = 1.
3. The input nodes/units (X0, X1 and X2) don’t have delta values, as there is nothing
those nodes control in the neural net. They are only there as a link between the
data set and the neural net. This is why the whole layer is usually not included in
the layer count.
34