0% found this document useful (0 votes)
39 views

02 Deep Feedforward Learning - Notes

It's a deep learning notes

Uploaded by

pavanibodhireddi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

02 Deep Feedforward Learning - Notes

It's a deep learning notes

Uploaded by

pavanibodhireddi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

06-09-2024

Deep Learning
Unit – 2

Deep Feedforward Learning


Topics Covered

1. What is a Neural Network


2. Model of a Neuron
3. Deep Feedforward Networks
4. Learning XOR
5. Gradient-Based Learning
6. Hidden Units
7. Architecture Design
8. Back Propagation

1. What is a Neural Network

The human brain is a highly complex, nonlinear and parallel computer.

It has the capacity to organize its structural constituents, known as neurons, so as


to perform computations many times faster than the fastest digital computer in
existence today.

1
06-09-2024

A neural network is a massively parallel distributed processor made up of simple


processing units, which has a natural propensity for storing experiential knowledge
and making it available for use.

It resembles the brain in two respects:

1. Knowledge is acquired by the network from its environment through a learning


process.

2. Interneuron connection strengths, known as synaptic weights, are used to


store the acquired knowledge.

The procedure used to perform the learning process is called a learning algorithm.

Its function is to modify the synaptic weights of the network to attain a desired
design objective.

Neural networks are also referred to as neurocomputers, or parallel distributed


processors.

2
06-09-2024

Benefits of Neural Networks

A neural network derives its computing power through

1. its massively parallel distributed structure

2. its ability to learn and generalize.

Properties and Capabilities of Neural Networks

1. Nonlinearity Nonlinearity is a highly important property, particularly if the


underlying physical mechanism responsible for generation of the input
signal is inherently nonlinear.
2. Input-output  Supervised learning
mapping  Working through training samples or task examples.
3. Adaptivity Adapting the synaptic weights to change in the surrounding
environment
4. Evidential
response
5. Contextual
information
6. Fault tolerance
7. VLSI
implementability
8. Uniformity of
analysis and
design
9. Neurobiological
analysis

3
06-09-2024

Human Brain

Central to the nervous system is the brain.

It is represented by the neural net.

The brain continually receives the information, perceives it, and makes
appropriate decisions.

The arrows pointing from left to right indicate the forward transmission of
information-bearing signals through the system.

The arrows pointing from right to left signify the presence of feedback in the
system.

4
06-09-2024

It is estimated that there are approximately 10 billion neurons and 60 trillion


synapses or connections in the human brain.

Synapses are elementary structural and functional units that mediate the
interactions between neurons.

2. Models of a Neuron

A neuron is an information-processing unit that is fundamental to the operation of a


neural network.

Its model can be shown in the following block diagram.

5
06-09-2024

6
06-09-2024

Contd …

7
06-09-2024

8
06-09-2024

3. Deep Feedforward Networks

Deep learning uses artificial neural networks to perform sophisticated computations


on large amounts of data.

The working of deep learning is based on the structure and functioning of the human
brain.

Deep feedforward networks, also called feedforward neural networks, or multilayer


perceptrons (MLPs), are the deep learning models.

Perceptron refers to an Artificial Neuron or neural network unit that helps to detect
certain input data computations in real time applications.

9
06-09-2024

Deep learning uses artificial neural networks to perform sophisticated computations


on large amounts of data.

The working of deep learning is based on the structure and functioning of the human
brain.

Deep feedforward networks, also called feedforward neural networks, or multilayer


perceptrons (MLPs), are the deep learning models.

Perceptron refers to an Artificial Neuron or neural network unit that helps to detect
certain input data computations in in real time applications.

In feedforward networks, the information flows through the function being evaluated
from x, through the intermediate computations used to define f, and finally to the
output y.

There are no feedback connections in which outputs of the model are fed back into
itself.

When feedforward neural networks are extended to include feedback connections, they
are called recurrent neural networks.

10
06-09-2024

Feedforward networks form the basis of many important commercial applications.

For example, the convolutional networks used for object recognition from photos are
a specialized kind of feedforward network.

Feedforward networks are a conceptual stepping stone on the path to recurrent


networks, which power many natural language applications.

Feedforward neural networks are typically represented by composing together many


different functions.

For example, we might have three functions f(1) f(2) and f(3) connected in a chain, to form
f(x) = f(3)(f(2)(f(1)(x))).

In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and
so on.

The overall length of the chain gives the depth of the model.

These chain structures are the most commonly used structures of neural networks.

11
06-09-2024

The final layer of a feedforward network is called the output layer.

During neural network training, we drive f(x) to match f*(x).

The training data provides us with noisy, approximate examples of f*(x) evaluated at
different training points.

Each example x is accompanied by a label y ≈ f*(x).

The training examples specify directly what the output layer must do at each point x; it
must produce a value that is close to y.

The behavior of the other layers is not directly specified by the training data.

In the NN, the first layer is called the input layer and the last layer is called the output
layer.

The layers in between the first and last layers are called the hidden layers, as their
results are not shown to the outer world.

4. Learning XOR

XOR stands for 'exclusive or’.

The output of the XOR function has only a true value if the two inputs are different.

If the two inputs are identical, the XOR function returns a false value.

The following table shows the inputs and outputs of the XOR function.

12
06-09-2024

In contrast to the OR problem and AND problem, the XOR problem cannot be
linearly separated by a single boundary line.

Let's have a look at the following images:

Defining a Neural Network for XOR Problem

Now, we define a Neural Network that is able to solve the XOR problem.

As we know, the XOR function cannot be linearly separated by one boundary line.

This is also the reason why the function cannot be solved by a simple Perceptron.

We need a more complex model.

With the correct choice of functions and weight parameters, a Neural Network with
one hidden layer is able to solve the XOR problem.

13
06-09-2024

Let's define the Neural Network we need.

In our model, the activation function is a simple threshold function.

If a certain threshold value is exceeded, the function returns output 1, otherwise 0.

So, the threshold value indicates from which value a neuron "fires".

In addition to the threshold values, we also need to define the weight parameters.

Question?

Why do we choose these specific weights and threshold values for the Network?

Let's have a look what happens in each layer.

14
06-09-2024

Hidden Unit - 1

Hidden Unit - 2

Output Unit

Calculations

Let's check if we really get the outputs of the XOR-problem with these formulas.

Case 1

The first case has the inputs x1 = 0 and x2 = 0 and the output should be y = 0.

15
06-09-2024

Case 2

The first case has the inputs x1 = 0 and x2 = 1 and the output should be y = 1.

Case 3

The first case has the inputs x1 = 1 and x2 = 0 and the output should be y = 1.

Case 4

The first case has the inputs x1 = 1 and x2 = 1 and the output should be y = 0.

Conclusion

The XOR function can't be solved by a simple Perceptron.

We need a Neural Network with as least one hidden layer.

16
06-09-2024

5. Gradient-Based Learning

Gradient Descent is an optimization algorithm for finding a local minimum of a


differentiable function.

Gradient descent in deep learning is simply used to find the values of a function's
parameters that minimize a cost function as far as possible.

The largest difference between the linear models we have seen so far and neural
networks is that the nonlinearity of a neural network causes most interesting loss
functions to become nonconvex.

The neural networks are usually trained by using iterative, gradient-based optimizers
that merely drive the cost function to a very low value, rather than the linear equation
solvers used to train linear regression models.

Convex optimization converges starting from any initial parameters.

Stochastic gradient descent applied to nonconvex loss functions has no such


convergence guarantee and is sensitive to the values of the initial parameters.

For feedforward neural networks, it is important to initialize all weights to small


random values.

The biases may be initialized to zero or to small positive values.

17
06-09-2024

Cost Functions

An important aspect of the design of a deep neural network is the choice of the cost
function.

The cost functions for neural networks are more or less the same as those for other
parametric models, such as linear models.

In most cases, our parametric model defines a distribution p(y | x; θ) and we simply use
the principle of maximum likelihood.

1. Learning Conditional Distributions with Maximum Likelihood

Most modern neural networks are trained using maximum likelihood.

This means that the cost function is simply the negative log-likelihood.

This cost function is given by

The specific form of the cost function changes from model to model, depending on the
specific form of logpmodel.

The expansion of J(θ) typically yields some terms that do not depend on the model
parameters and may be discarded.

An advantage of this approach of deriving the cost function from maximum likelihood is
that it removes the burden of designing cost functions for each model.

Specifying a model p(y|x) automatically determines a cost function logp(y|x).

18
06-09-2024

2. Learning Conditional Statistics

Instead of learning a full probability distribution p(y|x; θ), we often want to learn just one
conditional statistic of y given x.

For example, we may have a predictor f(x; θ) that we wish to employ to predict the mean of y.

For sufficiently large applications, the neural network is able to represent any function f
from a wide class of functions.

From this point of view, we can view the cost function as being a functional rather than
just a function.

A functional is a mapping from functions to real numbers.

We can thus think of learning as choosing a function rather than merely choosing a set of
parameters.

Now, we can design the cost functional to have its minimum occur at some specific function
we desire.

Output Units

The choice of cost function is tightly coupled with the choice of output unit.

Most of the time, we simply use the cross-entropy between the data distribution and
the model distribution.

The choice of how to represent the output then determines the form of the cross-
entropy function.

Any kind of neural network unit that may be used as an output can also be used as a
hidden unit.

19
06-09-2024

Suppose that the feedforward network provides a set of hidden features defined by
h = f(x; θ).

The role of the output layer is then to provide some additional transformation from
the features to complete the task that the network must perform.

6. Hidden Units

The design of hidden units is an extremely active area of research and does not yet have
many definitive guiding theoretical principles.

A hidden unit refers to the components comprising the layers of processors between
input and output units in a connectionist system.

It is usually impossible to predict in advance which will work best for hidden units.

The design process consists of trial and error, intuiting that a kind of hidden unit may
work well, and evaluating its performance on a validation set.

Some hidden units are not differentiable at all input points.

For example, the rectified linear function g(z)=max{0, z} is not differentiable at z = 0.

This may seem like it invalidates g for use with a gradient-based learning algorithm.

20
06-09-2024

Most hidden units can be described as accepting a vector of inputs x, computing an


affine transformation z=wTh+b, and then applying an element-wise nonlinear function
g(z).

Most hidden units are distinguished from each other only by the choice of the form of
the activation function g(z).

Rectified Linear Units and Their Generalizations

Rectified linear units use the activation function g(z)=max{0, z}.

Rectified linear units are easy to optimize due to similarity with linear units.

 Only difference with linear units is that they output 0 across half its domain
 Derivative is 1 everywhere that the unit is active
 Thus gradient direction is far more useful than with activation functions with
second-order effects

Rectified linear units are typically used on top of an affine transformation:

h=g(WTx+b).

21
06-09-2024

7. Architecture Design

Another key design consideration for neural networks is determining the architecture.

The word architecture refers to the overall structure of the network:

 how many units the network should have


 how these units should be connected to each other.

Most neural networks are organized into groups of units called layers.

Most neural network architectures arrange these layers in a chain structure, with each
layer being a function of the layer that preceded it.

In this structure, the first layer is given by

The second layer is given by

And so on

22
06-09-2024

In these chain-based (or multi-layer) architectures, the main architectural


considerations are choosing the depth of the network and the width of each layer.

In some cases, a network with even one hidden layer is sufficient to fit the training
set.

Deeper networks are often able to use far fewer units per layer and far fewer
parameters, as well as frequently generalizing to the test set, but they also tend to be
harder to optimize.

The ideal network architecture for a task must be found via experimentation guided
by monitoring the validation set error.

Universal Approximation Properties and Depth

A linear model, mapping from features to outputs via matrix multiplication, can
represent only linear functions.

It has the advantage of being easy to train because many loss functions result in
convex optimization problems when applied to linear models.

Unfortunately, we often want our systems to learn nonlinear functions.

23
06-09-2024

The universal approximation theorem means that regardless of what function we are
trying to learn, we know that a large MLP will be able to represent the function.

However, we are not guaranteed that the training algorithm will be able to learn that
function.

Even if the MLP is able to represent the function, learning can fail for two different
reasons.

1. the optimization algorithm used for training may not be able to find the value of
the parameters that corresponds to the desired function.

2. the training algorithm might choose the wrong function as a result of overfitting.

Feedforward networks provide a universal system for representing functions in the


sense that, given a function, there exists a feedforward network that approximates the
function.

8. Backpropagation

Backpropagation, or backward propagation of errors, is a process that is designed to


test for errors working back from output nodes to input nodes.

It is an important mathematical tool for improving the accuracy of predictions in


data mining and machine learning.

24
06-09-2024

Types of backpropagation networks

1 Static backpropagation Static backpropagation is a network developed to


map static inputs for static outputs.

Static backpropagation networks can solve static


classification problems, such as optical character
recognition (OCR).
2 Recurrent The recurrent backpropagation network is used for
backpropagation fixed-point learning.

Recurrent backpropagation activation feeds forward


until it reaches a fixed value.

Backpropagation Algorithm

Artificial neural networks use backpropagation as a learning algorithm to compute a


gradient descent with respect to weight values for the various inputs.

By comparing desired outputs to achieved system outputs, the systems are tuned by
adjusting connection weights to narrow the difference between the two as much as
possible.

The algorithm gets its name because the weights are updated backward, from output to
input.

25
06-09-2024

The advantages of using a backpropagation algorithm are as follows:

1. It does not have any parameters to tune except for the number of inputs.
2. It is highly adaptable and efficient and does not require any prior knowledge
about the network.
3. It is a standard process that usually works well.
4. It is user-friendly, fast and easy to program.
5. Users do not need to learn any special functions.

The disadvantages of using a backpropagation algorithm are as follows:

1. It prefers a matrix-based approach over a mini-batch approach.


2. Data mining is sensitive to noise and irregularities.
3. Performance is highly dependent on input data.
4. Training is time- and resource-intensive.

26
06-09-2024

What is the objective of a backpropagation algorithm?

Backpropagation algorithms are used extensively to train feedforward neural networks


in areas such as deep learning.

These algorithms efficiently compute the gradient of the loss function with respect to
the network weights.

This approach eliminates the inefficient process of directly computing the gradient with
respect to each individual weight.

It enables the use of gradient methods, like gradient descent or stochastic gradient
descent, to train multilayer networks and update weights to minimize loss.

What is a backpropagation algorithm in machine learning?

Backpropagation requires a known, desired output for each input value in order to
calculate the loss function gradient (how a prediction differs from actual results) as a
type of supervised machine learning.

Along with classifiers such as Naïve Bayesian filters and decision trees, the
backpropagation training algorithm has emerged as an important part of machine
learning applications that involve predictive analytics.

27
06-09-2024

Setting the Model Components for a Backpropagation NN

Imagine that we have a deep neural network that we need to train.

The purpose of training is to build a model that performs the exclusive OR (XOR)
functionality with two inputs and three hidden units, such that the training set (truth
table) looks something like the following:

X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0

We also need an activation function that determines the activation value at every
node in the neural net.

For simplicity, let’s choose an identity activation function: f(a) = a.

We also need a hypothesis function that determines the input to the activation
function.

h(X) = W0.X0 + W1.X1 + W2.X2

Here, X0 is the bias and W0 is its weight.

28
06-09-2024

Building a Neural Network

The neural network for learning XOR can be drawn as follows:

The leftmost layer is the input layer, which


takes X0 as the bias term of value one, and X1
and X2 as input features.

The layer in the middle is the first hidden layer,


which also takes a bias term Z0 value of one.

Finally, the output layer has only one output


unit D0 whose activation value is the actual
output of the model (i.e. h(x)).

How Forward Propagation Works

This goes through two steps that happen at every node/unit in the network:

1. Getting the weighted sum of inputs of a particular unit using the h(x) function we
defined earlier.

2. Plugging the value we get from step one into the activation function f(a)=a. Using
the activation value we get the output of the activation function as the input
feature for the connected nodes in the next layer.

29
06-09-2024

Units X0, X1, X2 and Z0 do not have any units


connected to them providing inputs.

Therefore, the steps mentioned above do not


occur in those nodes.

However, for the rest of the nodes/units, this is


how it all happens throughout the neural net
for the first input sample in the training set:

Unit Z1: Unit Z2:

h(x) = W0.X0 + W1.X1 + W2.X2 h(x) = W0.X0 + W1.X1 + W2.X2


=1.1+1.0+1.0 =1.1+1.0+1.0
=1=a =1=a

z = f(a) = a => z = f(1) = 1 z = f(a) = a => z = f(1) = 1

Unit Z3:

h(x) = W0.X0 + W1.X1 + W2.X2


=1.1+1.0+1.0
=1=a

z = f(a) = a => z = f(1) = 1

Unit D0:

h(x) = W0.Z0 + W1.Z1 + W2.Z2 + W3.Z3


=1.1+1.1+1.1+1.1
=4=a

z = f(a) = a => z = f(4) = 4

30
06-09-2024

As we mentioned earlier, the activation value (z) of the final unit (D0) is that of the whole
model.

Therefore, our model XOR predicts an output of one for the set of inputs {0, 0}.

Calculating the loss/cost of the current iteration would follow:

Loss = actual_y - predicted_y


=0-4
= -4

The actual_y value comes from the training set, while the predicted_y value is what our
model yields.

So the cost at this iteration is equal to -4.

When do you use Backpropagation in NN

The backpropagation goes through two steps that happen at every node/unit in the
network:

1. Getting the weighted sum of inputs of a particular unit using the h(x) function we
defined earlier.

2. Plugging the value we get from step one into the activation function f(a)=a. Using
the activation value we get the output of the activation function as the input
feature for the connected nodes in the next layer.

31
06-09-2024

Calculating Deltas in Backpropagation NN

We need to find the loss at every unit/node in the neural net.

Every loss the deep learning model arrives at is actually the mess that was caused by
all the nodes accumulated into one number.

Therefore, we need to find out which node is responsible for the most loss in every
layer, so that we can penalize it by giving it a smaller weight value, and thus lessening
the total loss of the model.

Calculating Deltas in Backpropagation NN (Contd …)

Calculating the delta for every unit can be problematic.

However, we now have a shortcut formula for the whole thing:

delta_0 = w.delta_1.f’(z)

Where values delta_0, w and f’(z) are those of the same unit’s, while delta_1 is the loss
of the unit on the other side of the weighted link.

32
06-09-2024

Calculating Deltas in Backpropagation NN (Contd …)

For example:

In order to get the loss of a node (e.g. Z0),


we multiply the value of its corresponding
f’(z) by the loss of the node it is connected
to in the next layer (delta_1), by the
weight of the link connecting both nodes.

Calculating Deltas in Backpropagation NN (Contd …)

For example:
We do the delta calculation step at every unit,
backpropagating the loss into the neural net,
and find out what loss every node/unit is
responsible for.

Let’s calculate those deltas.

delta_D0 = total_loss = -4
delta_Z0 = W.delta_D0.f'(Z0) = 1.(-4).1 = -4
delta_Z1 = W.delta_D0.f'(Z1) = 1.(-4).1 = -4
delta_Z2 = W.delta_D0.f'(Z2) = 1.(-4).1 = -4
delta_Z3 = W.delta_D0.f'(Z3) = 1.(-4).1 = -4

33
06-09-2024

Calculating Deltas in Backpropagation NN (Contd …)

There are a few things to note here:

1. The loss of the final unit (i.e. D0) is equal to the loss of the whole model. This is
because it is the output unit, and its loss is the accumulated loss of all the units
together.

2. The function f’(z) will always give the value one, no matter what the input (i.e. z) is
equal to. This is because the partial derivative, as we said earlier, follows: f’(a) = 1.

3. The input nodes/units (X0, X1 and X2) don’t have delta values, as there is nothing
those nodes control in the neural net. They are only there as a link between the
data set and the neural net. This is why the whole layer is usually not included in
the layer count.

34

You might also like