WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
WINSEM2023-24 BITE410L TH VL2023240503970 2024-03-11 Reference-Material-I
By
Dr.Ramkumar.T
[email protected]
1
What is a Neural Network?
• Neural network: information processing
paradigm inspired by biological nervous
systems, such as our brain
3
Biological Neurons
4
Biological Neurons
5
Biological Neurons
• Our sense organs interact with the outside world
• They relay information to the neurons
• The neurons (may) get activated and produces a response
(laughter in this case)
• Of course, in reality, it is not just a single neuron which
does all this
• There is a massively parallel interconnected network of
neurons
• Some of these neurons may fire (in red) in response to
this information and in turn relay information to other
neurons they are connected
• These neurons may also fire (again, in red) and the
process continues eventually resulting in a response
(laughter in this case)
6
Definition of ANN
“Data processing system consisting of a large
number of simple, highly interconnected
processing elements (artificial neurons) in an
architecture inspired by the structure of the
cerebral cortex of the brain”
8
Biological Neuron Artificial Neuron
Four basic components of a human biological The components of a basic artificial neuron
neuron
9
McCulloch – Pitts Neuron
• McCulloch (neuroscientist) and Pitts (logician) proposed a
highly simplified computational model of the neuron (1943)
11
Perceptron model (1958)
12
Perceptron model (1958)
• Inputs are no longer limited to Boolean values
• Bias helps the model to fit the given input data and control the
triggering of output function
13
Perceptron model (1958)
• AND
• OR
• NOT
• NAND
• NOR
• Single perceptron cannot deal XOR Gate data because it is
not linearly separable
16
Feed-forward Neural Network
17
Multi-Layered Feed-forward Networks
• Neural networks feedforward, also known as multi-layered
networks of neurons, are called "feedforward," where information
flows in one direction from the input layer to the output layer
without looping back.
• Multi-layered networks one input layer, one output layer and
one or more than one hidden layer(s)
• The information only flows in one direction - input x to output y.
Hence the term feed-forward
• Aims to minimize the loss function
18
Multi- layered Feed-forward Neural Network
19
Feed-forward Networks
27
Need for Sigmoid activation function
• The thresholding logic used by Step Function is harsh
• There will always be this sudden change in the decision (from
0 to 1) if the weighted sum crosses the threshold value
• For most real world applications we would expect a smoother
decision function which gradually changes from 0 to 1
28
Sigmoid Activation Function
29
Sigmoid Activation Function
• The output of a sigmoid
a c t i va t i o n f u n c t i o n ra n g e s
between 0 and 1.
• The output of a neuron that has a
sigmoid activation function is
very smooth and gives nice
continuous derivatives, which
works well when training a
neural network.
• Because of its capability to
provide continuous values in the
range of 0 to 1, the sigmoid
function is generally used to
output probability with respect
to a given class for a binary 30
classification
Tanh Activation Function
• The input–output relationship for a tanh
activation function is expressed as
31
Tanh Activation Function
where,
34
How does ANN learn?
• The learning objective of ANN is to minimize the cost
function for better prediction
• How can we minimize the cost function?
• Neural network makes predictions using forward propagation.
• if we can change some values in the forward propagation, we
can predict the correct output and minimize the loss.
• But what values can we change in the forward propagation?
Obviously, we can't change input and output.
• Gradient descent – Optimization technique, used to learn
the optimal values of the randomly initialized weight
matrices
• With the optimal values of weights, ANN can predict the
correct output and minimize the loss.
35
Gradient descent
• Plotting the values of ‘J(w)’ – ie, Cost and ‘w’ – ie, Weight
• Gradients are the derivatives that are actually the slope of a tangent line.
• There exists a value of parameters ‘w’ which has the minimum cost ‘J(w)’
• The gradient of the cost function is calculated as partial derivative of cost
function ‘J ‘ with respect to each model parameter ‘wj’ . In notation,
• In the gradient descent algorithm, we start with random model parameters
and calculate the cost
• keep updating the model parameters to move closer to the values that
results in minimum cost.
36
Back propagation of Network
• With gradient descent, we move our weights to a position
where the cost is minimum . How do we update the weights ?
• Back propagating the network from the output layer to the
input layer and calculate the gradient of the cost function
with respect to all the weights between the output and the
input layer
• After calculating gradients, old weights can be updated using
weight update rule:
• α - alpha, is called as learning rate - Used to control the
amount of weight adjustment at each step of training.
• This whole process of back propagating the network from the
output layer to the input layer and updating the weights of the
network using gradient descent to minimize the loss is called
Back propagation. 37
Role of Learning Rate
• If the learning rate is large, then we take a large step and our gradient
descent will be fast, but we might fail to reach the global minimum and
become stuck at a local minimum.
38
Ways of doing Gradient descent
• Batch gradient descent
• Uses all of the training instances to update the model parameters
in each iteration.(All examples at once)
• Converges slowly with accurate estimates of the error gradient
• Stochastic Gradient Descent (SGD)
• Updates the parameters using only a single training instance in
each iteration. The training instance is usually selected
randomly. (one sample at a time)
• converges fast with noisy estimates of the error gradient.
• Mini-batch Gradient Descent
• Divides the training set into smaller size called batch denoted by
‘b’. Mini-batch ‘b’ is used to update the model parameters in
each iteration.
• Mini-batch gradient descent is the most common
implementation of gradient descent used in the field of deep
learning
• The smaller batch size makes the learning process faster, but the
variance of the validation dataset accuracy is higher.
39
• A bigger batch size has a slower learning process, but the
Epoch
• The number of times a whole dataset is passed through the
neural network model is called an epoch
• One epoch means that the training dataset is passed forward
and backward through the neural network once
• A too-small number of epochs results in under fitting because
the neural network has not learned much enough
• On the other hand, too many epochs will lead to over fitting
where the model can predict the data (train data) very well,
but cannot predict new unseen data(test data) well enough
• Assume that the total number of training records are 12000, if
the batch size is 6000, then 2 iterations are needed for
completing 1 epoch
40