0% found this document useful (0 votes)
6 views

Neural Network (Perceptrons)

The document discusses neural networks, focusing on perceptrons and their application in classification problems, specifically predicting student pass rates based on study hours. It explains the conversion of linear hypotheses to non-linear ones using activation functions like sigmoid, and outlines the process of training neural networks through forward propagation, loss calculation, and backpropagation. Additionally, it highlights the importance of gradient descent in optimizing weights and biases within the network structure.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Neural Network (Perceptrons)

The document discusses neural networks, focusing on perceptrons and their application in classification problems, specifically predicting student pass rates based on study hours. It explains the conversion of linear hypotheses to non-linear ones using activation functions like sigmoid, and outlines the process of training neural networks through forward propagation, loss calculation, and backpropagation. Additionally, it highlights the importance of gradient descent in optimizing weights and biases within the network structure.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Neural Network (Perceptrons)

Classification dataset
Hours Study Pass Hours Study Pass
2 No 2 0
3 No 3 0
4 Yes 4 1
5 No 5 0
6 Yes 6 1
7 Yes 7 1
8 Yes 8 1
9 Yes 9 1
10 ? 10 ?

● Here, let’s say we are trying to predict whether a student is going to pass based on hours study
● Hours study, in this case is the feature and Pass, is a label
● Pass is a categorical variable consisting of two values {Yes, No}
● Such categorical variable outcome prediction is referred to as classification problem
● It is a task of supervised learning
Hours Study Pass Hours Study Pass
2 No 2 0
3 No 3 0
4 Yes 4 1
5 No 5 0
6 Yes 6 1
7 Yes 7 1
8 Yes 8 1
9 Yes 9 1
10 ? 10 ?

● Let’s try to plot the dataset

● Linear models can be problematic to fit such data


● Thus, we convert the linear hypothesis to a non-linear one to fit the
data

● One such non-linear activation function here used is Sigmoid


● Such an activation function converts a linear hypothesis into non-linear one
● Such nonlinear classification is known as logistic regression
● There are other activation functions available (ReLU, Tanh)
Sigmoid function
● If we are trying to fit a linear equation z = w.x+b to a non-linear pattern, we convert it
to a nonlinear equation instead
● The conversion is done through non-linear activation functions
● One such function is the sigmoid function.

Where, z = w.x+b [The linear function that we are trying to convert]

w = weight [Learnable parameter]

b = bias [Learnable parameter]

x = feature value [For multiple features, it’s a matrix multiplied with a weight
matrix]
Let’s assume,

Hours Study = x

Pass = y

● We’ll have to figure out z = w.x+b, and pass it through the


nonlinear function
● The challenge is to find appropriate values for w and b, which
we can do through gradient descent
● For gradient descent, we calculate the derivative of the loss
function, and subtract the derivative from the respective
parameter to update it
Loss function
Intuition behind the loss function:
Logistic regression uses the following loss function:
If y = 1 and a = 0.000005 [Extreme case of misclassification]
loss = -y.log(a) - (1-y).log(1-a)
loss = -1.log0 - (1-1).log(1-0) = 5.301 [High loss value]
Where,
If y = 1 and a = 1 [Best case of correct classification]
y = Ground truth
loss = -1.log1 - (1-1).log(1-1) = 0 [Low loss value]
a = Predicted value

If y = 0 and a = .000095 [Extreme case of misclassification]

loss = -0.log1 - (1-0).log(1-1) = 4.022 [High loss value]

If y = 0 and a = 0 [Best case of correct classification]

loss = -0.log0 - (1-0).log(1-0) = 0 [Low loss value]


Steps of logistic regression

● Randomly initialize the w and b for z=w.x+b


● Calculate the z for the initial w and b
● Calculate the loss
● Apply gradient descent to update the w and b
● Repeat the process until convergence
Logistic regression to neural networks
● Logistic regression can be expressed as a single neuron neural network
● Where, for a single feature and a single neuron, the network looks like:

● For multiple features on a single neuron, the architecture may look like:

● Single layer neural networks are referred to as perceptrons


Logistic regression to neural networks
● A network can involve multiple neurons

● And it can become fairly large and complex


Neural networks
● A method of processing based on multiple connected processing unit
● Each neuron is a linear function, followed by a nonlinear activation function
● The connectivity of the neurons build up a nested function
● Can learn complex patterns

● In the neural network above: z = σ(σ(x1.w1+b1).w2+b2)


● In a larger network, the function becomes much more nested and complex with many
learnable parameters, giving it the ability to learn complex patterns
● The learning process stays the same. After initialization, the weights {w 1, w2, w3,…, wn} and
biases {b1, b2, b3,..., bm} have to be updated through gradient descent
Neural networks
● The types of neural networks that we talked about until now is known as
feedforward neural networks
● There are also other forms of neural networks
● Some of the basic types are:
○ Feedforward Neural Networks
○ Convolutional Neural Networks
○ Recurrent Neural Networks
○ Generative Adversarial Networks
○ Transformers
○ … and Many More!
● For this lecture, we are going to talk about the feedforward networks only
Training a Neural Network
● Collect dataset
● Define the architecture of the Neural Network model that is to be trained on the
data
● Initialize all the weights and biases
● Calculate output based on the initialized weights and bias, which is called forward
propagation
● Calculate loss using an appropriate loss function
● Update the last layer weights and bias using the gradient of the loss
● Propagate the gradient to update the backwards layers, known as
backpropagation
● Run in a loop until convergence
Forward Propagation
Back to the Single Neuron Perceptron

● Here, there are two input feature values (x1, x2), output is (a)
● The loss function discussed in the logistic regression can be used here,
referred to as sigmoid loss function
Update process

● Calculate output a for all the data points in the dataset


● Calculate loss for n number of data points using:

● To update xᵢ, Calculate derivative of loss using


● Update xᵢ using
● Where, α is the learning rate
● Update all the weights and biases in the network
Gradient Calculation
Let’s calculate the derivatives
Let’s calculate the derivatives
Let’s calculate the derivatives
So, for a single neuron perceptron,

The gradient with respect to w1 is,

The gradient with respect to w2 would be,

This is for a single data point. For n number of samples, all the gradients have to be summed
up and then averaged
Some Basic Simulations
X Y
Let’s have a toy dataset .1 0
.2 0
.3 1
.4 ?

We are trying to fit it with the following perceptron

w and b are the unknowns, which have to be figured out

For three data points, the loss function is:


Initializing,

W = .7

b = .1

Thus, z = .7x+.1

● As the weights are randomly initialized, it should not fit our data points well, resulting in large error.

● Let’s calculate the error through the loss function

(Please note that we should use the normalized version of all the values for faster

and appropriate convergence. For simplicity, we are ignoring that step here.)
● Now, calculating the sum of loss for the three points in the dataset:

[For the first data point]

[For the second data point]

[For the third data point]

Average loss value across the three data points is 0.928/3 = 0.309
● Let’s try to reduce the loss value

● We’ll have to update both w and b using:

● Where, α is the learning rate


● Thus, if α=1, the updated w and b are:

● w = .7 - 1 * .013 = 0.687

● B = .1 - 1 * .23 = -0.13
Recalculating the sum of loss:

Average = 0.86 / 3 = 0.287

Slightly better than before!

We need to run this many times in a loop to get further improvements. (Depending on the nature of data, good
enough convergence might not be possible.)
But what about a larger network?
● This is a fairly large differentiation and, with increasing layer count, can very quickly
go out of hand. But wait…

● While calculating the derivative w.r.t. w2, the red-marked portion has already been

calculated. We can just get the numerical values and plug them in.
● For the gradient of each of the layer, we just need to store the gradient value for the
earlier layer and plug them in, making the derivative calculation process much simpler.
● In such a case, we are just propagating the gradient back from the later layer, hence the
name ‘backpropagation’.
● No matter how complex an expression is, it can be broken down into simple arithmetic
expressions.
● It is possible to easily calculate the derivative of these individual simple expressions,
get their values, apply chain rule repeatedly, and plug them in to the earlier operation
from the later operation.
● This method is known as automatic differentiation and is capable of calculating all the
gradients in a single backward swoop.
● This is how deep learning libraries are capable of calculating derivatives even in very
complex networks.
● Some other differentiation alternatives would be symbolic differentiation (can become
too complex in a large network) and numerical differentiation (not very accurate).

You might also like