Neural Network (Perceptrons)
Neural Network (Perceptrons)
Classification dataset
Hours Study Pass Hours Study Pass
2 No 2 0
3 No 3 0
4 Yes 4 1
5 No 5 0
6 Yes 6 1
7 Yes 7 1
8 Yes 8 1
9 Yes 9 1
10 ? 10 ?
● Here, let’s say we are trying to predict whether a student is going to pass based on hours study
● Hours study, in this case is the feature and Pass, is a label
● Pass is a categorical variable consisting of two values {Yes, No}
● Such categorical variable outcome prediction is referred to as classification problem
● It is a task of supervised learning
Hours Study Pass Hours Study Pass
2 No 2 0
3 No 3 0
4 Yes 4 1
5 No 5 0
6 Yes 6 1
7 Yes 7 1
8 Yes 8 1
9 Yes 9 1
10 ? 10 ?
x = feature value [For multiple features, it’s a matrix multiplied with a weight
matrix]
Let’s assume,
Hours Study = x
Pass = y
● For multiple features on a single neuron, the architecture may look like:
● Here, there are two input feature values (x1, x2), output is (a)
● The loss function discussed in the logistic regression can be used here,
referred to as sigmoid loss function
Update process
This is for a single data point. For n number of samples, all the gradients have to be summed
up and then averaged
Some Basic Simulations
X Y
Let’s have a toy dataset .1 0
.2 0
.3 1
.4 ?
W = .7
b = .1
Thus, z = .7x+.1
● As the weights are randomly initialized, it should not fit our data points well, resulting in large error.
(Please note that we should use the normalized version of all the values for faster
and appropriate convergence. For simplicity, we are ignoring that step here.)
● Now, calculating the sum of loss for the three points in the dataset:
Average loss value across the three data points is 0.928/3 = 0.309
● Let’s try to reduce the loss value
● w = .7 - 1 * .013 = 0.687
● B = .1 - 1 * .23 = -0.13
Recalculating the sum of loss:
We need to run this many times in a loop to get further improvements. (Depending on the nature of data, good
enough convergence might not be possible.)
But what about a larger network?
● This is a fairly large differentiation and, with increasing layer count, can very quickly
go out of hand. But wait…
● While calculating the derivative w.r.t. w2, the red-marked portion has already been
calculated. We can just get the numerical values and plug them in.
● For the gradient of each of the layer, we just need to store the gradient value for the
earlier layer and plug them in, making the derivative calculation process much simpler.
● In such a case, we are just propagating the gradient back from the later layer, hence the
name ‘backpropagation’.
● No matter how complex an expression is, it can be broken down into simple arithmetic
expressions.
● It is possible to easily calculate the derivative of these individual simple expressions,
get their values, apply chain rule repeatedly, and plug them in to the earlier operation
from the later operation.
● This method is known as automatic differentiation and is capable of calculating all the
gradients in a single backward swoop.
● This is how deep learning libraries are capable of calculating derivatives even in very
complex networks.
● Some other differentiation alternatives would be symbolic differentiation (can become
too complex in a large network) and numerical differentiation (not very accurate).