CS217_2024_lec11
CS217_2024_lec11
Disclaimer: These notes aggregate content from several texts and have not been subjected to the usual
scrutiny deserved by formal publications. If you find errors, please bring to the notice of the Instructor.
Consider the example of points shown below in the diagram; there are some cross(×) symbolic points and
some circular (◦) points, and we have to classify them into two different classes.
If we use logistic regression for classification, as the decision boundary of logistic regression is linear in the
2-D case, it is not possible to perfectly classify the points into two different classes as a line can’t separate
the × and ◦ points into two different regions.
1
f (X, w) = (Logistic Regression)
1 + e−wT X
So, to solve this problem of the linear decision boundary, we can use the concept of the basis function, which
gives us a non-linear decision boundary.
Let us take the basis function as :
T
Φ(X) = 1 x1 x2 x21 x22
Now, the decision boundary of this basis function is circular, and by properly adjusting the weights, we could
find a circular decision boundary that contains all the ◦ points.
1
f (Φ(X), w) = (Logistic Regression with basis function)
1 + e−wT Φ(X)
In Neural Networks, our goal is to attain non-linear behaviour without the need for explicit programming
dedicated to non-linearity; this is the fundamental principle behind Neural Networks.
11-1
11-2 Lecture 11: Introduction to Neural Networks
• 1950s- Nathanial Rochester from the IBM research laboratories led the first effort to simulate a neural
network. Unfortunately for him, the first attempt to do so failed.
• 1957- The first hardware implementation of perceptron was Mark I Perceptron machine built in 1957
at the Cornell Aeronautical Laboratory by psychologist Frank Rosenblatt, funded by the Information
Systems Branch of the United States Office of Naval Research and the Rome Air Development Center.
• 1982- Interest in the field was renewed. John Hopfield of Caltech presented a paper to the National
Academy of Sciences. His approach was to create more useful machines by using bidirectional lines.
Previously, the connections between neurons were only one way.
• 1982- US-Japan Joint Conference on Cooperative/ Competitive Neural Networks at which Japan an-
nounced their Fifth-Generation effort resulted US worrying about being left behind.
• 1997- A recurrent neural network (RNN) framework, Long Short-Term Memory (LSTM), was proposed
by Schmidhuber & Hochreiter.
First, let want us to understand why neural networks are called neural networks. The way an actual neuron
works involves the accumulation of electric potential, which, when exceeding a particular value, causes the
pre-synaptic neuron to discharge across the axon and stimulate the post-synaptic neuron. The human brain’s
capabilities are incredible compared to what we can do even with state-of-the-art neural networks.
In the following diagram, we illustrate the analogy between the neuron structure and the artificial neurons
in a neural network.
Lecture 11: Introduction to Neural Networks 11-3
If we have multiple features, each is passed through an affine transformation, which is basically a weighted
sum of input features with some bias term, which gives us something resembling a regression equation.We then
pass this result through our activation function, which gives us some form of probability. This probability
determines whether the neuron will fire — our result can then be plugged into our loss function to assess
the algorithm’s performance.
Here is a multi-layer neural network, and our target is to learn the W s and the bs of all layers to minimize
the loss J.
The activation function is analogous to the build-up of electrical potential in biological neurons, which fires
once a certain activation potential is reached. This activation potential is mimicked in artificial neural
networks using probability. The activation function should do two things:-
1. Ensure non-linearity to capture complex features that are not linear.
2. Ensure gradients remain large through the hidden layers in Deep Neural Networks; otherwise, we may
encounter the vanishing gradient problem.
Following are some of the popular activation functions:-
1. Sigmoid (σ) :-
1
σ(x) =
1 + e−x
Feedforward Neural Network: It is one of the broad types of Artificial Neural Networks, where the
flow of information is unidirectional and is from the inputs to outputs through hidden layers without any
cycles or loops, in contrast to recurrent neural networks, which have a bi-directional flow. Following are the
steps involved in training a neural network:
N
1 X
J(θ) = l(N N (xi , θ), yi )
N i=1
l(N N (xi , θ), yi ) = − [yi log(N N (xi , θ)) + (1 − yi ) log(1 − N N (xi , θ))]
• Stochastic Gradient Descent (SGD)
– Pick Random Data Point: Randomly select a data point (xi , yi ) from the dataset.
– Compute Gradient of the Loss: Calculate the gradient of the loss function with respect to
the parameters θ for the selected data point.
∇θ l(xi , yi )
– Update Parameters: Update the parameters θ using the gradient descent update rule:
θt+1 = θt − η∇θ l(xi , yi )
Where η (eta) is the learning rate, controlling the size of the steps taken during optimization.
– Mini-Batch Variant: Instead of updating parameters with single data points, one can use mini-
batches of data (B) to compute gradients and update parameters. This often leads to more stable
convergence.
11.6 Backpropagation
The whole training of neural networks lies in the fact that how you can train it using backpropagation. It
d(l)
simply means that if you have loss l and θ is set of all parameters, then how will you calculate d(θ) so that
d(l)
you can train by doing θ1 − = d(θ1 ) × learning rate.
Backpropagation efficiently computes gradients in neural networks by utilizing the chain rule of differenti-
dl dl
ation. For example, if we want to calculate da where f (a) = b, g(b) = l, instead of directly da , first we
db dl dl
compute db/da, and then dl/db, finally da × db = da .
11-6 Lecture 11: Introduction to Neural Networks
u1
x g
u2
u4
x3 u3 g3
x2 u2 g2
x1 u1 g1
In a multi-layer neural network, each layer consists of nodes and connections between nodes carry weights.
During both forward pass (computing the output) and backward pass (computing gradients for training),
derivatives play a crucial role. Here, we discuss the computation of derivatives with respect to inputs and
the matrix representation of these derivatives.
The derivative ∂⃗u
x represents the sensitivity of the intermediate layer u to changes in the input x. It can be
∂⃗
represented as a matrix where each element (i, j) corresponds to the partial derivative of ui with respect to
xj . This matrix is commonly known as the Jacobian matrix.
∂u ∂u2 ∂uk
∂x
1
∂x1 ··· ∂x1
∂u11 ∂u2
··· ∂uk
∂⃗u ∂x2 ∂x2 ∂x2
= . .. .. ..
∂⃗x .. . . .
∂u1 ∂u2 ∂uk
∂xd ∂xd ··· ∂xd
∂⃗
g
Similarly, for the matrix of ∂⃗u , it would have elements representing the partial derivatives of each element
in ⃗g with respect to each element in ⃗u.
∂g ∂g2 ∂gm
∂u
1
∂u1 ··· ∂u1
∂g11 ∂g2 ∂gm
∂⃗g ∂u2 ∂u2 ··· ∂u2
= . .. .. ..
∂⃗u .. . . .
∂g1 ∂g2 ∂gm
∂uk ∂uk ··· ∂uk
Lecture 11: Introduction to Neural Networks 11-7
Consider vectors ⃗x, ⃗u, and ⃗g , where every node in ⃗x is connected to every node in ⃗u, and every node in ⃗u is
∂gi
connected to every node in ⃗g . The derivative ∂x j
represents the sensitivity of each element gi in ⃗g to changes
in each element xj in ⃗x. This derivative can be computed using the chain rule:
k
∂gi X ∂uz ∂gi
= ·
∂xj z=1
∂xj ∂uz
Here, we sum over all elements uz , where each term in the sum involves the product of two partial derivatives.
References
• Medium article - https://towardsdatascience.com/simple-introduction-to-neural-networks-ac1d7c3d7a2