Lecture 2
Lecture 2
1.1 Introduction
Course Overview
• Objectives:
o Understand the foundational principles of neural networks and deep
learning.
o Gain hands-on experience with state-of-the-art deep learning
frameworks.
o Explore applications in fields like healthcare, computer vision, and
natural language processing.
• Structure:
o Weekly lectures and labs.
o Assignments, projects, and exams to assess understanding.
o Two Main Textbooks:
"Deep Learning with Python" by Francois Chollet
"Fundamentals of Neural Networks – Architectures, Algorithms,
and Applications" by Laurene Fausett
• Historical Context:
o Brief history of AI and machine learning leading to the development of
deep learning.
o Why deep learning has become prominent in recent years (availability of
data, computational power).
• Real-World Applications:
o Computer Vision: Facial recognition, object detection.
1
o Natural Language Processing: Machine translation, sentiment
analysis.
o Healthcare: Disease prediction, Patient outcome prediction, medical
imaging analysis.
o Self-Driving Cars: Autonomous navigation and decision-making.
Biological Inspiration
2
o Branch-like structures that receive signals from other neurons. These
signals are chemical in nature and are converted into electrical impulses
as they move toward the neuron's cell body (soma).
o In a biological neuron, dendrites receive signals from other neurons.
Similarly, in an artificial neural network, the input layer receives data
(e.g., features of an image or text) from the external environment.
• Axon (Output Layer):
o A long, slender projection that carries electrical impulses away from the
cell body. The axon transmits these impulses to other neurons, muscles,
or glands.
o The axon in a biological neuron transmits the processed signal to other
neurons. In an ANN, the output layer sends the final processed signal
(e.g., a classification decision or a prediction) to the next layer or to the
external environment.
• Synapse (Weights):
o The small gap between the axon terminal of one neuron and the
dendrites or cell body of another neuron. When an electrical impulse
reaches the end of an axon, it triggers the release of neurotransmitters,
which cross the synapse and bind to receptors on the next neuron,
allowing the signal to continue.
o The synapse is the point of connection between two neurons where
signals are transmitted. The strength of this transmission is influenced
by the synaptic weights. In an ANN, the synapse is represented by
weights that determine how much influence an input has on the output.
These weights are adjusted during training to minimize the error in
predictions.
• Activation Function (Neuron Firing):
o Just as a biological neuron "fires" (transmits a signal) if the incoming
signals are strong enough, an artificial neuron in an ANN activates and
passes on a signal based on an activation function. This function
3
introduces non-linearity into the model, enabling it to learn complex
patterns.
Figure 2:
Structure and Functionality:
o Neural Networks Mimic Neuronal Processing:
o Artificial neural networks are designed to mimic the way biological
neurons process information. The architecture of an ANN—comprising
input layers, hidden layers, and output layers—parallels the structure of
interconnected neurons in the brain.
o Learning Process:
o In biological neurons, learning occurs through the strengthening or
weakening of synaptic connections, a process known as synaptic
plasticity. In ANNs, learning occurs through the adjustment of weights
and biases during the training process, typically using algorithms like
backpropagation.
4
of neurons. Similarly, deep neural networks, with many layers, can learn
hierarchical representations of data, enabling them to identify complex
patterns in images, text, and other types of data.
o Parallel Processing:
o Just as the brain processes information in parallel across many neurons,
ANNs process data in parallel across multiple nodes, making them highly
efficient for tasks like image recognition, language processing, and
more.
• 1943: Warren McCulloch and Walter Pitts propose the first mathematical model
of a neuron.
• 1958: Frank Rosenblatt develops the Perceptron, the first algorithmically
described neural network.
• 1980s-1990s: Development of backpropagation and the rise of multilayer
perceptrons.
• 2010s: The resurgence of neural networks, particularly deep learning, driven by
advances in computational power and large datasets.
• Machine Learning:
o Requires feature engineering.
o Works well with structured data (e.g., tables of data).
• Deep Learning:
o Automatically extracts features.
5
o Excels with unstructured data (e.g., images, text, audio).
• Large Datasets:
o Deep learning requires vast amounts of data to train effective models.
• Computational Power:
o Advances in hardware, particularly GPUs, have enabled the training of
deep networks.
• Backpropagation and Gradient Descent:
o Backpropagation: Algorithm for computing the gradient of the loss
function with respect to the network’s weights.
o Gradient Descent: Optimization algorithm used to minimize the loss
function.
• Discuss the rise of deep learning and its impact on fields such as computer
vision, natural language processing, and healthcare.
• Real-world applications: self-driving cars, speech recognition, image
classification, etc.
• The importance of understanding the theory behind neural networks to apply
them effectively.
6
• Machine learning is a subset of AI that uses computer algorithms to analyze
data and make intelligent decisions based on what it has learned. Machine
learning algorithms are trained with large sets of data and they learn from
examples.
• Deep learning is a specialized subset of Machine Learning that uses layered
neural networks to simulate human decision-making. Deep learning algorithms
can label and categorize information and identify patterns. It is what enables AI
systems to continuously learn on the job, and improve the quality and accuracy
of results by determining whether decisions were correct.
Foundations of AI Learning
• What is Learning in AI?
o Learning in AI refers to the process by which algorithms adjust and
improve their performance based on data. This mimics human learning,
where experiences shape future actions and decisions.
• Types of Learning
o Supervised Learning:
The model learns from labeled data, which means the input data
comes with the correct output.
Example: Image classification where each image is labeled with
the correct category.
o Unsupervised Learning:
The model learns from unlabeled data, finding hidden patterns or
intrinsic structures.
Example: Clustering customers into different groups based on
purchasing behavior.
o Reinforcement Learning:
The model learns by interacting with an environment, receiving
rewards or penalties.
Example: Training a robot to navigate a maze.
7
What is Data Science?
• Data science is the process and method for extracting knowledge and insights
from large volumes of disparate data.
• Data Science can use many of the AI techniques to derive insight from data.
Deep Learning
8
• Deep learning algorithms do not directly map input to output. Instead, they rely
on several layers of processing units. Each layer passes its output to the next
layer, which processes it and passes it to the next. The many layers are why
it’s called deep learning. When creating deep learning algorithms, developers
and engineers configure the number of layers and the type of functions that
connect the outputs of each layer to the inputs of the next. Then they train the
model by providing it with lots of annotated examples.
9
𝑎𝑎(𝒙𝒙)
10
𝜑𝜑(∙): an activation function
This is a 3D visualization of the activation of a neuron for two inputs (𝑥𝑥1 , 𝑥𝑥2 ) and
output 𝑦𝑦 = ℎ(𝑥𝑥) values of -1 and 1.
ℎ(𝑥𝑥)
-1
• The range of the output is determined by the activation function 𝜑𝜑(∙) that is
between -1 and 1.
• It can be considered as a binary classifier that separates points in one region and
some other region. It depends on the different values of input 𝑿𝑿.
• The vector 𝑾𝑾 is perpendicular to the hyperplane that separates two regions
(e.g., the regions where the neuron output is -1 and 1) due to the geometric
interpretation of the linear combination 𝑾𝑾 ⋅ 𝑿𝑿 + 𝑏𝑏 = 0, which defines the
hyperplane in the input space. For any point on the hyperplane, the dot product
𝑾𝑾 ⋅ 𝑿𝑿 is equal to −𝑏𝑏, which is constant.
• The vector 𝑾𝑾 is the gradient of the linear combination 𝑾𝑾 ⋅ 𝑿𝑿 + 𝑏𝑏 with respect to
𝑿𝑿. The gradient points in the direction of the steepest increase of the function.
• The set of all points 𝑿𝑿 that satisfy 𝑾𝑾 ⋅ 𝑿𝑿 + 𝑏𝑏 = 0 forms a plane perpendicular to
𝑾𝑾.
11
• The orientation of the hyperplane, therefore, is determined by 𝑾𝑾.
• The bias 𝑏𝑏 shifts the hyperplane parallel to itself.
• The bias 𝑏𝑏 determines the position of the hyperplane relative to the origin in the
input space.
• When 𝑏𝑏 = 0, the hyperplane passes through the origin.
• When 𝑏𝑏 > 0, the hyperplane shifts away from the origin in the direction opposite
to 𝑾𝑾. Increasing 𝑏𝑏 moves the hyperplane further along the direction where 𝑾𝑾 ⋅ 𝑿𝑿
is negative. This can be thought of as lowering the threshold for classification
into the negative region.
• When 𝑏𝑏 < 0, the hyperplane shifts away from the origin in the direction of 𝑾𝑾.
Decreasing 𝑏𝑏 (making it more negative) moves the hyperplane further in the
direction where 𝑾𝑾 ⋅ 𝑿𝑿 is positive, effectively raising the threshold for positive
classification.
12
3. Hyperbolic Tangent (Tanh) Function
The tanh function is another S-shaped activation function, similar to the
sigmoid, but it outputs values between -1 and 1.
𝑒𝑒 𝑎𝑎(𝒙𝒙) − 𝑒𝑒 −𝑎𝑎(𝒙𝒙) 1 − 𝑒𝑒 −2𝑎𝑎(𝒙𝒙)
𝜑𝜑(𝑎𝑎(𝒙𝒙)) = = → 𝜑𝜑 ′ (𝒙𝒙) = 1 − 𝜑𝜑 2 (𝒙𝒙)
𝑒𝑒 𝑎𝑎(𝒙𝒙) + 𝑒𝑒 −𝑎𝑎(𝒙𝒙) 1 + 𝑒𝑒 −2𝑎𝑎(𝒙𝒙)
13
5. Leaky ReLU
Leaky ReLU is a variation of the ReLU function that allows a small, non-zero
gradient when the input is negative, which helps to keep the network learning
even for negative inputs.
𝑥𝑥, 𝑥𝑥 > 0 1, 𝑥𝑥 > 0
𝜑𝜑(𝒙𝒙) = � → 𝜑𝜑′(𝒙𝒙) = �
𝛼𝛼𝛼𝛼, 𝑥𝑥 ≤ 0 0, 𝑥𝑥 ≤ 0
where 𝛼𝛼 is a small constant (usually 𝛼𝛼 = .01).
6. Softmax Function
Softmax is often used in the output layer of a neural network for multi-class
classification and returns a vector of probability scores. It converts logits (raw
output of the network) into probabilities. Let 𝒛𝒛 = 𝒂𝒂(𝒙𝒙)
𝑒𝑒 𝑧𝑧𝑖𝑖
𝜑𝜑(𝒛𝒛)𝑖𝑖 = ,
∑𝑛𝑛𝑗𝑗=1 𝑒𝑒 𝑧𝑧𝑗𝑗
14
where 𝒂𝒂(𝒙𝒙) is the vector of raw outputs from the NN, and 𝑛𝑛 is the number of
classes.
Example: You’re given a dataset containing images of seal (class 0), pandas
(class 1), and ducks (class 2). You’d like to train a neural network to predict
whether a previously unseen image is that of a seal, a panda, or a duck. Thus
in this example 𝑛𝑛 = 3. Suppose you are given the vector 𝒛𝒛 = [.25, 1.23, −.8] of
raw outputs from the NN. Then,
𝑒𝑒 .25
𝑃𝑃(𝑦𝑦𝑖𝑖 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) = = .249
𝑒𝑒 .25 + 𝑒𝑒 1.23 + 𝑒𝑒 −.8
𝑒𝑒 1.23
𝑃𝑃(𝑦𝑦𝑖𝑖 = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝) = = .664
𝑒𝑒 .25 + 𝑒𝑒 1.23 + 𝑒𝑒 −.8
𝑒𝑒 −.8
𝑃𝑃(𝑦𝑦𝑖𝑖 = 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑) = = .087
𝑒𝑒 .25 + 𝑒𝑒 1.23 + 𝑒𝑒 −.8
In a multiclass classification problem, where the classes are mutually exclusive,
notice how the entries of the softmax output sum up to 1: 0.664 + 0.249 +
0.087 = 1.
Therefore, we conclude that there’s a 66.4% chance that the given image
belongs to class 1 (panda), and the input image has a 4.9% chance of being a
seal and around 8.7% chance of being a duck.
15
Example: Forward Propagation
16
Hebb net
17
If data are represented in bipolar form, the desired weight update would be
• Algorithm:
Example: A Hebb net for the AND function: binary inputs and targets
18
After the first input pattern, no learning occurs any more due to target is 0. Thus, fail
to classify for binary input and binary target output.
Example: A Hebb net for the AND function: bipolar inputs, bipolar targets
19
20
21
Now the decision boundary is correct.
Feedforward NN – Multilayer NN
Consider XOR (exclusive OR) function that outputs true or 1 only when the inputs are
different.
22
(2)
𝑤𝑤𝑖𝑖
(1)
𝑤𝑤𝑖𝑖,𝑗𝑗
ℎ(𝒙𝒙)𝑖𝑖
𝑦𝑦 = 𝑓𝑓(𝒙𝒙)
𝑥𝑥𝑗𝑗
𝑏𝑏 (2)
(1)
𝑏𝑏𝑖𝑖
1 1
(1) (1)
𝒂𝒂(𝒙𝒙) = 𝒃𝒃(𝟏𝟏) + 𝑾𝑾(𝟏𝟏) 𝒙𝒙; 𝑎𝑎(𝒙𝒙)𝑖𝑖 = 𝑏𝑏𝑖𝑖 + � 𝑤𝑤𝑖𝑖,𝑗𝑗 𝑥𝑥𝑗𝑗
𝑗𝑗
𝒉𝒉(𝒙𝒙) = 𝒈𝒈�𝒂𝒂(𝒙𝒙)�
𝑻𝑻
𝑓𝑓(𝒙𝒙) = ⏟
𝑜𝑜 �𝑏𝑏 (2) + 𝒘𝒘(𝟐𝟐) 𝒉𝒉(𝒙𝒙)�
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
23
o Strictly positive and sum to one
• Predicted class is the one with the highest estimated probability
Multilayer NN
ℎ(2) (𝒙𝒙)
ℎ(1) (𝒙𝒙)
𝑾𝑾(3)
𝑾𝑾(2) 𝒃𝒃(3)
𝑾𝑾(1) 𝒃𝒃(2) 1
(1)
𝒃𝒃 1
1
• Could have 𝐿𝐿 hidden layers
o Layer pre-activation for 𝑘𝑘 > 0 (𝒉𝒉(𝟎𝟎) (𝒙𝒙) = 𝒙𝒙)
• These techniques help prevent the model from overfitting to the training data.
• Empirical risk minimization (ERM)
24
o The objective of ERM is to minimize the error (or loss) on the training
dataset.
o Minimize the risk (error) that the model incurs on the training dataset.
This is calculated using a loss function (e.g., Mean Squared Error for
regression, Cross-Entropy Loss for classification).
o Framework to design learning algorithms
1
ℓ�𝑓𝑓�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�, 𝑦𝑦 (𝑡𝑡) � + 𝜆𝜆Ω(𝜽𝜽),
argmin � �����������
𝜽𝜽 𝑇𝑇
���������������
𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓.
𝑎𝑎𝑎𝑎𝑎𝑎. 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓.
𝑓𝑓�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�: The prediction made by the model for input 𝒙𝒙(𝑡𝑡) .
𝒙𝒙(𝑡𝑡) : training data
𝑦𝑦 (𝑡𝑡) : the true label for input 𝒙𝒙(𝑡𝑡)
Ω(𝜽𝜽): a regularizer (penalizes certain values of 𝜽𝜽)
𝑇𝑇: The number of samples in the training set
𝜆𝜆: regularization hyperparameter that controls the strength of the
penalty
o The goal is to find a function 𝑓𝑓 (i.e., a model, typically a neural network)
that minimizes the empirical risk. This means finding model parameters
(weights and biases) that minimize the loss over the training data.
o In multilayer neural networks, this is done through gradient-based
optimization methods like Stochastic Gradient Descent (SGD). The model
parameters are updated iteratively to reduce the empirical risk, based on
the gradient of the loss function with respect to the parameters.
• Regularization
o Regularization is used to prevent overfitting, which occurs when a model
performs well on the training data but fails to generalize to unseen data.
Regularization introduces a penalty for large or overly complex model
parameters, encouraging the model to find simpler patterns that
generalize better.
o Types of Regularization:
25
L2 Regularization (Ridge)
• L2 Regularization adds a penalty based on the squared
values of the model’s weights.
• Ω(𝜽𝜽) = ∑𝑗𝑗 𝑤𝑤𝑗𝑗2
L1 Regularization (Lasso)
• L1 Regularization adds a penalty based on the absolute
values of the model’s weights.
• Ω(𝜽𝜽) = ∑𝑗𝑗�𝑤𝑤𝑗𝑗 �
• L1 regularization drives some weights to exactly zero,
promoting sparsity in the model.
Dropout Regularization
• Dropout is a technique specific to neural networks where a
random subset of neurons is "dropped" (set to zero) during
training.
• Dropout forces the network to rely on different subsets of
neurons for each forward pass, reducing the likelihood that
the network becomes overly dependent on a particular
neuron or set of neurons.
• Gradient descent aims to minimize the loss function, which measures the
difference between the predicted outputs and the actual labels.
• By iteratively adjusting the model parameters (weights and biases), gradient
descent seeks to find the values that minimize this loss.
• Stochastic Gradient Descent (SGD) is an optimization algorithm that updates
the weights of the neural network using a single randomly selected sample (or
training example) at each iteration.
• SGD computes the gradient of the loss function for only one sample at a time.
• Key idea of SGD:
26
o Random Sampling: Instead of using the entire dataset to compute the
gradient, only one training sample is randomly selected at each iteration.
• Algorithm that performs updates after each example
o Initialize 𝜽𝜽 �𝜽𝜽 ≡ �𝑾𝑾(𝟏𝟏) , 𝒃𝒃(𝟏𝟏) , … , 𝑾𝑾(𝑳𝑳+𝟏𝟏) , 𝒃𝒃(𝑳𝑳+𝟏𝟏) ��
o for 𝑁𝑁 iterations (iteration over all examples)
for each training example �𝒙𝒙(𝑡𝑡) , 𝑦𝑦 (𝑡𝑡) �
• 𝚫𝚫 = − 𝛁𝛁𝜽𝜽 ℓ�𝒇𝒇�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�, 𝑦𝑦 (𝑡𝑡) �
������������� − 𝜆𝜆𝛁𝛁𝜽𝜽 Ω(𝜽𝜽)
𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑜𝑜𝑜𝑜 𝑡𝑡ℎ𝑒𝑒 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
𝑤𝑤𝑤𝑤𝑤𝑤ℎ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡 𝑡𝑡ℎ𝑒𝑒 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡𝑡𝑡 𝒘𝒘
27
• The goal is to minimize the cross-entropy, making the predicted
probabilities as close as possible to the actual labels.
• Use case: Binary classification (two classes, e.g., 0 and 1).
𝑁𝑁
1
𝐿𝐿 = − �[𝑦𝑦𝑖𝑖 log(𝑦𝑦�𝑖𝑖 ) + (1 − 𝑦𝑦𝑖𝑖 ) log(1 − 𝑦𝑦�𝑖𝑖 )],
𝑁𝑁
𝑖𝑖=1
2. Categorical Cross-Entropy:
• Use case: Multi-class classification (more than two classes).
• It measures the dissimilarity between the predicted probability
distribution (output of the model) and the actual probability distribution
(ground truth label).
• We minimize the negative log-likelihood:
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐−𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
�������������
𝑁𝑁 𝐶𝐶
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿
� = −�� �
𝑦𝑦𝑖𝑖,𝑐𝑐 log
���𝑦𝑦�𝑖𝑖,𝑐𝑐 = − log 𝒇𝒇(𝒙𝒙)𝑦𝑦 ,
ℓ(𝒇𝒇(𝒙𝒙),𝒚𝒚) 𝑖𝑖=1 𝑐𝑐=1 1(𝑦𝑦=𝑐𝑐) log 𝑓𝑓(𝒙𝒙)𝑐𝑐
• Gradient computation
o Loss Gradient at output
• Partial derivative:
28
𝜕𝜕 −1(𝑦𝑦=𝑐𝑐)
�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = ,
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑦𝑦
where 𝑐𝑐 is any of the output neuron
o Gradient:
1(𝑦𝑦=0)
1 𝑒𝑒(𝑦𝑦)
▽𝒇𝒇(𝒙𝒙) �− log 𝒇𝒇(𝒙𝒙)𝒚𝒚 � = − � ⋮ �=− ,
𝑓𝑓(𝑥𝑥)𝑦𝑦 1 𝑓𝑓(𝑥𝑥)𝑦𝑦
(𝑦𝑦=𝐶𝐶−1)
Where 𝑒𝑒(𝑦𝑦) is a one-hot encoded vector where the correct class 𝑦𝑦 has a
value of 1 and all other classes are 0.
𝑙𝑙(𝑓𝑓(𝑥𝑥), 𝑦𝑦)
Example: Assume you are working with a classification problem that has 4 classes
(i.e., 𝐶𝐶 = 4), and the true class label 𝑦𝑦 is class 2.
The softmax output of your neural network might look like this for a given input 𝑥𝑥:
This means the network is predicting the following probabilities for each class:
29
For the true class 𝑦𝑦 = 2, the one-hot encoded vector 𝑒𝑒(𝑦𝑦) will look like this:
In this example, the vector 𝑒𝑒(𝑦𝑦) has a 1 at the index of the correct class (class 2) and
0 elsewhere.
𝑒𝑒(𝑦𝑦) 1 𝑇𝑇 1 𝑇𝑇
Now from ▽𝒇𝒇(𝒙𝒙) �− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = − = − �0, 0, 𝑓𝑓(𝑥𝑥) , 0� = − �0, 0, .15 , 0� =
𝑓𝑓(𝑥𝑥)𝑦𝑦 2
This gradient vector will be used during backpropagation to adjust the network's
weights to make the model's prediction for the correct class (class 2) more confident.
𝑙𝑙(𝑓𝑓(𝑥𝑥), 𝑦𝑦)
• Partial derivative:
𝜕𝜕
�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = −�1(𝑦𝑦=𝑐𝑐) − 𝑓𝑓(𝒙𝒙)𝑐𝑐 �,
𝜕𝜕𝑎𝑎𝐿𝐿+1 (𝒙𝒙)𝑐𝑐
30
where 𝑓𝑓(𝒙𝒙)𝑦𝑦 is the predicted probability for the true class 𝑦𝑦, i.e., the
output of the softmax function for class 𝑦𝑦.
• The softmax function outputs probabilities for each class 𝑐𝑐 based on the
(𝐿𝐿+1)
pre-activation 𝑎𝑎𝑐𝑐 , where
(𝐿𝐿+1)
exp�𝑎𝑎𝑐𝑐 �
𝑓𝑓(𝑥𝑥)𝑐𝑐 = .
∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘(𝐿𝐿+1) �
(𝐿𝐿+1)
This function takes in the pre-activation scores 𝑎𝑎𝑐𝑐 and converts them
into probabilities that sum to 1.
• Derivative of the loss with respect to pre-activation
We want to compute the gradient of the loss function with respect to the
(𝐿𝐿+1)
pre-activation 𝑎𝑎𝑐𝑐 for class 𝑐𝑐.
• Case 1: 𝑦𝑦 = 𝑐𝑐 (the true class)
𝜕𝜕
By considering �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 �:
𝜕𝜕𝑎𝑎𝐿𝐿+1 (𝒙𝒙) 𝑐𝑐
In this case, the loss is directly influenced by the probability 𝑓𝑓(𝒙𝒙)𝑐𝑐 for
the correct class, and the gradient reflects how the model should
adjust this probability. For the cross-entropy loss,
𝜕𝜕 −1(𝑦𝑦=𝑐𝑐) −1
�− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = =
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑦𝑦 𝑓𝑓(𝑥𝑥)𝑦𝑦
Now, using the derivative of the softmax function with respect to its
(𝐿𝐿+1)
pre-activation 𝑎𝑎𝑐𝑐 , we get:
(𝐿𝐿+1)
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝜕𝜕 exp�𝑎𝑎𝑐𝑐
�
(𝐿𝐿+1)
= (𝐿𝐿+1)
� �=
𝜕𝜕𝑎𝑎𝑐𝑐 𝜕𝜕𝑎𝑎𝑐𝑐 ∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘(𝐿𝐿+1) �
(𝐿𝐿+1) (𝐿𝐿+1) (𝐿𝐿+1) (𝐿𝐿+1)
1(𝑦𝑦=𝑐𝑐) exp�𝑎𝑎𝑐𝑐 � ∙ ∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘 � − exp�𝑎𝑎𝑐𝑐 � ∙ exp�𝑎𝑎𝑐𝑐 �
= 2
(𝐿𝐿+1)
�∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘 ��
31
𝜕𝜕 1 𝜕𝜕
(𝐿𝐿+1)
�− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = − ∙ (𝐿𝐿+1) 𝑓𝑓(𝑥𝑥)𝑦𝑦
𝜕𝜕𝑎𝑎𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑦𝑦 𝜕𝜕𝑎𝑎
𝑐𝑐
1 𝜕𝜕 (𝐿𝐿+1)
=− ∙ (𝐿𝐿+1) ∙ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠�𝑎𝑎𝑦𝑦 �=
𝑓𝑓(𝑥𝑥)𝑦𝑦 𝜕𝜕𝑎𝑎
𝑐𝑐
(𝐿𝐿+1)
1 𝜕𝜕 exp�𝑎𝑎𝑦𝑦 �
=− ∙ (𝐿𝐿+1) � �
𝑓𝑓(𝑥𝑥)𝑦𝑦 𝜕𝜕𝑎𝑎 ∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘(𝐿𝐿+1) �
𝑐𝑐
1
=
⏟ − ∙ 𝑓𝑓(𝑥𝑥)𝑐𝑐 ∙ (1 − 𝑓𝑓(𝑥𝑥)𝑐𝑐 ) = 𝑓𝑓(𝑥𝑥)𝑐𝑐 − 1.
𝑦𝑦=𝑐𝑐
𝑓𝑓(𝑥𝑥)𝑐𝑐
where
𝜕𝜕𝜕𝜕 𝜕𝜕 1
= (− log 𝑓𝑓(𝑥𝑥)𝑐𝑐 ) = −
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑐𝑐
𝜕𝜕
(𝐿𝐿+1)
�− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = −�1(𝑦𝑦=𝑐𝑐) − 𝑓𝑓(𝑥𝑥)𝑐𝑐 � = 𝑓𝑓(𝑥𝑥)𝑐𝑐 − 𝑒𝑒(𝑦𝑦)𝑐𝑐
𝜕𝜕𝑎𝑎𝑐𝑐
o Gradient of the softmax and cross-entropy loss function for all cases
simultaneously:
o Let 𝒂𝒂(𝐿𝐿+1) (𝒙𝒙) represent the vector of pre-activations for all classes:
(𝐿𝐿+1) (𝐿𝐿+1) (𝐿𝐿+1)
𝒂𝒂(𝐿𝐿+1) (𝒙𝒙) = �𝑎𝑎1 , 𝑎𝑎2 , … , 𝑎𝑎𝐶𝐶 �
32
• 𝒆𝒆(𝑦𝑦) represent a one-hot encoded vector where the entry
corresponding to the true class 𝑦𝑦 is 1, and all other entries are 0:
𝒆𝒆(𝑦𝑦) = [𝑒𝑒(𝑦𝑦)1 , 𝑒𝑒(𝑦𝑦)2 , … , 𝑒𝑒(𝑦𝑦)𝐶𝐶 ].
Example: Neural Network with One Hidden Layer Using Stochastic Gradient Descent
Initialization
We start by initializing the weights and biases for each layer. Assume we use random
initialization for simplicity.
Let:
• 𝑊𝑊1 be the weights matrix for the input to hidden layer (shape: 2x3).
• 𝑊𝑊2 be the weights matrix for the hidden layer to the output layer (shape: 3x1).
Forward Propagation
33
Apply an activation function (let’s use ReLU): 𝑎𝑎1 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(𝑧𝑧1 )
• ▽ 𝑏𝑏2 = 𝛿𝛿2
Why?
34
𝑙𝑙(𝑓𝑓(𝑥𝑥), 𝑦𝑦)
𝑗𝑗th unit
If the loss function 𝜙𝜙(𝑎𝑎) can be written as a pre-activation function 𝑞𝑞𝑖𝑖 (𝑎𝑎) in the layer
above, then
(𝑘𝑘) (𝑘𝑘)
Considering 𝑎𝑎(𝑘𝑘) (𝑥𝑥)𝑖𝑖 = 𝑏𝑏𝑖𝑖 + ∑𝑗𝑗 𝑊𝑊𝑖𝑖,𝑗𝑗 ℎ(𝑘𝑘−1) (𝑥𝑥)𝑗𝑗 ,
(𝑘𝑘+1) 𝑇𝑇
= �𝑾𝑾∙,𝑗𝑗 � �▽𝑎𝑎(𝑘𝑘+1)(𝑥𝑥) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 ��
Gradient:
𝑇𝑇
▽𝒉𝒉(𝑘𝑘)(𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = 𝑾𝑾(𝑘𝑘+1) �▽𝒂𝒂(𝑘𝑘+1)(𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 ��
35
Loss gradient at hidden layers pre-activation
Considering 𝑗𝑗th activation ℎ(𝑘𝑘) (𝒙𝒙)𝑗𝑗 = 𝑔𝑔�𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 � only depends on the pre-activation 𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 ,
thus no sum,
( )
𝜕𝜕𝐿𝐿 𝜕𝜕 𝜕𝜕�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � 𝜕𝜕ℎ 𝑘𝑘 (𝒙𝒙)𝑗𝑗
= �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = ∙
𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 𝜕𝜕ℎ(𝑘𝑘) (𝒙𝒙)𝑗𝑗 ��
(𝑘𝑘)
(𝒙𝒙)
𝜕𝜕𝑎𝑎�����𝑗𝑗
=𝑔𝑔′�𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 �
Gradient:
(𝑘𝑘) 𝑇𝑇
▽𝒂𝒂(𝑘𝑘) (𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � =▽𝒉𝒉(𝑘𝑘)(𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � ▽
�� (𝑘𝑘) (𝒙𝒙) ℎ
𝒂𝒂� �������� (𝒙𝒙)
𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
Now we want to calculate the error signal 𝛿𝛿1 , which tells us how much the loss
depends on the pre-activation 𝑧𝑧1 of the hidden layer.
where
36
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧2
= ∙ = 𝛿𝛿2 𝑊𝑊2𝑇𝑇
𝜕𝜕𝑎𝑎1 𝜕𝜕𝑧𝑧
�2 𝜕𝜕𝑎𝑎
�1
=𝛿𝛿2 =𝑊𝑊2
Since The error signal 𝛿𝛿2 is the gradient of the loss w.r.t. 𝑧𝑧2 :
𝜕𝜕𝜕𝜕
𝛿𝛿2 =
𝜕𝜕𝑧𝑧2
𝜕𝜕𝑧𝑧2
= 𝑊𝑊2
𝜕𝜕𝑎𝑎1
𝜕𝜕𝐿𝐿 𝜕𝜕 𝜕𝜕�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑖𝑖 𝜕𝜕�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � (𝑘𝑘−1)
(𝑘𝑘)
= (𝑘𝑘)
�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = ∙ = ∙ ℎ𝑗𝑗 (𝒙𝒙)
𝜕𝜕𝑊𝑊𝑖𝑖,𝑗𝑗 𝜕𝜕𝑊𝑊𝑖𝑖,𝑗𝑗 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙) 𝑖𝑖
(𝑘𝑘)
𝜕𝜕𝑊𝑊𝑖𝑖,𝑗𝑗 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑖𝑖
(𝑘𝑘)
where 𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑖𝑖 = 𝑏𝑏𝑖𝑖 + ∑𝑗𝑗 𝑊𝑊𝑖𝑖,𝑗𝑗(𝑘𝑘) ℎ(𝑘𝑘−1) (𝒙𝒙)𝑗𝑗
37