0% found this document useful (0 votes)
2 views

Lecture 2

Uploaded by

pill.pine6731
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 2

Uploaded by

pill.pine6731
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Lecture 1: Introduction to Neural Networks and Deep Learning

1.1 Introduction

Course Overview

• Objectives:
o Understand the foundational principles of neural networks and deep
learning.
o Gain hands-on experience with state-of-the-art deep learning
frameworks.
o Explore applications in fields like healthcare, computer vision, and
natural language processing.

• Structure:
o Weekly lectures and labs.
o Assignments, projects, and exams to assess understanding.
o Two Main Textbooks:
 "Deep Learning with Python" by Francois Chollet
 "Fundamentals of Neural Networks – Architectures, Algorithms,
and Applications" by Laurene Fausett

Importance of Neural Networks and Deep Learning

• Historical Context:
o Brief history of AI and machine learning leading to the development of
deep learning.
o Why deep learning has become prominent in recent years (availability of
data, computational power).

• Real-World Applications:
o Computer Vision: Facial recognition, object detection.

1
o Natural Language Processing: Machine translation, sentiment
analysis.
o Healthcare: Disease prediction, Patient outcome prediction, medical
imaging analysis.
o Self-Driving Cars: Autonomous navigation and decision-making.

1.2 What is a Neural Network?

Biological Inspiration

• Neurons in the Brain:


o Structure of a biological neuron: Dendrites, axon, synapse.
o Information processing: How neurons communicate via electrical
impulses.

Figure 1: Structure of a Biological Neuron

Figure 1 depicts the structure of a biological neuron, which serves as the


foundational inspiration for artificial neural networks (ANNs). The relationship between
biological neurons and artificial neural networks can be understood as follows:
• Dendrites (Input Layer):

2
o Branch-like structures that receive signals from other neurons. These
signals are chemical in nature and are converted into electrical impulses
as they move toward the neuron's cell body (soma).
o In a biological neuron, dendrites receive signals from other neurons.
Similarly, in an artificial neural network, the input layer receives data
(e.g., features of an image or text) from the external environment.
• Axon (Output Layer):
o A long, slender projection that carries electrical impulses away from the
cell body. The axon transmits these impulses to other neurons, muscles,
or glands.
o The axon in a biological neuron transmits the processed signal to other
neurons. In an ANN, the output layer sends the final processed signal
(e.g., a classification decision or a prediction) to the next layer or to the
external environment.
• Synapse (Weights):
o The small gap between the axon terminal of one neuron and the
dendrites or cell body of another neuron. When an electrical impulse
reaches the end of an axon, it triggers the release of neurotransmitters,
which cross the synapse and bind to receptors on the next neuron,
allowing the signal to continue.
o The synapse is the point of connection between two neurons where
signals are transmitted. The strength of this transmission is influenced
by the synaptic weights. In an ANN, the synapse is represented by
weights that determine how much influence an input has on the output.
These weights are adjusted during training to minimize the error in
predictions.
• Activation Function (Neuron Firing):
o Just as a biological neuron "fires" (transmits a signal) if the incoming
signals are strong enough, an artificial neuron in an ANN activates and
passes on a signal based on an activation function. This function

3
introduces non-linearity into the model, enabling it to learn complex
patterns.

Figure 2:
Structure and Functionality:
o Neural Networks Mimic Neuronal Processing:
o Artificial neural networks are designed to mimic the way biological
neurons process information. The architecture of an ANN—comprising
input layers, hidden layers, and output layers—parallels the structure of
interconnected neurons in the brain.
o Learning Process:
o In biological neurons, learning occurs through the strengthening or
weakening of synaptic connections, a process known as synaptic
plasticity. In ANNs, learning occurs through the adjustment of weights
and biases during the training process, typically using algorithms like
backpropagation.

Application of Biological Principles:


o Hierarchical Learning:
o Biological neural networks are capable of hierarchical learning, where
more complex patterns are learned as information passes through layers

4
of neurons. Similarly, deep neural networks, with many layers, can learn
hierarchical representations of data, enabling them to identify complex
patterns in images, text, and other types of data.
o Parallel Processing:
o Just as the brain processes information in parallel across many neurons,
ANNs process data in parallel across multiple nodes, making them highly
efficient for tasks like image recognition, language processing, and
more.

History of Neural Networks:

• 1943: Warren McCulloch and Walter Pitts propose the first mathematical model
of a neuron.
• 1958: Frank Rosenblatt develops the Perceptron, the first algorithmically
described neural network.
• 1980s-1990s: Development of backpropagation and the rise of multilayer
perceptrons.
• 2010s: The resurgence of neural networks, particularly deep learning, driven by
advances in computational power and large datasets.

1.3 Introduction to Deep Learning

What is Deep Learning?

• A subset of machine learning that focuses on learning representations from


data through multiple layers of abstraction.

Difference Between Machine Learning and Deep Learning

• Machine Learning:
o Requires feature engineering.
o Works well with structured data (e.g., tables of data).
• Deep Learning:
o Automatically extracts features.

5
o Excels with unstructured data (e.g., images, text, audio).

Why Deep Learning Works

• Large Datasets:
o Deep learning requires vast amounts of data to train effective models.
• Computational Power:
o Advances in hardware, particularly GPUs, have enabled the training of
deep networks.
• Backpropagation and Gradient Descent:
o Backpropagation: Algorithm for computing the gradient of the loss
function with respect to the network’s weights.
o Gradient Descent: Optimization algorithm used to minimize the loss
function.

Why Study Neural Networks and Deep Learning?

• Discuss the rise of deep learning and its impact on fields such as computer
vision, natural language processing, and healthcare.
• Real-world applications: self-driving cars, speech recognition, image
classification, etc.
• The importance of understanding the theory behind neural networks to apply
them effectively.

1.4 Machine Learning, Deep Learning, Neural Networks

Artificial intelligence (AI):

• is a branch of computer science dealing with a simulation of intelligent


behavior. AI systems will typically demonstrate behaviors associated with
human intelligence such as planning, learning, reasoning, problem-solving,
knowledge representation, perception, motion, and manipulation, and to a
lesser extent social intelligence and creativity.

6
• Machine learning is a subset of AI that uses computer algorithms to analyze
data and make intelligent decisions based on what it has learned. Machine
learning algorithms are trained with large sets of data and they learn from
examples.
• Deep learning is a specialized subset of Machine Learning that uses layered
neural networks to simulate human decision-making. Deep learning algorithms
can label and categorize information and identify patterns. It is what enables AI
systems to continuously learn on the job, and improve the quality and accuracy
of results by determining whether decisions were correct.

Foundations of AI Learning
• What is Learning in AI?
o Learning in AI refers to the process by which algorithms adjust and
improve their performance based on data. This mimics human learning,
where experiences shape future actions and decisions.
• Types of Learning
o Supervised Learning:
 The model learns from labeled data, which means the input data
comes with the correct output.
 Example: Image classification where each image is labeled with
the correct category.
o Unsupervised Learning:
 The model learns from unlabeled data, finding hidden patterns or
intrinsic structures.
 Example: Clustering customers into different groups based on
purchasing behavior.
o Reinforcement Learning:
 The model learns by interacting with an environment, receiving
rewards or penalties.
 Example: Training a robot to navigate a maze.

7
What is Data Science?

• Data science is the process and method for extracting knowledge and insights
from large volumes of disparate data.
• Data Science can use many of the AI techniques to derive insight from data.

What is Machine Learning?

o Machine Learning, a subset of AI, uses computer algorithms to analyze data


and make intelligent decisions based on what it has learned.

Deep Learning

• While Machine Learning is a subset of Artificial Intelligence, Deep Learning is a


specialized subset of Machine Learning.

8
• Deep learning algorithms do not directly map input to output. Instead, they rely
on several layers of processing units. Each layer passes its output to the next
layer, which processes it and passes it to the next. The many layers are why
it’s called deep learning. When creating deep learning algorithms, developers
and engineers configure the number of layers and the type of functions that
connect the outputs of each layer to the inputs of the next. Then they train the
model by providing it with lots of annotated examples.

Neural Networks – Feedforward neural network (FNN)


Artificial Neuron
• An artificial neuron is simply a computational unit.

• Consider a single artificial neuron

9
𝑎𝑎(𝒙𝒙)

• Consider an input vector 𝒙𝒙 = (𝑥𝑥1 , … , 𝑥𝑥𝑚𝑚 ).


• The computation in artificial neuron can be decomposed into two steps:
o Neuron pre-activation (or input activation):

𝑎𝑎(𝒙𝒙) = 𝑏𝑏 + � 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝑏𝑏 + 𝒘𝒘𝑇𝑇 𝒙𝒙


𝑖𝑖

• 𝒘𝒘 are the connection weights


• 𝑏𝑏 is the neuron bias (scalar)
• It is a bias because if we have no input 𝑏𝑏 would be the pre-
activation.
• By observing a particular input, we move away from the initial
value of the neuron's pre-activation.
o Neuron (output) activation:

ℎ(𝑥𝑥) = 𝜑𝜑�𝑎𝑎(𝒙𝒙)� = 𝜑𝜑 �𝑏𝑏 + � 𝑤𝑤𝑖𝑖 𝑥𝑥𝑖𝑖 �


𝑖𝑖

10
 𝜑𝜑(∙): an activation function

This is a 3D visualization of the activation of a neuron for two inputs (𝑥𝑥1 , 𝑥𝑥2 ) and
output 𝑦𝑦 = ℎ(𝑥𝑥) values of -1 and 1.

ℎ(𝑥𝑥)

-1

Bias 𝑏𝑏 only changes the


position of the ridge

• The range of the output is determined by the activation function 𝜑𝜑(∙) that is
between -1 and 1.
• It can be considered as a binary classifier that separates points in one region and
some other region. It depends on the different values of input 𝑿𝑿.
• The vector 𝑾𝑾 is perpendicular to the hyperplane that separates two regions
(e.g., the regions where the neuron output is -1 and 1) due to the geometric
interpretation of the linear combination 𝑾𝑾 ⋅ 𝑿𝑿 + 𝑏𝑏 = 0, which defines the
hyperplane in the input space. For any point on the hyperplane, the dot product
𝑾𝑾 ⋅ 𝑿𝑿 is equal to −𝑏𝑏, which is constant.
• The vector 𝑾𝑾 is the gradient of the linear combination 𝑾𝑾 ⋅ 𝑿𝑿 + 𝑏𝑏 with respect to
𝑿𝑿. The gradient points in the direction of the steepest increase of the function.
• The set of all points 𝑿𝑿 that satisfy 𝑾𝑾 ⋅ 𝑿𝑿 + 𝑏𝑏 = 0 forms a plane perpendicular to
𝑾𝑾.

11
• The orientation of the hyperplane, therefore, is determined by 𝑾𝑾.
• The bias 𝑏𝑏 shifts the hyperplane parallel to itself.
• The bias 𝑏𝑏 determines the position of the hyperplane relative to the origin in the
input space.
• When 𝑏𝑏 = 0, the hyperplane passes through the origin.
• When 𝑏𝑏 > 0, the hyperplane shifts away from the origin in the direction opposite
to 𝑾𝑾. Increasing 𝑏𝑏 moves the hyperplane further along the direction where 𝑾𝑾 ⋅ 𝑿𝑿
is negative. This can be thought of as lowering the threshold for classification
into the negative region.
• When 𝑏𝑏 < 0, the hyperplane shifts away from the origin in the direction of 𝑾𝑾.
Decreasing 𝑏𝑏 (making it more negative) moves the hyperplane further in the
direction where 𝑾𝑾 ⋅ 𝑿𝑿 is positive, effectively raising the threshold for positive
classification.

Introduction to Neural Network Activation Functions


1. Linear (identity) function
𝜑𝜑(𝑎𝑎(𝒙𝒙)) = 𝑎𝑎(𝒙𝒙) → 𝜑𝜑 ′ (𝒙𝒙) = 1
o This function outputs the input directly without any modification.
o It is usually not used in hidden layers because it doesn't introduce non-
linearity, which is necessary for the network to learn complex patterns.
2. Sigmoid Function
1
𝜑𝜑(𝑎𝑎(𝒙𝒙)) = → 𝜑𝜑 ′ (𝑎𝑎(𝒙𝒙) = 𝜑𝜑(𝑎𝑎(𝒙𝒙))�1 − 𝜑𝜑(𝑎𝑎(𝒙𝒙)�
1 + 𝑒𝑒 −𝑎𝑎(𝒙𝒙)
o The sigmoid function is one of the most commonly used activation
functions in the past, particularly in binary classification problems.
o It maps any input to a value between 0 and 1, making it useful when
outputs need to represent probabilities.
o The sigmoid function has an S-shaped curve, which asymptotically
approaches 0 and 1 but never reaches these values.

12
3. Hyperbolic Tangent (Tanh) Function
The tanh function is another S-shaped activation function, similar to the
sigmoid, but it outputs values between -1 and 1.
𝑒𝑒 𝑎𝑎(𝒙𝒙) − 𝑒𝑒 −𝑎𝑎(𝒙𝒙) 1 − 𝑒𝑒 −2𝑎𝑎(𝒙𝒙)
𝜑𝜑(𝑎𝑎(𝒙𝒙)) = = → 𝜑𝜑 ′ (𝒙𝒙) = 1 − 𝜑𝜑 2 (𝒙𝒙)
𝑒𝑒 𝑎𝑎(𝒙𝒙) + 𝑒𝑒 −𝑎𝑎(𝒙𝒙) 1 + 𝑒𝑒 −2𝑎𝑎(𝒙𝒙)

4. Rectified Linear Unit (ReLU)


ReLU is perhaps the most popular activation function in modern neural
networks due to its simplicity and effectiveness.
1, 𝑎𝑎(𝒙𝒙) > 0
𝜑𝜑(𝑎𝑎(𝒙𝒙)) = max(0, 𝑎𝑎(𝒙𝒙)) → 𝜑𝜑′(𝑎𝑎(𝒙𝒙)) = �
0, 𝑎𝑎(𝒙𝒙) ≤ 0
o The ReLU function outputs the input directly if it is positive; otherwise, it
outputs zero.

13
5. Leaky ReLU
Leaky ReLU is a variation of the ReLU function that allows a small, non-zero
gradient when the input is negative, which helps to keep the network learning
even for negative inputs.
𝑥𝑥, 𝑥𝑥 > 0 1, 𝑥𝑥 > 0
𝜑𝜑(𝒙𝒙) = � → 𝜑𝜑′(𝒙𝒙) = �
𝛼𝛼𝛼𝛼, 𝑥𝑥 ≤ 0 0, 𝑥𝑥 ≤ 0
where 𝛼𝛼 is a small constant (usually 𝛼𝛼 = .01).

6. Softmax Function
Softmax is often used in the output layer of a neural network for multi-class
classification and returns a vector of probability scores. It converts logits (raw
output of the network) into probabilities. Let 𝒛𝒛 = 𝒂𝒂(𝒙𝒙)
𝑒𝑒 𝑧𝑧𝑖𝑖
𝜑𝜑(𝒛𝒛)𝑖𝑖 = ,
∑𝑛𝑛𝑗𝑗=1 𝑒𝑒 𝑧𝑧𝑗𝑗

14
where 𝒂𝒂(𝒙𝒙) is the vector of raw outputs from the NN, and 𝑛𝑛 is the number of
classes.

Example: You’re given a dataset containing images of seal (class 0), pandas
(class 1), and ducks (class 2). You’d like to train a neural network to predict
whether a previously unseen image is that of a seal, a panda, or a duck. Thus
in this example 𝑛𝑛 = 3. Suppose you are given the vector 𝒛𝒛 = [.25, 1.23, −.8] of
raw outputs from the NN. Then,
𝑒𝑒 .25
𝑃𝑃(𝑦𝑦𝑖𝑖 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) = = .249
𝑒𝑒 .25 + 𝑒𝑒 1.23 + 𝑒𝑒 −.8
𝑒𝑒 1.23
𝑃𝑃(𝑦𝑦𝑖𝑖 = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝) = = .664
𝑒𝑒 .25 + 𝑒𝑒 1.23 + 𝑒𝑒 −.8
𝑒𝑒 −.8
𝑃𝑃(𝑦𝑦𝑖𝑖 = 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑) = = .087
𝑒𝑒 .25 + 𝑒𝑒 1.23 + 𝑒𝑒 −.8
In a multiclass classification problem, where the classes are mutually exclusive,
notice how the entries of the softmax output sum up to 1: 0.664 + 0.249 +
0.087 = 1.
Therefore, we conclude that there’s a 66.4% chance that the given image
belongs to class 1 (panda), and the input image has a 4.9% chance of being a
seal and around 8.7% chance of being a duck.

Artificial Neuron – Logistic Regression for binary classification


• If the output of the neuron 𝑝𝑝(𝑦𝑦 = 1|𝒙𝒙) is greater than 0.5, predict class 1;
otherwise, predict class 0.

15
Example: Forward Propagation

Example: The role of a bias or threshold

Linear Separability via an example

16
Hebb net

17
If data are represented in bipolar form, the desired weight update would be

𝑤𝑤𝑖𝑖 (𝑛𝑛𝑛𝑛𝑛𝑛) = 𝑤𝑤𝑖𝑖 (𝑜𝑜𝑜𝑜𝑜𝑜) + 𝑥𝑥𝑖𝑖 𝑦𝑦.

• Algorithm:

Example: A Hebb net for the AND function: binary inputs and targets

18
After the first input pattern, no learning occurs any more due to target is 0. Thus, fail
to classify for binary input and binary target output.

Example: A Hebb net for the AND function: bipolar inputs, bipolar targets

19
20
21
Now the decision boundary is correct.

Feedforward NN – Multilayer NN

What is non linearly separable problems?

Consider XOR (exclusive OR) function that outputs true or 1 only when the inputs are
different.

• Using transformation of X1 and X2, we can make separable boundary.

Single hidden layer NN

22
(2)
𝑤𝑤𝑖𝑖
(1)
𝑤𝑤𝑖𝑖,𝑗𝑗
ℎ(𝒙𝒙)𝑖𝑖
𝑦𝑦 = 𝑓𝑓(𝒙𝒙)
𝑥𝑥𝑗𝑗

𝑏𝑏 (2)
(1)
𝑏𝑏𝑖𝑖

1 1

• Hidden layer pre-activation:

(1) (1)
𝒂𝒂(𝒙𝒙) = 𝒃𝒃(𝟏𝟏) + 𝑾𝑾(𝟏𝟏) 𝒙𝒙; 𝑎𝑎(𝒙𝒙)𝑖𝑖 = 𝑏𝑏𝑖𝑖 + � 𝑤𝑤𝑖𝑖,𝑗𝑗 𝑥𝑥𝑗𝑗
𝑗𝑗

• Hidden layer activation:

𝒉𝒉(𝒙𝒙) = 𝒈𝒈�𝒂𝒂(𝒙𝒙)�

• Output layer activation:

𝑻𝑻
𝑓𝑓(𝒙𝒙) = ⏟
𝑜𝑜 �𝑏𝑏 (2) + 𝒘𝒘(𝟐𝟐) 𝒉𝒉(𝒙𝒙)�
𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓

Softmax activation function

• For multi-class classification:


o We would like to estimate the conditional probability 𝑝𝑝(𝑦𝑦 = 𝑐𝑐|𝒙𝒙)

• We use the softmax activation function at the output:


𝑇𝑇
𝑒𝑒 𝑎𝑎1 𝑒𝑒 𝑎𝑎𝐶𝐶
𝑜𝑜(𝒂𝒂) = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝒂𝒂) = � 𝐶𝐶 … � ,
∑𝑗𝑗=1 𝑒𝑒 𝑎𝑎𝑗𝑗 ∑𝐶𝐶𝑗𝑗=1 𝑒𝑒 𝑎𝑎𝑗𝑗

23
o Strictly positive and sum to one
• Predicted class is the one with the highest estimated probability

Multilayer NN

ℎ(2) (𝒙𝒙)

ℎ(1) (𝒙𝒙)

𝑾𝑾(3)

𝑾𝑾(2) 𝒃𝒃(3)
𝑾𝑾(1) 𝒃𝒃(2) 1
(1)
𝒃𝒃 1
1
• Could have 𝐿𝐿 hidden layers
o Layer pre-activation for 𝑘𝑘 > 0 (𝒉𝒉(𝟎𝟎) (𝒙𝒙) = 𝒙𝒙)

𝒂𝒂(𝑘𝑘) (𝒙𝒙) = 𝒃𝒃(𝒌𝒌) + 𝑾𝑾(𝒌𝒌) 𝒉𝒉(𝑘𝑘−1) (𝒙𝒙)

o Hidden layer activation (𝑘𝑘 from 1 to 𝐿𝐿)

𝒉𝒉(𝑘𝑘) (𝒙𝒙) = 𝒈𝒈 �𝒂𝒂(𝑘𝑘) (𝒙𝒙)�

o Output layer activation (𝑘𝑘 = 𝐿𝐿 + 1)

𝒉𝒉(𝐿𝐿+1) (𝒙𝒙) = 𝒐𝒐 �𝒂𝒂(𝐿𝐿+1) (𝒙𝒙)� = 𝒇𝒇(𝒙𝒙)

Empirical risk minimization (ERM), Regularization

• These techniques help prevent the model from overfitting to the training data.
• Empirical risk minimization (ERM)

24
o The objective of ERM is to minimize the error (or loss) on the training
dataset.
o Minimize the risk (error) that the model incurs on the training dataset.
This is calculated using a loss function (e.g., Mean Squared Error for
regression, Cross-Entropy Loss for classification).
o Framework to design learning algorithms
1
ℓ�𝑓𝑓�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�, 𝑦𝑦 (𝑡𝑡) � + 𝜆𝜆Ω(𝜽𝜽),
argmin � �����������
𝜽𝜽 𝑇𝑇
���������������
𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓.
𝑎𝑎𝑎𝑎𝑎𝑎. 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓.

 𝑓𝑓�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�: The prediction made by the model for input 𝒙𝒙(𝑡𝑡) .
 𝒙𝒙(𝑡𝑡) : training data
 𝑦𝑦 (𝑡𝑡) : the true label for input 𝒙𝒙(𝑡𝑡)
 Ω(𝜽𝜽): a regularizer (penalizes certain values of 𝜽𝜽)
 𝑇𝑇: The number of samples in the training set
 𝜆𝜆: regularization hyperparameter that controls the strength of the
penalty
o The goal is to find a function 𝑓𝑓 (i.e., a model, typically a neural network)
that minimizes the empirical risk. This means finding model parameters
(weights and biases) that minimize the loss over the training data.
o In multilayer neural networks, this is done through gradient-based
optimization methods like Stochastic Gradient Descent (SGD). The model
parameters are updated iteratively to reduce the empirical risk, based on
the gradient of the loss function with respect to the parameters.
• Regularization
o Regularization is used to prevent overfitting, which occurs when a model
performs well on the training data but fails to generalize to unseen data.
Regularization introduces a penalty for large or overly complex model
parameters, encouraging the model to find simpler patterns that
generalize better.
o Types of Regularization:

25
 L2 Regularization (Ridge)
• L2 Regularization adds a penalty based on the squared
values of the model’s weights.
• Ω(𝜽𝜽) = ∑𝑗𝑗 𝑤𝑤𝑗𝑗2
 L1 Regularization (Lasso)
• L1 Regularization adds a penalty based on the absolute
values of the model’s weights.
• Ω(𝜽𝜽) = ∑𝑗𝑗�𝑤𝑤𝑗𝑗 �
• L1 regularization drives some weights to exactly zero,
promoting sparsity in the model.
 Dropout Regularization
• Dropout is a technique specific to neural networks where a
random subset of neurons is "dropped" (set to zero) during
training.
• Dropout forces the network to rely on different subsets of
neurons for each forward pass, reducing the likelihood that
the network becomes overly dependent on a particular
neuron or set of neurons.

Stochastic Gradient Descent (SGD)

• Gradient descent aims to minimize the loss function, which measures the
difference between the predicted outputs and the actual labels.
• By iteratively adjusting the model parameters (weights and biases), gradient
descent seeks to find the values that minimize this loss.
• Stochastic Gradient Descent (SGD) is an optimization algorithm that updates
the weights of the neural network using a single randomly selected sample (or
training example) at each iteration.
• SGD computes the gradient of the loss function for only one sample at a time.
• Key idea of SGD:

26
o Random Sampling: Instead of using the entire dataset to compute the
gradient, only one training sample is randomly selected at each iteration.
• Algorithm that performs updates after each example
o Initialize 𝜽𝜽 �𝜽𝜽 ≡ �𝑾𝑾(𝟏𝟏) , 𝒃𝒃(𝟏𝟏) , … , 𝑾𝑾(𝑳𝑳+𝟏𝟏) , 𝒃𝒃(𝑳𝑳+𝟏𝟏) ��
o for 𝑁𝑁 iterations (iteration over all examples)
 for each training example �𝒙𝒙(𝑡𝑡) , 𝑦𝑦 (𝑡𝑡) �
• 𝚫𝚫 = − 𝛁𝛁𝜽𝜽 ℓ�𝒇𝒇�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�, 𝑦𝑦 (𝑡𝑡) �
������������� − 𝜆𝜆𝛁𝛁𝜽𝜽 Ω(𝜽𝜽)
𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑜𝑜𝑜𝑜 𝑡𝑡ℎ𝑒𝑒 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
𝑤𝑤𝑤𝑤𝑤𝑤ℎ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡 𝑡𝑡ℎ𝑒𝑒 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡𝑡𝑡 𝒘𝒘

• 𝜽𝜽 ← 𝜽𝜽 + 𝛼𝛼𝚫𝚫, where 𝛼𝛼 is a hyperparameter (learning rate),


which controls how large the updates are at each step
• To apply this algorithm to NN training, we need
o The loss function ℓ�𝒇𝒇�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�, 𝑦𝑦 (𝑡𝑡) �
o A procedure to compute the parameter gradients 𝛁𝛁𝜽𝜽 ℓ�𝒇𝒇�𝒙𝒙(𝑡𝑡) ; 𝜽𝜽�, 𝑦𝑦 (𝑡𝑡) �
o The regularizer Ω(𝜽𝜽) (and the gradient 𝛁𝛁𝜽𝜽 Ω(𝜽𝜽))
o Initialization method

Cross-Entropy Loss function for classification


• Neural network estimates 𝑦𝑦�𝑖𝑖,𝑐𝑐 = 𝑓𝑓(𝒙𝒙)𝑐𝑐 = 𝑝𝑝(𝑦𝑦 = 𝑐𝑐|𝒙𝒙)
o We could maximize the probabilities of 𝑦𝑦 (𝑡𝑡) given 𝑥𝑥 (𝑡𝑡) in the training set
• It measures the difference between the predicted probability distribution
(output of the neural network) and the actual distribution (the true labels).
• The goal during training is to minimize this difference, which means the
network's predictions become more accurate.
• In classification tasks, the output of a neural network is typically a probability
distribution over the possible classes.
o For binary classification, this is often achieved using a sigmoid activation
function in the output layer.
o For multi-class classification, a softmax function is used.
1. Binary Cross-Entropy (a.k.a. log loss):

27
• The goal is to minimize the cross-entropy, making the predicted
probabilities as close as possible to the actual labels.
• Use case: Binary classification (two classes, e.g., 0 and 1).
𝑁𝑁
1
𝐿𝐿 = − �[𝑦𝑦𝑖𝑖 log(𝑦𝑦�𝑖𝑖 ) + (1 − 𝑦𝑦𝑖𝑖 ) log(1 − 𝑦𝑦�𝑖𝑖 )],
𝑁𝑁
𝑖𝑖=1

o 𝐿𝐿 is the loss (cross-entropy)


o 𝑁𝑁: the number of samples (a batch of data)
o 𝑦𝑦𝑖𝑖 : the true label for the 𝑖𝑖th sample (0 or 1) (e.g. the patient has
diabetes)
o 𝑦𝑦�𝑖𝑖 : the predicted probability that the 𝑖𝑖th sample belongs to class 1
(say 𝑦𝑦�𝑖𝑖 = .8, the model predicts an 80% chance that the patient 𝑖𝑖
has diabetes)

2. Categorical Cross-Entropy:
• Use case: Multi-class classification (more than two classes).
• It measures the dissimilarity between the predicted probability
distribution (output of the model) and the actual probability distribution
(ground truth label).
• We minimize the negative log-likelihood:

𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐−𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
�������������
𝑁𝑁 𝐶𝐶

𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿
� = −�� �
𝑦𝑦𝑖𝑖,𝑐𝑐 log
���𝑦𝑦�𝑖𝑖,𝑐𝑐 = − log 𝒇𝒇(𝒙𝒙)𝑦𝑦 ,
ℓ(𝒇𝒇(𝒙𝒙),𝒚𝒚) 𝑖𝑖=1 𝑐𝑐=1 1(𝑦𝑦=𝑐𝑐) log 𝑓𝑓(𝒙𝒙)𝑐𝑐

• where 𝑁𝑁 is the number of samples

Training NN – Output layer gradient

• Gradient computation
o Loss Gradient at output
• Partial derivative:

28
𝜕𝜕 −1(𝑦𝑦=𝑐𝑐)
�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = ,
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑦𝑦
where 𝑐𝑐 is any of the output neuron
o Gradient:
1(𝑦𝑦=0)
1 𝑒𝑒(𝑦𝑦)
▽𝒇𝒇(𝒙𝒙) �− log 𝒇𝒇(𝒙𝒙)𝒚𝒚 � = − � ⋮ �=− ,
𝑓𝑓(𝑥𝑥)𝑦𝑦 1 𝑓𝑓(𝑥𝑥)𝑦𝑦
(𝑦𝑦=𝐶𝐶−1)

Where 𝑒𝑒(𝑦𝑦) is a one-hot encoded vector where the correct class 𝑦𝑦 has a
value of 1 and all other classes are 0.

𝑙𝑙(𝑓𝑓(𝑥𝑥), 𝑦𝑦)

Example: Assume you are working with a classification problem that has 4 classes
(i.e., 𝐶𝐶 = 4), and the true class label 𝑦𝑦 is class 2.

The softmax output of your neural network might look like this for a given input 𝑥𝑥:

𝑓𝑓(𝑥𝑥) = [. 1, .7, .15, .05]𝑇𝑇

This means the network is predicting the following probabilities for each class:

𝑃𝑃(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 0) = .1; 𝑃𝑃(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 1) = .7; 𝑃𝑃(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 2) = .15; 𝑃𝑃(𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 3) = .05.

29
For the true class 𝑦𝑦 = 2, the one-hot encoded vector 𝑒𝑒(𝑦𝑦) will look like this:

𝑒𝑒(𝑦𝑦) = [0, 0, 1, 0]𝑇𝑇

In this example, the vector 𝑒𝑒(𝑦𝑦) has a 1 at the index of the correct class (class 2) and
0 elsewhere.

𝑒𝑒(𝑦𝑦) 1 𝑇𝑇 1 𝑇𝑇
Now from ▽𝒇𝒇(𝒙𝒙) �− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = − = − �0, 0, 𝑓𝑓(𝑥𝑥) , 0� = − �0, 0, .15 , 0� =
𝑓𝑓(𝑥𝑥)𝑦𝑦 2

−[0, 0, 6.67, 0]𝑇𝑇 .

This gradient vector will be used during backpropagation to adjust the network's
weights to make the model's prediction for the correct class (class 2) more confident.

• Loss Gradient at output pre-activation

𝑙𝑙(𝑓𝑓(𝑥𝑥), 𝑦𝑦)

• Partial derivative:
𝜕𝜕
�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = −�1(𝑦𝑦=𝑐𝑐) − 𝑓𝑓(𝒙𝒙)𝑐𝑐 �,
𝜕𝜕𝑎𝑎𝐿𝐿+1 (𝒙𝒙)𝑐𝑐

30
where 𝑓𝑓(𝒙𝒙)𝑦𝑦 is the predicted probability for the true class 𝑦𝑦, i.e., the
output of the softmax function for class 𝑦𝑦.
• The softmax function outputs probabilities for each class 𝑐𝑐 based on the
(𝐿𝐿+1)
pre-activation 𝑎𝑎𝑐𝑐 , where
(𝐿𝐿+1)
exp�𝑎𝑎𝑐𝑐 �
𝑓𝑓(𝑥𝑥)𝑐𝑐 = .
∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘(𝐿𝐿+1) �
(𝐿𝐿+1)
This function takes in the pre-activation scores 𝑎𝑎𝑐𝑐 and converts them
into probabilities that sum to 1.
• Derivative of the loss with respect to pre-activation
We want to compute the gradient of the loss function with respect to the
(𝐿𝐿+1)
pre-activation 𝑎𝑎𝑐𝑐 for class 𝑐𝑐.
• Case 1: 𝑦𝑦 = 𝑐𝑐 (the true class)
𝜕𝜕
By considering �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 �:
𝜕𝜕𝑎𝑎𝐿𝐿+1 (𝒙𝒙) 𝑐𝑐

In this case, the loss is directly influenced by the probability 𝑓𝑓(𝒙𝒙)𝑐𝑐 for
the correct class, and the gradient reflects how the model should
adjust this probability. For the cross-entropy loss,
𝜕𝜕 −1(𝑦𝑦=𝑐𝑐) −1
�− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = =
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑦𝑦 𝑓𝑓(𝑥𝑥)𝑦𝑦
Now, using the derivative of the softmax function with respect to its
(𝐿𝐿+1)
pre-activation 𝑎𝑎𝑐𝑐 , we get:
(𝐿𝐿+1)
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝜕𝜕 exp�𝑎𝑎𝑐𝑐

(𝐿𝐿+1)
= (𝐿𝐿+1)
� �=
𝜕𝜕𝑎𝑎𝑐𝑐 𝜕𝜕𝑎𝑎𝑐𝑐 ∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘(𝐿𝐿+1) �
(𝐿𝐿+1) (𝐿𝐿+1) (𝐿𝐿+1) (𝐿𝐿+1)
1(𝑦𝑦=𝑐𝑐) exp�𝑎𝑎𝑐𝑐 � ∙ ∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘 � − exp�𝑎𝑎𝑐𝑐 � ∙ exp�𝑎𝑎𝑐𝑐 �
= 2
(𝐿𝐿+1)
�∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘 ��

= 𝑓𝑓(𝑥𝑥)𝑐𝑐 − 𝑓𝑓(𝑥𝑥)2𝑐𝑐 = 𝑓𝑓(𝑥𝑥)𝑐𝑐 ∙ (1 − 𝑓𝑓(𝑥𝑥)𝑐𝑐 ).


So, for 𝑦𝑦 = 𝑐𝑐, the total derivative is

31
𝜕𝜕 1 𝜕𝜕
(𝐿𝐿+1)
�− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = − ∙ (𝐿𝐿+1) 𝑓𝑓(𝑥𝑥)𝑦𝑦
𝜕𝜕𝑎𝑎𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑦𝑦 𝜕𝜕𝑎𝑎
𝑐𝑐

1 𝜕𝜕 (𝐿𝐿+1)
=− ∙ (𝐿𝐿+1) ∙ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠�𝑎𝑎𝑦𝑦 �=
𝑓𝑓(𝑥𝑥)𝑦𝑦 𝜕𝜕𝑎𝑎
𝑐𝑐
(𝐿𝐿+1)
1 𝜕𝜕 exp�𝑎𝑎𝑦𝑦 �
=− ∙ (𝐿𝐿+1) � �
𝑓𝑓(𝑥𝑥)𝑦𝑦 𝜕𝜕𝑎𝑎 ∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘(𝐿𝐿+1) �
𝑐𝑐

1
=
⏟ − ∙ 𝑓𝑓(𝑥𝑥)𝑐𝑐 ∙ (1 − 𝑓𝑓(𝑥𝑥)𝑐𝑐 ) = 𝑓𝑓(𝑥𝑥)𝑐𝑐 − 1.
𝑦𝑦=𝑐𝑐
𝑓𝑓(𝑥𝑥)𝑐𝑐

• Case 2: 𝑦𝑦 ≠ 𝑐𝑐 (incorrect class)


(𝐿𝐿+1)
The gradient with respect to 𝑎𝑎𝑐𝑐 , 𝑦𝑦 ≠ 𝑐𝑐, is
(𝐿𝐿+1) (𝐿𝐿+1)
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 0 − exp�𝑎𝑎𝑐𝑐 � ∙ exp�𝑎𝑎𝑦𝑦 �
(𝐿𝐿+1)
= 2 = −𝑓𝑓(𝑥𝑥)𝑐𝑐 ∙ 𝑓𝑓(𝑥𝑥)𝑦𝑦
𝜕𝜕𝑎𝑎𝑐𝑐 (𝐿𝐿+1)
�∑𝑘𝑘 exp�𝑎𝑎𝑘𝑘 ��

𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 1


(𝐿𝐿+1)
= ∙ (𝐿𝐿+1) = − ∙ �−𝑓𝑓(𝑥𝑥)𝑐𝑐 ∙ 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = 𝑓𝑓(𝑥𝑥)𝑦𝑦
𝜕𝜕𝑎𝑎𝑐𝑐 𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝜕𝜕𝑎𝑎 𝑓𝑓(𝑥𝑥)𝑐𝑐
𝑐𝑐

where
𝜕𝜕𝜕𝜕 𝜕𝜕 1
= (− log 𝑓𝑓(𝑥𝑥)𝑐𝑐 ) = −
𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝜕𝜕𝜕𝜕(𝑥𝑥)𝑐𝑐 𝑓𝑓(𝑥𝑥)𝑐𝑐

Thus, by combining the cases 1 and 2,

𝜕𝜕
(𝐿𝐿+1)
�− log 𝑓𝑓(𝑥𝑥)𝑦𝑦 � = −�1(𝑦𝑦=𝑐𝑐) − 𝑓𝑓(𝑥𝑥)𝑐𝑐 � = 𝑓𝑓(𝑥𝑥)𝑐𝑐 − 𝑒𝑒(𝑦𝑦)𝑐𝑐
𝜕𝜕𝑎𝑎𝑐𝑐

o Gradient of the softmax and cross-entropy loss function for all cases
simultaneously:
o Let 𝒂𝒂(𝐿𝐿+1) (𝒙𝒙) represent the vector of pre-activations for all classes:
(𝐿𝐿+1) (𝐿𝐿+1) (𝐿𝐿+1)
𝒂𝒂(𝐿𝐿+1) (𝒙𝒙) = �𝑎𝑎1 , 𝑎𝑎2 , … , 𝑎𝑎𝐶𝐶 �

• 𝒇𝒇(𝒙𝒙) represent the vector of softmax outputs (class probabilities):


𝒇𝒇(𝒙𝒙) = [𝑓𝑓(𝑥𝑥)1 , 𝑓𝑓(𝑥𝑥)2 , … , 𝑓𝑓(𝑥𝑥)𝐶𝐶 ]

32
• 𝒆𝒆(𝑦𝑦) represent a one-hot encoded vector where the entry
corresponding to the true class 𝑦𝑦 is 1, and all other entries are 0:
𝒆𝒆(𝑦𝑦) = [𝑒𝑒(𝑦𝑦)1 , 𝑒𝑒(𝑦𝑦)2 , … , 𝑒𝑒(𝑦𝑦)𝐶𝐶 ].

o For example, if 𝑦𝑦 = 2 in a 3-class classification problem,


𝒆𝒆(𝑦𝑦) = [0, 1, 0].
• Vector form of the gradient
▽𝑎𝑎(𝐿𝐿+1) (𝒙𝒙) �− log 𝒇𝒇(𝒙𝒙)𝒚𝒚 � = 𝒇𝒇(𝒙𝒙) − 𝒆𝒆(𝑦𝑦).

Example: Neural Network with One Hidden Layer Using Stochastic Gradient Descent

Neural Network Structure:

• Input layer: 2 features

• Hidden layer: 3 neurons

• Output layer: 1 neuron (for binary classification)

Initialization

We start by initializing the weights and biases for each layer. Assume we use random
initialization for simplicity.

Let:

• 𝑊𝑊1 be the weights matrix for the input to hidden layer (shape: 2x3).

• 𝑏𝑏1 be the biases for the hidden layer (shape: 1x3).

• 𝑊𝑊2 be the weights matrix for the hidden layer to the output layer (shape: 3x1).

• 𝑏𝑏2 be the bias for the output layer (shape: 1x1).

Forward Propagation

Given an input vector 𝑥𝑥 = (𝑥𝑥1 , 𝑥𝑥2 ):

1. Hidden Layer Computation: 𝑧𝑧1 = 𝑥𝑥𝑊𝑊1 + 𝑏𝑏1

33
Apply an activation function (let’s use ReLU): 𝑎𝑎1 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(𝑧𝑧1 )

2. Output layer computation: 𝑧𝑧2 = 𝑎𝑎1 𝑊𝑊2 + 𝑏𝑏2


Since this is a binary classification, apply the sigmoid activation function to get
the predicted output:
1
𝑦𝑦� = 𝜎𝜎(𝑧𝑧2 ) =
1 + 𝑒𝑒 −𝑧𝑧2
3. Loss computation
Use the binary cross-entropy loss function:
𝐿𝐿 = −[𝑦𝑦 log 𝑦𝑦� + (1 − 𝑦𝑦) log(1 − 𝑦𝑦�)]
4. Backward Propagation (Gradient computation)
a. Gradient of the output layer: 𝛿𝛿2 = 𝑦𝑦� − 𝑦𝑦
Compute the gradients of 𝑊𝑊2 and 𝑏𝑏2
• ▽ 𝑊𝑊2 = 𝑎𝑎1𝑇𝑇 𝛿𝛿2
Why?
The output of the network is 𝑦𝑦� = 𝜎𝜎(𝑧𝑧2 ), where 𝑧𝑧2 = 𝑎𝑎1 𝑊𝑊2 + 𝑏𝑏2 .
The error signal 𝛿𝛿2 is the gradient of the loss w.r.t. 𝑧𝑧2 :
𝜕𝜕𝜕𝜕
𝛿𝛿2 = .
𝜕𝜕𝑧𝑧2
Now, the gradient of the loss w.r.t. 𝑊𝑊2 is:
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑧𝑧2
▽ 𝑊𝑊2 = = ∙ = 𝑎𝑎1𝑇𝑇 𝛿𝛿2
𝜕𝜕𝑊𝑊2 𝜕𝜕𝑧𝑧
�2 𝜕𝜕𝑊𝑊
�2
=𝛿𝛿2 =𝑎𝑎1

• ▽ 𝑏𝑏2 = 𝛿𝛿2

Why?

𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝑧𝑧2


▽ 𝑏𝑏2 = = ∙ = 𝛿𝛿2
𝜕𝜕𝑏𝑏2 𝜕𝜕𝑧𝑧
�2 𝜕𝜕𝑏𝑏
�2
=𝛿𝛿2 =1

Hidden layer gradient

34
𝑙𝑙(𝑓𝑓(𝑥𝑥), 𝑦𝑦)
𝑗𝑗th unit

If the loss function 𝜙𝜙(𝑎𝑎) can be written as a pre-activation function 𝑞𝑞𝑖𝑖 (𝑎𝑎) in the layer
above, then

𝜕𝜕𝜙𝜙(𝑎𝑎) 𝜕𝜕𝜙𝜙(𝑎𝑎) 𝜕𝜕𝑞𝑞𝑖𝑖 (𝑎𝑎)


=� ∙ ,
𝜕𝜕𝜕𝜕 𝑖𝑖 𝜕𝜕𝑞𝑞𝑖𝑖 (𝑎𝑎) 𝜕𝜕𝜕𝜕

where 𝑎𝑎 is a unit in layer.

Loss gradient at hidden layers

(𝑘𝑘) (𝑘𝑘)
Considering 𝑎𝑎(𝑘𝑘) (𝑥𝑥)𝑖𝑖 = 𝑏𝑏𝑖𝑖 + ∑𝑗𝑗 𝑊𝑊𝑖𝑖,𝑗𝑗 ℎ(𝑘𝑘−1) (𝑥𝑥)𝑗𝑗 ,

𝜕𝜕𝐿𝐿 𝜕𝜕 𝜕𝜕�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � 𝜕𝜕𝑎𝑎(𝑘𝑘+1) (𝒙𝒙)𝑖𝑖


= �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � =
⏟ � ∙
𝜕𝜕ℎ(𝑘𝑘) (𝒙𝒙)𝑗𝑗 𝜕𝜕ℎ(𝑘𝑘) (𝒙𝒙)𝑗𝑗 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑖𝑖 𝜕𝜕𝑎𝑎(𝑘𝑘+1) (𝒙𝒙)𝑖𝑖 (𝑘𝑘)
𝜕𝜕ℎ���
�� (𝒙𝒙)
�� 𝑗𝑗
𝑒𝑒𝑒𝑒𝑒𝑒ℎ 𝑝𝑝𝑝𝑝𝑝𝑝ℎ (𝑘𝑘+1)
=𝑊𝑊𝑖𝑖,𝑗𝑗

(𝑘𝑘+1) 𝑇𝑇
= �𝑾𝑾∙,𝑗𝑗 � �▽𝑎𝑎(𝑘𝑘+1)(𝑥𝑥) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 ��

Gradient:

𝑇𝑇
▽𝒉𝒉(𝑘𝑘)(𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = 𝑾𝑾(𝑘𝑘+1) �▽𝒂𝒂(𝑘𝑘+1)(𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 ��

35
Loss gradient at hidden layers pre-activation

Considering 𝑗𝑗th activation ℎ(𝑘𝑘) (𝒙𝒙)𝑗𝑗 = 𝑔𝑔�𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 � only depends on the pre-activation 𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 ,
thus no sum,

( )
𝜕𝜕𝐿𝐿 𝜕𝜕 𝜕𝜕�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � 𝜕𝜕ℎ 𝑘𝑘 (𝒙𝒙)𝑗𝑗
= �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = ∙
𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 𝜕𝜕ℎ(𝑘𝑘) (𝒙𝒙)𝑗𝑗 ��
(𝑘𝑘)
(𝒙𝒙)
𝜕𝜕𝑎𝑎�����𝑗𝑗
=𝑔𝑔′�𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 �

Gradient:

(𝑘𝑘) 𝑇𝑇
▽𝒂𝒂(𝑘𝑘) (𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � =▽𝒉𝒉(𝑘𝑘)(𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � ▽
�� (𝑘𝑘) (𝒙𝒙) ℎ
𝒂𝒂� �������� (𝒙𝒙)
𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽𝐽
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚

=▽𝒉𝒉(𝑘𝑘)(𝒙𝒙) �− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � ⨀


⏟ �… , 𝑔𝑔′ �𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑗𝑗 �, … �
𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒−𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝

Revisit the previous example:

• Gradient of the hidden layer:


Backpropagate the error to the hidden layer:
𝛿𝛿1 = 𝛿𝛿2 𝑊𝑊2𝑇𝑇 ∙ 𝑅𝑅𝑅𝑅𝑅𝑅𝑈𝑈 ′ (𝑧𝑧1 ),
where 𝛿𝛿1 is the error signal for the hidden layer, 𝛿𝛿2 is the error signal for the output
layer, 𝑊𝑊2 is the weight matrix between the hidden layer and the output layer.
The gradients of 𝑊𝑊1 and 𝑏𝑏1
▽ 𝑊𝑊1 = 𝑥𝑥 𝑇𝑇 𝛿𝛿1 ; ▽ 𝑏𝑏1 = 𝛿𝛿1
Pf)
• The hidden layer outputs 𝑎𝑎1 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(𝑧𝑧1 ), where 𝑧𝑧1 = 𝑥𝑥𝑊𝑊1 + 𝑏𝑏1 .
• The output of the network 𝑧𝑧2 = 𝑎𝑎1 𝑊𝑊2 + 𝑏𝑏2

Now we want to calculate the error signal 𝛿𝛿1 , which tells us how much the loss
depends on the pre-activation 𝑧𝑧1 of the hidden layer.

𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑎𝑎1


𝛿𝛿1 = = ∙ = 𝛿𝛿2 𝑊𝑊𝑇𝑇2 ∙ 𝑅𝑅𝑅𝑅𝑅𝑅𝑈𝑈′ (𝑧𝑧1 ),
𝜕𝜕𝑧𝑧1 𝜕𝜕𝑎𝑎
�1 𝜕𝜕𝑧𝑧
�1
=𝛿𝛿2 𝑊𝑊𝑇𝑇2 =𝑅𝑅𝑅𝑅𝑅𝑅𝑈𝑈′ (𝑧𝑧1 )

where

36
𝜕𝜕𝐿𝐿 𝜕𝜕𝐿𝐿 𝜕𝜕𝑧𝑧2
= ∙ = 𝛿𝛿2 𝑊𝑊2𝑇𝑇
𝜕𝜕𝑎𝑎1 𝜕𝜕𝑧𝑧
�2 𝜕𝜕𝑎𝑎
�1
=𝛿𝛿2 =𝑊𝑊2

Since The error signal 𝛿𝛿2 is the gradient of the loss w.r.t. 𝑧𝑧2 :

𝜕𝜕𝜕𝜕
𝛿𝛿2 =
𝜕𝜕𝑧𝑧2

And 𝑧𝑧2 = 𝑎𝑎1 𝑊𝑊2 + 𝑏𝑏2 ,

𝜕𝜕𝑧𝑧2
= 𝑊𝑊2
𝜕𝜕𝑎𝑎1

Next, the derivative of 𝑎𝑎1 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(𝑧𝑧1 ) w.r.t. 𝑧𝑧1 is

𝜕𝜕𝑎𝑎1 1, 𝑧𝑧1 > 0


= 𝑅𝑅𝑅𝑅𝑅𝑅𝑈𝑈 ′ (𝑧𝑧1 ) = �
𝜕𝜕𝑧𝑧1 0, 𝑧𝑧1 ≤ 0

Parameter Gradient – Loss gradient of parameters

Partial derivative (weights)

𝜕𝜕𝐿𝐿 𝜕𝜕 𝜕𝜕�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑖𝑖 𝜕𝜕�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � (𝑘𝑘−1)
(𝑘𝑘)
= (𝑘𝑘)
�− log 𝑓𝑓(𝒙𝒙)𝑦𝑦 � = ∙ = ∙ ℎ𝑗𝑗 (𝒙𝒙)
𝜕𝜕𝑊𝑊𝑖𝑖,𝑗𝑗 𝜕𝜕𝑊𝑊𝑖𝑖,𝑗𝑗 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙) 𝑖𝑖
(𝑘𝑘)
𝜕𝜕𝑊𝑊𝑖𝑖,𝑗𝑗 𝜕𝜕𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑖𝑖

(𝑘𝑘)
where 𝑎𝑎(𝑘𝑘) (𝒙𝒙)𝑖𝑖 = 𝑏𝑏𝑖𝑖 + ∑𝑗𝑗 𝑊𝑊𝑖𝑖,𝑗𝑗(𝑘𝑘) ℎ(𝑘𝑘−1) (𝒙𝒙)𝑗𝑗

37

You might also like