0% found this document useful (0 votes)

17 views

CE6146_Lecture_2

Uploaded by

tony910313

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

CE6146_Lecture_2

Uploaded by

tony910313

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

CE6146

Introduction to Deep Learning

Feedforward Neural Networks
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/9/26
Outline

• Introduction to Neural Networks

• Architecture and Operation of Feedforward Neural Networks

• Mathematical Foundations of Feedforward Neural Networks

• Forward and Backward Propagation in Neural Networks

• Loss Functions and Model Evaluation Metrics

2
Intended Learning Outcomes

By the end of this lecture, you will be able to:

• Understand neural network principles and architecture.
• Describe the architecture and operation of Feedforward Neural Networks (FNNs).
• Mathematically describe neurons, weights, and activation functions.
• Explain forward and backward propagation in FNNs.
• Apply loss functions and evaluate models using appropriate metrics.

3
Introduction to Neural Networks
What is a Neural Network

• A neural network is a computational model that is loosely inspired by the

biological neural networks in the brain.
• It consists of layers of interconnected units (neurons) that process input data
and learn patterns to perform specific tasks like classification, regression, and
more.
• In artificial neural networks, neurons are connected by weights, and signals
(input data) flow through the network.
5
Basic Components of a Neural Network
背
神經元
• Neuron (Node): The fundamental unit of a neural network that performs
computations. 神經網路的最基本計算單元

• Weights (w): Parameters that determine the strength of the input signals.
• Bias (b): A value added to the weighted sum to adjust the output.
• Activation Function (σ): A non-linear function applied to the output of a
neuron to introduce non-linearity and help the network learn complex patterns.

6
Structure of Neural Networks

• Input Layer: Receives the raw input data (e.g., pixels for image recognition or
features for classification).
• Hidden Layers: Layers where the network processes data by applying
transformations to extract features.
• Output Layer: Produces the final output (e.g., a predicted class label for
classification or a numerical value for regression).
activation function 會在output layer

7
Structure of Neural Networks
幾個feature = 幾個input layer Extract feature

Source: https://learnopencv.com/understanding-feedforward-neural-networks/

8
Neuron Operation

• Each neuron computes a weighted sum of its inputs: bias

𝑧 = 𝑤1⋅𝑥1 + 𝑤2⋅𝑥2 + ⋯ +𝑤𝑛⋅𝑥𝑛+ 𝑏

• The result is passed through an activation function to generate the output:
𝑎 = 𝜎(𝑧)
背
where 𝜎 is the activation function (e.g., Sigmoid, ReLU, Tanh).

9
Activation Functions (1/3)
透過非線性的函數去學習更複雜的patterns

• Purpose: Introduce non-linearity to help the network learn complex patterns.

Without non-linearity, the network could only model linear relationships.
• Common Activation Functions:
‐ Sigmoid: Outputs values between 0 and 1. 類似機率問題，只有0~1

‐ ReLU (Rectified Linear Unit): Outputs 0 for negative values and the input itself for
positive values. 負數直接⽤0代表，其他正常顯⽰

‐ Hyperbolic tangent (tanh): Outputs values between -1 and 1, centered around 0.

10
Activation Functions (2/3)

輸出介於0~1 輸出介於-1~1 輸出只會是非零數值

Sigmoid Hyperbolic tangent Rectified Linear Unit

1 𝑒 𝑥 − 𝑒 −𝑥
𝑓 𝑥 = 𝑓 𝑥 = 𝑥 𝑓 𝑥 = max(0, 𝑥)
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥

11
Activation Functions (3/3)
Linear Sigmoid tanh ReLU

1 𝑒 𝑥 − 𝑒 −𝑥
Formula 𝑓 𝑥 =𝑥 𝑓 𝑥 = 𝑓 𝑥 = 𝑥 𝑓 𝑥 = max(0, 𝑥)
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
Computationally
Simple,
Outputs between 0 and Outputs between -1 efficient, helps with
Pros computationally
1 and 1, zero-centered vanishing gradient
efficient
problem
Cannot model complex Vanishing gradient Dying ReLU problem
Vanishing gradient
Cons functions, not zero- problem, not zero- (some units never
problem
centered centered activate)
Binary Classification, Hidden layers where Most common,
Suitable
Regression problems Output layer in some zero-centered outputs especially in CNNs
Problem
cases are desired and FNNs
12
Why Non-Linearity is Important

• Without non-linearity, neural networks would be no more powerful than a

linear model (e.g., a simple regression).
• Real-world data often exhibits complex, non-linear relationships that cannot
be captured by linear models.
• Activation functions like ReLU and Sigmoid help neural networks model
these complex relationships.

13
Learning in Neural Networks

• The process of learning involves adjusting the weights and biases in the
network based on the error between the predicted output and the actual target.
從錯誤中學習，藉此回來修改weight 跟bias
• Key Steps:
東⻄⼀定要先送到神經網路跑
1) Forward Pass: Input data is passed through the network to compute the output.
才能從結果去判斷跟正確答案相差多少，藉此修正weight跟bias
2) Error Calculation: The difference (error) between the predicted output and the
actual target is computed using a loss function.
3) Weight Updates: The network’s weights and biases are adjusted to minimize the
從Step了解相差多少，更新bias 跟weight
error using an optimization algorithm (such as gradient descent).
14
Perceptron: The Simplest Neural Network

• A perceptron is a single-layer neural network used for binary classification. It

is the simplest form of a neural network.
• Mathematical Operation:
𝑧 =𝑤∙𝑥+𝑏
w: weight, x: input, b: bias.
If 𝑧>0, the perceptron outputs 1; otherwise, it outputs 0 (step function).
感知器

15
Limitations of Perceptrons

• Perceptrons can only model linearly separable functions.

• They cannot handle problems like XOR, which are not
linearly separable.
• This led to the development of multi-layer perceptrons
(MLP), shallow neural networks (SNNs), and other
advanced neural network architectures.
因為單純的Perceptron只能識別linear，所以要多層才能廣泛運⽤

Figure 6.1 in Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
16
Shallow Neural Networks
Single = shallow 所以⼀個隱藏層的叫shallow

• Shallow neural networks (SNNs) typically refer to neural networks with one
hidden layer (but they can have two or three layers in some contexts). These
networks are considered shallow because they don’t have a large number of
SNN = single hidden layer
hidden layers. MLP
DNN = multiple hidden layer
• The term “shallow” is often used to distinguish these networks from deep
neural networks (DNNs), which have multiple hidden layers.
• Introducing non-linearity with activation functions (e.g., ReLU, Sigmoid)
allows the network to handle more complex data.
因為單層隱藏層不⾜以處理non linear所以需要非線性的activation function處理 17
Multi-Layer Perceptron

• A Multi-Layer Perceptron (MLP) is a type of feedforward neural network

⾒P.21
with one or more hidden layers.
• MLPs are typically fully connected, meaning each neuron in one layer is
connected to every neuron in the next layer.
• An MLP with only one hidden layer is considered shallow, while MLPs with
many hidden layers are considered deep neural networks.

DNN
18
Summary of Key Concepts in NNs

• Neuron: Basic computational unit that processes input and generates output.
• Weight (w): The coefficient that determines the strength of the input's
contribution.
• Bias (b): Added to the weighted sum to adjust the output.
• Activation Function: Introduce non-linearity to enable the network to learn
complex patterns.

19
Architecture and Operation of FNNs
Feedforward Neural Networks (FNNs)
傳向是單向的，沒有迴圈產⽣

• A Feedforward Neural Network (FNN) is a type of neural network where data

flows in one direction—from the input layer, through the hidden layers, to the
output layer.
• Key Characteristics:
‐ No feedback loops or cycles.
‐ Typically used for tasks like classification, regression, and pattern recognition.
‐ Fully Connected: Every neuron in one layer is connected to every neuron in the next
layer (though they can be sparsely connected as well).
21
Layers of a FNN (1/2)
Z = wx + b
• Input Layer: Input Layer純接收raw data

• Receives raw input data (e.g., features of a dataset, pixel values in an image).Does not
perform any computations, only passes the data to the next layer.
• Hidden Layers: 處理較為複雜的模型
‐ These layers process the data using weights, biases, and activation functions.
‐ Each neuron in a hidden layer performs a weighted sum of its inputs and applies a non-
linear activation function. SNN在做的，FNN裡⾯每⼀個神經元重複做

‐ There can be multiple hidden layers in an FNN (depending on the depth of the
network).
22
Layers of a FNN (2/2)

• Output Layer:
• Produces the final prediction or
result (e.g., class label, regression
value).The output neurons are
typically associated with the task
(e.g., classification probabilities for
each class).

Source: https://learnopencv.com/understanding-feedforward-neural-networks/

23
Fully Connected FNN
Input Layer 1 Layer 2
neuron Layer L Output

x1 …… y1

x2 y2
……
……

……

……
xN …… yM
不做任何運算，直接把
data pass給hidden layer
Input Layer Hidden Layers Output Layer
24
Example of FNN Architecture

• Problem: Predict whether an email is spam or not based on features (word

count, sender, subject, etc.). classi cation

• Architecture:
‐ Input Layer: Features of the email (e.g., word count, sender).
‐ Hidden Layers: Two hidden layers, each applying a non-linear activation function.
‐ Output Layer: Single output neuron with a Sigmoid activation function for binary
classification (spam vs. not spam).
輸出介於0~1，類似機率

25
Feedforward Process in FNN (1/2)

• Input Layer: Data enters the network.

• Hidden Layers: Each neuron computes a weighted sum of inputs:
𝑧(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙) 前⼀層的輸出是下⼀層的輸入

z: Weighted sum. W: Weights. 𝑎(𝑙−1) : Output from the previous layer. 𝑏: Bias term.

• Activation: Apply an activation function (e.g., ReLU, Sigmoid) to compute

the neuron’s output.
• Output Layer: Compute the final output.

26
Feedforward Process in FNN (2/2)
input bias
Input w Layer 1
Weights 1⋅1 + (-2)⋅2 + (-1) 0.01
1 = -5
1 ……
2 -1
2 0.95
-2 ……
1 1⋅2 + (-2)⋅1 + (3)
=3 3
Bias
Sigmoid Function  ( z )
1
 (z ) =
1 + e−z z
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/2017-spring.php 27
Activation Functions in Hidden Layers

• Why Use Activation Functions?

Activation functions introduce non-linearity, enabling FNNs to model complex, non-
linear relationships.

• Common Activation Functions:

‐ ReLU (Rectified Linear Unit): Most commonly used in hidden layers due to its
simplicity and efficiency.
‐ Sigmoid: Used in binary classification problems.
‐ Tanh: Used in certain hidden layers when outputs need to be centered around zero.
28
Advantages of FNNs
多功能性

• Simplicity: FNNs are easy to understand and implement.

• Versatility: FNNs can be used for a wide range of tasks, including
classification, regression, and pattern recognition.
• Flexibility: By adding more hidden layers, FNNs can model increasingly
complex data. 越多層，越有能⼒處理複雜的Data，但同時也容易over tting，⾒P.30

• Deterministic Flow: Data flows strictly from input to output, with no

feedback loops or cycles.
29
Limitations of FNNs
越多層越貴……..
• Computational Complexity: Deep FNNs with many layers and neurons can be
computationally expensive to train, especially for large datasets.
• Overfitting: FNNs can overfit the training data if they have too many
parameters (e.g., too many hidden layers or too many neurons in each layer).
• Not Well-Suited for Sequential Data: FNNs do not have memory, making
背 them unsuitable for tasks like time series prediction or natural language
processing (better handled by Recurrent Neural Networks).
30
Mathematical Foundations of FNNs
Purpose of Mathematical Foundations

• Goal: To understand the mathematical principles behind both forward and

backward propagation in FNNs.
• Key Topics:
‐ Forward Propagation: How data flows through the network to generate
outputs.
‐ Backward Propagation: How the network learns by adjusting weights and
biases using gradient descent.
32
Forward Propagation

• Objective: Compute the output of the network given an input.

• Mathematical Operation: Each layer in the FNN applies a linear transformation
followed by a non-linear activation function:
前⼀層的輸出是下⼀
𝑍(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙) 層的輸入，迭代產⽣
𝑊(𝑙) : Weight matrix at layer 𝑙.
𝑎(𝑙−1): Output from the previous layer. 前⼀層的輸出

𝑏(𝑙): Bias for the current layer.

因為activation function 是在隱藏層，前
𝑍(𝑙): Weighted sum before applying activation. ⼀層的隱藏層輸出是下⼀層隱藏層的輸入

• The activation function 𝜎(𝑍(𝑙)) is then applied: 𝑎(𝑙) = 𝜎(𝑍(𝑙)) 33

Matrix Form for Multiple Neurons (1/2)

• Matrix Multiplication: For multiple neurons in a layer, forward propagation

uses matrix multiplication to efficiently compute outputs:
𝑍(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙)
‐ This operation applies to all neurons in layer 𝑙 at once.
‐ Efficiently computes the weighted sum for the entire layer using matrix
algebra.

34
Matrix Form for Multiple Neurons (2/2)

x1 … y1
…
x2 W1 W2 …
WL y2
b1 b2 b
… L

…
…
…
…

…
…

…
…
xN x a1 a2
… y yM
…

𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/2017-spring.php 35
Backward Propagation in FNNs

• Goal: To compute the gradients of the loss function with respect to each
weight and bias, and update them to minimize the error.
• Why Backpropagation?
Forward propagation computes the output, but backward propagation tells us
最重要的⽬的，要更新weight 跟 bias
how to adjust the weights and biases to improve the network’s performance.
• Loss Function: Measures the difference between the network’s prediction and
the true target. ⼀般⽽⾔，loss function 會在output layer之後，才會backpropagation

36
Chain Rule and Backpropagation
Backpropagation 是透過chain rule去驅動
• The Chain Rule: Backpropagation leverages the chain rule from calculus to
efficiently compute the gradients of the loss function with respect to each
𝑍(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙)
weight in the network.
𝜕𝑍 (𝑙) = 1
𝜕b(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑎(𝑙) 𝜕𝑍 (𝑙) 𝜕𝐿 𝜕𝐿 𝜕𝑍 (𝑙)
= ∙ ∙ , = ∙
𝜕W(𝑙) 𝜕𝑎(𝑙) 𝜕𝑍 (𝑙) 𝜕W(𝑙) 𝜕b(𝑙) 𝜕𝑍 (𝑙) 𝜕b(𝑙)

• The chain rule is applied layer-by-layer, moving from the output layer back to
the input layer, which is why the process is called backward propagation.

37
Gradient Descent

• Objective: Use the computed gradients from backward propagation to update

the weights and biases, minimizing the loss function.
• Weight Update Rule: η: Learning rate (controls the step size for weight updates)

𝜕𝐿 需要透過chain rule去得出
W (𝑙) = W (𝑙) −𝜂
𝜕W (𝑙) Gradient Descent本⾝是
跟Chain Rule同時發⽣
• Bias Update Rule:

𝜕𝐿
b (𝑙) = b (𝑙) − 𝜂 (𝑙)
𝜕b 38
Full Training Cycle

• Forward Propagation: Compute the network’s output by passing the inputs

forward through the layers.
• Loss Function: Calculate the error between the predicted and actual output
using a loss function.
• Backward Propagation via Chain Rule: Compute the gradients of the loss
function with respect to the weights and biases, starting from the output layer.
• Gradient Descent: Update the weights and biases to reduce the error.
39
Forward and Backward Propagation in NNs
What is Forward Propagation

• Purpose: Forward propagation is the process of computing the output of a

neural network for a given input by passing data through each layer of the
network.
• Flow of Data:
1) Input features are multiplied by weights and added to biases to compute the pre-
activation value 𝑍(𝑙).
從input layer 送到第⼀層hidden layer，然後再傳到下⼀層，以此類推
2) An activation function is applied to produce the activation 𝑎(𝑙) , which is passed to
the next layer.
41
What is Backward Propagation (1/2)

• Purpose: Backward propagation is the process of calculating the gradient of

the loss function with respect to each weight in the network, using the chain
rule. This allows the network to update its parameters to reduce error.
• Flow of Gradients: 從output layer往前回推到input layer
1) Compute the loss 𝐿 using a loss function (e.g., cross-entropy).
2) Calculate the gradient of the loss with respect to the output layer.
3) Propagate the gradients backward through each layer, adjusting weights using
gradient descent.
42
What is Backward Propagation (2/2)

• Backward Propagation is a way to figure out:

1) How much each weight and bias contributed to the error (using the chain
rule). 因為有⽤Chain Rule，才能給Gradient Descent處理

2) How to adjust those weights and biases to improve the network’s

performance. 透過gradient descent

43
Why Backpropagation Works
forward propagation的⽅式，透過loss function比較預估值跟實際值的差
距，套⽤backward propagation 更新參數，從input layer開始重新再跑⼀次
• After forward propagation, the network compares its prediction to the actual
value and calculates the error (using a loss function).
• The error is sent backward through the network to adjust the weights and
biases layer by layer.
• Key Insight: By adjusting the weights according to the contribution of each
layer to the error, the network gradually improves its predictions.

44
Concept of Gradient Descent
透過chain rule

• Gradient Descent is how we adjust the weights and biases.

• After computing the gradients (how much each weight or bias contributed to
the error), we update the weights by taking a small step in the direction that
reduces the error.
• The learning rate controls how big a step we take.
If 𝜂 is too small, learning is slow; if 𝜂 is too large, we might overshoot the
optimal solution.
45
Hyperparameters in Neural Networks

• Hyperparameters are values that control the training process of the neural
network but are not learned from the data. Instead, they are set before training
begins.
• Examples of Hyperparameters:
‐ Learning Rate (𝜂): Controls how big a step we take when updating weights.
‐ Number of Layers: Determines the depth of the network.
‐ Batch Size: The number of samples processed before updating weights.
‐ Epochs: The number of complete passes through the training data.
46
What is Batch Size
Batch 比較不會造成Over tting
由Memory決定

• The number of training examples used to calculate the gradient before

產⽣無效的雜質，尤其是
updating the model’s weights. 資料集合比較偏頗的時候

‐ Small Batch Size: 因為更新參數頻率會更⾼

Provides faster updates but can result in noisier gradients.

Commonly used when memory is limited.
‐ Large Batch Size: 更新參數頻率會低很多
Provides more stable gradients, but updates are slower.
Requires more memory. 47
What is an Epoch

• One complete pass through the entire training dataset.

‐ Each epoch trains the model on every sample in the dataset once.
‐ The network needs multiple epochs to fully learn the patterns in the data.

• Key Insight:
‐ Training typically involves many epochs to allow the model to converge on the optimal
solution.
‐ The number of epochs needs to be balanced to avoid underfitting or overfitting.
訓練太多次，造成對training data太過熟悉

48
How Batch Size and Epochs Affect Training
每次送多少資料進去training，送越多代表update次數少，送越少代表會更新越多次

• Batch Size determines how frequently the model updates the weights:
‐ Smaller batches lead to more frequent updates.
‐ Larger batches require more memory but lead to smoother updates.

• Epochs determine how many complete passes the model makes through the
data:
‐ Too few epochs lead to underfitting. 對訓練資料不夠熟悉

‐ Too many epochs can cause overfitting. 對訓練資料過度熟悉

49
Example: Batch Size and Epochs in Training
所以參數總更新次數為：(10000/100) *10
• Training a neural network with a dataset of 10,000 samples.
‐ Batch Size = 100: The network will update its weights after every 100 samples.
‐ Epoch = 10: The entire dataset will be passed through the network 10 times.

• Trade-offs:
‐ Small batch sizes can introduce variability in updates, leading to a more exploratory
learning process.
‐ Larger batch sizes make the learning process smoother but slower.

50
Balancing Batch Size and Epochs

• Choosing the Right Batch Size:

‐ Start with a batch size of 32 or 64 and adjust based on memory and performance.

• Choosing the Right Number of Epochs:

‐ Monitor the training loss and validation loss over time.
‐ Use early stopping to halt training when validation performance stops improving.

51
Practical Guidelines for Batch Size and Epochs

• Batch Size:
‐ Typically, a batch size between 32 and 256 is used in practice.
‐ For larger datasets, you can increase batch size to balance computation time and
memory usage.

• Epochs:
‐ The number of epochs is generally determined by early stopping or validation loss
behavior.
‐ You might start with 10–50 epochs and adjust based on performance.
52
Learning Curve

• A learning curve is a plot that shows how the model's performance improves
over time during training.
• Training Loss: Measures the model's performance on the training data after
each epoch.
• Validation Loss: Measures the model's performance on unseen validation data
to assess generalization.

53
How to Interpret a Learning Curve

• Underfitting: Both training and validation loss are high and do not decrease
significantly over time, indicating that the model is too simple to capture the
underlying patterns.
• Overfitting: The training loss decreases significantly, but the validation loss
starts to increase after a certain point, indicating that the model is memorizing
the training data.
• Good Fit: Both training and validation loss decrease and stabilize, indicating
that the model is learning well and generalizing to new data. 54
Using the Learning Curve to Optimize Training

• Key Insights:
‐ Early Stopping: Stop training when the validation loss stops improving to prevent
overfitting.
‐ Monitoring Performance: Regularly plot the learning curve to check if the model is
underfitting or overfitting.

• Practical Use: Use the learning curve to decide when to adjust model
complexity (e.g., adding layers, adjusting neurons, regularization).

55
Forward and Backward Propagation in Practice
YA
• Forward Propagation: Move forward through the network, computing the
output based on the inputs. Data要先送進去，才有結果

• Compute the Loss: Calculate the error using a loss function (e.g., cross-
有結果才能計算差多少
entropy, mean squared error).
• Backward Propagation: Move backward through the network, using the error
to adjust the weights and biases at each layer. 差多少才能回推了解參數的比重關係

• Update Weights: Use gradient descent to update the parameters of the

network. 有比重關係才能更新參數
56
Choosing the Number of Layers

• Simple Tasks: Use 1–2 hidden layers for simpler tasks, like basic
classification or regression.
• Complex Tasks: Use 3 or more hidden layers for tasks that require
hierarchical feature learning.
• Guiding Principle:
‐ Start with fewer layers and increase complexity based on validation performance.
‐ Deeper networks can learn more complex patterns, but are prone to overfitting if not
regularized.
57
Choosing the Number of Neurons per Layer

• Input Layer: The number of neurons in the input layer equals the number of
features in your dataset. 要截取幾個feature 就有幾個輸入層

• Output Layer: The number of neurons in the output layer equals the number of
target classes or the number of outputs. 要多少輸出類別，就有多少output層

• Hidden Layers: Common heuristics to start with:

‐ Between input and output size: The number of hidden neurons should be somewhere between
the size of the input and output layers.
‐ 2/3 Rule: A common heuristic is to start with (2/3) × input layer size + output layer size
‐ Sqrt Rule: Another starting point is input layer input layer size × output layer size
58
Step-by-Step Guide: Optimizing Layers and Neurons (1/2)

1) Start with a Simple Architecture

Begin with 1–2 hidden layers and a moderate number of neurons (e.g., between the size of the
input and output layer).
2) Use a Train/Validation Split
Split the dataset into training, validation, and test sets. Common practice is an 80/20 split for
training and validation (or 70/15/15 for train/validation/test). Train the model on the training set
and evaluate it on the validation set to monitor generalization performance.
3) Monitor Performance on Validation Data
Track training loss, validation loss, accuracy, or other relevant metrics during training. Use early
stopping to halt training when validation loss stops improving, which prevents overfitting. 59
Step-by-Step Guide: Optimizing Layers and Neurons (2/2)

4) Experiment with Increasing Layers and Neurons

Gradually increase the number of neurons in hidden layers and add more layers if necessary.
Empirical Rules:
Start with the input/output size heuristic (e.g., number of hidden neurons between the size of the input and
output layers). Increase neurons incrementally, doubling the number and adding more hidden layers if
underfitting is observed (i.e., both training and validation loss are high).

5) Apply Regularization to Avoid Overfitting

Use dropout, L2 regularization, and early stopping to control overfitting as the network grows
deeper or has more neurons. Regularization helps when the validation loss diverges from the
training loss, indicating overfitting. 60
Example Workflow

1) Split the dataset into 80% training and 20% validation.

2) Start with a network having 1 hidden layer with 64 neurons.
3) Train the model using early stopping, monitoring validation loss to stop
training if validation loss doesn't improve for 5 epochs.
4) If the model underfits, increase the number of neurons and hidden layers
incrementally (e.g., 128 neurons, 2 hidden layers).
5) Regularize the model using dropout (0.5) and L2 regularization to control
overfitting.
61
Loss Functions and Model Evaluation Metrics
What is a Loss Function

• A function that measures how well a neural network’s prediction 𝑦ො matches
the actual target value 𝑦.
• Purpose: During training, the network aims to minimize the loss function by
adjusting its weights and biases through gradient descent.

63
Common Loss Functions for Regression

• Mean Squared Error (MSE): Commonly used for regression tasks, it

measures the average squared difference between the predicted values and the
actual values.
MSE penalizes larger errors more than smaller ones, making it sensitive to outliers.

• Mean Absolute Error (MAE): Another regression loss function that measures
the average absolute difference between predictions and actual values.
Less sensitive to outliers than MSE, but does not emphasize large errors as much.

64
Common Loss Functions for Classification

• Binary Cross-Entropy (BCE): Used for binary classification tasks, where

there are two possible outcomes.
Cross-entropy measures the “distance” between the predicted probabilities and the actual
labels. A lower value indicates better predictions.

• Categorical Cross-Entropy: Used for multi-class classification where there are

more than two classes.
只要有Cross-Entropy 都是⽤於classi cation

65
Loss Functions in Practice

• Choosing the Right Loss Function:

‐ For regression tasks: Use MSE or MAE.
‐ For binary classification: Use Binary Cross-Entropy.
‐ For multi-class classification: Use Categorical Cross-Entropy.

66
Model Evaluation Metrics

• Model evaluation metrics are used to assess how well a trained model
performs on unseen (test) data. While loss functions measure how well a
model fits the training data, evaluation metrics measure generalization to new
data.
• Purpose: To ensure the model isn’t just memorizing the training data but can
generalize to new, unseen examples.

67
Metrics for Regression Models

• Mean Squared Error (MSE): Measures the average squared difference

between predicted and true values.
• Mean Absolute Error (MAE): Measures the average absolute difference
between predicted and true values.
• R-squared (𝑅2): Represents the proportion of variance in the target variable
that is predictable from the input features. A higher value indicates a better
model.
𝑅2 tells how well the model captures the variability of the target.
68
Metrics for Classification Models (1/2)

• Accuracy: Measures the proportion of correctly classified examples out of the

total examples.
Accuracy is useful when the dataset is balanced between classes but can be misleading
for imbalanced datasets.

• Precision, Recall, and F1-Score: These are especially important for

imbalanced classification problems.
如果數據太過於偏頗，容易造成分類不平衡

69
Metrics for Classification Models (2/2)

• Precision: Measures how many of the predicted positive instances are actually
positive.
• Recall (Sensitivity): Measures how many of the actual positive instances were
correctly identified.
• F1-Score: The harmonic mean of precision and recall, providing a balanced
measure.
Precision, recall, and F1-score are critical when false positives and false negatives have
different consequences (e.g., in medical diagnosis). 70
Choosing the Right Evaluation Metric

• For Regression:
‐ Use MSE or MAE to measure the accuracy of the predictions.
‐ Use 𝑅2 to understand how much variance the model explains.

• For Classification:
‐ Use Accuracy when the dataset is balanced.
‐ Use Precision, Recall, and F1-Score for imbalanced datasets.
‐ Use ROC-AUC for binary classification to measure how well the model separates
classes.
71
Q&A

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (82)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Product Recall and Withdrawal Procedure
100% (7)
Product Recall and Withdrawal Procedure
5 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Unit 3 - Ann
No ratings yet
Unit 3 - Ann
49 pages
CNN and Gan: Introduction To
No ratings yet
CNN and Gan: Introduction To
58 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
UNIT II Basic On Neural Networks
No ratings yet
UNIT II Basic On Neural Networks
36 pages
Notes Chapter Neural Networks
No ratings yet
Notes Chapter Neural Networks
18 pages
Lecture 7 - Neural Networks
No ratings yet
Lecture 7 - Neural Networks
48 pages
Int254 Unit 3
No ratings yet
Int254 Unit 3
29 pages
chapter 4 Neural Network
No ratings yet
chapter 4 Neural Network
46 pages
Lesson 2 Neural Network Architectures
No ratings yet
Lesson 2 Neural Network Architectures
35 pages
2021 Pho1 15 Neural Networks Part1
No ratings yet
2021 Pho1 15 Neural Networks Part1
77 pages
Machine Learning
No ratings yet
Machine Learning
77 pages
UNIT-4 TNM
No ratings yet
UNIT-4 TNM
25 pages
Neural Networks
No ratings yet
Neural Networks
33 pages
NNDL
No ratings yet
NNDL
96 pages
DL_EXP-2_16010422230
No ratings yet
DL_EXP-2_16010422230
6 pages
neural-networks-part1
No ratings yet
neural-networks-part1
74 pages
01 - Neural Network Basics
No ratings yet
01 - Neural Network Basics
93 pages
4 Neural Networks
No ratings yet
4 Neural Networks
44 pages
ECSE484 Intro v2
No ratings yet
ECSE484 Intro v2
67 pages
Soft Compute
No ratings yet
Soft Compute
21 pages
UNIT - 4
No ratings yet
UNIT - 4
17 pages
AI Mod4 Session 8 Best Fit Line & ANN
No ratings yet
AI Mod4 Session 8 Best Fit Line & ANN
39 pages
Artificial intelligence basics
No ratings yet
Artificial intelligence basics
13 pages
Artificial Neural Network: Synapses Weight The Individual Parts of Information
No ratings yet
Artificial Neural Network: Synapses Weight The Individual Parts of Information
8 pages
MLP 1122 20240509 ch10 DeepNN
No ratings yet
MLP 1122 20240509 ch10 DeepNN
47 pages
Unit 1
No ratings yet
Unit 1
70 pages
Image Classification With Feed-Forward Neural Networks: Meller, Matula and Chłąd
No ratings yet
Image Classification With Feed-Forward Neural Networks: Meller, Matula and Chłąd
7 pages
Neural Network
No ratings yet
Neural Network
18 pages
Neural Networks: Some Material Adopted From Notes by
No ratings yet
Neural Networks: Some Material Adopted From Notes by
35 pages
Neural_Networks
No ratings yet
Neural_Networks
39 pages
4.3 Perceptron and MFFN
No ratings yet
4.3 Perceptron and MFFN
12 pages
ML Unit-5 Final
No ratings yet
ML Unit-5 Final
23 pages
NEURAL NETWORK-SONIYA
No ratings yet
NEURAL NETWORK-SONIYA
72 pages
AI Unit5 Neural Network 1c2c9166 c1b7 47a3 8ce1 e914f1ab6afb
No ratings yet
AI Unit5 Neural Network 1c2c9166 c1b7 47a3 8ce1 e914f1ab6afb
52 pages
Chapter Neural Networks
No ratings yet
Chapter Neural Networks
14 pages
Neural Networks - Annotated
No ratings yet
Neural Networks - Annotated
21 pages
Understanding Multi-Layer Feed-Forward Neural Networks in Machine Learning
No ratings yet
Understanding Multi-Layer Feed-Forward Neural Networks in Machine Learning
4 pages
deep learning UNIT 1
No ratings yet
deep learning UNIT 1
22 pages
unit v
No ratings yet
unit v
9 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Unit 1 Question and Answers
100% (1)
Unit 1 Question and Answers
29 pages
ECE/CS 559 - Neural Networks Lecture Notes #2 Mathematical Models For The Neuron, Neural Network Architectures
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #2 Mathematical Models For The Neuron, Neural Network Architectures
8 pages
Neural Deep Learning
No ratings yet
Neural Deep Learning
221 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
221 pages
Neural Networks
No ratings yet
Neural Networks
27 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
Neural Network
No ratings yet
Neural Network
131 pages
Unit 1
No ratings yet
Unit 1
16 pages
Chapter One
No ratings yet
Chapter One
9 pages
Artificial neural network using R
No ratings yet
Artificial neural network using R
15 pages
Introduction To AI Large Language Models (Course2)
No ratings yet
Introduction To AI Large Language Models (Course2)
3 pages
9 - Neural Networks 1
100% (1)
9 - Neural Networks 1
11 pages
CS231n Convolutional Neural Networks For Visual Recognition 2
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 2
12 pages
Neural and Fuzzy Systems
No ratings yet
Neural and Fuzzy Systems
27 pages
ADVANCED_SUPERVISED_LEARNING[1]
No ratings yet
ADVANCED_SUPERVISED_LEARNING[1]
17 pages
AIMLF-UNIT4
No ratings yet
AIMLF-UNIT4
20 pages
Neural Nets
No ratings yet
Neural Nets
33 pages
Unit 4 - Artificial Intelligence
No ratings yet
Unit 4 - Artificial Intelligence
9 pages
ML 6
No ratings yet
ML 6
10 pages
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
CE6146_Lecture_1
No ratings yet
CE6146_Lecture_1
63 pages
CE6146_Lecture_3
No ratings yet
CE6146_Lecture_3
83 pages
CE6146_Lecture_5
No ratings yet
CE6146_Lecture_5
55 pages
CE6146_Lecture_4
No ratings yet
CE6146_Lecture_4
53 pages
Molecular dynamics simulation and machine learning of mechanical response in non-equiatomic FeCrNiCoMn high-entropy alloy
No ratings yet
Molecular dynamics simulation and machine learning of mechanical response in non-equiatomic FeCrNiCoMn high-entropy alloy
12 pages
Artificial intelligence driven robotic control system for personalized elderly care and foot massage
No ratings yet
Artificial intelligence driven robotic control system for personalized elderly care and foot massage
13 pages
AI Ass 2
No ratings yet
AI Ass 2
32 pages
Twitter Sentiment Analysis Using Support Vector Machine and Deep Learning Model in E-Learning Implementation During The Covid-19 Outbreak
No ratings yet
Twitter Sentiment Analysis Using Support Vector Machine and Deep Learning Model in E-Learning Implementation During The Covid-19 Outbreak
11 pages
Diagnosing Dysarthria With Long Short-Term Memory Networks: September 15-19, 2019, Graz, Austria
No ratings yet
Diagnosing Dysarthria With Long Short-Term Memory Networks: September 15-19, 2019, Graz, Austria
5 pages
Prediction of Risk Delay in Construction Projects Using A Hybrid Artificial Intelligence Model
No ratings yet
Prediction of Risk Delay in Construction Projects Using A Hybrid Artificial Intelligence Model
14 pages
IEEE_Conference_Template__2_ (3)
No ratings yet
IEEE_Conference_Template__2_ (3)
5 pages
Irfan 2017
No ratings yet
Irfan 2017
5 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
Advanced Image To Speech Conversion
No ratings yet
Advanced Image To Speech Conversion
46 pages
2023 ML Assignment
No ratings yet
2023 ML Assignment
57 pages
Understanding Consumers Visual Attention in Mobile Advertisements An Ambulatory Eye-Tracking Study with Machine Learning T
No ratings yet
Understanding Consumers Visual Attention in Mobile Advertisements An Ambulatory Eye-Tracking Study with Machine Learning T
20 pages
Blood Cancer Detection Cnn
No ratings yet
Blood Cancer Detection Cnn
19 pages
IEEE Conference Template
No ratings yet
IEEE Conference Template
8 pages
Credit Card Fraud Detection Using State-Of-The-Art Machine Learning and Deep Learning Algorithms
No ratings yet
Credit Card Fraud Detection Using State-Of-The-Art Machine Learning and Deep Learning Algorithms
16 pages
Topic 7
No ratings yet
Topic 7
70 pages
LLMs4Psych Arabic (3)
No ratings yet
LLMs4Psych Arabic (3)
35 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Crime Analytics Analysis of Crimes Through Newspaper Articles
No ratings yet
Crime Analytics Analysis of Crimes Through Newspaper Articles
7 pages
Prediction of Diabetes Using Classi Cation Algorithms
No ratings yet
Prediction of Diabetes Using Classi Cation Algorithms
8 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Credit Card Fraud Detection Using Machine Learning Final Research Paper
100% (2)
Credit Card Fraud Detection Using Machine Learning Final Research Paper
11 pages
Fake Reviews Detection Using Supervised Machine Learning
No ratings yet
Fake Reviews Detection Using Supervised Machine Learning
7 pages
Animals 13 00033 v2 PDF
No ratings yet
Animals 13 00033 v2 PDF
11 pages

CE6146_Lecture_2

Uploaded by

CE6146_Lecture_2

Uploaded by

CE6146

Introduction to Deep Learning

• Introduction to Neural Networks

• Architecture and Operation of Feedforward Neural Networks

• Mathematical Foundations of Feedforward Neural Networks

• Forward and Backward Propagation in Neural Networks

• Loss Functions and Model Evaluation Metrics

By the end of this lecture, you will be able to:

• A neural network is a computational model that is loosely inspired by the

• Each neuron computes a weighted sum of its inputs: bias

𝑧 = 𝑤1⋅𝑥1 + 𝑤2⋅𝑥2 + ⋯ +𝑤𝑛⋅𝑥𝑛+ 𝑏

• Purpose: Introduce non-linearity to help the network learn complex patterns.

‐ Hyperbolic tangent (tanh): Outputs values between -1 and 1, centered around 0.

輸出介於0~1 輸出介於-1~1 輸出只會是非零數值

Sigmoid Hyperbolic tangent Rectified Linear Unit

• Without non-linearity, neural networks would be no more powerful than a

• A perceptron is a single-layer neural network used for binary classification. It

• Perceptrons can only model linearly separable functions.

• A Multi-Layer Perceptron (MLP) is a type of feedforward neural network

• A Feedforward Neural Network (FNN) is a type of neural network where data

• Problem: Predict whether an email is spam or not based on features (word

• Input Layer: Data enters the network.

• Activation: Apply an activation function (e.g., ReLU, Sigmoid) to compute

• Why Use Activation Functions?

• Common Activation Functions:

• Simplicity: FNNs are easy to understand and implement.

• Deterministic Flow: Data flows strictly from input to output, with no

• Goal: To understand the mathematical principles behind both forward and

• Objective: Compute the output of the network given an input.

𝑏(𝑙): Bias for the current layer.

• The activation function 𝜎(𝑍(𝑙)) is then applied: 𝑎(𝑙) = 𝜎(𝑍(𝑙)) 33

• Matrix Multiplication: For multiple neurons in a layer, forward propagation

• Objective: Use the computed gradients from backward propagation to update

• Forward Propagation: Compute the network’s output by passing the inputs

• Purpose: Forward propagation is the process of computing the output of a

• Purpose: Backward propagation is the process of calculating the gradient of

• Backward Propagation is a way to figure out:

2) How to adjust those weights and biases to improve the network’s

• Gradient Descent is how we adjust the weights and biases.

• The number of training examples used to calculate the gradient before

‐ Small Batch Size: 因為更新參數頻率會更⾼

Provides faster updates but can result in noisier gradients.

• One complete pass through the entire training dataset.

‐ Too many epochs can cause overfitting. 對訓練資料過度熟悉

• Choosing the Right Batch Size:

• Choosing the Right Number of Epochs:

• Update Weights: Use gradient descent to update the parameters of the

• Hidden Layers: Common heuristics to start with:

1) Start with a Simple Architecture

4) Experiment with Increasing Layers and Neurons

5) Apply Regularization to Avoid Overfitting

1) Split the dataset into 80% training and 20% validation.

• Mean Squared Error (MSE): Commonly used for regression tasks, it

• Binary Cross-Entropy (BCE): Used for binary classification tasks, where

• Categorical Cross-Entropy: Used for multi-class classification where there are

• Choosing the Right Loss Function:

• Mean Squared Error (MSE): Measures the average squared difference

• Accuracy: Measures the proportion of correctly classified examples out of the

• Precision, Recall, and F1-Score: These are especially important for

You might also like