0% found this document useful (0 votes)
17 views

CE6146_Lecture_2

Uploaded by

tony910313
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

CE6146_Lecture_2

Uploaded by

tony910313
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

CE6146

Introduction to Deep Learning


Feedforward Neural Networks
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/9/26
Outline

• Introduction to Neural Networks

• Architecture and Operation of Feedforward Neural Networks

• Mathematical Foundations of Feedforward Neural Networks

• Forward and Backward Propagation in Neural Networks

• Loss Functions and Model Evaluation Metrics

2
Intended Learning Outcomes

By the end of this lecture, you will be able to:


• Understand neural network principles and architecture.
• Describe the architecture and operation of Feedforward Neural Networks (FNNs).
• Mathematically describe neurons, weights, and activation functions.
• Explain forward and backward propagation in FNNs.
• Apply loss functions and evaluate models using appropriate metrics.

3
Introduction to Neural Networks
What is a Neural Network

• A neural network is a computational model that is loosely inspired by the


biological neural networks in the brain.
• It consists of layers of interconnected units (neurons) that process input data
and learn patterns to perform specific tasks like classification, regression, and
more.
• In artificial neural networks, neurons are connected by weights, and signals
(input data) flow through the network.
5
Basic Components of a Neural Network

神經元
• Neuron (Node): The fundamental unit of a neural network that performs
computations. 神經網路的最基本計算單元

• Weights (w): Parameters that determine the strength of the input signals.
• Bias (b): A value added to the weighted sum to adjust the output.
• Activation Function (σ): A non-linear function applied to the output of a
neuron to introduce non-linearity and help the network learn complex patterns.

6
Structure of Neural Networks

• Input Layer: Receives the raw input data (e.g., pixels for image recognition or
features for classification).
• Hidden Layers: Layers where the network processes data by applying
transformations to extract features.
• Output Layer: Produces the final output (e.g., a predicted class label for
classification or a numerical value for regression).
activation function 會在output layer

7
Structure of Neural Networks
幾個feature = 幾個input layer Extract feature

• Input Layer: Receives the raw input data (e.g., pixels for image recognition or
features for classification).
• Hidden Layers: Layers where the network processes data by applying
transformations to extract features.
• Output Layer: Produces the final output (e.g., a predicted class label for
classification or a numerical value for regression). 內含activation function

Source: https://learnopencv.com/understanding-feedforward-neural-networks/

8
Neuron Operation

• Each neuron computes a weighted sum of its inputs: bias

𝑧 = 𝑤1⋅𝑥1 + 𝑤2⋅𝑥2 + ⋯ +𝑤𝑛⋅𝑥𝑛+ 𝑏


• The result is passed through an activation function to generate the output:
𝑎 = 𝜎(𝑧)

where 𝜎 is the activation function (e.g., Sigmoid, ReLU, Tanh).

9
Activation Functions (1/3)
透過非線性的函數去學習更複雜的patterns

• Purpose: Introduce non-linearity to help the network learn complex patterns.


Without non-linearity, the network could only model linear relationships.
• Common Activation Functions:
‐ Sigmoid: Outputs values between 0 and 1. 類似機率問題,只有0~1

‐ ReLU (Rectified Linear Unit): Outputs 0 for negative values and the input itself for
positive values. 負數直接⽤0代表,其他正常顯⽰

‐ Hyperbolic tangent (tanh): Outputs values between -1 and 1, centered around 0.

10
Activation Functions (2/3)

輸出介於0~1 輸出介於-1~1 輸出只會是非零數值

Sigmoid Hyperbolic tangent Rectified Linear Unit


1 𝑒 𝑥 − 𝑒 −𝑥
𝑓 𝑥 = 𝑓 𝑥 = 𝑥 𝑓 𝑥 = max(0, 𝑥)
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥

11
Activation Functions (3/3)
Linear Sigmoid tanh ReLU

1 𝑒 𝑥 − 𝑒 −𝑥
Formula 𝑓 𝑥 =𝑥 𝑓 𝑥 = 𝑓 𝑥 = 𝑥 𝑓 𝑥 = max(0, 𝑥)
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
Computationally
Simple,
Outputs between 0 and Outputs between -1 efficient, helps with
Pros computationally
1 and 1, zero-centered vanishing gradient
efficient
problem
Cannot model complex Vanishing gradient Dying ReLU problem
Vanishing gradient
Cons functions, not zero- problem, not zero- (some units never
problem
centered centered activate)
Binary Classification, Hidden layers where Most common,
Suitable
Regression problems Output layer in some zero-centered outputs especially in CNNs
Problem
cases are desired and FNNs
12
Why Non-Linearity is Important

• Without non-linearity, neural networks would be no more powerful than a


linear model (e.g., a simple regression).
• Real-world data often exhibits complex, non-linear relationships that cannot
be captured by linear models.
• Activation functions like ReLU and Sigmoid help neural networks model
these complex relationships.

13
Learning in Neural Networks

• The process of learning involves adjusting the weights and biases in the
network based on the error between the predicted output and the actual target.
從錯誤中學習,藉此回來修改weight 跟bias
• Key Steps:
東⻄⼀定要先送到神經網路跑
1) Forward Pass: Input data is passed through the network to compute the output.
才能從結果去判斷跟正確答案相差多少,藉此修正weight跟bias
2) Error Calculation: The difference (error) between the predicted output and the
actual target is computed using a loss function.
3) Weight Updates: The network’s weights and biases are adjusted to minimize the
從Step了解相差多少,更新bias 跟weight
error using an optimization algorithm (such as gradient descent).
14
Perceptron: The Simplest Neural Network

• A perceptron is a single-layer neural network used for binary classification. It


is the simplest form of a neural network.
• Mathematical Operation:
𝑧 =𝑤∙𝑥+𝑏
w: weight, x: input, b: bias.
If 𝑧>0, the perceptron outputs 1; otherwise, it outputs 0 (step function).
感知器

15
Limitations of Perceptrons

• Perceptrons can only model linearly separable functions.


• They cannot handle problems like XOR, which are not
linearly separable.
• This led to the development of multi-layer perceptrons
(MLP), shallow neural networks (SNNs), and other
advanced neural network architectures.
因為單純的Perceptron只能識別linear,所以要多層才能廣泛運⽤

Figure 6.1 in Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
16
Shallow Neural Networks
Single = shallow 所以⼀個隱藏層的叫shallow

• Shallow neural networks (SNNs) typically refer to neural networks with one
hidden layer (but they can have two or three layers in some contexts). These
networks are considered shallow because they don’t have a large number of
SNN = single hidden layer
hidden layers. MLP
DNN = multiple hidden layer
• The term “shallow” is often used to distinguish these networks from deep
neural networks (DNNs), which have multiple hidden layers.
• Introducing non-linearity with activation functions (e.g., ReLU, Sigmoid)
allows the network to handle more complex data.
因為單層隱藏層不⾜以處理non linear所以需要非線性的activation function處理 17
Multi-Layer Perceptron

• A Multi-Layer Perceptron (MLP) is a type of feedforward neural network


⾒P.21
with one or more hidden layers.
• MLPs are typically fully connected, meaning each neuron in one layer is
connected to every neuron in the next layer.
• An MLP with only one hidden layer is considered shallow, while MLPs with
many hidden layers are considered deep neural networks.

DNN
18
Summary of Key Concepts in NNs

• Neuron: Basic computational unit that processes input and generates output.
• Weight (w): The coefficient that determines the strength of the input's
contribution.
• Bias (b): Added to the weighted sum to adjust the output.
• Activation Function: Introduce non-linearity to enable the network to learn
complex patterns.

19
Architecture and Operation of FNNs
Feedforward Neural Networks (FNNs)
傳向是單向的,沒有迴圈產⽣

• A Feedforward Neural Network (FNN) is a type of neural network where data


flows in one direction—from the input layer, through the hidden layers, to the
output layer.
• Key Characteristics:
‐ No feedback loops or cycles.
‐ Typically used for tasks like classification, regression, and pattern recognition.
‐ Fully Connected: Every neuron in one layer is connected to every neuron in the next
layer (though they can be sparsely connected as well).
21
Layers of a FNN (1/2)
Z = wx + b
• Input Layer: Input Layer純接收raw data

• Receives raw input data (e.g., features of a dataset, pixel values in an image).Does not
perform any computations, only passes the data to the next layer.
• Hidden Layers: 處理較為複雜的模型
‐ These layers process the data using weights, biases, and activation functions.
‐ Each neuron in a hidden layer performs a weighted sum of its inputs and applies a non-
linear activation function. SNN在做的,FNN裡⾯每⼀個神經元重複做

‐ There can be multiple hidden layers in an FNN (depending on the depth of the
network).
22
Layers of a FNN (2/2)

• Output Layer:
• Produces the final prediction or
result (e.g., class label, regression
value).The output neurons are
typically associated with the task
(e.g., classification probabilities for
each class).

Source: https://learnopencv.com/understanding-feedforward-neural-networks/

23
Fully Connected FNN
Input Layer 1 Layer 2
neuron Layer L Output

x1 …… y1

x2 y2
……
……

……

……

……

……
xN …… yM
不做任何運算,直接把
data pass給hidden layer
Input Layer Hidden Layers Output Layer
24
Example of FNN Architecture

• Problem: Predict whether an email is spam or not based on features (word


count, sender, subject, etc.). classi cation

• Architecture:
‐ Input Layer: Features of the email (e.g., word count, sender).
‐ Hidden Layers: Two hidden layers, each applying a non-linear activation function.
‐ Output Layer: Single output neuron with a Sigmoid activation function for binary
classification (spam vs. not spam).
輸出介於0~1,類似機率

25
Feedforward Process in FNN (1/2)

• Input Layer: Data enters the network.


• Hidden Layers: Each neuron computes a weighted sum of inputs:
𝑧(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙) 前⼀層的輸出是下⼀層的輸入

z: Weighted sum. W: Weights. 𝑎(𝑙−1) : Output from the previous layer. 𝑏: Bias term.

• Activation: Apply an activation function (e.g., ReLU, Sigmoid) to compute


the neuron’s output.
• Output Layer: Compute the final output.

26
Feedforward Process in FNN (2/2)
input bias
Input w Layer 1
Weights 1⋅1 + (-2)⋅2 + (-1) 0.01
1 = -5
1 ……
2 -1
2 0.95
-2 ……
1 1⋅2 + (-2)⋅1 + (3)
=3 3
Bias
Sigmoid Function  ( z )
1
 (z ) =
1 + e−z z
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/2017-spring.php 27
Activation Functions in Hidden Layers

• Why Use Activation Functions?


Activation functions introduce non-linearity, enabling FNNs to model complex, non-
linear relationships.

• Common Activation Functions:


‐ ReLU (Rectified Linear Unit): Most commonly used in hidden layers due to its
simplicity and efficiency.
‐ Sigmoid: Used in binary classification problems.
‐ Tanh: Used in certain hidden layers when outputs need to be centered around zero.
28
Advantages of FNNs
多功能性

• Simplicity: FNNs are easy to understand and implement.


• Versatility: FNNs can be used for a wide range of tasks, including
classification, regression, and pattern recognition.
• Flexibility: By adding more hidden layers, FNNs can model increasingly
complex data. 越多層,越有能⼒處理複雜的Data,但同時也容易over tting,⾒P.30

• Deterministic Flow: Data flows strictly from input to output, with no


feedback loops or cycles.
29
Limitations of FNNs
越多層越貴……..
• Computational Complexity: Deep FNNs with many layers and neurons can be
computationally expensive to train, especially for large datasets.
• Overfitting: FNNs can overfit the training data if they have too many
parameters (e.g., too many hidden layers or too many neurons in each layer).
• Not Well-Suited for Sequential Data: FNNs do not have memory, making
背 them unsuitable for tasks like time series prediction or natural language
processing (better handled by Recurrent Neural Networks).
30
Mathematical Foundations of FNNs
Purpose of Mathematical Foundations

• Goal: To understand the mathematical principles behind both forward and


backward propagation in FNNs.
• Key Topics:
‐ Forward Propagation: How data flows through the network to generate
outputs.
‐ Backward Propagation: How the network learns by adjusting weights and
biases using gradient descent.
32
Forward Propagation

• Objective: Compute the output of the network given an input.


• Mathematical Operation: Each layer in the FNN applies a linear transformation
followed by a non-linear activation function:
前⼀層的輸出是下⼀
𝑍(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙) 層的輸入,迭代產⽣
𝑊(𝑙) : Weight matrix at layer 𝑙.
𝑎(𝑙−1): Output from the previous layer. 前⼀層的輸出

𝑏(𝑙): Bias for the current layer.


因為activation function 是在隱藏層,前
𝑍(𝑙): Weighted sum before applying activation. ⼀層的隱藏層輸出是下⼀層隱藏層的輸入

• The activation function 𝜎(𝑍(𝑙)) is then applied: 𝑎(𝑙) = 𝜎(𝑍(𝑙)) 33


Matrix Form for Multiple Neurons (1/2)

• Matrix Multiplication: For multiple neurons in a layer, forward propagation


uses matrix multiplication to efficiently compute outputs:
𝑍(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙)
‐ This operation applies to all neurons in layer 𝑙 at once.
‐ Efficiently computes the weighted sum for the entire layer using matrix
algebra.

34
Matrix Form for Multiple Neurons (2/2)

x1 … y1

x2 W1 W2 …
WL y2
b1 b2 b
… L








xN x a1 a2
… y yM

𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/2017-spring.php 35
Backward Propagation in FNNs

• Goal: To compute the gradients of the loss function with respect to each
weight and bias, and update them to minimize the error.
• Why Backpropagation?
Forward propagation computes the output, but backward propagation tells us
最重要的⽬的,要更新weight 跟 bias
how to adjust the weights and biases to improve the network’s performance.
• Loss Function: Measures the difference between the network’s prediction and
the true target. ⼀般⽽⾔,loss function 會在output layer之後,才會backpropagation

36
Chain Rule and Backpropagation
Backpropagation 是透過chain rule去驅動
• The Chain Rule: Backpropagation leverages the chain rule from calculus to
efficiently compute the gradients of the loss function with respect to each
𝑍(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙)
weight in the network.
𝜕𝑍 (𝑙) = 1
𝜕b(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑎(𝑙) 𝜕𝑍 (𝑙) 𝜕𝐿 𝜕𝐿 𝜕𝑍 (𝑙)
= ∙ ∙ , = ∙
𝜕W(𝑙) 𝜕𝑎(𝑙) 𝜕𝑍 (𝑙) 𝜕W(𝑙) 𝜕b(𝑙) 𝜕𝑍 (𝑙) 𝜕b(𝑙)

• The chain rule is applied layer-by-layer, moving from the output layer back to
the input layer, which is why the process is called backward propagation.

37
Gradient Descent

• Objective: Use the computed gradients from backward propagation to update


the weights and biases, minimizing the loss function.
• Weight Update Rule: η: Learning rate (controls the step size for weight updates)

𝜕𝐿 需要透過chain rule去得出
W (𝑙) = W (𝑙) −𝜂
𝜕W (𝑙) Gradient Descent本⾝是
跟Chain Rule同時發⽣
• Bias Update Rule:

𝜕𝐿
b (𝑙) = b (𝑙) − 𝜂 (𝑙)
𝜕b 38
Full Training Cycle

• Forward Propagation: Compute the network’s output by passing the inputs


forward through the layers.
• Loss Function: Calculate the error between the predicted and actual output
using a loss function.
• Backward Propagation via Chain Rule: Compute the gradients of the loss
function with respect to the weights and biases, starting from the output layer.
• Gradient Descent: Update the weights and biases to reduce the error.
39
Forward and Backward Propagation in NNs
What is Forward Propagation

• Purpose: Forward propagation is the process of computing the output of a


neural network for a given input by passing data through each layer of the
network.
• Flow of Data:
1) Input features are multiplied by weights and added to biases to compute the pre-
activation value 𝑍(𝑙).
從input layer 送到第⼀層hidden layer,然後再傳到下⼀層,以此類推
2) An activation function is applied to produce the activation 𝑎(𝑙) , which is passed to
the next layer.
41
What is Backward Propagation (1/2)

• Purpose: Backward propagation is the process of calculating the gradient of


the loss function with respect to each weight in the network, using the chain
rule. This allows the network to update its parameters to reduce error.
• Flow of Gradients: 從output layer往前回推到input layer
1) Compute the loss 𝐿 using a loss function (e.g., cross-entropy).
2) Calculate the gradient of the loss with respect to the output layer.
3) Propagate the gradients backward through each layer, adjusting weights using
gradient descent.
42
What is Backward Propagation (2/2)

• Backward Propagation is a way to figure out:


1) How much each weight and bias contributed to the error (using the chain
rule). 因為有⽤Chain Rule,才能給Gradient Descent處理

2) How to adjust those weights and biases to improve the network’s


performance. 透過gradient descent

43
Why Backpropagation Works
forward propagation的⽅式,透過loss function比較預估值跟實際值的差
距,套⽤backward propagation 更新參數,從input layer開始重新再跑⼀次
• After forward propagation, the network compares its prediction to the actual
value and calculates the error (using a loss function).
• The error is sent backward through the network to adjust the weights and
biases layer by layer.
• Key Insight: By adjusting the weights according to the contribution of each
layer to the error, the network gradually improves its predictions.

44
Concept of Gradient Descent
透過chain rule

• Gradient Descent is how we adjust the weights and biases.


• After computing the gradients (how much each weight or bias contributed to
the error), we update the weights by taking a small step in the direction that
reduces the error.
• The learning rate controls how big a step we take.
If 𝜂 is too small, learning is slow; if 𝜂 is too large, we might overshoot the
optimal solution.
45
Hyperparameters in Neural Networks

• Hyperparameters are values that control the training process of the neural
network but are not learned from the data. Instead, they are set before training
begins.
• Examples of Hyperparameters:
‐ Learning Rate (𝜂): Controls how big a step we take when updating weights.
‐ Number of Layers: Determines the depth of the network.
‐ Batch Size: The number of samples processed before updating weights.
‐ Epochs: The number of complete passes through the training data.
46
What is Batch Size
Batch 比較不會造成Over tting
由Memory決定

• The number of training examples used to calculate the gradient before


產⽣無效的雜質,尤其是
updating the model’s weights. 資料集合比較偏頗的時候

‐ Small Batch Size: 因為更新參數頻率會更⾼

Provides faster updates but can result in noisier gradients.


Commonly used when memory is limited.
‐ Large Batch Size: 更新參數頻率會低很多
Provides more stable gradients, but updates are slower.
Requires more memory. 47
What is an Epoch

• One complete pass through the entire training dataset.


‐ Each epoch trains the model on every sample in the dataset once.
‐ The network needs multiple epochs to fully learn the patterns in the data.

• Key Insight:
‐ Training typically involves many epochs to allow the model to converge on the optimal
solution.
‐ The number of epochs needs to be balanced to avoid underfitting or overfitting.
訓練太多次,造成對training data太過熟悉

48
How Batch Size and Epochs Affect Training
每次送多少資料進去training,送越多代表update次數少,送越少代表會更新越多次

• Batch Size determines how frequently the model updates the weights:
‐ Smaller batches lead to more frequent updates.
‐ Larger batches require more memory but lead to smoother updates.

• Epochs determine how many complete passes the model makes through the
data:
‐ Too few epochs lead to underfitting. 對訓練資料不夠熟悉

‐ Too many epochs can cause overfitting. 對訓練資料過度熟悉

49
Example: Batch Size and Epochs in Training
所以參數總更新次數為:(10000/100) *10
• Training a neural network with a dataset of 10,000 samples.
‐ Batch Size = 100: The network will update its weights after every 100 samples.
‐ Epoch = 10: The entire dataset will be passed through the network 10 times.

• Trade-offs:
‐ Small batch sizes can introduce variability in updates, leading to a more exploratory
learning process.
‐ Larger batch sizes make the learning process smoother but slower.

50
Balancing Batch Size and Epochs

• Choosing the Right Batch Size:


‐ Start with a batch size of 32 or 64 and adjust based on memory and performance.

• Choosing the Right Number of Epochs:


‐ Monitor the training loss and validation loss over time.
‐ Use early stopping to halt training when validation performance stops improving.

51
Practical Guidelines for Batch Size and Epochs

• Batch Size:
‐ Typically, a batch size between 32 and 256 is used in practice.
‐ For larger datasets, you can increase batch size to balance computation time and
memory usage.

• Epochs:
‐ The number of epochs is generally determined by early stopping or validation loss
behavior.
‐ You might start with 10–50 epochs and adjust based on performance.
52
Learning Curve

• A learning curve is a plot that shows how the model's performance improves
over time during training.
• Training Loss: Measures the model's performance on the training data after
each epoch.
• Validation Loss: Measures the model's performance on unseen validation data
to assess generalization.

53
How to Interpret a Learning Curve

• Underfitting: Both training and validation loss are high and do not decrease
significantly over time, indicating that the model is too simple to capture the
underlying patterns.
• Overfitting: The training loss decreases significantly, but the validation loss
starts to increase after a certain point, indicating that the model is memorizing
the training data.
• Good Fit: Both training and validation loss decrease and stabilize, indicating
that the model is learning well and generalizing to new data. 54
Using the Learning Curve to Optimize Training

• Key Insights:
‐ Early Stopping: Stop training when the validation loss stops improving to prevent
overfitting.
‐ Monitoring Performance: Regularly plot the learning curve to check if the model is
underfitting or overfitting.

• Practical Use: Use the learning curve to decide when to adjust model
complexity (e.g., adding layers, adjusting neurons, regularization).

55
Forward and Backward Propagation in Practice
YA
• Forward Propagation: Move forward through the network, computing the
output based on the inputs. Data要先送進去,才有結果

• Compute the Loss: Calculate the error using a loss function (e.g., cross-
有結果才能計算差多少
entropy, mean squared error).
• Backward Propagation: Move backward through the network, using the error
to adjust the weights and biases at each layer. 差多少才能回推了解參數的比重關係

• Update Weights: Use gradient descent to update the parameters of the


network. 有比重關係才能更新參數
56
Choosing the Number of Layers

• Simple Tasks: Use 1–2 hidden layers for simpler tasks, like basic
classification or regression.
• Complex Tasks: Use 3 or more hidden layers for tasks that require
hierarchical feature learning.
• Guiding Principle:
‐ Start with fewer layers and increase complexity based on validation performance.
‐ Deeper networks can learn more complex patterns, but are prone to overfitting if not
regularized.
57
Choosing the Number of Neurons per Layer

• Input Layer: The number of neurons in the input layer equals the number of
features in your dataset. 要截取幾個feature 就有幾個輸入層

• Output Layer: The number of neurons in the output layer equals the number of
target classes or the number of outputs. 要多少輸出類別,就有多少output層

• Hidden Layers: Common heuristics to start with:


‐ Between input and output size: The number of hidden neurons should be somewhere between
the size of the input and output layers.
‐ 2/3 Rule: A common heuristic is to start with (2/3) × input layer size + output layer size
‐ Sqrt Rule: Another starting point is input layer input layer size × output layer size
58
Step-by-Step Guide: Optimizing Layers and Neurons (1/2)

1) Start with a Simple Architecture


Begin with 1–2 hidden layers and a moderate number of neurons (e.g., between the size of the
input and output layer).
2) Use a Train/Validation Split
Split the dataset into training, validation, and test sets. Common practice is an 80/20 split for
training and validation (or 70/15/15 for train/validation/test). Train the model on the training set
and evaluate it on the validation set to monitor generalization performance.
3) Monitor Performance on Validation Data
Track training loss, validation loss, accuracy, or other relevant metrics during training. Use early
stopping to halt training when validation loss stops improving, which prevents overfitting. 59
Step-by-Step Guide: Optimizing Layers and Neurons (2/2)

4) Experiment with Increasing Layers and Neurons


Gradually increase the number of neurons in hidden layers and add more layers if necessary.
Empirical Rules:
Start with the input/output size heuristic (e.g., number of hidden neurons between the size of the input and
output layers). Increase neurons incrementally, doubling the number and adding more hidden layers if
underfitting is observed (i.e., both training and validation loss are high).

5) Apply Regularization to Avoid Overfitting


Use dropout, L2 regularization, and early stopping to control overfitting as the network grows
deeper or has more neurons. Regularization helps when the validation loss diverges from the
training loss, indicating overfitting. 60
Example Workflow

1) Split the dataset into 80% training and 20% validation.


2) Start with a network having 1 hidden layer with 64 neurons.
3) Train the model using early stopping, monitoring validation loss to stop
training if validation loss doesn't improve for 5 epochs.
4) If the model underfits, increase the number of neurons and hidden layers
incrementally (e.g., 128 neurons, 2 hidden layers).
5) Regularize the model using dropout (0.5) and L2 regularization to control
overfitting.
61
Loss Functions and Model Evaluation Metrics
What is a Loss Function

• A function that measures how well a neural network’s prediction 𝑦​ො matches
the actual target value 𝑦.
• Purpose: During training, the network aims to minimize the loss function by
adjusting its weights and biases through gradient descent.

63
Common Loss Functions for Regression

• Mean Squared Error (MSE): Commonly used for regression tasks, it


measures the average squared difference between the predicted values and the
actual values.
MSE penalizes larger errors more than smaller ones, making it sensitive to outliers.

• Mean Absolute Error (MAE): Another regression loss function that measures
the average absolute difference between predictions and actual values.
Less sensitive to outliers than MSE, but does not emphasize large errors as much.

64
Common Loss Functions for Classification

• Binary Cross-Entropy (BCE): Used for binary classification tasks, where


there are two possible outcomes.
Cross-entropy measures the “distance” between the predicted probabilities and the actual
labels. A lower value indicates better predictions.

• Categorical Cross-Entropy: Used for multi-class classification where there are


more than two classes.
只要有Cross-Entropy 都是⽤於classi cation

65
Loss Functions in Practice

• Choosing the Right Loss Function:


‐ For regression tasks: Use MSE or MAE.
‐ For binary classification: Use Binary Cross-Entropy.
‐ For multi-class classification: Use Categorical Cross-Entropy.

66
Model Evaluation Metrics

• Model evaluation metrics are used to assess how well a trained model
performs on unseen (test) data. While loss functions measure how well a
model fits the training data, evaluation metrics measure generalization to new
data.
• Purpose: To ensure the model isn’t just memorizing the training data but can
generalize to new, unseen examples.

67
Metrics for Regression Models

• Mean Squared Error (MSE): Measures the average squared difference


between predicted and true values.
• Mean Absolute Error (MAE): Measures the average absolute difference
between predicted and true values.
• R-squared (𝑅2): Represents the proportion of variance in the target variable
that is predictable from the input features. A higher value indicates a better
model.
𝑅2 tells how well the model captures the variability of the target.
68
Metrics for Classification Models (1/2)

• Accuracy: Measures the proportion of correctly classified examples out of the


total examples.
Accuracy is useful when the dataset is balanced between classes but can be misleading
for imbalanced datasets.

• Precision, Recall, and F1-Score: These are especially important for


imbalanced classification problems.
如果數據太過於偏頗,容易造成分類不平衡

69
Metrics for Classification Models (2/2)

• Precision: Measures how many of the predicted positive instances are actually
positive.
• Recall (Sensitivity): Measures how many of the actual positive instances were
correctly identified.
• F1-Score: The harmonic mean of precision and recall, providing a balanced
measure.
Precision, recall, and F1-score are critical when false positives and false negatives have
different consequences (e.g., in medical diagnosis). 70
Choosing the Right Evaluation Metric

• For Regression:
‐ Use MSE or MAE to measure the accuracy of the predictions.
‐ Use 𝑅2 to understand how much variance the model explains.

• For Classification:
‐ Use Accuracy when the dataset is balanced.
‐ Use Precision, Recall, and F1-Score for imbalanced datasets.
‐ Use ROC-AUC for binary classification to measure how well the model separates
classes.
71
Q&A

You might also like