CE6146_Lecture_2
CE6146_Lecture_2
2
Intended Learning Outcomes
3
Introduction to Neural Networks
What is a Neural Network
• Weights (w): Parameters that determine the strength of the input signals.
• Bias (b): A value added to the weighted sum to adjust the output.
• Activation Function (σ): A non-linear function applied to the output of a
neuron to introduce non-linearity and help the network learn complex patterns.
6
Structure of Neural Networks
• Input Layer: Receives the raw input data (e.g., pixels for image recognition or
features for classification).
• Hidden Layers: Layers where the network processes data by applying
transformations to extract features.
• Output Layer: Produces the final output (e.g., a predicted class label for
classification or a numerical value for regression).
activation function 會在output layer
7
Structure of Neural Networks
幾個feature = 幾個input layer Extract feature
• Input Layer: Receives the raw input data (e.g., pixels for image recognition or
features for classification).
• Hidden Layers: Layers where the network processes data by applying
transformations to extract features.
• Output Layer: Produces the final output (e.g., a predicted class label for
classification or a numerical value for regression). 內含activation function
Source: https://learnopencv.com/understanding-feedforward-neural-networks/
8
Neuron Operation
9
Activation Functions (1/3)
透過非線性的函數去學習更複雜的patterns
‐ ReLU (Rectified Linear Unit): Outputs 0 for negative values and the input itself for
positive values. 負數直接⽤0代表,其他正常顯⽰
10
Activation Functions (2/3)
11
Activation Functions (3/3)
Linear Sigmoid tanh ReLU
1 𝑒 𝑥 − 𝑒 −𝑥
Formula 𝑓 𝑥 =𝑥 𝑓 𝑥 = 𝑓 𝑥 = 𝑥 𝑓 𝑥 = max(0, 𝑥)
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥
Computationally
Simple,
Outputs between 0 and Outputs between -1 efficient, helps with
Pros computationally
1 and 1, zero-centered vanishing gradient
efficient
problem
Cannot model complex Vanishing gradient Dying ReLU problem
Vanishing gradient
Cons functions, not zero- problem, not zero- (some units never
problem
centered centered activate)
Binary Classification, Hidden layers where Most common,
Suitable
Regression problems Output layer in some zero-centered outputs especially in CNNs
Problem
cases are desired and FNNs
12
Why Non-Linearity is Important
13
Learning in Neural Networks
• The process of learning involves adjusting the weights and biases in the
network based on the error between the predicted output and the actual target.
從錯誤中學習,藉此回來修改weight 跟bias
• Key Steps:
東⻄⼀定要先送到神經網路跑
1) Forward Pass: Input data is passed through the network to compute the output.
才能從結果去判斷跟正確答案相差多少,藉此修正weight跟bias
2) Error Calculation: The difference (error) between the predicted output and the
actual target is computed using a loss function.
3) Weight Updates: The network’s weights and biases are adjusted to minimize the
從Step了解相差多少,更新bias 跟weight
error using an optimization algorithm (such as gradient descent).
14
Perceptron: The Simplest Neural Network
15
Limitations of Perceptrons
Figure 6.1 in Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
16
Shallow Neural Networks
Single = shallow 所以⼀個隱藏層的叫shallow
• Shallow neural networks (SNNs) typically refer to neural networks with one
hidden layer (but they can have two or three layers in some contexts). These
networks are considered shallow because they don’t have a large number of
SNN = single hidden layer
hidden layers. MLP
DNN = multiple hidden layer
• The term “shallow” is often used to distinguish these networks from deep
neural networks (DNNs), which have multiple hidden layers.
• Introducing non-linearity with activation functions (e.g., ReLU, Sigmoid)
allows the network to handle more complex data.
因為單層隱藏層不⾜以處理non linear所以需要非線性的activation function處理 17
Multi-Layer Perceptron
DNN
18
Summary of Key Concepts in NNs
• Neuron: Basic computational unit that processes input and generates output.
• Weight (w): The coefficient that determines the strength of the input's
contribution.
• Bias (b): Added to the weighted sum to adjust the output.
• Activation Function: Introduce non-linearity to enable the network to learn
complex patterns.
19
Architecture and Operation of FNNs
Feedforward Neural Networks (FNNs)
傳向是單向的,沒有迴圈產⽣
• Receives raw input data (e.g., features of a dataset, pixel values in an image).Does not
perform any computations, only passes the data to the next layer.
• Hidden Layers: 處理較為複雜的模型
‐ These layers process the data using weights, biases, and activation functions.
‐ Each neuron in a hidden layer performs a weighted sum of its inputs and applies a non-
linear activation function. SNN在做的,FNN裡⾯每⼀個神經元重複做
‐ There can be multiple hidden layers in an FNN (depending on the depth of the
network).
22
Layers of a FNN (2/2)
• Output Layer:
• Produces the final prediction or
result (e.g., class label, regression
value).The output neurons are
typically associated with the task
(e.g., classification probabilities for
each class).
Source: https://learnopencv.com/understanding-feedforward-neural-networks/
23
Fully Connected FNN
Input Layer 1 Layer 2
neuron Layer L Output
x1 …… y1
x2 y2
……
……
……
……
……
……
xN …… yM
不做任何運算,直接把
data pass給hidden layer
Input Layer Hidden Layers Output Layer
24
Example of FNN Architecture
• Architecture:
‐ Input Layer: Features of the email (e.g., word count, sender).
‐ Hidden Layers: Two hidden layers, each applying a non-linear activation function.
‐ Output Layer: Single output neuron with a Sigmoid activation function for binary
classification (spam vs. not spam).
輸出介於0~1,類似機率
25
Feedforward Process in FNN (1/2)
z: Weighted sum. W: Weights. 𝑎(𝑙−1) : Output from the previous layer. 𝑏: Bias term.
26
Feedforward Process in FNN (2/2)
input bias
Input w Layer 1
Weights 1⋅1 + (-2)⋅2 + (-1) 0.01
1 = -5
1 ……
2 -1
2 0.95
-2 ……
1 1⋅2 + (-2)⋅1 + (3)
=3 3
Bias
Sigmoid Function ( z )
1
(z ) =
1 + e−z z
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/2017-spring.php 27
Activation Functions in Hidden Layers
34
Matrix Form for Multiple Neurons (2/2)
x1 … y1
…
x2 W1 W2 …
WL y2
b1 b2 b
… L
…
…
…
…
…
…
…
…
…
…
xN x a1 a2
… y yM
…
𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Source: https://speech.ee.ntu.edu.tw/~hylee/ml/2017-spring.php 35
Backward Propagation in FNNs
• Goal: To compute the gradients of the loss function with respect to each
weight and bias, and update them to minimize the error.
• Why Backpropagation?
Forward propagation computes the output, but backward propagation tells us
最重要的⽬的,要更新weight 跟 bias
how to adjust the weights and biases to improve the network’s performance.
• Loss Function: Measures the difference between the network’s prediction and
the true target. ⼀般⽽⾔,loss function 會在output layer之後,才會backpropagation
36
Chain Rule and Backpropagation
Backpropagation 是透過chain rule去驅動
• The Chain Rule: Backpropagation leverages the chain rule from calculus to
efficiently compute the gradients of the loss function with respect to each
𝑍(𝑙) = 𝑊(𝑙)⋅𝑎(𝑙−1) + 𝑏(𝑙)
weight in the network.
𝜕𝑍 (𝑙) = 1
𝜕b(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑎(𝑙) 𝜕𝑍 (𝑙) 𝜕𝐿 𝜕𝐿 𝜕𝑍 (𝑙)
= ∙ ∙ , = ∙
𝜕W(𝑙) 𝜕𝑎(𝑙) 𝜕𝑍 (𝑙) 𝜕W(𝑙) 𝜕b(𝑙) 𝜕𝑍 (𝑙) 𝜕b(𝑙)
• The chain rule is applied layer-by-layer, moving from the output layer back to
the input layer, which is why the process is called backward propagation.
37
Gradient Descent
𝜕𝐿 需要透過chain rule去得出
W (𝑙) = W (𝑙) −𝜂
𝜕W (𝑙) Gradient Descent本⾝是
跟Chain Rule同時發⽣
• Bias Update Rule:
𝜕𝐿
b (𝑙) = b (𝑙) − 𝜂 (𝑙)
𝜕b 38
Full Training Cycle
43
Why Backpropagation Works
forward propagation的⽅式,透過loss function比較預估值跟實際值的差
距,套⽤backward propagation 更新參數,從input layer開始重新再跑⼀次
• After forward propagation, the network compares its prediction to the actual
value and calculates the error (using a loss function).
• The error is sent backward through the network to adjust the weights and
biases layer by layer.
• Key Insight: By adjusting the weights according to the contribution of each
layer to the error, the network gradually improves its predictions.
44
Concept of Gradient Descent
透過chain rule
• Hyperparameters are values that control the training process of the neural
network but are not learned from the data. Instead, they are set before training
begins.
• Examples of Hyperparameters:
‐ Learning Rate (𝜂): Controls how big a step we take when updating weights.
‐ Number of Layers: Determines the depth of the network.
‐ Batch Size: The number of samples processed before updating weights.
‐ Epochs: The number of complete passes through the training data.
46
What is Batch Size
Batch 比較不會造成Over tting
由Memory決定
• Key Insight:
‐ Training typically involves many epochs to allow the model to converge on the optimal
solution.
‐ The number of epochs needs to be balanced to avoid underfitting or overfitting.
訓練太多次,造成對training data太過熟悉
48
How Batch Size and Epochs Affect Training
每次送多少資料進去training,送越多代表update次數少,送越少代表會更新越多次
• Batch Size determines how frequently the model updates the weights:
‐ Smaller batches lead to more frequent updates.
‐ Larger batches require more memory but lead to smoother updates.
• Epochs determine how many complete passes the model makes through the
data:
‐ Too few epochs lead to underfitting. 對訓練資料不夠熟悉
49
Example: Batch Size and Epochs in Training
所以參數總更新次數為:(10000/100) *10
• Training a neural network with a dataset of 10,000 samples.
‐ Batch Size = 100: The network will update its weights after every 100 samples.
‐ Epoch = 10: The entire dataset will be passed through the network 10 times.
• Trade-offs:
‐ Small batch sizes can introduce variability in updates, leading to a more exploratory
learning process.
‐ Larger batch sizes make the learning process smoother but slower.
50
Balancing Batch Size and Epochs
51
Practical Guidelines for Batch Size and Epochs
• Batch Size:
‐ Typically, a batch size between 32 and 256 is used in practice.
‐ For larger datasets, you can increase batch size to balance computation time and
memory usage.
• Epochs:
‐ The number of epochs is generally determined by early stopping or validation loss
behavior.
‐ You might start with 10–50 epochs and adjust based on performance.
52
Learning Curve
• A learning curve is a plot that shows how the model's performance improves
over time during training.
• Training Loss: Measures the model's performance on the training data after
each epoch.
• Validation Loss: Measures the model's performance on unseen validation data
to assess generalization.
53
How to Interpret a Learning Curve
• Underfitting: Both training and validation loss are high and do not decrease
significantly over time, indicating that the model is too simple to capture the
underlying patterns.
• Overfitting: The training loss decreases significantly, but the validation loss
starts to increase after a certain point, indicating that the model is memorizing
the training data.
• Good Fit: Both training and validation loss decrease and stabilize, indicating
that the model is learning well and generalizing to new data. 54
Using the Learning Curve to Optimize Training
• Key Insights:
‐ Early Stopping: Stop training when the validation loss stops improving to prevent
overfitting.
‐ Monitoring Performance: Regularly plot the learning curve to check if the model is
underfitting or overfitting.
• Practical Use: Use the learning curve to decide when to adjust model
complexity (e.g., adding layers, adjusting neurons, regularization).
55
Forward and Backward Propagation in Practice
YA
• Forward Propagation: Move forward through the network, computing the
output based on the inputs. Data要先送進去,才有結果
• Compute the Loss: Calculate the error using a loss function (e.g., cross-
有結果才能計算差多少
entropy, mean squared error).
• Backward Propagation: Move backward through the network, using the error
to adjust the weights and biases at each layer. 差多少才能回推了解參數的比重關係
• Simple Tasks: Use 1–2 hidden layers for simpler tasks, like basic
classification or regression.
• Complex Tasks: Use 3 or more hidden layers for tasks that require
hierarchical feature learning.
• Guiding Principle:
‐ Start with fewer layers and increase complexity based on validation performance.
‐ Deeper networks can learn more complex patterns, but are prone to overfitting if not
regularized.
57
Choosing the Number of Neurons per Layer
• Input Layer: The number of neurons in the input layer equals the number of
features in your dataset. 要截取幾個feature 就有幾個輸入層
• Output Layer: The number of neurons in the output layer equals the number of
target classes or the number of outputs. 要多少輸出類別,就有多少output層
• A function that measures how well a neural network’s prediction 𝑦ො matches
the actual target value 𝑦.
• Purpose: During training, the network aims to minimize the loss function by
adjusting its weights and biases through gradient descent.
63
Common Loss Functions for Regression
• Mean Absolute Error (MAE): Another regression loss function that measures
the average absolute difference between predictions and actual values.
Less sensitive to outliers than MSE, but does not emphasize large errors as much.
64
Common Loss Functions for Classification
65
Loss Functions in Practice
66
Model Evaluation Metrics
• Model evaluation metrics are used to assess how well a trained model
performs on unseen (test) data. While loss functions measure how well a
model fits the training data, evaluation metrics measure generalization to new
data.
• Purpose: To ensure the model isn’t just memorizing the training data but can
generalize to new, unseen examples.
67
Metrics for Regression Models
69
Metrics for Classification Models (2/2)
• Precision: Measures how many of the predicted positive instances are actually
positive.
• Recall (Sensitivity): Measures how many of the actual positive instances were
correctly identified.
• F1-Score: The harmonic mean of precision and recall, providing a balanced
measure.
Precision, recall, and F1-score are critical when false positives and false negatives have
different consequences (e.g., in medical diagnosis). 70
Choosing the Right Evaluation Metric
• For Regression:
‐ Use MSE or MAE to measure the accuracy of the predictions.
‐ Use 𝑅2 to understand how much variance the model explains.
• For Classification:
‐ Use Accuracy when the dataset is balanced.
‐ Use Precision, Recall, and F1-Score for imbalanced datasets.
‐ Use ROC-AUC for binary classification to measure how well the model separates
classes.
71
Q&A