0% found this document useful (0 votes)
2 views

Lecture05-DeepLearningCNN

The document provides a comprehensive overview of Convolutional Neural Networks (CNNs), detailing their structure, including layers such as input, convolutional, activation, pooling, batch normalization, and various element-wise operations. It explains the functionality of each layer, their mathematical representations, and key use cases, highlighting the importance of CNNs in processing structured grid data like images. Additionally, it discusses the benefits of techniques like residual connections and concatenation in enhancing network performance and learning capabilities.

Uploaded by

sonle180804
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture05-DeepLearningCNN

The document provides a comprehensive overview of Convolutional Neural Networks (CNNs), detailing their structure, including layers such as input, convolutional, activation, pooling, batch normalization, and various element-wise operations. It explains the functionality of each layer, their mathematical representations, and key use cases, highlighting the importance of CNNs in processing structured grid data like images. Additionally, it discusses the benefits of techniques like residual connections and concatenation in enhancing network performance and learning capabilities.

Uploaded by

sonle180804
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Convolutional Neural Networks

Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education

February 10, 2025

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 1 / 84


Introduction
• Convolutional Neural Networks (CNNs) are specialized neural networks for processing
structured grid data (e.g., images).
• They leverage the spatial hierarchy of features.
• Key building blocks are layers with specific functionalities.

Convolutional neural networks


Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 2 / 84
Input Layer
• Represents the input data.
• For images: a tensor of shape (height, width, channels).
• Example:
• A 256 × 256 RGB image has shape (256, 256, 3).
• A 28 × 28 gray-scale image has shape (28, 28, 1).

• Often preprocessed (e.g., value scaling and normalization).

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 3 / 84


Convolutional Layer
• Performs convolution using learnable filters (kernels).
• Filters slide over the input, producing feature maps.
• Captures local patterns and spatial hierarchies.
• Output size depends on padding and stride.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 4 / 84


Activation Layer
• Introduces non-linearity, crucial for learning complex patterns.
• Common activation functions:
• Sigmoid: σ(x) = 1
1+e −x
• Output between 0 and 1 (useful for probabilities).
• Suffers from vanishing gradients, especially for very large or very small inputs.
• Tanh (Hyperbolic Tangent): tanh(x) = e x −e −x
e x +e −x
• Output between -1 and 1.
• Also suffers from vanishing gradients, but less severely than Sigmoid.
• Zero-centered output.
• ReLU (Rectified Linear Unit): f (x) = max(0, x)
• Simple and computationally efficient.
• Addresses the vanishing gradient problem for positive inputs.
• Can suffer from ”dying ReLU” problem (neurons become inactive).
(
x, if x > 0
• Leaky ReLU: f (x) = (where α is a small constant, e.g., 0.01)
αx, otherwise
• Addresses the “dying ReLU” problem by allowing a small gradient for negative inputs.
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 5 / 84
Activation Layer
• Common activation functions: (
x, if x > 0
• Parametric ReLU (PReLU): f (x) = (where α is a learnable parameter)
αx, otherwise
• Similar to Leaky ReLU, but α becomes a learnable parameter, offering more flexibility.
(
x, if x > 0
• ELU (Exponential Linear Unit): f (x) = (where α is a positive
α(e x − 1), otherwise
constant)
• Similar to ReLU but with a smooth transition for negative values.
• Can push the mean activation closer to zero, which can speed up learning.
• Swish: f (x) = x · σ(βx) where σ(x) is the sigmoid function and β is a constant or a
learnable parameter.
• It is a gated activation function.
• It performs better than ReLU in deeper models
• GELU (Gaussian Error Linear Units): f (x) = x · Φ(x) where Φ(x) is the Cumulative
Distribution Function of the standard Gaussian distribution
• Smooth approximation of ReLU.
• Performs well in transformers and other NLP models.
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 6 / 84
Pooling Layer - Part 1: Local Pooling
• Downsamples feature maps, reducing spatial dimensions.
• Reduces computational complexity and overfitting.
• Introduces translational invariance (to some extent).
• Local Pooling: Applies pooling within small, non-overlapping windows (e.g., 2x2).
• Max Pooling: Takes the maximum value within the window.

Output(i, j) = max Input(i + x, j + y )


(x,y )∈W

where W is the pooling window. Preserves important features and is robust to small variations.
• Average Pooling: Takes the average value within the window.

1 X
Output(i, j) = Input(i + x, j + y )
|W |
(x,y )∈W

where W is the pooling window and |W | is the number of elements in W . Smoother output, less
sensitive to noise.
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 7 / 84
Pooling Layer - Part 2: Global Pooling and Usage

• Global Pooling: Applies pooling over the entire feature map.


• Global Average Pooling (GAP): Takes the average of all values in each feature map.
Reduces the number of parameters significantly. Commonly used at the end of CNNs for
classification.
H X
W
1 X
Output(c) = Input(i, j, c)
H ×W
i=1 j=1

where H and W are the height and width of the feature map and c is the channel.
• Global Max Pooling (GMP): Takes the maximum value of all values in each feature map.

Output(c) = max Input(i, j, c)


i,j

where H and W are the height and width of the feature map and c is the channel.
• More robust to spatial translations compared to local pooling.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 8 / 84


Pooling Layer - Part 3: Use Case

• When to use which?


• Max Pooling: Often preferred for object detection and image classification tasks where
identifying the presence of a feature is more important than its precise location.
• Average Pooling: Can be useful for tasks where smoothing the feature map is beneficial,
such as image segmentation or when dealing with noisy data.
• Global Average Pooling: Widely used in classification tasks as a replacement for fully
connected layers, reducing the number of parameters and improving generalization.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 9 / 84


Batch Normalization Layer
• Problem: Internal Covariate Shift - the distribution of network activations changes during training
as the network’s parameters are updated, slowing down training process.
• Solution: Batch normalization (bn) normalizes the activations of each layer for each mini-batch.
• Normalization Process:
1. Calculate the mean and variance of the activations within the mini-batch:
m m
1 X 1 X
µB = xi , σB2 = (xi − µB )2
m i=1 m i=1
where xi are the activations in the mini-batch B of size m.
2. Normalize the activations (where ϵ is a small constant for numerical stability):
xi − µB
x̂i = p 2
σB + ϵ
3. Scale and shift the normalized activations using learnable parameters γ and β:
yi = γ x̂i + β
This allows the network to learn the optimal scale and shift for each activation.
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 10 / 84
Batch Normalization Layer

• Benefits:
• Reduces Internal Covariate Shift: Makes training more stable and faster.
• Allows for Higher Learning Rates: As the activations are normalized, the network is less
sensitive to the choice of learning rate.
• Regularization Effect: Reduces overfitting to some extent.
• Smoother Optimization Landscape: Makes optimization easier.
• Inference: During inference, the population mean and variance (estimated during training using
moving averages) are used instead of the mini-batch statistics.
• Placement: Typically placed after convolutional/fully connected layers and before the activation
function.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 11 / 84


Element-wise Addition Layer
• Functionality: Performs element-wise addition of two tensors (or feature maps) of the same
shape.
• Mathematical Representation: If A and B are two tensors of shape (H, W , C ), then the
element-wise addition C = A + B is defined as:

C (i, j, k) = A(i, j, k) + B(i, j, k)

for all i ∈ {1, . . . , H}, j ∈ {1, . . . , W }, and k ∈ {1, . . . , C }.


• Key Use Case: Residual Connections (ResNets):
• Residual connections (also known as skip connections or shortcuts) add the input of a block
to its output.
• This allows the network to learn residual mappings F (x) = H(x) − x instead of directly
learning H(x), where H(x) is the desired underlying mapping.
• The element-wise addition performs the H(x) = F (x) + x operation.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 12 / 84


Element-wise Addition Layer

• Benefits of Residual Connections:


• Addresses the Vanishing Gradient Problem: By providing a direct path for gradients to
flow through, residual connections mitigate the vanishing gradient problem, enabling training
of very deep networks.
• Enables Learning Identity Mappings: It becomes easier for the network to learn an
identity mapping (F (x) = 0) if it’s optimal, as the residual connection provides that path
directly. This is crucial for training very deep networks, as the network can easily choose to
skip layers if they are not needed.
• Improves Information Flow: By directly adding the input to the output, information flows
more easily through the network.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 13 / 84


Element-wise Multiplication Layer

• Functionality: Performs element-wise multiplication (also known as Hadamard product) of two


tensors (or feature maps) of the same shape.
• Mathematical Representation: If A and B are two tensors of shape (H, W , C ), then the
element-wise multiplication C = A ⊙ B is defined as:

C (i, j, k) = A(i, j, k) × B(i, j, k)

for all i ∈ {1, . . . , H}, j ∈ {1, . . . , W }, and k ∈ {1, . . . , C }. The symbol “⊙” represents the
Hadamard product.
• Key Use Cases:
• Attention and Gating: Element-wise multiplication is central to attention mechanisms and feature
gating. One tensor acts as input features, while the other provides weights (attention maps or gates)
to selectively emphasize (values near 1) or suppress (values near 0) parts of the input.
• Modulation/Scaling: More generally, it can modulate or scale feature map activations.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 14 / 84


Element-wise Multiplication Layer
• Example in an Attention Mechanism:
• Let X be the input feature map and A be the attention map.
• The attended feature map X ′ is calculated as:

X′ = X ⊙ A

• If A(i, j, k) is close to 1, the corresponding feature X (i, j, k) is preserved.


• If A(i, j, k) is close to 0, the corresponding feature X (i, j, k) is suppressed.
• Benefits:
• Selective Feature Emphasis: Allows the network to focus on relevant features and ignore
irrelevant ones.
• Dynamic Feature Modulation: Enables the network to dynamically adjust the importance
of different features based on the input.
• Computational Efficiency: Element-wise multiplication is computationally efficient
compared to other operations like matrix multiplication.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 15 / 84


Concatenation Layer

• Functionality: Concatenation joins tensors along a specified dimension, effectively ”stacking”


them. Unlike element-wise operations, it increases the dimensionality of the output.
• Mathematical Representation: Let’s consider concatenating two tensors A and B along the
channel (depth) dimension. If A has shape (H, W , CA ) and B has shape (H, W , CB ), the
concatenation C = concat(A, B) along the channel dimension will have shape (H, W , CA + CB ).
The elements of C are defined as:
(
A(i, j, k) if 1 ≤ k ≤ CA
C (i, j, k) =
B(i, j, k − CA ) if CA + 1 ≤ k ≤ CA + CB

for all i ∈ {1, . . . , H} and j ∈ {1, . . . , W }.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 16 / 84


Concatenation Layer
• Key Use Cases:
• Inception Modules: In Inception networks, concatenation is used to combine feature maps
produced by different convolutional filters (e.g., 1x1, 3x3, 5x5) operating on the same input.
This allows the network to capture features at multiple scales.
• Feature Fusion: Concatenation can fuse features extracted from different parts of a network
or from different modalities (e.g., image and text).
• Skip Connections (Alternative to Addition): While residual connections primarily use
element-wise addition, concatenation can also be used in skip connections, especially when
the shapes of the feature maps to be combined are significantly different.
• Benefits:
• Increased Feature Diversity: Combining features from different sources or filter sizes
enhances feature representation.
• Efficient Information Aggregation: Provides a simple and efficient way to combine
information from multiple pathways.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 17 / 84


Operation Illustration

Element-wise addition

Element-wise multiplication
Depth-wise concatenation

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 18 / 84


Flattening Layer

• Purpose: The flattening layer transforms the multi-dimensional output of the


convolutional/pooling layers into a 1D vector.
• Operation: It simply reshapes the input tensor without learning any parameters.
• Example: Consider a tensor with shape (H, W , C ) (Height, Width, Channels). Flattening
converts this to a vector of size H × W × C .
• Necessity: Fully connected layers expect 1D input. Flattening bridges the gap between the spatial
feature maps produced by convolutional layers and the input requirements of fully connected layers.
• Mathematical Representation: If the input tensor is X with dimensions (H, W , C ), the
flattened vector Y has elements:
Y [n] = X (i, j, k)
where n = k + C (j + Wi) and 0 ≤ i < H, 0 ≤ j < W , 0 ≤ k < C . This essentially linearizes the
tensor in a row-major fashion.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 19 / 84


Fully Connected Layers (Dense Layers)
• Functionality: Fully connected layers connect each neuron in the layer to every neuron in the
previous layer.
• Operation: Performs an affine transformation:

y = Wx + b

where:
• x is the input vector (from the flattening layer).
• W is the weight matrix.
• b is the bias vector.
• y is the output vector.
• Purpose in CNNs: After feature extraction by convolutional layers, fully connected layers are
typically used for high-level reasoning and decision-making (e.g., classification).
• Number of Parameters: A fully connected layer with M input neurons and N output neurons
has M × N + N parameters (including biases).
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 20 / 84
Fully Connected Layers (Dense Layers)
• Computational Cost: The computational cost is proportional to M × N.
• Drawbacks:
• Fully connected layers are prone to overfitting due to the large number of parameters,
especially when the input vector is high-dimensional.
• Global Average Pooling is often used as a replacement for fully connected layers to mitigate
this issue.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 21 / 84


Flattening and Fully Connected Layers: Combined
• In a typical CNN architecture, the flattening layer is placed immediately before the fully
connected layers.
• The output of the last convolutional/pooling layer, which is a multi-dimensional tensor
representing the extracted features, is flattened into a 1D vector.
• This 1D vector is then fed as input to one or more fully connected layers, which perform
the final classification or regression task.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 22 / 84


Output Layer

• The output layer produces the final prediction of the network, transforming the high-level features
learned by the preceding layers into a desired output format.
• The choice of activation function and the structure of the output layer depend on the task:
• Classification (Multi-class):
• Structure: Typically a fully connected layer with a number of neurons equal to the number of
classes.
• Activation: Softmax is used to produce a probability distribution over the classes:

e zi
Softmax(z)i = PK
j=1 e zj

where z is the vector of logits (raw outputs of the fully connected layer), zi is the logit for class
i, and K is the number of classes.
• Output: A vector of probabilities, where each element represents the probability of the input
belonging to a specific class. The sum of these probabilities is 1.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 23 / 84


Output Layer

• The choice of activation function and the structure of the output layer depend on the task:
• Binary Classification:
• Structure: A single neuron.
• Activation: Sigmoid is often used to output a probability between 0 and 1:

1
Sigmoid(x) =
1 + e −x
• Output: A single value representing the probability of the input belonging to the positive class.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 24 / 84


Output Layer

• The choice of activation function and the structure of the output layer depend on the task:
• Regression:
• Structure: Typically a fully connected layer with a number of neurons equal to the number of
output variables.
• Activation: Linear activation (identity function) is commonly used:

f (x) = x

Sometimes other activations are used, depending on the range of the target variable.
• Output: A vector of continuous values representing the predicted output variables.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 25 / 84


Output Layer

Other Output Types: Other output types exist for tasks like object detection (bounding boxes, class
probabilities), semantic segmentation (pixel-wise classification), etc., and they use specialized structures
and loss functions.

Loss Functions: The output layer is closely tied to the loss function used to train the network.
Common loss functions include:
• Cross-Entropy Loss (Classification): Measures the difference between the predicted probability
distribution and the true distribution.
• Mean Squared Error (MSE) (Regression): Measures the average squared difference between
the predicted and true values.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 26 / 84


Convolutional Layers: Introduction

• Core Building Block of CNNs: Convolutional layers are fundamental to Convolutional Neural
Networks (CNNs), specialized for processing data with a grid-like topology, such as images, videos,
and time-series data.
• Local Receptive Fields: Unlike fully connected layers where each neuron is connected to all
neurons in the previous layer, convolutional layers exploit the spatial (or temporal) structure of the
input by connecting each neuron only to a local region of the input. This local region is called the
receptive field.
• Filters (Kernels): Convolutional layers use learnable filters (also called kernels) to detect local
patterns. A filter is a small matrix of weights that slides (convolves) across the input.
• Each filter learns to detect a specific feature (e.g., edges, corners, textures).
• Multiple filters are used in a convolutional layer to learn multiple features.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 27 / 84


Convolutional Layers: Introduction

• Convolution and Feature Maps: Convolution involves element-wise multiplication and


summation between the filter and input within the receptive field, producing feature maps that
highlight detected features. Each filter generates a distinct feature map.
• Parameter Sharing: Using the same filter across the input (parameter sharing) reduces
parameters, improves efficiency, and mitigates overfitting.
• Translation Equivariance: Shifting the input results in a corresponding shift in the output
feature maps, crucial for location-invariant tasks like image recognition.
• Hierarchical Features: Stacking convolutional layers enables learning hierarchical features, with
lower layers detecting simple features (e.g., edges) and higher layers detecting complex features
(e.g., objects).

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 28 / 84


Convolution Operation

• 1D Convolution (Signal Processing):


• Given a 1D input signal x[n] and a 1D filter (kernel) w [n], the discrete convolution is defined
as:
X∞
(x ∗ w )[n] = x[m]w [n − m]
m=−∞

• In practice, the filter has finite length, so the summation limits are adjusted accordingly. If
the filter has length K , the practical convolution is:
K
X −1
(x ∗ w )[n] = x[n − m]w [m]
m=0

• Example: Let x = [1, 2, 3, 4] and w = [0.5, 0.5]. Then (x ∗ w )[2] = (1 ∗ 0.5) + (2 ∗ 0.5) = 1.5

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 29 / 84


Convolution Operation
• 2D Convolution (Image Processing):
• Given a 2D input image I (i, j) and a 2D filter (kernel) K (i, j), the 2D discrete convolution is
defined as:
X∞ X∞
(I ∗ K )(i, j) = I (m, n)K (i − m, j − n)
m=−∞ n=−∞
• In practice, the filter has finite dimensions Kh × Kw , so the practical convolution is:
h −1 KX
KX w −1

(I ∗ K )(i, j) = I (i − m, j − n)K (m, n)


m=0 n=0
• This can be visualized as sliding the kernel over the image and computing the element-wise
multiplication and sum at each location.
• Cross-Correlation vs. Convolution: In deep learning, what is often referred to as ”convolution”
is actually cross-correlation, where the kernel is not flipped. The equation for 2D
cross-correlation is:
h −1 KX
KX w −1

(I ⋆ K )(i, j) = I (i + m, j + n)K (m, n)


Thien Huynh-The - HCMUTE m=0 n=0
Convolutional Neural Networks February 10, 2025 30 / 84
Convolutional Hyperparameters

• Convolutional layers have several key hyperparameters that control their behavior and
computational cost.
• Understanding these parameters is crucial for designing effective CNN architectures.
• We will discuss the following:
• Filter/Kernel Size
• Stride
• Dilation Rate
• Padding

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 31 / 84


Filter/Kernel Size (K )
• The filter or kernel defines the spatial dimensions of the convolution operation (e.g., 1x1, 3x3).
• It determines the receptive field: the region of the input that influences a single output value.
• Larger kernels:
• Capture larger spatial contexts.
• More computationally expensive (more parameters and FLOPs).
• Smaller kernels:
• Capture finer details.
• Less computationally expensive.
• Common choices: 3x3, 1x1 (for channel-wise operations).

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 32 / 84


Stride (S)
• Stride defines the number of pixels the filter shifts between consecutive convolution operations.
• A stride of 1 means the filter moves one pixel at a time.
• Larger strides:
• Reduce the output size (downsampling).
• Reduce computational cost.
• Smaller strides:
• Preserve more spatial information.
• Increase computational cost.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 33 / 84


Dilation Rate (D)
• Dilation introduces spacing between the kernel elements, effectively expanding the receptive field
without increasing the number of parameters.
• A dilation rate of 1 is a standard convolution.
• Larger dilation rates:
• Increase the receptive field exponentially.
• Useful for capturing long-range dependencies.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 34 / 84


Padding (P)
• Padding adds extra pixels around the input boundary to control the output size.
• Types of Padding:
• Valid Padding (No Padding): The output size is smaller than the input size.
• Same Padding: Padding is added so that the output size is the same as the input size (when
stride is 1). The amount of padding is usually calculated as P = K −1
2
• Padding helps to:
• Prevent excessive shrinking of the output and preserve information at the input’s borders.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 35 / 84


Output Size Calculation

• Let:
• I : Input size (height/width).
• K : Kernel size.
• P: Padding.
• S: Stride.
• D: Dilation rate.
• The output size O is calculated as:
 
I − D(K − 1) + 2P
O= +1
S

• Example: I = 7, K = 3, P = 1, S = 1, D = 1: O = ⌊ 7−1(3−1)+2(1)
1 ⌋+1=7

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 36 / 84


Number of Trainable Parameters (Weights)

• Let:
• K : Kernel size (assuming square kernel).
• Cin : Number of input channels.
• Cout : Number of output channels (number of filters).
• The number of trainable parameters (weights) in a convolutional layer is:

Parameters = K 2 × Cin × Cout + Cout (bias)

• The Cout term accounts for the bias term for each filter.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 37 / 84


Computational Cost (FLOPs)

• FLOPs (Floating Point Operations) measure the computational cost.


• Let:
• O: Output size (height/width).
• K : Kernel size.
• Cin : Number of input channels.
• Cout : Number of output channels.
• The number of FLOPs is approximately:

FLOPs ≈ O 2 × K 2 × Cin × Cout

• This formula counts multiplications and additions.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 38 / 84


Example: Parameters and FLOPs

• Input: 224x224x3 image


• Convolutional layer: 3x3 kernel, 64 output channels, stride 1, padding 1
j k
• Output size: 224−1(3−1)+2(1)
1 + 1 = 224
• Parameters: 32 × 3 × 64 + 64 = 1792
• FLOPs: 2242 × 32 × 3 × 64 ≈ 860 MFLOPs

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 39 / 84


Specialized Convolutional Layers: Motivation
• Computational Cost of Standard Convolutions: Standard convolutions can be computationally
demanding, especially with many channels, large kernels, or high-resolution inputs.
• Need for Efficiency: Resource-constrained applications (e.g., mobile, embedded systems) require
more efficient alternatives.
• Benefits of Specialized Convolutions: These offer:
• Reduced computational cost (fewer parameters and FLOPs).
• Improved efficiency (faster inference/training).
• Potential performance gains.
• Types of Specialized Convolutions (covered next):
• Depthwise Convolution
• Grouped Convolution
• Pointwise Convolution (1x1)
• Depthwise Separable Convolution

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 40 / 84


Standard Convolution: Recap

• Input: H × W × Cin
• Kernel: K × K × Cin
• Number of filters: Cout
• Output: H ′ × W ′ × Cout
• Number of parameters: K × K × Cin × Cout
• FLOPs: H ′ × W ′ × K × K × Cin × Cout

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 41 / 84


Depthwise Convolution
• Applies a single filter to each input channel independently.
• Input: H × W × Cin
• Kernel: K × K × 1 (one filter per channel)
• Output: H ′ × W ′ × Cin (same number of channels as input)
• Number of parameters: K × K × Cin
• FLOPs: H ′ × W ′ × K × K × Cin
• Much more efficient than standard convolution, especially when Cout >> 1.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 42 / 84


Pointwise Convolution (1x1 Convolution)
• Input: H × W × Cin
• Kernel: 1 × 1 × Cin - 1x1 kernel to perform a linear combination of the input channels.
• Number of filters: Cout
• Output: H × W × Cout (spatial dimensions remain the same)
• Number of parameters: 1 × 1 × Cin × Cout = Cin × Cout
• FLOPs: H × W × Cin × Cout
• Used for:
• Reducing/increasing the number of channels.
• Adding non-linearity after depthwise convolution.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 43 / 84


Grouped Convolution
• Divides the input channels into G groups and applies standard conv independently within each
group (depthwise convolution is a special case of grouped convolution where G = Cin )
• Input: H × W × Cin
• Kernel: K × K × Cin
G
• Number of filters per group: Cout
G
• Output: H ′ × W ′ × Cout
• Number of parameters: K × K × Cin Cout Cout
G × G × G = K × K × Cin × G
• FLOPs: H ′ × W ′ × K × K × Cin × Cout
G

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 44 / 84


Depthwise Separable Convolution
• Combines depthwise and pointwise convolutions.
• First, a depthwise convolution is applied.
• Then, a pointwise convolution is used to combine the output channels.
• Significantly reduces computational cost compared to standard convolution.
• Used in MobileNet and Xception.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 45 / 84


Convolutional Layers: Animated Explanation

Groups, Depthwise, and Depthwise-Separable Convolution

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 46 / 84


Backbone CNN Models: Review

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 47 / 84


Introduction to LeNet-5
• Historical Significance: LeNet-5, developed by Yann LeCun et al. in the 1990s, is one of the
earliest and most influential Convolutional Neural Network (CNN) architectures.
• Purpose: Designed for handwritten and machine-printed character recognition (e.g., MNIST
dataset).
• Key Innovations: Introduced fundamental CNN concepts:
• Convolutional layers with learnable weights.
• Local receptive fields.
• Spatial subsampling (pooling).
• Shared weights (parameter sharing).

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 48 / 84


LeNet-5 Architecture

• LeNet-5 consists of seven layers (excluding the input):


• Input Layer: 32x32 grayscale image.
• Convolutional Layer C1: 6 5x5 filters, stride 1, no padding. Output: 28x28x6.
• Subsampling Layer S2 (Average Pooling): 2x2 pooling, stride 2. Output: 14x14x6.
• Convolutional Layer C3: 16 5x5 filters. Output: 10x10x16. Note: the connections between feature
maps in S2 and C3 are not fully connected in the original LeNet-5 paper.
• Subsampling Layer S4 (Average Pooling): 2x2 pooling, stride 2. Output: 5x5x16.
• Fully Connected Layer F5: 120 neurons.
• Fully Connected Layer F6: 84 neurons.
• Output Layer: 10 neurons (one for each digit 0-9) with RBF (Radial Basis Function) or Softmax
activation.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 49 / 84


Key Concepts and Impact

Key concepts:
• Convolutional Layers: Local receptive fields, feature extraction.
• Subsampling (Pooling): Reducing spatial resolution, increasing robustness to small shifts and distortions.
• Parameter Sharing: Reducing the number of parameters and improving generalization.
• Hierarchical Feature Learning: Lower layers detect simple features (edges, lines), higher layers detect more
complex features (combinations of edges, shapes).
Impacts:
• LeNet-5 laid the foundation for modern CNN architectures.
• Its key concepts are still used in many state-of-the-art models.
• It demonstrated the power of CNNs for image recognition and other tasks involving structured data.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 50 / 84


Introduction to AlexNet
• Revolutionary Impact: AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey
Hinton, won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 by a
significant margin, marking a turning point in DL for computer vision.
• Key Contributions:
• Deeper architecture than previous CNNs.
• Use of ReLU activation functions.
• Training on GPUs for faster training.
• Local response normalization (LRN).
• Overlapping pooling and data augmentation

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 51 / 84


AlexNet Architecture
AlexNet consists of eight layers (excluding the input):
• Input Layer: 227x227x3 RGB image.
• Convolutional Layer 1: 96 11x11 filters, stride 4, no padding. Output: 55x55x96.
• Max Pooling Layer 1: 3x3 pooling, stride 2. Output: 27x27x96.
• Convolutional Layer 2: 256 5x5 filters, stride 1, padding 2. Output: 27x27x256.
• Max Pooling Layer 2: 3x3 pooling, stride 2. Output: 13x13x256.
• Convolutional Layer 3: 384 3x3 filters, stride 1, padding 1. Output: 13x13x384.
• Convolutional Layer 4: 384 3x3 filters, stride 1, padding 1. Output: 13x13x384.
• Convolutional Layer 5: 256 3x3 filters, stride 1, padding 1. Output: 13x13x256.
• Max Pooling Layer 3: 3x3 pooling, stride 2. Output: 6x6x256.
• Fully Connected Layer 1: 4096 neurons.
• Fully Connected Layer 2: 4096 neurons.
• Output Layer: 1000 neurons (for 1000 ImageNet classes) with Softmax activation.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 52 / 84


Key Innovations and Impact

Key Innovations:
• ReLU Activations: Accelerated training by mitigating vanishing gradients.
• GPU Training: Enabled training of larger models on larger datasets.
• Local Response Normalization (LRN): Local channel normalization (minor impact).
• Overlapping Pooling: Reduced overfitting.
• Data Augmentation: Improved generalization by increasing training data diversity.
Impact:
• Deep Learning Resurgence in CV: Sparked renewed interest and rapid progress in deep learning for
computer vision.
• Foundation for Modern CNNs: Influenced many subsequent CNN architectures.
• Influence on Other Fields: Impacted other areas of deep learning like NLP and speech recognition.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 53 / 84


Introduction to VGG-16
• Visual Geometry Group (VGG): Developed by the VGG at the University of Oxford.
• Key Insight: Demonstrated the importance of network depth in achieving better performance in
image classification.
• Uniform Architecture: Used very small (3x3) convolutional filters throughout the entire network,
leading to a much deeper architecture than AlexNet.
• ILSVRC 2014: Achieved top performance in the ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) 2014.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 54 / 84


VGG-16 Architecture
• Key Characteristics:
• Only 3x3 convolutional filters with stride 1 and padding 1 are used.
• 2x2 max pooling with stride 2 is used for downsampling.
• Multiple convolutional layers are stacked before each pooling layer.
• Layers (simplified): VGG-16 refers to 16 layers with weights (convolutional or fully connected):
• Input: 224x224x3 RGB image.
• Conv1 (2 layers): 64 filters. Output: 224x224x64
• Max Pool 1: Output: 112x112x64
• Conv2 (2 layers): 128 filters. Output: 112x112x128
• Max Pool 2: Output: 56x56x128
• Conv3 (3 layers): 256 filters. Output: 56x56x256
• Max Pool 3: Output: 28x28x256
• Conv4 (3 layers): 512 filters. Output: 28x28x512
• Max Pool 4: Output: 14x14x512
• Conv5 (3 layers): 512 filters. Output: 14x14x512
• Max Pool 5: Output: 7x7x512
• FC1: 4096 neurons
• FC2: 4096 neurons
• Output (FC3): 1000 neurons (ImageNet classes) with Softmax

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 55 / 84


VGG-16 vs VGG-19

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 56 / 84


Advantages of Small 3x3 Convolutions

• Deeper Network: Stacking multiple 3x3 convolutions allows for a deeper network, which
can learn more complex features.
• Reduced Number of Parameters: Two stacked 3x3 convolutions have the same
receptive field as one 5x5 convolution but with fewer parameters:
• One 5x5: 5 × 5 = 25 parameters
• Two 3x3: (3 × 3) + (3 × 3) = 18 parameters
• More Non-linearities: Stacking more layers increases the number of non-linear
activations (ReLU), which makes the network more expressive.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 57 / 84


Impact of VGG Networks

• Emphasis on Depth: Solidified the importance of network depth for achieving high
performance.
• Simple and Effective Design: The uniform architecture with small filters made VGG
networks easy to understand and implement.
• Transfer Learning: VGG models pretrained on ImageNet became widely used for transfer
learning in various computer vision tasks.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 58 / 84


Introduction to GoogLeNet

• ILSVRC 2014 Winner: GoogLeNet, developed by Google, won the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2014, achieving a significant improvement over previous
architectures (including VGG).
• Key Innovation: Inception Module: Introduced the Inception module, a novel building block
that significantly improved efficiency and performance.
• Depth and Efficiency: Achieved greater depth than previous networks while maintaining
manageable computational cost.
• Reduced Parameters: Significantly fewer parameters than AlexNet, making it more efficient and
less prone to overfitting.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 59 / 84


The Inception Module
• Motivation: To capture features at multiple scales simultaneously.
• Structure: Consists of parallel branches with different convolutional filter sizes (1x1, 3x3, 5x5)
and max pooling.
• 1x1 Convolutions: Used 1x1 convolutions for dimensionality reduction before the more expensive
3x3 and 5x5 convolutions, significantly reducing computational cost.
• Concatenation: The outputs of all branches are concatenated along the channel dimension.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 60 / 84


GoogLeNet Architecture
• Stacking Inception Modules: GoogLeNet consists of multiple Inception modules stacked on top
of each other.
• Auxiliary Classifiers: Included auxiliary classifiers at intermediate layers to improve gradient flow
during training and prevent vanishing gradients.
• No Fully Connected Layers at the End: Used Global Average Pooling (GAP) at the end
instead of fully connected layers, further reducing the number of parameters.
• Simplified Structure (Conceptual): Input - Several Convolutional Layers - Several
Convolutional Layers - Global Average Pooling - Softmax Output

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 61 / 84


Advantages of GoogLeNet

• Increased Depth and Width: The Inception module allows for increasing both the
depth and width of the network without a significant increase in computational cost.
• Computational Efficiency: Using 1x1 convolutions for dimensionality reduction
significantly reduces the number of parameters and FLOPs.
• Improved Performance: Achieved state-of-the-art performance on ImageNet with
significantly fewer parameters than previous models.
• Reduced Overfitting: The reduced number of parameters and the use of auxiliary
classifiers helped to reduce overfitting.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 62 / 84


Impact of GoogLeNet

• Shift Towards Efficient Architectures: Influenced the development of more efficient


CNN architectures.
• Inception Module as a Building Block: The Inception module became a popular
building block in many subsequent CNNs.
• Focus on Computational Cost: Highlighted the importance of considering
computational cost in deep learning model design.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 63 / 84


Introduction to Inception-v3

• Evolution of Inception: Inception-v3 is the third iteration of the Inception architecture,


building upon the ideas introduced in GoogLeNet (Inception-v1).
• Focus on Efficiency and Performance: Aimed to further improve both computational
efficiency and classification performance.
• Key Improvements: Introduced several architectural refinements:
• Factorization of larger convolutions into smaller ones.
• Asymmetric convolutions.
• Auxiliary classifiers with improved loss.
• Batch Normalization in auxiliary classifiers.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 64 / 84


Factorization of Convolutions
• Factorizing 5x5 Convolutions: A 5x5 convolution can be factorized into two consecutive 3x3
convolutions, reducing the number of parameters and computations:
• One 5x5: 5 × 5 = 25 parameters
• Two 3x3: (3 × 3) + (3 × 3) = 18 parameters
This increases depth, adding more non-linearities (ReLU activations) and thus increasing the
network’s expressiveness.
• Factorizing n × n Convolutions: More generally, any n × n convolution can be factorized into a
sequence of 1 × n and n × 1 convolutions. For example, a 3x3 convolution can be factorized into a
1x3 followed by a 3x1.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 65 / 84


Asymmetric Convolutions

• Further Factorization: Inception-v3 further factorizes convolutions by using asymmetric


convolutions, such as 1xn followed by nx1.
• Example: Instead of a 3x3 convolution, Inception-v3 uses a 1x3 convolution followed by a 3x1
convolution.
• Benefits: This further reduces the number of parameters and computations compared to using
two 3x3 convolutions.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 66 / 84


Improved Auxiliary Classifiers
• Purpose of Auxiliary Classifiers: To improve gradient flow during training, especially in very
deep networks, and prevent vanishing gradients.
• Improvements in v3: In Inception-v3, the auxiliary classifiers were improved by:
• Using batch normalization in the auxiliary classifiers.
• Using a different loss function (softmax cross-entropy) for the auxiliary classifiers.
• Contribution to Final Loss: The loss from the auxiliary classifiers is added to the main loss with
a smaller weight (e.g., 0.3).

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 67 / 84


Overall Impact of Inception-v3

• State-of-the-Art Performance: Achieved even better performance on ImageNet


compared to its predecessors.
• Emphasis on Efficient Design: Further emphasized the importance of efficient network
design.
• Influence on Subsequent Architectures: Influenced the design of many subsequent
CNN architectures by demonstrating the effectiveness of factorization and asymmetric
convolutions.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 68 / 84


Introduction to ResNet

• Challenge of Deep Networks: Training very deep neural networks was a major challenge due to
the vanishing gradient problem.
• Key Innovation: Residual Connections (Skip Connections): ResNet, introduced by He et al.,
addressed this problem with the concept of residual connections (also known as skip connections
or shortcuts).
• ILSVRC 2015 Winner: Achieved state-of-the-art results on ImageNet in 2015, surpassing
human-level performance on the classification task.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 69 / 84


The Vanishing Gradient Problem
• Gradient Propagation: During backpropagation, gradients are multiplied as they are passed
through multiple layers.
• Vanishing Gradients: In very deep networks, these repeated multiplications can cause the
gradients to become extremely small, effectively preventing the earlier layers from learning.
• Impact: This makes it difficult to train very deep networks effectively.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 70 / 84


Residual Connections
• Concept: Instead of directly learning a mapping H(x), ResNet learns a residual mapping
F (x) = H(x) − x, where x is the input to the layer.
• Residual Block: The output of a residual block is then H(x) = F (x) + x. The addition is
performed using element-wise addition.
• Identity Mapping: If the identity mapping is optimal, the network can easily learn it by setting
F (x) = 0.
• Gradient Flow: Residual connections provide a direct path for gradients to flow through,
mitigating the vanishing gradient problem.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 71 / 84


ResNet Architectures
• Different Depths: ResNet comes in various depths (e.g., ResNet-18, ResNet-34, ResNet-50,
ResNet-101, ResNet-152), with the number indicating the number of layers.
• Bottleneck Layers: Deeper ResNet architectures (e.g., ResNet-50 and above) use bottleneck
layers to reduce computational cost. A bottleneck layer consists of a 1x1 convolution, a 3x3
convolution, and another 1x1 convolution.
• Overall Structure (General):
1. Input Convolution and Pooling
2. Several Blocks of Residual Layers (repeated)
3. Global Average Pooling
4. Fully Connected Layer (for classification)

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 72 / 84


Benefits and Impact of ResNet

• Training Very Deep Networks: Enabled the training of significantly deeper networks
than previously possible.
• Improved Performance: Achieved state-of-the-art results on various computer vision
tasks.
• Foundation for Future Architectures: The concept of residual connections has become
a fundamental building block in many subsequent CNN architectures.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 73 / 84


MobileNet: Efficient Mobile-First CNNs

• Key Idea: Focuses on extreme computational


efficiency for mobile and embedded devices.
• Key Components:
• Depthwise Separable Convolutions:
Factorizes standard convolutions into
depthwise and pointwise convolutions to
significantly reduce computation.
• Width Multiplier: A hyperparameter to
control the number of channels, further
reducing computation.
• Resolution Multiplier: A hyperparameter to
control the input image resolution, also
impacting computation.
• Goal: Achieve a good balance between accuracy
and latency/model size.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 74 / 84


DenseNet: Dense Connections for Feature Reuse
• Idea: Maximizes information flow between layers by connecting each layer to all preceding layers.
• Dense Blocks: Each layer receives feature maps from all preceding layers as input and passes its
own feature maps to all subsequent layers.
• Benefits:
• Strong feature reuse, leading to more compact models.
• Mitigates vanishing gradient problem.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 75 / 84


SENet (Squeeze-and-Excitation Networks): Channel Attention

• Key Idea: Introduces channel-wise attention mechanisms to dynamically recalibrate channel-wise


feature responses.
• Key Component: Squeeze-and-Excitation (SE) Block:
• Squeeze: Global average pooling to obtain channel-wise statistics.
• Excitation: Two fully connected layers with a sigmoid activation to learn channel-wise weights.
• Scale: Element-wise multiplication of the channel weights with the original feature maps.

• Benefit: Improves feature discrimination by emphasizing important channels and suppressing less
important ones.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 76 / 84


ResNeXt: Aggregated Residual Transformations
• Key Idea: Extends ResNet by replicating multiple parallel paths (transformations) within each
residual block, aggregating their outputs.
• Key Component: Cardinality: The number of parallel paths, acting as a new dimension besides
depth and width.
• Benefit: Improves performance by exploring a richer set of transformations while maintaining
computational efficiency.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 77 / 84


Recent Cutting-Edge Models (Brief Overview)

• EfficientNet: Focuses on compound scaling of network width, depth, and resolution using a
principled approach.
• RegNet: Explores network design space using a population-based search to find optimal
architectures.
• Vision Transformers (ViT): Applies the Transformer architecture from NLP to image
classification, treating images as sequences of patches.
• ConvNeXt: A modern take on the classical ConvNet design inspired by the Transformer
architecture, showing the strong potential of carefully designed ConvNets.

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 78 / 84


Performance: Accuracy vs Complexity

A good neural network has a high accuracy and is fast.


Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 79 / 84
Python Code - Image Classification (Part 01)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.applications import ResNet50V2 # Example: ResNet50V2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
import numpy as np
# Data paths (adjust these for your data)
train_dir = ’/content/drive/MyDrive/Colab Notebooks/final_project_dataset/training_set’
validation_dir = ’/content/drive/MyDrive/Colab Notebooks/final_project_dataset/test_set’
IMG_SIZE = (224, 224) # ResNet50V2 input size
# Data augmentation and preprocessing
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True
)
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 80 / 84
Python Code - Image Classification (Part 02)
validation_datagen = ImageDataGenerator(rescale=1./255)

try:
# Attempt to create data generators
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=IMG_SIZE,
batch_size=32,
class_mode=’categorical’ # or ’binary’ if you have two classes
)

validation_generator = validation_datagen.flow_from_directory(
validation_dir,
target_size=IMG_SIZE,
batch_size=32,
class_mode=’categorical’ # or ’binary’ if you have two classes
)
except OSError as e:
print(f"Error creating data generators: {e}")
raise # Re-raise to stop execution on data generator errors

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 81 / 84


Python Code - Image Classification (Part 03)

# Load pre-trained model (ResNet50V2 in this example)


base_model = ResNet50V2(
weights=’imagenet’,
include_top=False, # Exclude the classification layer
input_shape=IMG_SIZE + (3,)
)

# Freeze the base model layers


base_model.trainable = False

# Add custom classification head


x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation=’relu’)(x) # Add a dense layer
predictions = Dense(train_generator.num_classes, activation=’softmax’)(x) # Output layer

model = Model(inputs=base_model.input, outputs=predictions)

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 82 / 84


Python Code - Image Classification (Part 04)
# Compile the model
model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[’accuracy’]) # Adjust loss

# Train the model


epochs = 10 # Adjust as needed
try:
history = model.fit(
train_generator,
steps_per_epoch=train_generator.samples // train_generator.batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=validation_generator.samples // validation_generator.batch_size
)
except Exception as e: # Catch any training errors
print(f"Error during training: {e}")

# Access training history (outside the try block)


acc = history.history[’accuracy’]
val_acc = history.history[’val_accuracy’]
loss = history.history[’loss’]
val_loss = history.history[’val_loss’]

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 83 / 84


Python Code - Image Classification (Part 05)

epochs_range = range(epochs)

plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label=’Training Accuracy’)
plt.plot(epochs_range, val_acc, label=’Validation Accuracy’)
plt.legend(loc=’lower right’)
plt.title(’Training and Validation Accuracy’)

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label=’Training Loss’)
plt.plot(epochs_range, val_loss, label=’Validation Loss’)
plt.legend(loc=’upper right’)
plt.title(’Training and Validation Loss’)
plt.show()

# Save the model


model.save(’image_classifier_model.h5’)

Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 84 / 84

You might also like