Lecture05-DeepLearningCNN
Lecture05-DeepLearningCNN
Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education
where W is the pooling window. Preserves important features and is robust to small variations.
• Average Pooling: Takes the average value within the window.
1 X
Output(i, j) = Input(i + x, j + y )
|W |
(x,y )∈W
where W is the pooling window and |W | is the number of elements in W . Smoother output, less
sensitive to noise.
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 7 / 84
Pooling Layer - Part 2: Global Pooling and Usage
where H and W are the height and width of the feature map and c is the channel.
• Global Max Pooling (GMP): Takes the maximum value of all values in each feature map.
where H and W are the height and width of the feature map and c is the channel.
• More robust to spatial translations compared to local pooling.
• Benefits:
• Reduces Internal Covariate Shift: Makes training more stable and faster.
• Allows for Higher Learning Rates: As the activations are normalized, the network is less
sensitive to the choice of learning rate.
• Regularization Effect: Reduces overfitting to some extent.
• Smoother Optimization Landscape: Makes optimization easier.
• Inference: During inference, the population mean and variance (estimated during training using
moving averages) are used instead of the mini-batch statistics.
• Placement: Typically placed after convolutional/fully connected layers and before the activation
function.
for all i ∈ {1, . . . , H}, j ∈ {1, . . . , W }, and k ∈ {1, . . . , C }. The symbol “⊙” represents the
Hadamard product.
• Key Use Cases:
• Attention and Gating: Element-wise multiplication is central to attention mechanisms and feature
gating. One tensor acts as input features, while the other provides weights (attention maps or gates)
to selectively emphasize (values near 1) or suppress (values near 0) parts of the input.
• Modulation/Scaling: More generally, it can modulate or scale feature map activations.
X′ = X ⊙ A
Element-wise addition
Element-wise multiplication
Depth-wise concatenation
y = Wx + b
where:
• x is the input vector (from the flattening layer).
• W is the weight matrix.
• b is the bias vector.
• y is the output vector.
• Purpose in CNNs: After feature extraction by convolutional layers, fully connected layers are
typically used for high-level reasoning and decision-making (e.g., classification).
• Number of Parameters: A fully connected layer with M input neurons and N output neurons
has M × N + N parameters (including biases).
Thien Huynh-The - HCMUTE Convolutional Neural Networks February 10, 2025 20 / 84
Fully Connected Layers (Dense Layers)
• Computational Cost: The computational cost is proportional to M × N.
• Drawbacks:
• Fully connected layers are prone to overfitting due to the large number of parameters,
especially when the input vector is high-dimensional.
• Global Average Pooling is often used as a replacement for fully connected layers to mitigate
this issue.
• The output layer produces the final prediction of the network, transforming the high-level features
learned by the preceding layers into a desired output format.
• The choice of activation function and the structure of the output layer depend on the task:
• Classification (Multi-class):
• Structure: Typically a fully connected layer with a number of neurons equal to the number of
classes.
• Activation: Softmax is used to produce a probability distribution over the classes:
e zi
Softmax(z)i = PK
j=1 e zj
where z is the vector of logits (raw outputs of the fully connected layer), zi is the logit for class
i, and K is the number of classes.
• Output: A vector of probabilities, where each element represents the probability of the input
belonging to a specific class. The sum of these probabilities is 1.
• The choice of activation function and the structure of the output layer depend on the task:
• Binary Classification:
• Structure: A single neuron.
• Activation: Sigmoid is often used to output a probability between 0 and 1:
1
Sigmoid(x) =
1 + e −x
• Output: A single value representing the probability of the input belonging to the positive class.
• The choice of activation function and the structure of the output layer depend on the task:
• Regression:
• Structure: Typically a fully connected layer with a number of neurons equal to the number of
output variables.
• Activation: Linear activation (identity function) is commonly used:
f (x) = x
Sometimes other activations are used, depending on the range of the target variable.
• Output: A vector of continuous values representing the predicted output variables.
Other Output Types: Other output types exist for tasks like object detection (bounding boxes, class
probabilities), semantic segmentation (pixel-wise classification), etc., and they use specialized structures
and loss functions.
Loss Functions: The output layer is closely tied to the loss function used to train the network.
Common loss functions include:
• Cross-Entropy Loss (Classification): Measures the difference between the predicted probability
distribution and the true distribution.
• Mean Squared Error (MSE) (Regression): Measures the average squared difference between
the predicted and true values.
• Core Building Block of CNNs: Convolutional layers are fundamental to Convolutional Neural
Networks (CNNs), specialized for processing data with a grid-like topology, such as images, videos,
and time-series data.
• Local Receptive Fields: Unlike fully connected layers where each neuron is connected to all
neurons in the previous layer, convolutional layers exploit the spatial (or temporal) structure of the
input by connecting each neuron only to a local region of the input. This local region is called the
receptive field.
• Filters (Kernels): Convolutional layers use learnable filters (also called kernels) to detect local
patterns. A filter is a small matrix of weights that slides (convolves) across the input.
• Each filter learns to detect a specific feature (e.g., edges, corners, textures).
• Multiple filters are used in a convolutional layer to learn multiple features.
• In practice, the filter has finite length, so the summation limits are adjusted accordingly. If
the filter has length K , the practical convolution is:
K
X −1
(x ∗ w )[n] = x[n − m]w [m]
m=0
• Example: Let x = [1, 2, 3, 4] and w = [0.5, 0.5]. Then (x ∗ w )[2] = (1 ∗ 0.5) + (2 ∗ 0.5) = 1.5
• Convolutional layers have several key hyperparameters that control their behavior and
computational cost.
• Understanding these parameters is crucial for designing effective CNN architectures.
• We will discuss the following:
• Filter/Kernel Size
• Stride
• Dilation Rate
• Padding
• Let:
• I : Input size (height/width).
• K : Kernel size.
• P: Padding.
• S: Stride.
• D: Dilation rate.
• The output size O is calculated as:
I − D(K − 1) + 2P
O= +1
S
• Example: I = 7, K = 3, P = 1, S = 1, D = 1: O = ⌊ 7−1(3−1)+2(1)
1 ⌋+1=7
• Let:
• K : Kernel size (assuming square kernel).
• Cin : Number of input channels.
• Cout : Number of output channels (number of filters).
• The number of trainable parameters (weights) in a convolutional layer is:
• The Cout term accounts for the bias term for each filter.
• Input: H × W × Cin
• Kernel: K × K × Cin
• Number of filters: Cout
• Output: H ′ × W ′ × Cout
• Number of parameters: K × K × Cin × Cout
• FLOPs: H ′ × W ′ × K × K × Cin × Cout
Key concepts:
• Convolutional Layers: Local receptive fields, feature extraction.
• Subsampling (Pooling): Reducing spatial resolution, increasing robustness to small shifts and distortions.
• Parameter Sharing: Reducing the number of parameters and improving generalization.
• Hierarchical Feature Learning: Lower layers detect simple features (edges, lines), higher layers detect more
complex features (combinations of edges, shapes).
Impacts:
• LeNet-5 laid the foundation for modern CNN architectures.
• Its key concepts are still used in many state-of-the-art models.
• It demonstrated the power of CNNs for image recognition and other tasks involving structured data.
Key Innovations:
• ReLU Activations: Accelerated training by mitigating vanishing gradients.
• GPU Training: Enabled training of larger models on larger datasets.
• Local Response Normalization (LRN): Local channel normalization (minor impact).
• Overlapping Pooling: Reduced overfitting.
• Data Augmentation: Improved generalization by increasing training data diversity.
Impact:
• Deep Learning Resurgence in CV: Sparked renewed interest and rapid progress in deep learning for
computer vision.
• Foundation for Modern CNNs: Influenced many subsequent CNN architectures.
• Influence on Other Fields: Impacted other areas of deep learning like NLP and speech recognition.
• Deeper Network: Stacking multiple 3x3 convolutions allows for a deeper network, which
can learn more complex features.
• Reduced Number of Parameters: Two stacked 3x3 convolutions have the same
receptive field as one 5x5 convolution but with fewer parameters:
• One 5x5: 5 × 5 = 25 parameters
• Two 3x3: (3 × 3) + (3 × 3) = 18 parameters
• More Non-linearities: Stacking more layers increases the number of non-linear
activations (ReLU), which makes the network more expressive.
• Emphasis on Depth: Solidified the importance of network depth for achieving high
performance.
• Simple and Effective Design: The uniform architecture with small filters made VGG
networks easy to understand and implement.
• Transfer Learning: VGG models pretrained on ImageNet became widely used for transfer
learning in various computer vision tasks.
• ILSVRC 2014 Winner: GoogLeNet, developed by Google, won the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC) 2014, achieving a significant improvement over previous
architectures (including VGG).
• Key Innovation: Inception Module: Introduced the Inception module, a novel building block
that significantly improved efficiency and performance.
• Depth and Efficiency: Achieved greater depth than previous networks while maintaining
manageable computational cost.
• Reduced Parameters: Significantly fewer parameters than AlexNet, making it more efficient and
less prone to overfitting.
• Increased Depth and Width: The Inception module allows for increasing both the
depth and width of the network without a significant increase in computational cost.
• Computational Efficiency: Using 1x1 convolutions for dimensionality reduction
significantly reduces the number of parameters and FLOPs.
• Improved Performance: Achieved state-of-the-art performance on ImageNet with
significantly fewer parameters than previous models.
• Reduced Overfitting: The reduced number of parameters and the use of auxiliary
classifiers helped to reduce overfitting.
• Challenge of Deep Networks: Training very deep neural networks was a major challenge due to
the vanishing gradient problem.
• Key Innovation: Residual Connections (Skip Connections): ResNet, introduced by He et al.,
addressed this problem with the concept of residual connections (also known as skip connections
or shortcuts).
• ILSVRC 2015 Winner: Achieved state-of-the-art results on ImageNet in 2015, surpassing
human-level performance on the classification task.
• Training Very Deep Networks: Enabled the training of significantly deeper networks
than previously possible.
• Improved Performance: Achieved state-of-the-art results on various computer vision
tasks.
• Foundation for Future Architectures: The concept of residual connections has become
a fundamental building block in many subsequent CNN architectures.
• Benefit: Improves feature discrimination by emphasizing important channels and suppressing less
important ones.
• EfficientNet: Focuses on compound scaling of network width, depth, and resolution using a
principled approach.
• RegNet: Explores network design space using a population-based search to find optimal
architectures.
• Vision Transformers (ViT): Applies the Transformer architecture from NLP to image
classification, treating images as sequences of patches.
• ConvNeXt: A modern take on the classical ConvNet design inspired by the Transformer
architecture, showing the strong potential of carefully designed ConvNets.
try:
# Attempt to create data generators
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=IMG_SIZE,
batch_size=32,
class_mode=’categorical’ # or ’binary’ if you have two classes
)
validation_generator = validation_datagen.flow_from_directory(
validation_dir,
target_size=IMG_SIZE,
batch_size=32,
class_mode=’categorical’ # or ’binary’ if you have two classes
)
except OSError as e:
print(f"Error creating data generators: {e}")
raise # Re-raise to stop execution on data generator errors
epochs_range = range(epochs)
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label=’Training Accuracy’)
plt.plot(epochs_range, val_acc, label=’Validation Accuracy’)
plt.legend(loc=’lower right’)
plt.title(’Training and Validation Accuracy’)
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label=’Training Loss’)
plt.plot(epochs_range, val_loss, label=’Validation Loss’)
plt.legend(loc=’upper right’)
plt.title(’Training and Validation Loss’)
plt.show()