0% found this document useful (0 votes)
13 views

In-The-Wild_Deepfake_Detection_using_Adaptable_CNN_Models_with_Visual_Class_Activation_Mapping_for_Improved_Accuracy

This paper presents a study on detecting Deepfakes using adaptable convolutional neural network (CNN) models, specifically evaluating DenseNet121, ResNet18, SqueezeNet, and VGG11 on the OpenForensics dataset. The VGG11 model achieved the highest accuracy of 94.46% for Deepfake detection, with Grad-CAM techniques employed to visualize the decision-making process of the models. The findings highlight the effectiveness of different CNN architectures in identifying manipulated images and provide insights for selecting appropriate models for specific applications.

Uploaded by

sayeem26s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

In-The-Wild_Deepfake_Detection_using_Adaptable_CNN_Models_with_Visual_Class_Activation_Mapping_for_Improved_Accuracy

This paper presents a study on detecting Deepfakes using adaptable convolutional neural network (CNN) models, specifically evaluating DenseNet121, ResNet18, SqueezeNet, and VGG11 on the OpenForensics dataset. The VGG11 model achieved the highest accuracy of 94.46% for Deepfake detection, with Grad-CAM techniques employed to visualize the decision-making process of the models. The findings highlight the effectiveness of different CNN architectures in identifying manipulated images and provide insights for selecting appropriate models for specific applications.

Uploaded by

sayeem26s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 5th International Conference on Computer Communication and the Internet

In-The-Wild Deepfake Detection using Adaptable


CNN Models with Visual Class Activation Mapping
for Improved Accuracy
2023 5th International Conference on Computer Communication and the Internet (ICCCI) | 979-8-3503-2695-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCCI59363.2023.10210096

Muhammad Salihin Saealal Mohd Zamri Ibrahim* Mohd Ibrahim Shapiai Norasyikin Fadilah
Faculty of Electric and Faculty of Electrical and Centre for Artificial Intelligence Faculty of Electrical and
Electronic Engineering Electronics Engineering and Robotics Electronics Engineering
Technology Technology Malaysia-Japan International Technology
Universiti Teknikal Malaysia Universiti Malaysia Pahang Institue of Technology, UTM Universiti Malaysia Pahang
Melaka Pekan, Pahang, Malaysia Kuala Lumpur, Malaysia Pekan, Pahang, Malaysia
Melaka, Malaysia [email protected] [email protected] [email protected]
[email protected]

Abstract— Deepfake technology has become increasingly a threat to public safety, national security, and democratic
sophisticated in recent years, making detecting fake images and processes, highlighting the need for robust tamper detection
videos challenging. This paper investigates the performance of methods with enhanced accuracy and effectiveness.
adaptable convolutional neural network (CNN) models for
detecting Deepfakes. In-the-wild OpenForensics dataset was used Several Deepfake detection approaches have been developed
to evaluate four different CNN models (DenseNet121, ResNet18, to address these issues [1]–[3]. In general, methods are divided
SqueezeNet, and VGG11) at different batch sizes and with various into two types [4]: 1) detection based on visual artifacts within
performance metrics. Results show that the adapted VGG11 the video image, which identifies abnormal features that may
model with a batch size of 32 achieved the highest accuracy of arise during Deepfake synthesis (e.g., teeth appear as a single
94.46% in detecting Deepfakes, outperforming the other models, white blob instead of individual teeth, and 2) detection based on
with DenseNet121 as the second-best performer achieving an capturing temporal features across video frames frame by frame
accuracy of 93.89% with the same batch size. Grad-CAM rather than focusing on temporal consistency.
techniques are utilized to visualize the decision-making process
within the models, aiding in understanding the Deepfake It has been shown that the type of alteration to which an
classification process. These findings provide valuable insights image has been subjected leaves a trail of evidence. Forensic
into the performance of different deep learning models and can algorithms have been developed that extract attributes
guide the selection of an appropriate model for a specific associated with these traces and use them to detect targeted
application. image alterations. Resizing and resampling, median filtering,
contrast enhancement, multiple JPEG compression, etc., can be
Keywords—deepfake, deep learning, convolution neural seen using this method. There are numerous approaches to
network, batch size, Grad-CAM visualization. extracting this feature, the most common of which employs the
Convolution Neural Network (CNN) as a feature extractor [5].
I. INTRODUCTION
The proliferation of Deepfakes, which are artificially Guo et al. proposed the adaptive manipulation traces
generated images or videos that appear genuine, is a significant extraction network (AMTEN) that uses a CNN and fully
concern in today’s world. These Deepfakes can be used to connected network (FCN) for feature extraction and
spread false information, manipulate public opinion, and even categorization to learn manipulation traces on facial photos [6].
harm individuals or institutions. While there have been several Saealal et al. also used a similar strategy but collected different
efforts to identify fake images and videos, less attention has been feature information by using the eye blink state within temporal
paid to detecting manipulated faces in-the-wild. As a result, video images. They employed the VGG16 network to extract
there is a growing need for robust tamper detection methods. spatial features and used the Long Short-Term Memory (LSTM)
network to retrieve temporal characteristics [7]. Furthermore,
Deep learning algorithms have facilitated the creation and they expanded their investigation by combining CNN with
distribution of Deepfakes, increasing the risk of security temporal analysis to detect Deepfakes utilizing 3-D CNN
breaches. The detection of Deepfakes is crucial to minimize models [8].
their potential impact, and researchers have proposed various
methods, including CNNs with temporal analysis. However, To address the vanishing gradient problem, He et al.
current techniques have limitations in accuracy and introduced the ResNet architecture, which employs residual
effectiveness. To improve Deepfake detection, new learning [9], [10], [11]. ResNet uses a residual block, consisting
architectures and techniques are being explored. Deepfakes pose of two or more convolutional layers and a shortcut connection,

979-8-3503-2695-6/23/$31.00 ©2023 IEEE 9


Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
to directly learn the residual mapping between the input and
output of the block.
Huang et al. introduced the Dense Convolutional Network
(DenseNet) that establishes shorter connections between layers
to address various challenges faced by traditional convolutional
networks, such as the vanishing gradient problem, feature
propagation, and reuse [12]. SqueezeNet, designed by Iandola et
al., is a CNN architecture that maintains high accuracy while
minimizing the number of parameters in the model by replacing
3x3 filters with 1x1 filters and using "squeeze layers" to reduce
the number of input channels to 3x3 filters. They also delayed
downsampling until later in the network to produce larger
activation maps [13].
GradCAM++ is an advanced version of the Grad-CAM
method, introduced by Aditya Chattopadhyay et al. in 2018. It is
Fig. 1. Proposed system architecture for Deepfake detection
a technique for visualizing the regions of an input image that
contribute most strongly to the output of a convolutional neural In addition, visual mapping techniques such as GradCAM++
network. GradCAM++ uses a weighted combination of the first and GradCAMElementWise are utilized to evaluate each
and second-order gradients of the output concerning the feature model's performance by visualizing the regions of the image that
maps, resulting in a more fine-grained heatmap visualization. It are significant for the classification decision. This allows for a
has been used in object localization, saliency detection, and better understanding of how each model makes its decisions and
feature attribution, providing valuable insights into the decision- identifies suspected deepfake areas.
making process of deep neural networks [14].
B. Deep Learning Model
This paper investigates four deep learning models
employing convolutional neural networks (CNN) for Deepfake DenseNet121 is a neural network architecture that features
image identification by learning spatial features from the dense connectivity patterns, enabling improved feature
OpenForensics dataset. The performance of these models is extraction and gradient flow. Its building blocks, Dense Blocks,
assessed under different batch sizes, and a detailed description consist of multiple layers connected to each other in a
of the method is provided, along with an analysis of each feedforward manner, allowing for the reuse of features and
model's results. Key contributions of this paper are as follows: easier network training. The network includes Bottleneck
Layers that use 1x1 convolution followed by 3x3 convolution to
• Evaluation of various CNN architectures that extract reduce feature maps and improve efficiency. It also has
spatial features from in-the-wild images to effectively Transition Layers that combine 1x1 convolution and average
detect Deepfake images. pooling to maintain the number of feature maps while reducing
spatial dimensions, leading to a more compact network. The
• An extensive study of each model's performance with
three distinct batch sizes and the implementation of an architecture is illustrated in Figure 2.
early stopping technique.
• Insights into the class activation heatmap of the output
model, shedding light on the impact of salient features on
the models' overall performance.
II. METHODOLOGY
A. Deepfake Detection System
The proposed Deepfake detection system comprises four
primary steps, illustrated in Fig. 1. Firstly, the system crops the
face region from the images in the OpenForensics database.
Secondly, features are extracted from the cropped images using
four deep learning models: DenseNet121, ResNet18,
SqueezeNet, and VGG11. These models are initialized with pre-
trained weights from the ImageNet database, which contains
millions of labelled images from various categories, to expedite
the training process. These extracted features are then used as
input to the classification model. The binary classification model
is trained on the OpenForensics dataset to improve all model Fig. 2. DenseNet model
parameters, including the CNN layers. The model's performance
is evaluated using four standard performance metrics: accuracy,
precision, recall, and F1-score.

10
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
The output of a layer in a DenseNet is calculated as the
concatenation of the feature maps produced by all previous
layers:

xl = H l ([ x0 , xi , , xi -1 ]) (1)

where xl is the output of layer l, Hl is the composite function of


all layers up to and including layer l, and [x0, x1,…,xi-1] represents
the concatenation of the feature maps produced by all previous
layers.
ResNet was proposed as a novel technique to overcome the
difficulty of training deep neural networks. By reformulating the
network layers to learn residual functions based on layer inputs,
it became easier to train deeper networks with greater accuracy.
While increasing the depth of a neural network typically
Fig. 3. VGG11 Model
improves performance, training such a network is challenging,
and merely stacking layers to learn the desired mapping often C. Performance Evaluations
leads to accuracy degradation and decreased convergence.
However, residual learning addresses these issues by using In classification models, performance is assessed using
stacked layers to learn residual mapping and has been shown to metrics such as accuracy, precision, recall, and F1-score.
be effective in achieving superior performance. In this paper, Accuracy represents the overall correctness of predictions, while
ResNet18 is employed, which comprises 18 layers consisting of precision measures the model's ability to avoid false positives.
convolutional, pooling, and fully connected layers. Recall evaluates the model's ability to identify positive instances
correctly. F1-score, a combined metric of precision and recall,
SqueezeNet is a convolutional neural network architecture provides an overall measure of model performance that is
that employs a unique design approach to achieve high accuracy balanced and gives more weight to lower values.
in image classification tasks while using significantly fewer
parameters compared to traditional networks. The architecture D. Visual Class Activation Mapping
comprises convolutional and pooling layers, followed by a fully The Grad-CAM technique provides interpretability to deep
connected layer for classification. The convolutional layers neural network predictions by generating visual explanations. It
utilize 1x1 filters to reduce input channels and 3x3 filters to involves computing the gradient of the output of the final
capture spatial information. Moreover, SqueezeNet employs convolutional layer and generating a class activation map
"fire modules," which are made up of a squeeze layer and an highlighting the relevant regions of the input image. This
expanded layer to enhance network non-linearity. method is applicable to any CNN architecture by determining
The VGG model is a convolutional neural network weights through global average pooling of gradient values and
architecture designed for image classification tasks. To detect post-processing the class activation map for normalization and
Deepfake images, the VGG model can be utilized as a feature thresholding.
extractor, and the extracted features can be used as input to the The critical equation in Grad-CAM is the computation of the
classifier. VGG11, illustrated in Fig. 3, has 11 layers, including class activation map, which is given by:
8 convolutional layers and three fully connected layers. Each
convolutional layer has a 3x3 filter with a stride of 1 and padding æ ö
LcGrad -CAM = ReLU ç å a kc Ak ÷ (2)
of 1, enabling the model to learn hierarchical representation of è k ø
the input image. The use of small filters, along with the c
¶y
network's depth, significantly improves its accuracy. The output a k = åå k
c 1 (3)
Z i j ¶Ai , j
of each convolutional layer undergoes a rectified linear unit
(ReLU) activation function, which enhances the model's ability ¶y c
to learn non-linear features. Z = ååå (4)
k i j ¶Aik, j
After feature extraction, the extracted features are fed into
the fully connected layers of the VGG11 model. The final three
fully connected layers perform the classification task, outputting where LcGrad-CAM is the class activation map for the target class c,
a vector of probabilities indicating the likelihood of the input Ak is the feature map for the k-th channel of the final
image belonging to each pre-defined class. convolutional layer, ack is the weight assigned to the k-th channel
for the target class c, and ReLU is the rectified linear unit
function. The weights ack are computed by taking the global
average pooling of the gradient values for each feature map and
yc is the output score for the target class c.

11
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
GradCAM++ and GradCAMElementWise are both B. Experiment Setup
extensions of Grad-CAM, a visualization technique that The PyTorch v1.12 was chosen as the programming
highlights important regions of an input image for a neural framework, and the models were trained on an NVIDIA
network's prediction. GradCAM++ uses a weighted RTX3090 GPU with Ubuntu 20.04 operating system.
combination of gradients and second-order derivatives for more Additionally, Scikit-learn was employed as a data analysis tool
fine-grained attention maps, useful in object detection and to evaluate the performance of the models. The models were
localization. GradCAMElementWise differs in computing the trained and tested with different batch sizes (32, 64, and 128)
class activation map, using element-wise multiplication for and a learning rate of 0.0001. Adam optimizer was used as an
potentially more accurate localization of small objects in the optimization algorithm to update the parameters of models
input image. during the training process. The early stopping method was
III. RESULT AND DISCUSSION employed to prevent overfitting, stopping the training process
when the model performance ceased to improve on the
A. Dataset validation dataset with a patient value of 4. Model performance
This paper uses OpenForensics [15] dataset to compare was evaluated using accuracy, precision, recall, and F1-score
model performance. This dataset contains 115,325 images with metrics. Additionally, Grad-CAM techniques were applied to
334,136 human faces of various sizes and resolutions. The visualize the features learned by the model and aid in
dataset includes several types of forgeries, such as color understanding the Deepfake classification process.
manipulation, edge manipulation, block-wise distortion, image C. Overall Performance with Early Stopping Implementation
corruption, convolution mask transformation, and external
effect to simulate context in real-world scenes. For the Fig. 4 provides a detailed comparison of the performance of
experiments, the dataset was processed to create 70,000 real four different models (DenseNet121, Resnet18, SqueezeNet,
human faces and 70,000 fake samples for training and validation and VGG11) in terms of their accuracy, the number of epochs
with random split of 90-10. Additionally, 5,500 real human required to reach maximum accuracy, and the time taken to train
faces and 5,500 fake samples were generated for testing. the model. The models were trained on a dataset using different
batch sizes with early stopping technique.

(a) DenseNet121 (b) ResNet18

(c) SqueezeNet (d) VGG11

Fig. 4. The training time, epoch and accuracy of different models on the test dataset using different batch sizes.

12
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
One noteworthy finding from the figure is that the recall, indicates the ability of the model to achieve high
DenseNet121 and VGG11 models consistently achieved higher precision and recall simultaneously.
accuracy scores than the ResNet18 and SqueezeNet models
across all batch sizes, indicating that they may be more suitable Notably, the model parameter size did not always
for applications that prioritize high accuracy. However, correspond with higher accuracy scores, as DenseNet121, which
ResNet18 required the fewest number of epochs to reach its had a smaller parameter size than VGG11, performed similarly
maximum accuracy score, which could be beneficial in time- well in terms of accuracy across all batch sizes. This suggests
critical applications. DenseNet121 and VGG11 required more that other factors, such as the architecture and the training
epochs, suggesting they may need longer training times, but may process, may play a more critical role in determining model
also be more accurate. performance.

Early stopping techniques were used, which may have E. Deepfake Visualization
impacted the number of epochs required to reach maximum The four models investigated in this paper are often
accuracy scores. The time taken to train the models varied considered ‘black boxes’ because they are highly complex and
significantly, with DenseNet121 taking the longest and challenging to interpret. To address this, heatmaps are
ResNet18 taking the shortest time. Therefore, the choice of commonly used to visualize the attention or activation of
model may depend on available computational resources and specific regions in an image that are most critical in making a
desired training speed. These findings demonstrate the prediction. The Grad-CAM techniques were used to gain
importance of considering both accuracy and training time when insights into the decision-making process within the last
selecting a deep learning model for a specific task. convolutional layer of the VGG11 model and identify the most
critical regions of an input image for making a prediction. The
D. Performance Metrics on Model Architecture using VGG11 model was selected because it had the highest
Different Batch Sizes performance metric, based on the results in Table I.
Table I shows the performance metrics of four other models
Fig. 5 displays the heatmap generated using two extensions
investigated at various batch sizes and model parameter sizes.
in Grad-CAM techniques, GradCAM++ and
Across all models, increasing the batch size led to decreased
GradCAMElementWise, using pre-trained weights from the
accuracy and recall and increased precision, indicating the trade-
ImageNet dataset. It was observed that the heatmap emphasized
offs between different evaluation metrics. VGG11 consistently
the Deepfake area but lacked precision. In Fig. 6, the heatmap
achieved the highest accuracy, recall, and F1-score, while
shifted focus to the center of the Deepfake area when trained
Resnet18 had the highest precision at batch size 128. In the
data from the OpenForensics dataset was used. GradCAM++
context of Deepfake classification, a high precision means that
provided stable heatmaps across different batch sizes, while
the model is good at avoiding false positives, i.e., cases where
GradCAMElementWise became narrower when the batch size
the model predicts a Deepfake, but the sample is actually
increased.
genuine. This is important to prevent false accusations and
unnecessary investigations. A high recall means that the model It can be seen that when the batch size is increased, the
is good at detecting all the Deepfakes in the dataset, which is heatmap can become more stable and robust because the model
crucial for the model’s effectiveness in real-world scenarios. processes more input samples and captures a more
representative set of activations in the last convolutional layer.
TABLE I. PERFORMANCE METRICS OF FOUR DIFFERENT MODELS
INVESTIGATED AT DIFFERENT BATCH SIZES AND MODEL PARAMETER SIZES
Performance metrics
Model Batch
Architecture F1-
size Size Accuracy Precision Recall
score
32 0.9389 0.9567 0.9247 0.9404
DenseNet121 7M 64 0.9124 0.9377 0.8936 0.9151
128 0.9196 0.9629 0.8871 0.9234
32 0.9200 0.9663 0.8854 0.9241
Resnet18 11.2M 64 0.8959 0.9454 0.8615 0.9015
128 0.8405 0.9774 0.7687 0.8606
32 0.9185 0.9397 0.9024 0.9207
SqueezeNet 736K 64 0.9084 0.9663 0.8670 0.9140 (a) (b)
128 0.9042 0.9696 0.8585 0.9106
32 0.9446 0.9512 0.9396 0.9453 Fig. 5. (a) GradCam++ and (b) GradCAMElementWise heatmap on Deepfake
VGG11 128M 64 0.9298 0.9452 0.9178 0.9313 image (example 1) using pre-trained weight from the ImageNet dataset.
128 0.8947 0.9541 0.8540 0.9013
Fig. 7 presents another example of a tilted head while
wearing a cap. Based on the generated heatmap, it can be
Therefore, precision and recall are essential in Deepfake concluded that GradCAM++ with a batch size of 128 produced
detection to ensure that the model accurately identifies more stable and detailed saliency, localizing the area of the
Deepfakes while minimizing the risk of false positives. The F1- Deepfake.
score, which is a measure of the balance between precision and

13
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
Batch Size of Higher Education (MOHE) via the Research and Innovation
32 64 128 Department, Universiti Malaysia Pahang (UMP) Malaysia.
REFERENCES
[1] Y. Li, M.-C. Chang, and S. Lyu, “In Ictu Oculi: Exposing AI Created Fake
(a) Videos by Detecting Eye Blinking,” in 2018 IEEE International
Workshop on Information Forensics and Security (WIFS), Dec. 2018. doi:
10.1109/wifs.2018.8630787.
[2] F. Matern, C. Riess, and M. Stamminger, “Exploiting Visual Artifacts to
Expose Deepfakes and Face Manipulations,” in 2019 IEEE Winter
Applications of Computer Vision Workshops (WACVW), Jan. 2019. doi:
10.1109/wacvw.2019.00020.
(b) [3] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P.
Natarajan, “Recurrent Convolutional Strategies for Face Manipulation
Detection in Videos,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops, Jun.
2019.
Fig. 6. (a) GradCam++ and (b) GradCAMElementWise heatmap on Deepfake [4] T. T. Nguyen et al., “Deep Learning for Deepfakes Creation and
image (example 1) using trained data from the OpenForensics dataset. Detection: A Survey,” SSRN Electronic Journal, 2022, doi:
10.2139/ssrn.4030341.
Batch Size [5] P. Kumar, M. Vatsa, and R. Singh, “Detecting Face2Face Facial
Reenactment in Videos,” in 2020 IEEE Winter Conference on
32 64 128 Applications of Computer Vision (WACV), Mar. 2020. doi:
10.1109/wacv45572.2020.9093628.
[6] Z. Guo, G. Yang, J. Chen, and X. Sun, “Fake face detection via adaptive
manipulation traces extraction network,” Computer Vision and Image
(a) Understanding, vol. 204, p. 103170, Mar. 2021, doi:
10.1016/j.cviu.2021.103170.
[7] M. S. Saealal, M. Z. Ibrahim, D. J. Mulvaney, M. I. Shapiai, and N.
Fadilah, “Using cascade CNN-LSTM-FCNs to identify AI-altered video
based on eye state sequence,” PLoS One, vol. 17, no. 12, p. e0278989,
Dec. 2022, doi: 10.1371/journal.pone.0278989.
[8] M. S. Saealal, M. Z. Ibrahim, M. Yakno, and N. W. Arshad, “Three-
(b) Dimensional Convolutional Approaches for the Verification of Deepfake
Videos: The Effect of Image Depth Size on Authentication Performance,”
Journal of Advances in Information Technology, vol. 14, no. 3, pp. 488-
494, 2023, doi: 10.12720/jait.14.3.488-494.
[9] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
Fig. 7. (a) GradCam++ and (b) GradCAMElementWise heatmap on Deepfake with gradient descent is difficult,” IEEE Trans Neural Netw, vol. 5, no. 2,
image (example 2) using trained data from the OpenForensics dataset. pp. 157–166, Mar. 1994, doi: 10.1109/72.279181.
[10] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
IV. CONCLUSION feedforward neural networks,” in Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics,
In conclusion, this study provides significant insights into AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010,
the performance of various deep learning models based on CNN 2010, vol. 9, pp. 249–256. [Online]. Available:
architecture, including the impact of batch size on their accuracy http://proceedings.mlr.press/v9/glorot10a.html
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
and training time. These findings can help guide the selection of Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
an appropriate model not only for Deepfake scenarios but also Recognition (CVPR), Jun. 2016. doi: 10.1109/cvpr.2016.90.
for other applications. It is important to note that the results may [12] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
have been influenced by early stopping techniques, particularly Connected Convolutional Networks,” in 2017 IEEE Conference on
with respect to the number of epochs required for achieving Computer Vision and Pattern Recognition (CVPR), Jul. 2017. doi:
maximum accuracy. 10.1109/cvpr.2017.243.
[13] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K.
Future research directions may include exploring the use of Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer
parameters and 1MB model size,” Computing Research Repository
ensemble models and examining other hyperparameters to
(CoRR), vol. abs/1602.07360, 2016, [Online]. Available:
optimize model performance. Overall, these directions have the http://arxiv.org/abs/1602.07360
potential to enhance the accuracy and interpretability of [14] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
Deepfake detection models and make them more practical for “Grad-CAM++: Generalized Gradient-Based Visual Explanations for
real-world applications. Deep Convolutional Networks,” in 2018 IEEE Winter Conference on
Applications of Computer Vision (WACV), Mar. 2018. doi:
ACKNOWLEDGMENT 10.1109/wacv.2018.00097.
[15] T.-N. Le, H. H. Nguyen, J. Yamagishi, and I. Echizen, “OpenForensics:
This research is financially supported by the Fundamental Large-Scale Challenging Dataset For Multi-Face Forgery Detection And
Research Grant Scheme (FRGS/1/2021/ICT07/UMP/02/1) with Segmentation In-The-Wild,” in 2021 IEEE/CVF International
the RDU number RDU210136 which is awarded by the Ministry Conference on Computer Vision (ICCV), Oct. 2021. doi:
10.1109/iccv48922.2021.00996.

14
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.

You might also like