In-The-Wild_Deepfake_Detection_using_Adaptable_CNN_Models_with_Visual_Class_Activation_Mapping_for_Improved_Accuracy
In-The-Wild_Deepfake_Detection_using_Adaptable_CNN_Models_with_Visual_Class_Activation_Mapping_for_Improved_Accuracy
Muhammad Salihin Saealal Mohd Zamri Ibrahim* Mohd Ibrahim Shapiai Norasyikin Fadilah
Faculty of Electric and Faculty of Electrical and Centre for Artificial Intelligence Faculty of Electrical and
Electronic Engineering Electronics Engineering and Robotics Electronics Engineering
Technology Technology Malaysia-Japan International Technology
Universiti Teknikal Malaysia Universiti Malaysia Pahang Institue of Technology, UTM Universiti Malaysia Pahang
Melaka Pekan, Pahang, Malaysia Kuala Lumpur, Malaysia Pekan, Pahang, Malaysia
Melaka, Malaysia [email protected] [email protected] [email protected]
[email protected]
Abstract— Deepfake technology has become increasingly a threat to public safety, national security, and democratic
sophisticated in recent years, making detecting fake images and processes, highlighting the need for robust tamper detection
videos challenging. This paper investigates the performance of methods with enhanced accuracy and effectiveness.
adaptable convolutional neural network (CNN) models for
detecting Deepfakes. In-the-wild OpenForensics dataset was used Several Deepfake detection approaches have been developed
to evaluate four different CNN models (DenseNet121, ResNet18, to address these issues [1]–[3]. In general, methods are divided
SqueezeNet, and VGG11) at different batch sizes and with various into two types [4]: 1) detection based on visual artifacts within
performance metrics. Results show that the adapted VGG11 the video image, which identifies abnormal features that may
model with a batch size of 32 achieved the highest accuracy of arise during Deepfake synthesis (e.g., teeth appear as a single
94.46% in detecting Deepfakes, outperforming the other models, white blob instead of individual teeth, and 2) detection based on
with DenseNet121 as the second-best performer achieving an capturing temporal features across video frames frame by frame
accuracy of 93.89% with the same batch size. Grad-CAM rather than focusing on temporal consistency.
techniques are utilized to visualize the decision-making process
within the models, aiding in understanding the Deepfake It has been shown that the type of alteration to which an
classification process. These findings provide valuable insights image has been subjected leaves a trail of evidence. Forensic
into the performance of different deep learning models and can algorithms have been developed that extract attributes
guide the selection of an appropriate model for a specific associated with these traces and use them to detect targeted
application. image alterations. Resizing and resampling, median filtering,
contrast enhancement, multiple JPEG compression, etc., can be
Keywords—deepfake, deep learning, convolution neural seen using this method. There are numerous approaches to
network, batch size, Grad-CAM visualization. extracting this feature, the most common of which employs the
Convolution Neural Network (CNN) as a feature extractor [5].
I. INTRODUCTION
The proliferation of Deepfakes, which are artificially Guo et al. proposed the adaptive manipulation traces
generated images or videos that appear genuine, is a significant extraction network (AMTEN) that uses a CNN and fully
concern in today’s world. These Deepfakes can be used to connected network (FCN) for feature extraction and
spread false information, manipulate public opinion, and even categorization to learn manipulation traces on facial photos [6].
harm individuals or institutions. While there have been several Saealal et al. also used a similar strategy but collected different
efforts to identify fake images and videos, less attention has been feature information by using the eye blink state within temporal
paid to detecting manipulated faces in-the-wild. As a result, video images. They employed the VGG16 network to extract
there is a growing need for robust tamper detection methods. spatial features and used the Long Short-Term Memory (LSTM)
network to retrieve temporal characteristics [7]. Furthermore,
Deep learning algorithms have facilitated the creation and they expanded their investigation by combining CNN with
distribution of Deepfakes, increasing the risk of security temporal analysis to detect Deepfakes utilizing 3-D CNN
breaches. The detection of Deepfakes is crucial to minimize models [8].
their potential impact, and researchers have proposed various
methods, including CNNs with temporal analysis. However, To address the vanishing gradient problem, He et al.
current techniques have limitations in accuracy and introduced the ResNet architecture, which employs residual
effectiveness. To improve Deepfake detection, new learning [9], [10], [11]. ResNet uses a residual block, consisting
architectures and techniques are being explored. Deepfakes pose of two or more convolutional layers and a shortcut connection,
10
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
The output of a layer in a DenseNet is calculated as the
concatenation of the feature maps produced by all previous
layers:
xl = H l ([ x0 , xi , , xi -1 ]) (1)
11
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
GradCAM++ and GradCAMElementWise are both B. Experiment Setup
extensions of Grad-CAM, a visualization technique that The PyTorch v1.12 was chosen as the programming
highlights important regions of an input image for a neural framework, and the models were trained on an NVIDIA
network's prediction. GradCAM++ uses a weighted RTX3090 GPU with Ubuntu 20.04 operating system.
combination of gradients and second-order derivatives for more Additionally, Scikit-learn was employed as a data analysis tool
fine-grained attention maps, useful in object detection and to evaluate the performance of the models. The models were
localization. GradCAMElementWise differs in computing the trained and tested with different batch sizes (32, 64, and 128)
class activation map, using element-wise multiplication for and a learning rate of 0.0001. Adam optimizer was used as an
potentially more accurate localization of small objects in the optimization algorithm to update the parameters of models
input image. during the training process. The early stopping method was
III. RESULT AND DISCUSSION employed to prevent overfitting, stopping the training process
when the model performance ceased to improve on the
A. Dataset validation dataset with a patient value of 4. Model performance
This paper uses OpenForensics [15] dataset to compare was evaluated using accuracy, precision, recall, and F1-score
model performance. This dataset contains 115,325 images with metrics. Additionally, Grad-CAM techniques were applied to
334,136 human faces of various sizes and resolutions. The visualize the features learned by the model and aid in
dataset includes several types of forgeries, such as color understanding the Deepfake classification process.
manipulation, edge manipulation, block-wise distortion, image C. Overall Performance with Early Stopping Implementation
corruption, convolution mask transformation, and external
effect to simulate context in real-world scenes. For the Fig. 4 provides a detailed comparison of the performance of
experiments, the dataset was processed to create 70,000 real four different models (DenseNet121, Resnet18, SqueezeNet,
human faces and 70,000 fake samples for training and validation and VGG11) in terms of their accuracy, the number of epochs
with random split of 90-10. Additionally, 5,500 real human required to reach maximum accuracy, and the time taken to train
faces and 5,500 fake samples were generated for testing. the model. The models were trained on a dataset using different
batch sizes with early stopping technique.
Fig. 4. The training time, epoch and accuracy of different models on the test dataset using different batch sizes.
12
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
One noteworthy finding from the figure is that the recall, indicates the ability of the model to achieve high
DenseNet121 and VGG11 models consistently achieved higher precision and recall simultaneously.
accuracy scores than the ResNet18 and SqueezeNet models
across all batch sizes, indicating that they may be more suitable Notably, the model parameter size did not always
for applications that prioritize high accuracy. However, correspond with higher accuracy scores, as DenseNet121, which
ResNet18 required the fewest number of epochs to reach its had a smaller parameter size than VGG11, performed similarly
maximum accuracy score, which could be beneficial in time- well in terms of accuracy across all batch sizes. This suggests
critical applications. DenseNet121 and VGG11 required more that other factors, such as the architecture and the training
epochs, suggesting they may need longer training times, but may process, may play a more critical role in determining model
also be more accurate. performance.
Early stopping techniques were used, which may have E. Deepfake Visualization
impacted the number of epochs required to reach maximum The four models investigated in this paper are often
accuracy scores. The time taken to train the models varied considered ‘black boxes’ because they are highly complex and
significantly, with DenseNet121 taking the longest and challenging to interpret. To address this, heatmaps are
ResNet18 taking the shortest time. Therefore, the choice of commonly used to visualize the attention or activation of
model may depend on available computational resources and specific regions in an image that are most critical in making a
desired training speed. These findings demonstrate the prediction. The Grad-CAM techniques were used to gain
importance of considering both accuracy and training time when insights into the decision-making process within the last
selecting a deep learning model for a specific task. convolutional layer of the VGG11 model and identify the most
critical regions of an input image for making a prediction. The
D. Performance Metrics on Model Architecture using VGG11 model was selected because it had the highest
Different Batch Sizes performance metric, based on the results in Table I.
Table I shows the performance metrics of four other models
Fig. 5 displays the heatmap generated using two extensions
investigated at various batch sizes and model parameter sizes.
in Grad-CAM techniques, GradCAM++ and
Across all models, increasing the batch size led to decreased
GradCAMElementWise, using pre-trained weights from the
accuracy and recall and increased precision, indicating the trade-
ImageNet dataset. It was observed that the heatmap emphasized
offs between different evaluation metrics. VGG11 consistently
the Deepfake area but lacked precision. In Fig. 6, the heatmap
achieved the highest accuracy, recall, and F1-score, while
shifted focus to the center of the Deepfake area when trained
Resnet18 had the highest precision at batch size 128. In the
data from the OpenForensics dataset was used. GradCAM++
context of Deepfake classification, a high precision means that
provided stable heatmaps across different batch sizes, while
the model is good at avoiding false positives, i.e., cases where
GradCAMElementWise became narrower when the batch size
the model predicts a Deepfake, but the sample is actually
increased.
genuine. This is important to prevent false accusations and
unnecessary investigations. A high recall means that the model It can be seen that when the batch size is increased, the
is good at detecting all the Deepfakes in the dataset, which is heatmap can become more stable and robust because the model
crucial for the model’s effectiveness in real-world scenarios. processes more input samples and captures a more
representative set of activations in the last convolutional layer.
TABLE I. PERFORMANCE METRICS OF FOUR DIFFERENT MODELS
INVESTIGATED AT DIFFERENT BATCH SIZES AND MODEL PARAMETER SIZES
Performance metrics
Model Batch
Architecture F1-
size Size Accuracy Precision Recall
score
32 0.9389 0.9567 0.9247 0.9404
DenseNet121 7M 64 0.9124 0.9377 0.8936 0.9151
128 0.9196 0.9629 0.8871 0.9234
32 0.9200 0.9663 0.8854 0.9241
Resnet18 11.2M 64 0.8959 0.9454 0.8615 0.9015
128 0.8405 0.9774 0.7687 0.8606
32 0.9185 0.9397 0.9024 0.9207
SqueezeNet 736K 64 0.9084 0.9663 0.8670 0.9140 (a) (b)
128 0.9042 0.9696 0.8585 0.9106
32 0.9446 0.9512 0.9396 0.9453 Fig. 5. (a) GradCam++ and (b) GradCAMElementWise heatmap on Deepfake
VGG11 128M 64 0.9298 0.9452 0.9178 0.9313 image (example 1) using pre-trained weight from the ImageNet dataset.
128 0.8947 0.9541 0.8540 0.9013
Fig. 7 presents another example of a tilted head while
wearing a cap. Based on the generated heatmap, it can be
Therefore, precision and recall are essential in Deepfake concluded that GradCAM++ with a batch size of 128 produced
detection to ensure that the model accurately identifies more stable and detailed saliency, localizing the area of the
Deepfakes while minimizing the risk of false positives. The F1- Deepfake.
score, which is a measure of the balance between precision and
13
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.
Batch Size of Higher Education (MOHE) via the Research and Innovation
32 64 128 Department, Universiti Malaysia Pahang (UMP) Malaysia.
REFERENCES
[1] Y. Li, M.-C. Chang, and S. Lyu, “In Ictu Oculi: Exposing AI Created Fake
(a) Videos by Detecting Eye Blinking,” in 2018 IEEE International
Workshop on Information Forensics and Security (WIFS), Dec. 2018. doi:
10.1109/wifs.2018.8630787.
[2] F. Matern, C. Riess, and M. Stamminger, “Exploiting Visual Artifacts to
Expose Deepfakes and Face Manipulations,” in 2019 IEEE Winter
Applications of Computer Vision Workshops (WACVW), Jan. 2019. doi:
10.1109/wacvw.2019.00020.
(b) [3] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P.
Natarajan, “Recurrent Convolutional Strategies for Face Manipulation
Detection in Videos,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Workshops, Jun.
2019.
Fig. 6. (a) GradCam++ and (b) GradCAMElementWise heatmap on Deepfake [4] T. T. Nguyen et al., “Deep Learning for Deepfakes Creation and
image (example 1) using trained data from the OpenForensics dataset. Detection: A Survey,” SSRN Electronic Journal, 2022, doi:
10.2139/ssrn.4030341.
Batch Size [5] P. Kumar, M. Vatsa, and R. Singh, “Detecting Face2Face Facial
Reenactment in Videos,” in 2020 IEEE Winter Conference on
32 64 128 Applications of Computer Vision (WACV), Mar. 2020. doi:
10.1109/wacv45572.2020.9093628.
[6] Z. Guo, G. Yang, J. Chen, and X. Sun, “Fake face detection via adaptive
manipulation traces extraction network,” Computer Vision and Image
(a) Understanding, vol. 204, p. 103170, Mar. 2021, doi:
10.1016/j.cviu.2021.103170.
[7] M. S. Saealal, M. Z. Ibrahim, D. J. Mulvaney, M. I. Shapiai, and N.
Fadilah, “Using cascade CNN-LSTM-FCNs to identify AI-altered video
based on eye state sequence,” PLoS One, vol. 17, no. 12, p. e0278989,
Dec. 2022, doi: 10.1371/journal.pone.0278989.
[8] M. S. Saealal, M. Z. Ibrahim, M. Yakno, and N. W. Arshad, “Three-
(b) Dimensional Convolutional Approaches for the Verification of Deepfake
Videos: The Effect of Image Depth Size on Authentication Performance,”
Journal of Advances in Information Technology, vol. 14, no. 3, pp. 488-
494, 2023, doi: 10.12720/jait.14.3.488-494.
[9] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
Fig. 7. (a) GradCam++ and (b) GradCAMElementWise heatmap on Deepfake with gradient descent is difficult,” IEEE Trans Neural Netw, vol. 5, no. 2,
image (example 2) using trained data from the OpenForensics dataset. pp. 157–166, Mar. 1994, doi: 10.1109/72.279181.
[10] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
IV. CONCLUSION feedforward neural networks,” in Proceedings of the Thirteenth
International Conference on Artificial Intelligence and Statistics,
In conclusion, this study provides significant insights into AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010,
the performance of various deep learning models based on CNN 2010, vol. 9, pp. 249–256. [Online]. Available:
architecture, including the impact of batch size on their accuracy http://proceedings.mlr.press/v9/glorot10a.html
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
and training time. These findings can help guide the selection of Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
an appropriate model not only for Deepfake scenarios but also Recognition (CVPR), Jun. 2016. doi: 10.1109/cvpr.2016.90.
for other applications. It is important to note that the results may [12] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
have been influenced by early stopping techniques, particularly Connected Convolutional Networks,” in 2017 IEEE Conference on
with respect to the number of epochs required for achieving Computer Vision and Pattern Recognition (CVPR), Jul. 2017. doi:
maximum accuracy. 10.1109/cvpr.2017.243.
[13] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K.
Future research directions may include exploring the use of Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer
parameters and 1MB model size,” Computing Research Repository
ensemble models and examining other hyperparameters to
(CoRR), vol. abs/1602.07360, 2016, [Online]. Available:
optimize model performance. Overall, these directions have the http://arxiv.org/abs/1602.07360
potential to enhance the accuracy and interpretability of [14] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian,
Deepfake detection models and make them more practical for “Grad-CAM++: Generalized Gradient-Based Visual Explanations for
real-world applications. Deep Convolutional Networks,” in 2018 IEEE Winter Conference on
Applications of Computer Vision (WACV), Mar. 2018. doi:
ACKNOWLEDGMENT 10.1109/wacv.2018.00097.
[15] T.-N. Le, H. H. Nguyen, J. Yamagishi, and I. Echizen, “OpenForensics:
This research is financially supported by the Fundamental Large-Scale Challenging Dataset For Multi-Face Forgery Detection And
Research Grant Scheme (FRGS/1/2021/ICT07/UMP/02/1) with Segmentation In-The-Wild,” in 2021 IEEE/CVF International
the RDU number RDU210136 which is awarded by the Ministry Conference on Computer Vision (ICCV), Oct. 2021. doi:
10.1109/iccv48922.2021.00996.
14
Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 19,2024 at 00:55:27 UTC from IEEE Xplore. Restrictions apply.