2024-Compsec-An Adversarial Attack Approach For Explainable AI Evaluation On Deepfake Detection Models
2024-Compsec-An Adversarial Attack Approach For Explainable AI Evaluation On Deepfake Detection Models
A R T I C L E I N F O A B S T R A C T
Keywords: With the rising concern on model interpretability, the application of eXplainable AI (XAI) tools on deepfake
Deepfake detection models has been a topic of interest recently. In image classification tasks, XAI tools highlight pixels
Explainable AI influencing the decision given by a model. This helps in troubleshooting the model and determining areas that
Evaluation
may require further tuning of parameters. With a wide range of tools available in the market, choosing the right
Adversarial attack
Image forensics
tool for a model becomes necessary as each one may highlight different sets of pixels for a given image. There is a
need to evaluate different tools and decide the best performing ones among them. Generic XAI evaluation
methods like insertion or removal of salient pixels/segments are applicable for general image classification tasks
but may produce less meaningful results when applied on deepfake detection models due to their functionality.
In this paper, we perform experiments to show that generic removal/insertion XAI evaluation methods are not
suitable for deepfake detection models. We also propose and implement an XAI evaluation approach specifically
suited for deepfake detection models.
* Corresponding author.
E-mail address: [email protected] (B. Gowrisankar).
https://doi.org/10.1016/j.cose.2023.103684
Received 22 June 2023; Received in revised form 14 November 2023; Accepted 21 December 2023
Available online 22 December 2023
0167-4048/© 2023 Elsevier Ltd. All rights reserved.
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
2
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
Fig. 3. Illustration of changes done to images for various evaluation methods (a) Fake image (b) Explanation of fake image (c) Deletion: Top 20 % pixels replaced
with per-channel mean (d) IAUC: Top 20 % pixels inserted on a blurred image (e) IROF: Top 20 segments replaced with per-channel mean.
OOD arises when a model is shown testing data that does not belong to Infidelity Yeh et al. (2019) compares the difference in model output
the training data distribution. The prediction given by the model on such after an arbitrary perturbation with the dot product of the perturbation
data should not be trusted upon. By inserting pixels using different vector and attribution map. The perturbation vector can be of two types:
replacement methods, the resulting data may not lie within the Noisy Baseline and Square Removal. The former deals with using a
boundary of the original training distribution. As a result, it cannot be Gaussian random vector as the perturbation vector. Square Removal
ascertained whether the drop/rise in the prediction of the new image is captures spatial information in the images by removing square patches
due to removal of important features or because of the shift in distri from an image of predefined size. Once again, the perturbations used in
bution. Gomez et al. (2022) proved that distortions to an image like the this case are not constrained by any limit, thus making them perceptible.
one shown in Fig. 3(d) does introduce OOD issues and leads to unex Impact Coverage Lin et al. (2019) adds an adversarial patch to the
pected behaviour of the model. Further, there may exist a local corre image such that it is misclassified. The explanation for the adversarial
lation among pixels even after removal/insertion of important features image is then computed. Intuitively, a good tool must highlight the
which will still allow the model to guess the prediction correctly. In the adversarial patch in its explanation since it led to the misclassification.
case of Fig. 3(c), if the actual value of the removed pixel is very close to However, the adversarial patches used in this metric are
the average value, then not much of information is removed from an image-independent and visually perceptible. Similar to Max-Sensitivity,
image. While segmentation removes the problem of local correlation in this metric also requires the explanation method to be available at the
IROF, the issue of OOD still persists. In our work, only a small magnitude time of evaluation. Our work focuses on creating adversarial images that
of noise is added to an image to ensure that it remains visually imper are visually imperceptible from the original ones and does not require
ceptible from the original one. We also make use of segmentation in our re-computation of the explanation on the adversarial images.
approach to prevent local correlation among pixels. In other words,
noise will be added to all pixels that represent similar features. 2.2. Saliency-based adversarial attacks
Pointing game (Zhang et al., 2018) measures how well an explana
tion tool can highlight pixels in a pre-annotated area such as a bounding Works that confine the adversarial attack perturbation to the salient
box. The pointing game score is defined as the ratio of total number of regions of an input have also been published in recent literature. How
hits to the total number of all samples. It is useful in a domain where the ever, unlike our work, their primary motivation was not to evaluate the
ground truth for the salient area is known beforehand (for example, salient regions. Dong et al. (2020) showed that a superpixel based
object detection tasks). This method cannot be applied on deepfakes. adversarial attack strategy guided by the results of CAM (Zhou et al.,
This is because we as humans cannot ascertain which areas contribute to 2016) was robust to image processing based defense and steganalysis
the real/fake class and thus, cannot pre-annotate any area in a real/fake based detection. Xiang et al. (2021) formulated a local black-box
image to check if the salient pixels fall within that area. adversarial attack technique to improve query efficiency and the
RemOve And Retrain (ROAR) (Hooker et al., 2019) is an extension of transferability of adversarial samples. Instead of attacking the original
Deletion and overcomes the OOD limitation. The authors of ROAR argue image directly, they first perform a white-box attack on a surrogate
that it is necessary to retrain the model after deletion to judge whether model where the noise is confined to the regions highlighted by Grad
the reduction in model performance is due to the distribution shift or CAM (Selvaraju et al., 2017). The resulting image is then used as a
because of removal of important features. Their approach is to retrain starting point for the actual black box attack. Dai et al. (2023) intro
the model using the images created after deletion and compare the ac duced a salient-based black-box attack method with the main intention
curacy between the original and retrained model. Ideally, a good tool of creating imperceptible adversarial samples. All these works utilize the
should result in the retrained model having a lower accuracy than the salient regions of the same image for which they intend to produce an
original one. However, there are a few issues when it comes to adopting adversarial image. However, our work differs in the aspect that we bring
ROAR for practical purposes. Retraining consumes a lot of time and may into play the correlation of salient regions of images belonging to the
demand more computational resources. The author of Rieger and Han opposite class. In other words, we identify the salient regions of a real
sen (2020) notes that it is not feasible for research groups to use ROAR as image and attack those regions in the corresponding fake image to
an evaluation of eight tools based on ResNet50 may take around 241 generate an adversarial fake image. This correlation is possible in
days even with eight GPUs. Unlike ROAR, our work does not involve deepfake images due to the similar orientation of faces in a real-fake
model retraining and hence is computationally inexpensive. image pair.
Max-Sensitivity Yeh et al. (2019) measures the degree to which an
explanation changes on introducing random perturbations to the data. A 3. Background
small amount of noise is added to the image and the attribution map is
recomputed on the perturbed image. The difference between the attri This section gives an insight on the XAI tools and adversarial attack
bution maps of the original and perturbed image is measured. In the method used in this paper.
ideal case, an explanation should have less sensitivity. Having a higher
sensitivity could possibly mean that the explanation is susceptible to
adversarial attacks. While this metric may be useful in measuring the 3.1. XAI
robustness of an explanation, it cannot be used to validate the faithful
ness of an explanation. XAI tools are divided into two types: model-specific and model-
agnostic. Model-specific tools consider the internal structure and
3
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
working of a model while making their calculations. Such tools can only can be used to identify salient parts of an image. The input image
be applied to a specific class of models like, for example, Convolutional is run through the model and the last convolution layer’s output
Neural Networks (CNN). Model-agnostic tools on the other hand, do not and loss are retrieved. This is followed by calculating the gradient
require any knowledge about the structure of the model. They work by of that output with respect to the model loss. The average of the
randomly perturbing the input data features and observe the model’s gradients is then multiplied with the last CNN layer’s output to
performance for different perturbations. The following tools were get the final saliency map. Grad-CAM is a model-specific tool as it
selected for our experiments based on their compatibility with the can be applied to CNNs only.
chosen deepfake detection models:
3.2. Adversarial attacks
(a) SOBOL (Fel et al., 2021)
It is a model-agnostic tool which leverages the use of a math An adversarial attack on a model is the process of adding noise to the
ematical concept called SOBOL indices to identify the contribu data such that the model reverses/changes its original prediction.
tion of input variables to the variance of the output. A set of Creating visually imperceptible adversarial samples is a challenging task
real-valued masks are drawn from a Quasi-Monte Carlo (QMC) as the amount of noise added cannot exceed a certain limit in order to
sequence and then applied to an input image through a pertur maintain visual similarity. Adversarial samples can be created in two
bation function (e.g. blurring) to form perturbed inputs. These ways:
inputs are forwarded to the model to obtain prediction scores.
Using the masks and the associated prediction scores, an expla (a) White-box method
nation is produced which characterizes the importance of each In this method, an adversary has complete access to the model
region by estimating the total order of SOBOL indices. One parameters and gradients. Popular white-box methods to
drawback of SOBOL is that it can be applied to image data only. generate adversarial samples include Fast Gradient Sign Method
(b) eXplanation with Ranked Area Integrals (XRAI) (Kapishnikov (FGSM) (Goodfellow et al., 2015) and Iterative-Fast Gradient
et al., 2019) Sign Method (I-FGSM) (Kurakin et al., 2017). However, such type
XRAI combines Integrated Gradients (IG) (Sundararajan et al., of attacks is not practical in real-world scenarios since an ad
2017) with additional steps to determine the regions of an image versary may not have complete access to the model. The adver
contributing the most to a decision. It performs pixel-level attri sarial samples created using such type of attacks are not
bution for an input image using the IG method with a black transferable, meaning that they can bypass only the model using
baseline and a white baseline. It then over-segments the image to which they were created.
create a patchwork of small regions. XRAI aggregates the (b) Black-box method
pixel-level attribution within each segment to determine its In this case, an adversary can only query the model for output
attribution density. Using these values, the segments are ranked and knows nothing else about the model. Since gradients cannot
and ordered from most to least positive. This determines the most be directly retrieved from a model, gradient estimation tech
salient regions of an image. Unlike SOBOL, XRAI is not niques like Differential Evolution (Storn and Price, 1997) and
model-agnostic since it makes use of gradients of a model. Natural Evolutionary Strategies (NES) (Wierstra et al., 2014) are
(c) Random Input Sampling Explanations (RISE) (Petsiuk et al., a few black-box methods used to create adversarial samples. This
2018) attack is more challenging to execute when compared to its
Like SOBOL, RISE works by randomly perturbing the input and ob white-box counterpart due to the lack of information about the
serves the changes in the model’s predictions. Image perturbation is model. Since we believe that a good XAI tool should help to create
done by generating binary masks and occluding the corresponding better fake images without knowing a model’s parameters, we
regions of the image. The prediction scores of the perturbed images will be dealing with this method in our evaluation approach.
are used as weights to weight the importance of the mask. The More details about its implementation is discussed in Section 4.2.
intuition here is that masks that contain important pixels will be
weighted more when compared to other masks. The weighted masks 4. Methodology
are then added up to produce a saliency map. The computation time
for RISE is heavy since a lot of masks are required to explain a single In this section, we present the details of our proposed evaluation
image. framework. Our approach is to add noise to fake images in those visual
(d) Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro concepts that contribute highly to the classification of real images as
et al., 2016) indicated by an explanation tool. For a given real-fake image pair, we
The authors of LIME proposed that it is not necessary to look at first compute the explanation of the real image and identify the
the entire decision boundary to explain a prediction but rather important visual concept. Next, we need to find a way to manipulate the
zooming into the local area of the individual data instance should same visual concept in the corresponding fake image. The orientation of
be enough to provide a reasonable explanation. Numerous data faces in real-fake image pairs are similar. Hence, if we can derive the
samples are generated from the original input by randomly per pixel indices of the salient visual concept of real images, those indices
turbing the features. These data samples are then weighted based will provide us the regions for manipulation in the corresponding fake
on their distance from the original input. They are then fed to the image. To implement this, we make use of image segmentation through
model and their corresponding predictions are retrieved. The slic algorithm of scikit-image package. The fake image is divided into a
explanation provided by LIME lies around finding a simple linear total of 100 segments. We then rank the segments based on the expla
model that fits the new data samples and their corresponding nation of the real image to select the region for distortion. This is fol
predictions. lowed by the generation of adversarial sample using the selected
(e) Gradient Class Activation Mapping (Grad-CAM) (Selvaraju et al., segment. We present the implementation details of the ranking and
2017) adversarial image generation below. An illustration of our evaluation
This tool is a slight modification of Class Activation Mapping approach is shown in Fig. 4.
(CAM) (Zhou et al., 2016). The CAM approach had a major
drawback since the existing network had to be modified. Sel 4.1. Ranking segments
varaju et al. proposed a slight modification of CAM where the
gradients of the classification score of the last convolutional layer The fake image segments are ranked based on the mean importance
4
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
approach followed by Rieger and Hansen (2020). Rieger and Hansen around an image, that is, θ is set as θ + σ δ where δ ~ N (0, I). However,
(2020) showed that two segments can be compared by computing the instead of setting n values δ ~ N (0, I), Hussain et al. used antithetic
mean importance of each segment according to a given explanation sampling to sample Gaussian noise for i ∈ {1, …, 2n} and set δj = δn− j + 1
method. For a given image i, saliency map Ei and a set of segments for j ∈ {2n + 1, …, n} since this was shown to empirically improve the
{Sli } Ll=1 where L is the total number of segments, the importance of a performance of NES. Estimating the gradient with a population of n
segment Sli can be represented as: samples yields the following variance reduced gradient estimate:
⃒⃒ ( )⃒⃒
( ) ⃒⃒Ei Sil ⃒⃒1 1 ∑ n
In the above equation, ||x||1 represents the L1 norm or the sum of the Since our approach deals with adding noise to only a particular
absolute values of a vector x. The segments are ranked in descending segment instead of the entire image, we modify the above equation as
order based on their mean values and the indices ind of the topmost follows:
ranked segment is retrieved. These indices correspond to the pixels to
1 ∑
n
which noise will be added during adversarial image generation. ∇E[F(θ)] ≈ δi [ind]F(θ[ind] + σδi [ind])y
σn i=1
5
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
All the experiments were done using python 3.9.12 on a 64 GB RAM As each tool has various parameters, we would like to highlight the
Ubuntu 22.04 OS machine with a NVIDIA GeForce RTX 3070 GPU. configuration used for each tool since different configurations can pro
duce varying results.
5.1. Dataset and models
1) SOBOL: For both models, the default values were used for each
FaceForensics++ (Rössler et al., 2019) and Celeb-DF (Li et al., 2020) parameter.
datasets were used for our experiments. FaceForensics++ is a collection 2) XRAI: For both models, the default values were used for each
of fake videos created using four manipulation methods: Deepfake (DF), parameter.
Face2Face (F2F), FaceSwap (FS) and NeuralTextures (NT). There are 3) RISE: There is no default value for the number of masks in RISE.
1000 real videos and each manipulation method consists of 1000 videos. Since RISE is computation heavy, we limited the number of masks to
The videos are offered in different compression modes. We chose the raw 2000.
format for our experiments. The dataset is divided into 720 videos for 4) LIME: The default value of 1000 perturbations was used to create
training and 140 videos each for validation and testing. For the 140 explanations for both models. LIME makes use of segmentation al
videos in the test set, we sampled 10 frames from each video. Thus, a gorithms built on scikit-image to perturb different segments and
total of 1400 original and fake images were used to carry out our pro ranks them based on their importance. We made use of the slic al
posed method. Celeb-DF consists of 590 real videos and 5639 fake gorithm for LIME as well.
videos. The test set consists of 518 videos in total. We sampled 5 frames 5) Grad-CAM: Any CNN layer can be utilized to view Grad-CAM’s
from this larger test set. Kindly note that while Celeb-DF is open-sourced explanation, but in practice it is most preferred to use the last CNN
and publicly available, we cannot distribute any derived or manipulated layer of a model. Hence, ‘conv2d_15′ and ‘conv4.pointwise’ layers
data from the dataset according to their terms of use as listed in (Terms were used to visualize MesoNet and XceptionNet respectively.
to use Celeb-DF, 2023).
Two factors had an impact on our choice for the number of sampled 5.3. Configuration of NES
frames from the datasets. The datasets have a frame rate of 30 fps(frames
per second). Given that there are 140 and 518 videos in the test set of 1) Maximum iterations itr: The number of iterations was set as 50 for
FaceForensics++ and Celeb-DF datasets respectively, working with both models. This number was chosen to keep the computation time
every frame will increase computational and time complexity as high of the evaluation process as less as possible while at the same time
lighted by recent works (Xu et al., 2023) and (Shiohara and Yamasaki, providing sufficient amount of trials for NES to reach an adversarial
2022). To reduce computational complexity, (Xu et al., 2023) samples solution.
only 4 frames for training models while (Shiohara and Yamasaki, 2022) 1
2) Learning rate α: The learning rate was set as 255 . The reason for this is
uses 8 frames. The second reason is the fact that recent deepfake de
to make sure that NES does not add too much noise at each step. If
tectors rely on finding spatial inconsistency among frames (Gu et al.,
too much noise is added at each step, then there could be a possibility
2021). Thus, an attacker would effectively need to modify the same
where the prediction probability of the adversarial fake image never
spatial region across multiple frames to bypass detection. Given that the
comes close to 0 but rather moves in the opposite direction.
datasets have a frame rate of 30 fps with an average video length of 13 s
3) Maximum distortion ϵ: The maximum amount of noise added to an
(Ismail et al., 2021), the sampling of 5–10 frames across the video should 16
image was capped at 255 to maintain visual imperceptibility. It might
be sufficient to demonstrate that the entire video is vulnerable.
be easier to spot the distortion done to pixels if this number is
We implemented our approach on MesoNet (Afchar et al., 2018) and
increased.
XceptionNet. We have highlighted few issues and the reasons for our
4) Search variance σ : The search variance was set as 0.001 following the
experimental choices below:
implementation of (Hussain et al., 2021).
5) Number of samples n: This was set as 20 for MesoNet and 80 for
• XceptionNet accepts images of any size whereas MesoNet accepts
XceptionNet. The number of samples had to be increased for Xcep
images of size 256 × 256 × 3 only. While XceptionNet’s publicly
tionNet due to its complexity when compared to MesoNet.
available model was originally trained on images of size 299 × 299
× 3, we have carried out our experiments for size 256 × 256 × 3 to
maintain consistency between both models. 6. Results
• Out of the four manipulation methods provided by FaceForensics++,
we have tested out our evaluation proposal on DF, F2F and FS only. In this section, we first demonstrate the limitations of the applica
The accuracy of XceptionNet on 1400 real images for NT was just 20 bility of existing removal/insertion XAI evaluation methods on deepfake
%. We believe that this could be due to the fact that NT manipulates detectors. We then show the results of our implemented approach.
only few frames in the source video and thus, 1400 frames may not
be enough to get a good accuracy on NT. 6.1. Analysis of generic removal/insertion XAI evaluation methods
• The pre-trained models of MesoNet and XceptionNet were available
in different versions. In the case of MesoNet, three models trained We investigated the results of different removal/insertion XAI eval
individually on DF, F2F and FS and one model trained on all uation methods on XceptionNet for FaceForensics++ DF and Celeb-DF
manipulation methods was available. However, in the case of datasets. The chosen metrics were Deletion, IROF and IAUC.
XceptionNet, only the model trained on all manipulation methods In Deletion, top 15 % of salient pixels highlighted by an XAI tool are
was available. The accuracy of 1400 real images on the MesoNet replaced with either a zero value, uniform random value or blurred
model that was trained on all manipulation methods was only 38.36 using neighbouring pixels. The model accuracy is expected to drop after
%. In such a case, the explanations retrieved for real images cannot replacement and the magnitude of drop determines the effectiveness of
be trusted upon since the prediction probability itself will point more the tool. IROF is similar to Deletion, with the only difference being the
towards the fake class. Hence for MesoNet, we have used the three replacement of 15 salient segments instead of 15 % of salient pixels. In
model versions individually trained on DF, F2F and FS as opposed to IAUC, top 15 % of salient pixels are inserted on a completely blurred
XceptionNet where the model trained on all manipulation methods version of the original image. The model accuracy is expected to rise
was used. after the insertion and the magnitude of rise determines the effectiveness
6
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
Fig. 5. Model accuracy results for removal XAI evaluation metrics on XceptionNet for FaceForensics++ DF (a) Deletion: Top 15 % of salient pixels replaced with
zero, random and blurred values (b) IROF: Top 15 salient segments replaced with zero, random and blurred values.
Fig. 6. Model accuracy results for removal XAI evaluation metrics on XceptionNet for Celeb-DF (a) Deletion: Top 15 % of salient pixels replaced with zero, random
and blurred values (b) IROF: Top 15 salient segments replaced with zero, random and blurred values.
of the tool.
Table 1
Fig. 5(a) and 5(b) show the results of Deletion and IROF for Xcep
IAUC results on XceptionNet. The second row shows the accuracy of the model
tionNet on FaceForensics++ DF. It can be observed that none of the
on the blurred version of the dataset. Rows 3–7 show the accuracy of the model
tools show a drop in model accuracy on computation of Deletion and
on inserting the top 15 % of salient pixels onto the blurred images. The values in
IROF. Rather interestingly, the numbers are greater than the actual ac rows 3–7 should be greater than the values in row 2.
curacy of 84 % which is represented by the dashed line. This shows that
Dataset FaceForensics++ DF Celeb-DF
replacement of pixels on fake images is not suitable to evaluate XAI tools
since the detectors rely on face artifacts to perform detection. Replace Accuracy (Blurred images) 0% 74.82 %
SOBOL 0% 20.58 %
ment of pixels/segments results in distorting those face artifacts and can
XRAI 0% 21.29 %
produce unexpected results as can be seen in this case. RISE 0% 97.17 %
Fig. 6(a) and 6(b) show the results of removal metrics for Xcep LIME 0% 78.23 %
tionNet on Celeb-DF. While the results of some tools fall below the actual GradCAM 0% 62.28 %
accuracy of 80.11 %, there are still some abnormal values which make
the results difficult to compare. For instance, the results for RISE and
expected to drop on removal of pixels/segments. Since the results of
GradCAM are well above 80.11 % in most of the cases. This raises
RISE and GradCAM do not show a drop in model accuracy, it also makes
questions on the validity of these metrics since the model accuracy is
us question whether the drop in the results of other tools was genuine or
7
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
Table 2
Implementation of our approach on FaceForensics++ dataset. The original accuracy of the models on actual fake images is reported in the third row. The accuracy of
the models on adversarial fake images created using the respective tools is reported in rows 4–8. A faithful tool should reduce the accuracy of a model on adversarial
fake samples.
MesoNet XceptionNet
DF F2F FS DF F2F FS
random. tool for DF whereas LIME and RISE perform well on F2F and FS
Table 1 shows the results of IAUC for XceptionNet on the two data respectively. Note that since we have used three different MesoNet
sets. The second row shows the accuracy of XceptionNet on the blurred models trained individually on DF, F2F and FS, the faithfulness of tools
versions of the respective datasets. When the top 15 % of salient pixels will not be similar for all of them. For instance, XRAI which performed
are inserted onto these blurred images, the accuracy is expected to in well on the MesoNet model trained on DF, had the least drop in model
crease. However, the results were not as expected. For FaceForensics++ accuracy for adversarial F2F images. However, this is not the case with
DF, the results of all tools remained at 0 % while for Celeb-DF, the results XceptionNet. Since the evaluation is carried out on a single model
of SOBOL, XRAI and GradCAM showed a decrease in accuracy. The re trained on all manipulation methods, the faithfulness of tools should be
sults are not meaningful enough to evaluate the tools since they deviate the same regardless of whichever dataset is used. Our experiment results
from the expected behaviour of an increase in accuracy. This also shows prove the same. GradCAM showed the best results for DF, F2F and FS
that blurring may not be an effective replacement method to evaluate whereas LIME had the least drop in model accuracy for adversarial
XAI tools on deepfake images. images of all datasets. Fig. 7 shows an example of the evaluation process
Overall, the results of our experiments show that generic removal/
insertion XAI evaluation methods may or may not work well for specific
image processing tasks. This warrants the need to research and develop Table 3
new XAI evaluation methods specific to the task that the model has been Implementation of our approach on XceptionNet for
Celeb-DF dataset.
trained for.
Original Accuracy 80.11 %
SOBOL 27.29 %
XRAI 19.64 %
6.2. Analysis of our proposed approach RISE 28.58 %
LIME 25.17 %
Table 2 shows the results of our implementation on FaceForensics++ GradCAM 21.94 %
dataset. In the case of MesoNet, we can see that XRAI is the most faithful
8
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
Table 4 7. Conclusion
Accuracy of adversarial DF images created using respective tools on
XceptionNet. Due to the rapid spread of deepfake content across the internet,
Image Size 256 × 256 × 3 299 × 299 × 3 numerous deep learning models are being proposed to distinguish be
Original Accuracy 84 % 94.2 % tween real and fake content. While a model may claim a high detection
SOBOL 53.14 % 71.07 % accuracy, it is also important to gain the trust of the person deploying it.
XRAI 57.07 % 76.57 %
XAI tools have played a tremendous role in the last few years by helping
RISE 58.5 % 73.71 %
LIME 61.57 % 75.28 % humans understand the working of a model. However, blindly using an
GradCAM 32.28 % 64.5 % XAI tool’s result to trust/mistrust the model is also not recommended
given that different XAI tools work using different strategies. It is
important to ensure that an XAI tool remains faithful to a model. In this
Table 5 paper, we demonstrated the limitations of existing removal/insertion
Average time taken (in seconds) by respective tools for explaining one prediction XAI evaluation methods on deepfake detectors. We also presented a
by the models. novel approach to evaluate the faithfulness of an XAI tool on a deepfake
MesoNet XceptionNet
detection model. We proposed to evaluate tools based on their ability to
generate adversarial fake images using the explanation of corresponding
SOBOL 1.17 5.07
real images. We believe that this approach will aid researchers and de
XRAI 3.24 32.6
RISE 12 23.54 velopers to deploy the right tools on their models.
LIME 5.73 5.56 One limitation of our work is that it requires the presence of a cor
GradCAM 4.94 2.52 responding real video for a fake video, meaning that fake videos that do
not have a real video counterpart cannot be used with our approach.
This drawback hindered us from exploring other deepfake datasets like
Table 6 UADFV and DFDC. Another major limitation of our approach is that it
Average time taken (in seconds) for generating one adversarial fake image using may not work on deepfake detectors that are robust to adversarial at
the segment highlighted by the respective tools. tacks. Adversarially robust detectors cleverly neglect the noise added to
MesoNet XceptionNet images and continue to give the same prediction regardless of the
distortion. Developing evaluation methods that work on adversarially
DF F2F FS DF F2F FS
robust detectors can be explored as future work.
SOBOL 5.64 % 29.35 % 74.28 % 53.14 % 65.57 % 88 %
XRAI 5.5 % 38.28 % 72.07 % 57.07 % 75.64 % 89.64 %
RISE 88.07 % 29.85 % 20.71 % 58.5 % 72.28 % 88.42 % CRediT authorship contribution statement
LIME 71.71 % 19.92 % 45.85 % 61.57 % 80 % 92.14 %
GradCAM 54.07 % 44.28 % 46.78 % 32.28 % 42.78 % 83 % Balachandar Gowrisankar: Conceptualization, Investigation,
Methodology, Software, Validation, Writing – original draft. Vrizlynn L.
for a real-fake image pair on MesoNet. L. Thing: Conceptualization, Methodology, Writing – original draft.
Table 3 shows the results of our approach for XceptionNet on Celeb-
DF. We could not use Celeb-DF with MesoNet as a pre-trained model was
Declaration of Competing Interest
not available publicly. XRAI showed the best results while the adver
sarial images created using RISE showed the highest adversarial
The authors declare that they have no known competing financial
accuracy.
interests or personal relationships that could have appeared to influence
We have also outlined a comparison of the evaluation results on
the work reported in this paper.
XceptionNet between images of size 299 × 299 × 3 and 256 × 256 × 3 in
Table 4. We can observe that the ability to generate adversarial images
Data availability
reduces when the image resolution is increased.
Do note that we are manipulating only one segment in each image.
The data that has been used is confidential.
Since we are dividing a total of 65,536 pixels into 100 segments, roughly
around 1 % of pixels gets manipulated to generate an adversarial image.
The accuracy of the models on adversarial fake samples may decrease
even further if more segments are manipulated. Our main objective was References
to find out the most faithful tool by keeping the amount of distortion on
Afchar, D., Nozick, V., Yamagishi, J., Echizen I., 2018. MesoNet: a compact facial video
an image as less and imperceptible as possible. forgery detection network, 2018 IEEE International Workshop On Information
Table 5 and 6 show the average time taken for computation of Forensics and Security (WIFS), Hong Kong, China, pp. 1–7, doi:10.1109/WIFS.201
explanation and evaluation by different tools respectively. As far as time 8.863076.
Bach, S., Binder, A., Montavon, G., Klauschen, F., Muller, K.R., Samek, W., 2015. On
complexity of explanations is concerned, we can observe that Grad-CAM pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
seems to provide quick responses in general. This can be attributed to propagation. PLoS ONE 10 (7), 1–46. https://doi.org/10.1371/journal.
the fact that it involves relatively simple computations when compared pone.0130140, 07[Online]. Available.
Cai, Z., Stefanov, K., Dhall, A., Hayat, M., 2022. Do You Really Mean That? Content
to the other tools. While the time taken by model-specific tools like Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal
Grad-CAM and XRAI may depend on the model’s structure, the time Forgery Localization, [Online]. Available: https://arxiv.org/abs/2204.06228.
taken by model-agnostic tools on different models should be similar for a Chollet, F., 2017. Xception: deep Learning with Depthwise Separable Convolutions, 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI,
given set of parameters since they do not depend upon the model’s
USA, pp. 1800–1807, doi:10.1109/CVPR.2017.195.
structure. However, a significant increase in the amount of explanation Coccomini, D.A., Messina, N., Gennaro, C., Falchi, F. (2022). Combining EfficientNet and
time for SOBOL and RISE can be observed on XceptionNet. We had to Vision Transformers for Video Deepfake Detection. In: Sclaroff, S., Distante, C., Leo,
M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022.
reduce the batch size of images while applying SOBOL and RISE on
ICIAP 2022. Lecture Notes in Computer Science, vol 13233. Springer, Cham. https
XceptionNet to make them fit our GPU. Thus, based on our observation, ://doi.org/10.1007/978-3-031-06433-3_19.
lowering of batch size of images may lead to an increased computation Dai, Z., Liu, S., Li, Q., Tang, Ke, 2023. Saliency attack: towards imperceptible black-box
time for perturbation based tools. adversarial attack. ACM Trans. Intell. Syst. Technol. 14 (3), 20. https://doi.org/
10.1145/3582563. Article 45.
9
B. Gowrisankar and V.L.L. Thing Computers & Security 139 (2024) 103684
Dong, X., et al., 2020. Robust Superpixel-Guided Attentional Adversarial Attack, IEEE/ Petsiuk, V., Das, A., Saenko, K., 2018. RISE: randomized Input Sampling for Explanation
CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, of Black-box Models, arXiv:abs/1806.07421.
USA, 2020, pp. 12892–12901, doi:10.1109/CVPR42600.2020.01291. Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies J., Niessner, M., 2019.
Dong, S., Wang, J., Liang, J., Fan, H., Ji, R., (2022). Explaining Deepfake Detection by FaceForensics++: learning to Detect Manipulated Facial Images, 2019 IEEE/CVF
Analysing Image Matching. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp.
Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in 1–11, doi:10.1109/ICCV.2019.00009.
Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031- Ribeiro, M.T., Singh, S., Guestrin, C., 2016. Why Should I Trust You?: explaining the
19781-9_2. Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International
Fel, T., Cadene, R., Chalvidal, M., Cord, M., Vigouroux, D., Serre, T., 2021. Look at the Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for
Variance! Efficient Black-box Explanations with Sobol-based Sensitivity Analysis, Computing Machinery, New York, NY, USA, 1135–1144. https://doi.org/10.11
arXiv:abs/2111.04138. 45/2939672.2939778.
Gomez, T., Fréour, T., Mouchère, H., (2022). Metrics for Saliency Map Evaluation of Rieger L., Hansen, L.K., 2020. IROF: a low resource evaluation metric for explanation
Deep Learning Explanation Methods. In: El Yacoubi, M., Granger, E., Yuen, P.C., Pal, methods [Online]. Available: https://arxiv.org/abs/2003.08747.
U., Vincent, N. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2022. Samek, W., Binder, A., Montavon, G., Lapuschkin, S., Muller, K.R., 2017. Evaluating the
Lecture Notes in Computer Science, vol 13363. Springer, Cham. https://doi.org/ visualization of what a deep neural network has learned. IEEE Trans. Neural Netw.
10.1007/978-3-031-09037-0_8. Learn. Syst. 28 (11), 2660–2673.
Goodfellow, I.J., Shlens, J., Szegedy, C., 2015. Explaining and Harnessing Adversarial Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh D., Batra, D., 2017. Grad-
Examples, in 3rd International Conference on Learning Representations, ICLR 2015, CAM: visual explanations from deep networks via gradient-based localization, 2017
San Diego, CA, USA, May 7-9, Conference Track Proceedings, Y. Bengio and Y. IEEE International Conference on Computer Vision (ICCV), Venice, Italy, pp.
LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6572. 618–626, doi:10.1109/ICCV.2017.74.
Gu, Z., Chen, Y., Yao, T., Ding, S., Li, J., Huang, F., Ma., L., 2021. Spatiotemporal Shiohara K., Yamasaki, T., 2022. Detecting deepfakes with self-blended images, in
Inconsistency Learning for DeepFake Video Detection. In Proceedings of the 29th Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern
ACM International Conference on Multimedia (MM ’21). Association for Computing Recognition (CVPR), New Orleans, LA, USA, pp. 18699–18708. doi:10.1109/C
Machinery, New York, NY USA, 3473–3481. https://doi.org/10.1145/347408 VPR52688.2022.01816.
5.3475508. Storn, R., Price, K., 1997. Differential evolution-a simple and efficient heuristic for global
Hooker, S., Erhan, D., Kindermans, P.J., Kim, B., 2019. A benchmark for interpretability optimization over continuous spaces. J. Glob. Optim. 11 (4), 341.
methods in deep neural networks. In: Wallach, H., Larochelle, H., Beygelzimer, A., Sundararajan, M., Taly, A., Yan, Q., 2017. Axiomatic Attribution for Deep Networks, in
d’Alche-Buc, F., Fox, E., Garnett, ‘ R. (Eds.), A benchmark for interpretability Proceedings of the 34th International Conference on Machine Learning - Volume 70,
methods in deep neural networks. Adv. Neural Inf. Process. Syst. 32, 9737–9748 ser. ICML’17. JMLR.org, p. 3319–3328.
[Online]. Available. http://papers.nips.cc/paper/9167-a-benchmark-forinterpret Terms to use Celeb-DF. https://docs.google.com/forms/d/e/1FAIpQLScoXint8ndZXy
ability-methods-in-deep-neural-networks.pdf. Ji2Rcy4MvDHkkZLyBFKN43lTeyiG88wrG0rA/viewform (accessed Nov. 10 2023).
Hussain, S., Neekhara, P., Jere, M., Koushanfar F., McAuley, J., 2021. Adversarial Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., Schmidhuber, J., 2014.
Deepfakes: evaluating Vulnerability of Deepfake Detectors to Adversarial Examples, Natural evolution strategies. J. Machine Learn. Res. 15 (27), 949–980 [Online].
2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Available. http://jmlr.org/papers/v15/wierstra14a.html.
Waikoloa, HI, USA, pp. 3347–3356, doi:10.1109/WACV48630.2021.00339. Xiang, T., Liu, H., Guo, S., Zhang, T., & Liao, X., 2021. Local Black-box Adversarial
Ismail, A., Elpeltagy, M., Zaki, M., ElDahshan, K.A., 2021. Deepfake video detection: Attacks: a Query Efficient Approach, arXiv:abs/2101.01032.
yOLO-face convolution recurrent approach. PeerJ Comput. Sci. 7, e730. https://doi. Xu, Y., Liang, J., Jia G., Yang, Z., Zhang, Y., He, R., 2023. TALL: thumbnail Layout for
org/10.7717/peerj-cs.730. Deepfake Video Detection, arXiv:abs/2307.07494.
Kapishnikov, A., Bolukbasi, T., Viegas F., Terry, M., 2019. XRAI: better Attributions Yeh, C.-K., Hsieh, C.-Y., Suggala, A.S., Inouye, D.I., Ravikumar. P., 2019. On the (in)
Through Regions, 2019 IEEE/CVF International Conference on Computer Vision fidelity and sensitivity of explanations. Proceedings of the 33rd International
(ICCV), Seoul, Korea (South), pp. 4947–4956, doi:10.1109/ICCV.2019.00505. Conference on Neural Information Processing Systems. Curran Associates Inc., Red
Kurakin, A., Goodfellow, I.J., Bengio, S., 2017. Adversarial Machine Learning at Scale, in Hook, NY, USA, Article 984, 10967–10978.
5th International Conference on Learning Representations, ICLR 2017, Toulon, Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S., Oct. 2018. Topdown
France, April 24-26, Conference Track Proceedings. OpenReview.net, 2017. neural attention by excitation backprop. Int. J. Comput. Vis. 126 (10), 1084–1102.
[Online]. Available: https://openreview.net/forum?id=BJm4T4Kgx. Zhou, B., Khosla, A., Lapedriza, A., Oliva A., Torralba, A., 2016. Learning Deep Features
Li, Y., Yang, X., Sun, P., Qi H., Lyu, S., 2020. Celeb-DF: a Large-Scale Challenging Dataset for Discriminative Localization, 2016 IEEE Conference on Computer Vision and
for DeepFake Forensics, 2020 IEEE/CVF Conference on Computer Vision and Pattern Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 2921–2929, doi:10.110
Recognition (CVPR), Seattle, WA, USA, pp. 3204–3213, doi:10.1109/CVPR42600.2 9/CVPR.2016.319.
020.00327.
Lin, Z.Q., Shafiee, M.J., Bochkarev, S., Jules, M.S., Wang, X., Wong, A., 2019. Do
Balachandar Gowrisankar is a research engineer at the Cybersecurity Strategic Tech
explanations reflect decisions? A machinecentric strategy to quantify the
nology Centre, ST Engineering, Singapore. He received his Master of Computing degree in
performance of explainability algorithms. CoRR abs/1910.07387,. [Online].
Infocomm Security from National University of Singapore and B.E. degree in Computer
Available. http://arxiv.org/abs/1910.07387.
Engineering from College of Engineering Guindy, India. His-research interests include
Lundberg S.M., Lee, S.I., 2017. A Unified Approach to Interpreting Model Predictions, in
multimedia forensics and machine learning security.
Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S.
Bengio, H. Wallach, R. Fergus, S. Vishwanathan et al., Eds. Curran Associates, Inc.,
pp. 4765–4774. [Online]. Available: http://papers.nips.cc/paper/7062-aunified-a Dr Vrizlynn L. L. Thing is the SVP, Head of Cybersecurity Strategic Technology Centre, at
pproach-to-interpreting-model-predictions.pdf. ST Engineering, where she provides thought leadership and oversees cybersecurity tech
Neekhara, P., Dolhansky, B., Bitton, J., Ferrer, C., 2021. Adversarial threats to deepfake nology innovation. She has over two decades of cybersecurity and digital forensics R&D
detection: a practical perspective, in 2021 IEEE/CVF Conference on Computer Vision and programme management experience. She is also actively involved at national level
and Pattern Recognition Workshops (CVPRW). Los Alamitos, CA, USA: IEEE initiatives where she contributes to standards development and technology innovation
Computer Society, pp. 923–932. [Online]. Available: https://doi.ieeecomputersocie roadmaps shaping.
ty.org/10.1109/CVPRW53098.2021.0010.
10