0% found this document useful (0 votes)
8 views

Block Based Deepfake Detection Main

Uploaded by

xueping wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Block Based Deepfake Detection Main

Uploaded by

xueping wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DeepFeatureX Net: Deep Features eXtractors

based Network for discriminating synthetic from


real images

Orazio Pontorno[0009−0009−0381−9971] , Luca Guarnera[0000−0001−8315−351X] , and


Sebastiano Battiato[0000−0001−6127−2470]
arXiv:2404.15697v1 [cs.CV] 24 Apr 2024

Department of Mathematics and Computer Science, University of Catania, Italy


[email protected], {luca.guarnera,sebastiano.battiato}@unict.it

Abstract. Deepfakes, synthetic images generated by deep learning al-


gorithms, represent one of the biggest challenges in the field of Digital
Forensics. The scientific community is working to develop approaches
that can discriminate the origin of digital images (real or AI-generated).
However, these methodologies face the challenge of generalization, that
is, the ability to discern the nature of an image even if it is generated
by an architecture not seen during training. This usually leads to a drop
in performance. In this context, we propose a novel approach based on
three blocks called Base Models, each of which is responsible for ex-
tracting the discriminative features of a specific image class (Diffusion
Model-generated, GAN-generated, or real) as it is trained by exploiting
deliberately unbalanced datasets. The features extracted from each block
are then concatenated and processed to discriminate the origin of the
input image. Experimental results showed that this approach not only
demonstrates good robust capabilities to JPEG compression but also out-
performs state-of-the-art methods in several generalization tests. Code,
models and dataset are available at https://github.com/opontorno/
block-based_deepfake-detection.

Keywords: Deepfake Detection · Multimedia Forensics · Generative


Models.

1 Introduction

Generative models have achieved a high degree of fidelity in content genera-


tion, producing increasingly realistic and convincing results. Thanks to the vast
amount of data available today and the continuous development of complex ar-
chitectures, such as Generative Adversarial Networks (GANs) [17] and Diffusion
Models (DMs) [25,46], these models are able to produce images, text, sound
and video with an astonishing quality that can hardly be distinguished from
those created by human beings. This ability to generate high-fidelity content
has opened up new opportunities in a wide range of fields, from art and en-
tertainment to scientific research and multimedia content production. However,
2 O. Pontorno et al.

along with their powerful creative capabilities, generative models also have sev-
eral negative aspects. One of the main problems is the possibility of abuse, as
such models can be used to generate fake or convincingly manipulated content,
fuelling the spread of misinformation and fraud [45,48]. Moreover, they can raise
ethical concerns regarding intellectual property and privacy [33], especially when
they are used to create content based on personal data without the consent of
the involved people. The proper and preventive detection of AI-generated con-
tent therefore becomes a critical priority to combat the spread of deepfakes and
maintain the integrity of online information.
The scientific community is striving to find increasingly new and effective
techniques and methods that can discern the nature (real or generated) of digi-
tal images. These techniques can be based on analysis and processing of statis-
tics extracted from images (e.g. analytical traces) or on deep learning engines.
Among other we recall the analysis of image frequencies, such as the Discrete
Cosine Transform (DCT) and the Fourier Transform to map image pixels from
the spatial domain to the frequency domain, facilitating greater interpretability
in the task of deepfake recognition [2,22]. Deep learning-based methodologies in-
volve the construction of neural models achieving in general better results than
the previous techniques [1,16], but at the expense of a lower generalization.
In this paper we propose a deep learning based architecture that exploits
three backbones, called “Base Models” (BM) trained and specialized to specific
classification tasks with special emphasis to DM generated data, GAN generated
data, and real ones. The fundamental concept is based on utilising the inherent
capabilities of the basic models, each of which is dedicated to extracting dis-
criminating features specific to a generating architecture left behind during the
image generation process. This approach aims at enhancing the final model by
making it more resilient and robust to JPEG compression attacks, commonly
employed by social networks, and more effective in the generalisation of the ac-
quired knowledge. Focusing on specific distinctive features associated with dif-
ferent image generation technologies allows the model to develop a deeper and
more focused understanding of the peculiarities of each image category, thus im-
proving its ability to distinguish between genuine and synthetic images in real
and variable contexts. With this work we face the difficulty, often encountered in
the state-of-the-art, of generalizing the recognition capabilities acquired in the
training phase both to images generated by AIs not belonging to the dataset
used in that phase and to synthetic images of generating architectures other
than those taken into consideration.
The main contributions of this paper are:
– A new approach for extracting main features from digital images using Base
Models.
– A model capable of retaining its discriminative ability even at JPEG com-
pression attacks.
The paper follows the following structure: Section 2 provides an overview of
the main deepfake detection methods currently present in the state-of-the-art;
in Section 3, a detailed description of the dataset of images used to conduct the
DeepFeatureX Net: Deep Features eXtractors based Network 3

experiments is provided; subsequently, in Section 4.2, the architecture proposed


in this study and the stages of the training method are presented in detail;
the experimental results obtained during the testing phase are reported in a
Section 5.2; finally, the paper concludes with a concluding section where the
main findings are summarized and the future directions of research are outlined.

2 Related Works

Most deepfake detection methods are based on intrinsic trace analysis to classify
real content from synthetic ones. The Expectation-Maximization algorithm was
used in [19] to capture the correlation between pixels, resulting in a discrimina-
tive trace able to distinguish deepfake images from pristine ones. McCloskey et
al. [38] showed that generative models create synthetic content with color channel
curve statistics different from the real data, resulting in another discriminative
trace. In the frequency domain [21,37], researchers highlighted the possibility
of identifying abnormal traces left during generative models, mainly analyzing
features extracted from DCT [3,9,15]. Liu et al. [35] proposed a method called
Spatial-Phase Shallow Learning (SPSL) that combines spatial imaging and phase
spectrum to capture artifacts from up-sampling on synthetic data, improving
deepfake detection. Corvi et al. [11] analyzed a large number of images gener-
ated by different families of generative models (GAN, DM, and VQ-GAN (Vector
Quantized Generative Adversarial Networks)) in the Fourier domain to discover
the most discriminative features between real and synthetic images. The exper-
iments showed that regular anomalous patterns are available in each category
of involved architecture. Another category of detectors are deep neural network-
based approaches. Wang et al. [50] used a ResNet-50 model trained with images
generated by ProGAN [28] to differentiate real from synthesized images. Their
study demonstrated the model’s ability to generalize beyond ProGAN-generated
Deepfakes. Wang et al. [49] introduced FakeSpotter, a new approach that relies
on monitoring the behaviors of neurons (counting which and how many activate
on the input image) within a dedicated CNN to identify Deepfake-generated
faces. Many researchers have focused their research on trying to investigate how
possible it is to detect images created by diffusion models. Corvi et al. [10] were
among the first to address this issue, exploring the difficulties in distinguishing
images generated by diffusion models from real ones and evaluating the suitabil-
ity of current detectors. Sha et al. [44] proposed DE-FAKE, a machine learning
classifier designed for detecting diffusion model-generated images across four
prominent text-to-image architectures. The authors then proposed a pioneering
study on the detection and attribution of fake images generated by diffusion
models, demonstrating the feasibility of distinguishing such images from real
ones and attributing them to the source models, and also discovering the in-
fluence of prompts on the authenticity of images. Recently, Guarnera et al.[20]
proposed a method based on the attribution of images generated by generative
adversarial networks (GANs) and diffusion models (DMs) through a multi-level
4 O. Pontorno et al.

hierarchical strategy. At each level, a distinct and specific task is addressed: the
first level (more generic), allows discerning between real and AI-generated im-
ages (either created by GAN or DM architectures); the second level determines
whether the images come from GAN or DM technologies; and the third level
addresses the attribution of the specific model used to generate the images.
The limitations of these methods mainly concern the presence of experimental
results performed only under ideal conditions and, consequently, the almost total
absence of generalization tests: the classification performance of most state-of-
the-art methods drops drastically when testing images generated by architectures
never considered during the training procedure.

3 Dataset details

The dataset comprises a total of 72, 334 images, distributed as follows: 19, 334
real images collected from CelebA [36], FFHQ [31], and other sources [33,10],
37, 572 images generated by the GAN architectures GauGAN [40], BigGAN [4],
ProGAN [29], StarGAN [6], AttGAN [24], GDWCT [5], CycleGAN [54], Style-
GAN [31], StyleGAN2 [32], StyleGAN3 [30], and 15, 423 images produced by the
DM architectures DALL-E MINI 1 , DALL-E 2 [41], Latent Diffusion [42], Stable
Diffusion 2 (Figure 1 (a) shows some examples of used images). All images are
in PNG format.
Initially the dataset was divided into three parts: a first 40% was used for
training and validation of the Base Models (refer to Section 4.1); another 40%
was used for training and validation of the complete models (refer to Section 4.2);
finally the remaining 20% was used as testing dataset for both phases. Since our
only goal is to discern the nature of the images, regardless of semantics, resolu-
tion, and size, the images were collected with as much variety of these parameters
as possible. The objective is to underscore the dataset’s varied composition, in-
corporating images from different sources, each marked by unique tasks and
approaches to image creation.

4 Proposed Method

The model proposed in this paper consists of exploiting three CNN backbones
as feature extractors, which are then concatenated and processed to solve the
classification task. The key idea of the model lies in the training of the three
backbones, each of which is trained using a specially unbalanced dataset of im-
ages (as detailed below). The purpose of this procedure is to force each backbone
to focus on finding the discriminative features, left by each type of generative
model during the generation phase, contained in the images belonging to a spe-
cific class (real, GAN-generated, DM-generated). We give the name of ‘Base
Model’ to backbones trained on a highly unbalanced dataset and later used as
1
github.com/borisdayma/dalle-mini
2
github.com/CompVis/stable-diffusion
DeepFeatureX Net: Deep Features eXtractors based Network 5

Fig. 1. Entire pipeline of the proposed method. (a) shows the process of dividing the
training dataset into three unbalanced subsets, each with respect to a specific class
(DM, GAN, real) used for training a specific Base Model. (b) illustrates the architecture
of the final model, which takes the three Base Models ϕc trained in the previous phase
with frozen weights, and uses them to extract the features from a digital image ϕc (I),
where c ∈ C = {DM , GAN , REAL}. These are then concatenated in channel dimension
ϕ(I) = ϕDM (I) ⊕ ϕGAN (I) ⊕ ϕREAL (I) and processed to solve the classification task.

feature pullers in the complete model. Figure 1 shows the entire pipeline of the
proposed method.

4.1 Training of Base Models


As mentioned above, each of the three Base Models was trained using a subset of
the training data set. In particular, from the original image are extracted three
subsets that are somewhat unbalanced with respect to each of the classes follow-
ing a 90:10 ratio. In this training phase, a pre-trained Convolutional Neural Net-
6 O. Pontorno et al.

DM Base Model GAN Base Model REAL Base Model


Backbone
Acc Rec Pre F1 Acc Rec Pre F1 Acc Rec Pre F1
DenseNet 121 76.34 99.00 47.60 64.29 92.34 99.25 87.64 93.08 73.17 99.08 49.73 66.22
DenseNet 161 83.96 98.64 57.40 72.57 94.62 99.45 91.03 95.05 77.39 99.21 54.04 69.97
DenseNet 169 78.83 99.16 50.40 66.83 93.86 99.32 89.91 94.38 71.20 99.29 47.95 64.67
DenseNet 201 79.11 98.90 50.75 67.08 92.61 99.37 87.96 93.32 72.97 99.29 49.54 66.10
EfficientNet b0 85.74 97.09 60.50 74.55 88.77 98.81 82.87 90.14 77.91 97.56 54.70 70.10
EfficientNet b4 78.00 97.47 49.43 65.59 87.14 98.74 80.78 88.86 74.31 97.27 50.84 66.78
ResNet 18 76.64 98.03 47.90 64.36 84.28 99.06 77.16 86.75 65.14 99.03 43.17 60.13
ResNet 34 77.29 98.12 48.62 65.02 83.33 99.40 75.93 86.10 66.75 98.82 44.33 61.21
ResNet 50 78.14 98.61 49.59 66.00 90.34 99.14 84.82 91.42 70.79 99.13 47.59 64.31
ResNet 101 77.20 99.00 48.53 65.13 90.85 98.87 85.71 91.82 69.10 99.08 46.17 62.99
ResNet 152 76.48 99.00 47.75 64.43 93.42 99.32 89.23 94.00 70.59 98.92 47.41 64.10
ResNeXt 101 75.38 98.58 46.60 63.28 93.82 98.40 90.53 94.30 66.26 98.66 43.96 60.82
ViT b16 76.53 98.58 47.80 64.38 83.58 99.69 76.11 86.32 68.45 99.24 45.67 62.55
ViT b32 74.05 96.92 45.20 61.65 80.83 98.99 73.39 84.29 60.44 99.61 40.13 57.21
Table 1. Percentage values of the metrics Accuracy, Recall, Precision, and F1 Score
obtained by testing the Base Model to the binary classification between the predomi-
nant class and the class ’others’.

work (CNN) standard is adapted by performing a binary classification between


the predominant class and the one named ‘others’, composed of some images
taken randomly by the other two remaining classes. Figure 1 (a) summarizes the
overall process. Once the training is completed, the three Base Models are first
frozen, the last linear layer (delegated to the binary classification) removed so
that the characteristics maps of the last convolution layer are returned as out-
put. Our hypothesis, verified during the test phase, is that, following this training
procedure, the backbones focus on the search for the main characteristics of the
predominant class in order to be able to recognize their presence/absence, dur-
ing inference. In conducting the experiments, the following CNNs were used as
backbone: DenseNet 121, DenseNet 161, DenseNet 169, DenseNet 201 [26], Effi-
cientNet b0, EfficientNet b4 [47], ResNet 18, ResNet 34, ResNet 50, ResNet 101,
ResNet 152 [23], ResNeXt 101 [53], ViT b16, ViT b32 [13]. All backbones have
been pretrained on the Imagenet [12] dataset. All experiments were conducted
on GPU NVIDIA RTX a6000. The parameters of each model were selected by
choosing those that obtained the minimum loss value during model validation.
Table 1 shows the accuracy, recall, precision, and F1 score values obtained by
evaluating all backbones on testing images. From the results we can observe how
this training led to maximizing the recall value, this indicates that the classifi-
cation model is able to correctly identify all the positive examples of the interest
class (the unbalanced one). In other words, the model tends to minimize false
negatives; that is, there are no cases where the model wrongly classified a pos-
itive example as negative. This confirms our initial hypothesis that, following
the training procedure described above, Base Models are able to capture the
discriminative features of each generating architecture.
DeepFeatureX Net: Deep Features eXtractors based Network 7

4.2 Overall architecture


The final model uses the three Base Models trained as described in Section 4.1
as feature extractors, at this stage they will no longer be trained as the weights
have been frozen. Each base model receives the same digital image as input and is
tasked with identifying and extracting the discriminative features of each class.
These are then concatenated to obtain a three-channel tensor, which is then
processed through a custom CNN, consisting of a sequence of 5 convolutions
1D with respectively kernel size of 7, 5, 3, 3, 3, all with padding 1 and stride 1;
This was followed by a Global Average Pooling operation and a three-node linear
output classifier. Figure 1 (b) presents both the entire pipeline and a graphical
representation of the model.
For the training phase of the complete models we used the Cross Entropy Loss
weighed with respect to the frequency of each class in the dataset (Equation 1).
This choice was necessary to avoid that the models were too influenced by the
imbalance present in the dataset of used images.
N
1 XX
Weighted Cross Entropy Loss = − wc yi,c log(ŷi,c ) (1)
N i=1
c∈C

where N s the number of samples, C = {GAN, DM, REAL} is the set of


classes, yi,c is the ground truth label for sample i and c, ŷi,c is the predicted
probability for sample i and class c, and wc is the weight for class c. In particular:
1
wc = ∀c ∈ C.
#images of class c

5 Experimental results
Two types of experiments were conducted: Inference and robustness tests to as-
sess the effectiveness and robustness of the classification models, and comparison
with the state-of-the-art in the generalization test.

5.1 Inference and robustness tests


In this first testing phase, we tested the proposed architecture by varying the
backbone of the Base Model in order to choose the best model. For this testing
phase, 20% of the images of the original dataset (Sec. 3) were used. Furthermore,
in order to make the accuracy metric more meaningful, both validation and
testing datasets were balanced so as to have the same number of images for each
class (DM-generated, GAN-generated, real).
Initially, the models were tested using just raw images. Subsequently, the
test images were compressed into JPEG format using different Quality Factors
(QF): 90, 80, 70, 60, 50. On these new compressed image sets, we again tested
the models by analysing their robustness to JPEG compression. Figure 2 shows
8 O. Pontorno et al.

Fig. 2. Image variation as JPEG compression Quality Factor decreases. On the left
raw image, at center JPEG compressed image at Quality Factor 80, and on the right
the image at QF 50. Image generated by StyleGAN2 [32].

Multi-class - Accuracy / F1-Score (%)


Base Model
JPEG-Compression
backbone RAW
QF90 QF80 QF70 QF60 QF50
DenseNet 121 91.39/91.35 89.34/89.26 87.19/87.11 84.31/84.26 81.17/81.24 79.19/79.33
DenseNet 161 93.30/93.21 91.72/92.07 88.21/88.04 83.88/84.06 77.78/78.21 74.06/74.53
DenseNet 169 91.11/91.14 89.36/89.36 86.99/87.04 82.99/83.17 77.92/78.31 75.46/75.92
DenseNet 201 92.31/92.30 90.70/90.67 87.78/87.76 84.76/84.76 81.98/82.01 79.78/79.87
EfficientNet b0 88.62/88.66 86.19/86.26 84.74/84.83 82.95/83.05 80.45/80.56 78.98/79.11
EfficientNet b4 84.39/84.51 83.00/83.16 81.59/81.75 80.28/80.45 78.48/78.66 78.42/78.61
ResNet 18 85.22/85.23 84.64/84.64 83.61/83.61 82.72/82.71 80.65/80.68 79.77/79.77
ResNet 34 85.98/86.00 85.37/85.37 84.25/84.23 82.82/82.75 80.37/80.30 79.12/79.02
ResNet 50 89.63/89.62 88.01/87.99 86.07/86.06 83.94/83.95 80.90/80.96 78.85/78.93
ResNet 101 90.96/90.97 89.85/89.84 87.90/87.88 86.53/86.52 83.27/84.32 82.97/83.05
ResNet 152 91.10/91.11 89.57/89.57 87.53/87.54 85.26/85.30 82.18/82.26 80.63/80.72
ResNeXt 101 89.26/89.28 87.89/87.91 87.02/87.04 85.60/85.67 83.71/84.83 82.76/82.61
ViT b16 88.09/88.11 86.96/87.03 85.91/85.99 84.33/84.45 82.80/82.69 79.48/79.69
ViT b32 81.25/81.24 81.23/81.22 81.12/81.07 81.15/81.11 80.73/80.68 80.14/80.08
Table 2. Percentage values of Accuracy and F1 Score obtained during the testing
phase in three-class classification (GAN vs. DM vs. real) at backbone variation.

the main differences between images with and without JPEG compression. It
can be observed that as QF is decreased, low frequencies are removed and JPEG
blocks are visible. This operation could lead to the removal of those (potentially)
discriminative features identified by the various classifiers.
Table 2 shows the performance of both tests in the three-class classification.
From the results obtained, we can see that, regardless of the backbone used in
the Base Model, in general this approach succeeds in achieving accuracy values
in excess of 85%. In particular, the use of a model belonging to the DenseNet
family as a backbone gives a boost to the overall performance of the models.
To gain a better understanding of the model’s ability to distinguish between
real and AI-generated images (from GAN or DM) we recalculated the previ-
ous performance values in binary classification: the calculation was performed
considering the predicted classes GAN and DM as deepfakes and keeping the
predictions of the real class unchanged, then the metrics were recalculated. Ta-
DeepFeatureX Net: Deep Features eXtractors based Network 9

Binary - Accuracy / F1-Score (%)


Base Model
JPEG-Compression
backbone RAW
QF90 QF80 QF70 QF60 QF50
DenseNet 121 92.50/92.23 91.64/91.35 90.67/90.48 87.44/88.55 85.86/86.44 83.85/84.89
DenseNet 161 93.83/93.67 92.74/92.56 90.17/90.16 88.34/88.84 82.75/84.28 79.65/82.15
DenseNet 169 92.10/91.92 90.95/90.69 89.01/88.92 86.21/86.58 82.64/83.79 80.48/82.12
DenseNet 201 93.30/93.03 91.30/90.97 89.45/89.06 88.09/87.81 85.78/85.75 83.78/83.89
EfficientNet b0 89.36/89.01 87.72/87.35 85.86/85.48 84.96/84.57 82.69/82.59 81.57/81.39
EfficientNet b4 86.67/86.56 86.03/86.11 84.58/84.91 84.32/84.67 82.14/82.73 82.26/82.74
ResNet 18 85.62/84.89 85.71/84.96 84.35/83.49 84.05/82.99 81.92/80.84 81.55/80.21
ResNet 34 87.00/86.37 86.99/86.28 85.25/84.28 83.56/82.16 81.58/79.89 79.57/77.48
ResNet 50 90.27/89.83 89.10/88.53 87.12/86.41 85.48/84.71 82.92/82.21 80.24/79.24
ResNet 101 91.59/91.37 91.01/90.68 89.10/88.60 87.72/87.17 85.33/84.89 84.02/83.63
ResNet 152 91.71/91.38 90.44/89.93 88.57/88.03 86.79/86.21 84.60/84.16 82.72/82.20
ResNeXt 101 90.86/90.67 90.50/90.25 89.12/88.93 88.09/87.93 86.54/86.61 86.11/86.13
ViT b16 89.49/89.27 89.75/89.68 88.40/88.43 87.19/87.38 85.36/85.71 82.25/82.59
ViT b32 82.31/81.15 82.63/81.45 82.55/81.38 82.24/81.09 82.33/80.96 81.66/80.20
Table 3. Percentage values of Accuracy and F1 Score obtained during the testing phase
in binary classification (Deepfake vs real) of the model when the backbone varies. Values
were obtained considering DM and GAN predictions as Deepfake.

ble 3 shows the metrics obtained from the recalculation. Looking at the new
values, we can see how performance has increased in terms of accuracy in both
the inference test and, above all, the JPEG compression robustness test. From
the obtained results, DenseNet 161 represents the backbone of the Basic Model
as it leads to the best classification results and demonstrates good robustness to
JPEG compression: despite the fact that the model was trained using only raw
images, the accuracy and F1 score values tend not to decrease drastically as the
compression QF decreases.

5.2 Comparison with S.O.T.A. in generalization

In this section, we examine the generalization capacity of our approach. The


selected final model uses DenseNet 121 as the backbone of the Base Model,
chosen for its excellent performance found in the tests described in Section 5.1.
Initially, we conducted an analysis of the baselines: the models used as the
backbone of the Base Models were trained in the same conditions of our method
and subsequently evaluated in terms of generalization. This process allowed us to
compare the effectiveness of our model with the use of standard architectures.
Next, we compared our model with state-of-the-art models trained on similar
tasks, namely the distinction between AI-generated images and real images.
In order to assess the generalisation capability of the models, we used differ-
ent test sets. These test sets were divided into two categories: images generated
by generative models previously observed during the training phase, but with
different semantic variations and initial conditions, factors that often compli-
cate classification, and images generated by models not included in the training
phase. In addition, we conducted further tests distinguishing between images
generated exclusively by GANs technologies, images generated exclusively by
DMs technologies and images generated by both technologies. We use the nota-
tion whereby we define: T∗i the dataset containing images generated by models
10 O. Pontorno et al.

already considered in the training phase; T∗o the dataset containing images gener-
i/o
ated by architectures not considered in the training phase; T∗ contains images
generated by both type of architectures during the training phase; TG∗ the dataset
containing only images generated by GANs as fakes; TD∗ the dataset containing

only images generated by DMs as fakes; TD/G contains images generated by both
GANs and DMs architectures. Explicitly:

– TGi contains a fake image sample of 2000 divided equally between images
generated by GauGAN [40], BigGAN [4], ProGAN [29], and CycleGAN [54].
– TGo contains a fake image sample of 2000 divided equally between images
generated by Generative Adversarial Transformers (GANformer) [27], De-
noising DiffusionGANs [52], DiffusionGANs [51], ProjectedGANs [43], and
Taming Transformers [14].
i/o
– TG contains a fake image sample of 2000 divided equally between images
generated by the same generative models of TGi and TGo .
– TDi contains a fake image sample of 2000 divided equally between images
generated by Diffusion and images taken randomly from the COCOFake
dataset [8], generated by Stable Diffusion 3 .
– TDo contains a fake image sample of 2000 divided equally between images
generated by Vector Quantized Diffusion Model (VQ Diffusion) [18], Denois-
ing Diffusion Probabilistic Model (DDPM) [25], and images taken randomly
from the COCOGlide dataset, generated by Glide [39].
i/o
– TD contains a fake image sample of 2000 divided equally between images
generated by the same generative models of TDi and TDo .
i
– TD/G contains a fake image sample of 2000 divided equally between images
generated by the same generative models of TDi and TGi .
o
– TD/G contains a fake image sample of 2000 divided equally between images
generated by the same generative models of TDo and TGo .
i/o
– TD/G contains a fake image sample of 2000 divided equally between images
generated by all the same previous generative models.

We also specify that each of the datasets listed above contains a sample of
2000 real images taken randomly in equal numbers from the datasets We also
specify that each of the datasets listed above contains a sample of 2000 real
images taken randomly in equal numbers from the AFHQ [7], Imagenet [12] and
COCO [34] datasets.
Table 4 shows the percentage values of the accuracies obtained by the various
models in the different contexts T . When reading the results, it is important to
consider that all images in the test sets are compressed in JPEG format, which,
taking into account that our model was trained using only raw images, may have
lowered its performance as demonstrated in Section 5.1. The state-of-the-art ap-
proaches used for comparison are [1,16,20,50]. This choice is due to the fact that
almost all these methods were trained using generative architectures considered
3
github.com/CompVis/stable-diffusion
DeepFeatureX Net: Deep Features eXtractors based Network 11

i/o i/o i/o


TGi TGo TG TDi TDo TD i
TG/D o
TG/D TG/D
DenseNet 121 56.57 74.02 66.96 72.07 48.20 58.68 60.58 64.23 63.13
DenseNet 161 53.56 73.97 66.21 69.19 48.12 56.93 58.55 64.51 61.98
DenseNet 169 52.03 66.73 61.10 65.25 43.98 52.61 56.80 57.93 57.51
DenseNet 201 55.93 70.03 64.14 67.39 48.70 56.10 60.71 62.35 62.72
EfficientNet b0 49.58 71.81 62.99 69.22 45.12 55.19 55.78 61.67 59.85
EfficientNet b4 50.37 68.64 61.36 69.42 46.60 55.92 56.75 60.98 60.00
Baselines

ResNet 18 63.77 68.89 66.65 66.23 55.03 58.30 63.70 62.30 62.82
ResNet 34 53.87 70.03 63.25 65.48 48.55 54.76 57.31 61.84 60.98
ResNet 50 59.58 73.13 67.95 67.89 53.38 58.50 62.10 65.01 63.90
ResNet 101 60.35 68.08 65.40 72.12 56.45 59.68 64.21 63.62 63.20
ResNet 152 53.94 68.61 61.84 63.90 50.15 55.27 55.27 61.51 60.00
ResNeXt 101 54.35 67.42 62.86 74.18 50.23 59.57 61.19 61.06 61.87
ViT b16 65.81 73.31 69.46 68.59 52.69 58.30 66.62 64.53 62.72
ViT b32 54.07 61.91 58.87 60.34 41.87 47.92 56.22 54.25 57.38
Gandhi2020 [16] 52.30 50.79 51.71 49.91 50.86 50.34 51.54 50.57 51.06
SOTA

Wang2020 [50] 62.41 53.18 57.87 50.13 50.93 50.44 58.26 52.14 54.86
Arshed2024 [1] 47.46 47.65 48.54 52.69 50.00 51.04 49.89 48.94 52.20
Guarnera2024 [20] 55.00 55.63 56.23 54.11 45.98 49.97 56.07 52.21 57.17
Our 64.74 72.47 69.89 68.09 60.82 59.96 66.06 65.02 64.39
Table 4. Percentage values of the accuracy obtained in generalization phase. The tests
distinguished between images generated from architectures seen in the training phase,
but with different initial conditions (superscript i), and images generated from archi-
tectures never seen before (superscript o), and mixed (superscript i/o). Furthermore,
the tests distinguished between using only images generated by GANs (G-index), those
by DMs (D-index), and mixed (G/D-index).

in our experiments. Wang et al. [50] and Gandhi et al. [16] used only images gen-
erated by GAN models and represent some of the best approaches in literature
able to solve well the deepfake detection task (in the specific domain of GAN
generated images). Despite this, experimental results reported in Table 4 show
that these approaches are able to achieve similar classification results compared
to methods trained considering images generated by also DM engines. However,
these results show little ability to generalize. Our approach is able to generalize
better, outperforming such state-of-the-art methods with classification accuracy
over 10%, in any context. Arshed et al. [1] and Guarnera et al. [20] used one
specific architecture to extract features for images generated by GAN and DM
engines. The main limitation compared to our approach regards the strategy for
feature extraction, since we used three specific models to better extract the most
discriminative characteristics of the input data for each involved image category
(GAN-generated, DM-generated, real).
In summary, from the obtained results (Table 4), our approach succeeds on av-
erage in generalizing better in most of the performed tests. Although baselines
perform well in generalization when the dataset is composed of deepfake images
generated by a single technology, they encounter difficulties when the dataset
contains images from multiple generating architectures, both seen and unseen
i/o
(column TG/D ). Then, the proposed model outperforms all other state-of-the-art
methods, confirming the good generalization ability in different contexts.
12 O. Pontorno et al.

6 Conclusion and future works


The challenge of generalization emerges as a major obstacle in the context of
deepfake detection. The ability to accurately distinguish between AI-generated
and real images is crucial to monitor the ongoing development of generative
models. In this article we proposed a new approach that can ensure robustness
to JPEG attacks, typically used by social networks, and contributed to a small
step forward in solving the problem of generalization of detectors. The use of
three different blocks specialized in the extraction of discriminative features of
a specific images category (GAN-generated, DM-generated, real) allows our ap-
proach to develop a deeper understanding of intrinsic characteristic between real
and synthetic images. This approach aims to provide a solid basis for the accu-
rate identification of images even in the presence of variations and complexity
introduced by different image generation techniques. This is the starting point
for our future research: we want to strengthen the capabilities of the three dis-
criminative feature extractors, analyze their outputs spatially and model new
high-performance and structure-independent architectures.

Acknowledgements Orazio Pontorno is a PhD candidate enrolled in the Na-


tional PhD in Artificial Intelligence, XXXIX cycle, course on Health and life
sciences, organized by Università Campus Bio-Medico di Roma. This research
is supported by Azione IV.4 - “Dottorati e contratti di ricerca su tematiche
dell’innovazione" del nuovo Asse IV del PON Ricerca e Innovazione 2014-2020
“Istruzione e ricerca per il recupero - REACT-EU”- CUP: E65F21002580005.

References
1. Arshed, M.A., Mumtaz, S., Ibrahim, M., Dewi, C., Tanveer, M., Ahmed, S.: Mul-
ticlass AI-Generated Deepfake Face Detection Using Patch-Wise Deep Learning
Model. Computers 13(1), 31 (2024)
2. Asnani, V., Yin, X., Hassner, T., Liu, X.: Reverse Engineering of Generative Mod-
els: Inferring Model Hyperparameters from Generated Images. IEEE Transactions
on Pattern Analysis and Machine Intelligence (2023)
3. Bergmann, S., Moussa, D., Brand, F., Kaup, A., Riess, C.: Forensic analysis of AI-
compression traces in spatial and frequency domain. Pattern Recognition Letters
(2024)
4. Brock, A., Donahue, J., Simonyan, K.: Large Scale GAN Training for High Fidelity
Natural Image Synthesis. In: International Conference on Learning Representations
(2018)
5. Cho, W., Choi, S., Park, D.K., Shin, I., Choo, J.: Image-To-Image Translation via
Group-Wise Deep Whitening-and-Coloring Transformation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10639–
10647 (2019)
6. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: Unified Gen-
erative Adversarial Networks for Multi-Domain Image-to-Image Translation. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion. pp. 8789–8797 (2018)
DeepFeatureX Net: Deep Features eXtractors based Network 13

7. Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: Diverse Image Synthesis for
Multiple Domains. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 8188–8197 (2020)
8. Cocchi, F., Baraldi, L., Poppi, S., Cornia, M., Cucchiara, R.: Unveiling the Impact
of Image Transformations on Deepfake Detection: An Experimental Analysis. In:
International Conference on Image Analysis and Processing. pp. 345–356. Springer
(2023)
9. Concas, S., Perelli, G., Marcialis, G.L., Puglisi, G.: Tensor-Based Deepfake Detec-
tion In Scaled And Compressed Images. In: 2022 IEEE International Conference
on Image Processing (ICIP). pp. 3121–3125. IEEE (2022)
10. Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., Verdoliva, L.: On the
Detection of Synthetic Images Generated by Diffusion Models. In: IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5.
IEEE (2023)
11. Corvi, R., Cozzolino, D., Poggi, G., Nagano, K., Verdoliva, L.: Intriguing Properties
of Synthetic Images: from Generative Adversarial Networks to Diffusion Models.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 973–982 (2023)
12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale
hierarchical image database. In: 2009 IEEE Conference on Computer Vision and
Pattern Recognition. pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.
5206848
13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An Image is Worth
16x16 Words: Transformers for Image Recognition at Scale. In: International Con-
ference on Learning Representations (2020)
14. Esser, P., Rombach, R., Ommer, B.: Taming Transformers for High-Resolution Im-
age Synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. pp. 12873–12883 (2021)
15. Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever-
aging Frequency Analysis for Deep Fake Image Recognition. In: Proceedings of the
37th International Conference on Machine Learning, ICML. pp. 3247–3258. PMLR
(2020)
16. Gandhi, A., Jain, S.: Adversarial Perturbations Fool Deepfake Detectors. In: 2020
International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2020)
17. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative Adversarial Nets. Advances in Neural
Information Processing Systems 27 (2014)
18. Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.:
Vector Quantized Diffusion Model for Text-to-Image Synthesis. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
10696–10706 (2022)
19. Guarnera, L., Giudice, O., Battiato, S.: Fighting Deepfake by Exposing the Con-
volutional Traces on Images. IEEE Access 8, 165085–165098 (2020)
20. Guarnera, L., Giudice, O., Battiato, S.: Mastering Deepfake Detection: A Cutting-
Edge Approach to Distinguish GAN and Diffusion-Model Images. ACM Trans-
actions on Multimedia Computing, Communications and Applications (2024).
https://doi.org/10.1145/3652027
21. Guarnera, L., Giudice, O., Nastasi, C., Battiato, S.: Preliminary Forensics Analy-
sis of Deepfake Images. In: 2020 AEIT International Annual Conference (AEIT).
pp. 1–6. IEEE (2020). https://doi.org/10.23919/AEIT50178.2020.9241108
14 O. Pontorno et al.

22. Guarnera, L., Giudice, O., Nießner, M., Battiato, S.: On the Exploitation of Deep-
fake Model Recognition. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. pp. 61–70 (2022)
23. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recogni-
tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 770–778 (2016)
24. He, Z., Zuo, W., Kan, M., Shan, S., Chen, X.: AttGAN: Facial Attribute Editing
by Only Changing What You Want. IEEE Transactions on Image Processing (11),
5464–5478 (2019)
25. Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models. Advances in
Neural Information Processing Systems 33, 6840–6851 (2020)
26. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected
Convolutional Networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 4700–4708 (2017)
27. Hudson, D.A., Zitnick, L.: Generative Adversarial Transformers. In: International
Conference on Machine Learning. pp. 4487–4499. PMLR (2021)
28. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive Growing of GANs for Im-
proved Quality, Stability, and Variation. In: International Conference on Learning
Representations (ICLR) 2018 (2018)
29. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive Growing of GANs for Im-
proved Quality, Stability, and Variation. In: International Conference on Learning
Representations (2018)
30. Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila,
T.: Alias-Free Generative Adversarial Networks. Advances in Neural Information
Processing Systems 34, 852–863 (2021)
31. Karras, T., Laine, S., Aila, T.: A Style-Based Generator Architecture for Gen-
erative Adversarial Networks. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 4401–4410 (2019)
32. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and Improving the Image Quality of StyleGAN. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 8110–8119 (2020)
33. Leotta, R., Giudice, O., Guarnera, L., Battiato, S.: Not with My Name! Inferring
Artists’ Names of Input Strings Employed by Diffusion Models. In: International
Conference on Image Analysis and Processing. pp. 364–375. Springer (2023)
34. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár,
P., Zitnick, C.L.: Microsoft Coco: Common Objects in Context. In: Computer
Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September
6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
35. Liu, H., Li, X., Zhou, W., Chen, Y., He, Y., Xue, H., Zhang, W., Yu, N.: Spatial-
Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 772–781 (2021)
36. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep Learning Face Attributes in the Wild. In:
Proceedings of International Conference on Computer Vision (ICCV) (December
2015)
37. Marra, F., Gragnaniello, D., Verdoliva, L., Poggi, G.: Do GANs Leave Artificial
Fingerprints? 2019 IEEE Conference on Multimedia Information Processing and
Retrieval (MIPR) pp. 506–511 (2019)
38. McCloskey, S., Albright, M.: Detecting GAN-Generated Imagery Using Saturation
Cues. In: 2019 IEEE International Conference on Image Processing (ICIP). pp.
4584–4588. IEEE (2019)
DeepFeatureX Net: Deep Features eXtractors based Network 15

39. Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B.,
Sutskever, I., Chen, M.: GLIDE: Towards Photorealistic Image Generation and
Editing with Text-Guided Diffusion Models. In: International Conference on Ma-
chine Learning. pp. 16784–16804. PMLR (2022)
40. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: GauGAN: Semantic Image Synthesis
with Spatially Adaptive Normalization. In: ACM SIGGRAPH 2019 Real-Time
Live! pp. 1–1 (2019)
41. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical Text-
Conditional Image Generation with CLIP Latents. arXiv preprint:2204.06125 1(2),
3 (2022)
42. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution
Image Synthesis with Latent Diffusion Models. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)
43. Sauer, A., Chitta, K., Müller, J., Geiger, A.: Projected GANs Converge Faster.
Advances in Neural Information Processing Systems 34, 17480–17492 (2021)
44. Sha, Z., Li, Z., Yu, N., Zhang, Y.: De-fake: Detection and Attribution of Fake
Images Generated by Text-to-Image Generation Models. In: Proceedings of the
2023 ACM SIGSAC Conference on Computer and Communications Security. pp.
3418–3432 (2023)
45. Shan, S., Cryan, J., Wenger, E., Zheng, H., Hanocka, R., Zhao, B.Y.: Glaze: Pro-
tecting Artists from Style Mimicry by {Text-to-Image} Models. In: 32nd USENIX
Security Symposium (USENIX Security 23). pp. 2187–2204 (2023)
46. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper-
vised Learning Using Nonequilibrium Thermodynamics. In: International Confer-
ence on Machine Learning. pp. 2256–2265. PMLR (2015)
47. Tan, M., Le, Q.: Efficientnet: Rethinking Model Scaling for Convolutional Neu-
ral Networks. In: International Conference on Machine Learning. pp. 6105–6114.
PMLR (2019)
48. Vyas, N., Kakade, S.M., Barak, B.: On Provable Copyright Protection for Genera-
tive Models. In: International Conference on Machine Learning. pp. 35277–35299.
PMLR (2023)
49. Wang, R., Juefei-Xu, F., Ma, L., Xie, X., Huang, Y., Wang, J., Liu, Y.: FakeSpot-
ter: a Simple Yet Robust Baseline for Spotting AI-Synthesized Fake Faces. In:
Proceedings of the Twenty-Ninth International Conference on International Joint
Conferences on Artificial Intelligence. pp. 3444–3451 (2021)
50. Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: CNN-Generated Im-
ages are Surprisingly Easy to Spot... for Now. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 8695–8704 (2020)
51. Wang, Z., Zheng, H., He, P., Chen, W., Zhou, M.: Diffusion-GAN: Training GANs
with Diffusion. arXiv preprint arXiv:2206.02262 (2022)
52. Xiao, Z., Kreis, K., Vahdat, A.: Tackling the Generative Learning Trilemma with
Denoising Diffusion GANs. arXiv preprint arXiv:2112.07804 (2021)
53. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated Residual Transfor-
mations for Deep Neural Networks. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)
54. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired Image-To-Image Translation
Using Cycle-Consistent Adversarial Networks. In: Proceedings of the IEEE Inter-
national Conference on Computer Vision. pp. 2223–2232 (2017)

You might also like