0% found this document useful (0 votes)
12 views

Tan Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable CVPR 2024 Paper

This CVPR paper presents a novel approach to deepfake detection by introducing the concept of Neighboring Pixel Relationships (NPR) to capture generalized artifacts from up-sampling operations in CNN-based generative networks. The study reveals that these up-sampling operators can produce detectable forgery artifacts, leading to a significant 12.8% improvement in detection performance over existing methods. The findings emphasize the importance of understanding generator architectures in enhancing the generalization capabilities of deepfake detection systems.

Uploaded by

javaria2275583
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Tan Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable CVPR 2024 Paper

This CVPR paper presents a novel approach to deepfake detection by introducing the concept of Neighboring Pixel Relationships (NPR) to capture generalized artifacts from up-sampling operations in CNN-based generative networks. The study reveals that these up-sampling operators can produce detectable forgery artifacts, leading to a significant 12.8% improvement in detection performance over existing methods. The findings emphasize the importance of understanding generator architectures in enhancing the generalization capabilities of deepfake detection systems.

Uploaded by

javaria2275583
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Rethinking the Up-Sampling Operations in CNN-based Generative Network for


Generalizable Deepfake Detection

Chuangchuang Tan1,2 , Huan Liu1,2 , Yao Zhao1,2 *, Shikui Wei1,2 , Guanghua Gu3,4 , Ping Liu5 , Yunchao Wei1,2
1
Institute of Information Science, Beijing Jiaotong University
2
Beijing Key Laboratory of Advanced Information Science and Network Technology
3
School of Information Science and Engineering, Yanshan University
4
Hebei Key Laboratory of Information Transmission and Signal Processing
5
CSE department, University of Nevada, Reno, USA
[email protected], [email protected]

Abstract Neighboring Pixel Relationship(NPR)

Recently, the proliferation of highly realistic synthetic Real

images, facilitated through a variety of GANs and Diffu-


sions, has significantly heightened the susceptibility to mis-
use. While the primary focus of deepfake detection has tra-
ditionally centered on the design of detection algorithms,
Fake

an investigative inquiry into the generator architectures has


remained conspicuously absent in recent years. This paper
contributes to this lacuna by rethinking the architectures of
(a) Images (b)NPR-R (c)NPR-G (d)NPR-B
CNN-based generator, thereby establishing a generalized subtraction
representation of synthetic artifacts. Our findings illumi-
nate that the up-sampling operator can, beyond frequency-
based artifacts, produce generalized forgery artifacts. In
particular, the local interdependence among image pixels
caused by upsampling operators is significantly demon-
(e)Different NPRs bewteen Real and Fake
strated in synthetic images generated by GAN or diffusion.
Building upon this observation, we introduce the concept of Figure 1. The visualization of Neighboring Pixel Relationship
Neighboring Pixel Relationships(NPR) as a means to cap- (NPR) of real image and its inversion [72]. To fully understand
ture and characterize the generalized structural artifacts the NPR, (a) we invert the real image by [72], and (b-d) present
stemming from up-sampling operations. A comprehensive NPR heatmap for the R, G, B channel of images. In addition,
analysis is conducted on an open-world dataset, comprising to show that NPR can be used as artifacts representation, (e) the
samples generated by 28 distinct generative models. This differential NPRs between real and fake images is shown. The
analysis culminates in the establishment of a novel state-of- proposed NPR Effectively reveals the differences between real and
the-art performance, showcasing a remarkable 12.8% im- fake images.
provement over existing methods. The code is available at
https://github.com/chuangchuangtan/NPR-
images have reached a level of realism that makes them
DeepfakeDetection.
virtually indistinguishable from authentic images to human
observers. Nevertheless, the misuse of these capabilities
poses potential threats in political and economic domains.
1. Introduction Addressing this issue requires the development of general-
With the rapid evolution of image synthetic technologies, izable deepfake detection methods. In recent years, notable
such as GAN[14, 25, 26], Diffusion[20, 53], AI-generated strides [9, 12, 13, 34–36] have been made in forgery detec-
tion, particularly in the face forgery detection.
* Corresponding author In the realm of deepfake detection, a significant chal-

28130
lenge for detectors is to generalize effectively to unseen details such as hair, eyes, and beard. Despite the genera-
deepfake sources in real-world scenarios. Recent advance- tor’s tendency to enhance details for realism, traces of the
ments aimed at enhancing this generalization ability include up-sampling layer persist in the local image pixels.
the refinement of detection algorithms [21, 51], the aug- To comprehensively evaluate the generalization ability
mentation of datasets [6, 22, 23], and the development of of our proposed NPR, we conduct simulations using a vast
pre-trained models [49, 60]. Despite these efforts, a con- database of images generated by 28 distinct models 1 . Our
spicuous gap remains in the lack of source-invariant repre- extensive experiments demonstrate the effectiveness and
sentation exploited from the generator pipeline for forgery versatility of the artifact representation generated by the
image detection. This deficiency leads to failures in detect- NPR across diverse and unseen sources.
ing unknown forgery domains. Intriguingly, there has been Our paper makes the following contributions:
a scarcity of investigative inquiry into generator architec- • We propose a simple yet effective artifact representa-
tures in recent years. tion, Neighboring Pixel Relationships (NPR), designed
In addressing this challenge, our work centers on ana- to capture local up-sampling artifacts from image pix-
lyzing generator architectures to extract generalized artifact els. Thanks to the widespread use of up-sampling oper-
representations. Previous studies [12, 13, 69] have demon- ations in existing generation models, NPR demonstrates
strated the ubiquity of up-sampling components in com- the ability to generalize to unseen sources.
mon GAN pipelines. Simultaneously, given the widespread • We demonstrate that up-sampling operators can cause
adoption of U-Net in diffusion models, such as DDPM [20], generalized forgery artifacts beyond frequency-based ar-
ADM [11], and LDM [53], the up-sampling layer emerges tifacts. The trace of the up-sampling layer from local im-
as a crucial module in diffusion models. The up-sampling age pixels exhibits more generalization compared to its
cue holds significant potential for advancing generalizable influence on the whole image in the frequency domain
deepfake detection. Building on these insights, in this paper, for deepfake detection.
our focus is on achieving source-invariant forgery detec- • Our experiments validate the effectiveness of the pro-
tion by rethinking artifacts stemming from the up-sampling posed NPR, showcasing strong generalization capabilities
component of common generation models. Existing works across 28 different generation models used for forgery im-
predominantly consider its impact on the entire image in age synthesis.
the frequency domain. In contrast, our approach involves
exploring the trace of the up-sampling layer at the level of 2. Related Work
local image pixels, providing a more nuanced understand-
In this section, we present a concise survey of deepfake
ing of its influence.
detection approaches, categorizing them into two main
Specifically, in the pipelines of common generation groups: image-based and frequency-based detection.
models, up-sampling is employed to transform the low-
resolution latent space into high resolution. Within the 2.1. Image-based Fake Detection
scaled feature, local pixels exhibit a strong relationship. For
Some studies [54] utilize images as input data to train binary
instance, employing nearest neighbor interpolation results
classification models for forgery detection. Rossler et al.
in the local 2 × 2 pixels sharing the same value. Subsequent
[54] employ images to train a straightforward Xception [9]
to the up-sampling operation, the scaled features are further
for detecting fake face images. Other works concentrate on
processed through convolutional layers to generate images.
specific regions, such as eyes and lips, to discern fake face
During this process, a relationship is established among lo-
media [16, 32]. Yu et al. [67] and Marra et al. [43] extract
cal pixels through the combination of the up-sampling op-
the unique fingerprints of the GAN model from generated
eration and the translation invariance of CNN layers. This,
images to perform detection. Chai et al. [5] employ limited
in turn, manifests as discernible relationships among local
receptive fields to identify patches that render images de-
pixels in the generated images.
tectable. Some works enhance the generalization of detec-
Building upon these insights, we propose a simple but tors to unseen sources by diversifying training data through
effective artifact representation, termed Neighboring Pixel augmentation methods [61, 62], adversarial training [7], re-
Relationships (NPR), aimed at achieving generalized deep- construction techniques [4, 18], fingerprint generators [22],
fake detection. NPR serves as the artifact representation and blending images [57]. Additionally, Ju et al. [24] inte-
for training the detection model. The primary innovation of grate global spatial information and local informative fea-
our approach lies in introducing a simple yet versatile arti-
1 ProGAN, StyleGAN, StyleGAN2, BigGAN, CycleGAN, StarGAN,
fact representation derived from the common up-sampling
GauGAN, Deepfake, AttGAN, BEGAN, CramerGAN, InfoMaxGAN,
component of generation pipelines. In Fig. 1, we showcase MMDGAN, RelGAN, S3GAN, SNGAN, STGAN, DDPM, IDDPM,
NPR heatmaps for a real face and its inversion. Signifi- ADM, LDM, PNDM, VQDiffusion, Glide, Stable Diffusion v1, Stable
cantly, NPR effectively captures artifacts related to image Diffusion v2, DALLE, and Midjourney.

28131
tures to train a two-branch model. The AltFreezing [63] image is labeled with y, indicating whether it belongs to the
adopts both spatial and temporal artifacts to achieve Face category of ”real” (y = 0) or ”fake” (y = 1).
Forgery Detection. Li et al. [30] utilize Continual learning Here, we train a binary classifier D(\cdot ), utilizing the train-
[68, 71] to solve the continual deepfake detection problem. ing source X_{i} :
Ojha et al. [49] and Tan et al. [60] employ feature maps
and gradients, respectively, as general representations. DIO \begin {split} & P_{i} = f(X_{i}), \\ & D^{i} = \mathop {\arg \min }_{\theta } \ loss(D( P_{i}; \theta ),\ y), \end {split} \label {eq:eq3}
[58] utilizes training-free filters to extract artifact represen- (2)
tations.
2.2. Frequency-based Fake Detection where f() is the representation extractor, P_{i} is the artifact
representation of X_{i} .
Given that GAN architectures heavily rely on up-scaling op- Our overarching goal is to design a well extractor f(),
erations, some studies [12, 13] delve into the impact of up- which extracts a generalized artifact P_{i} from the training
sampling across the entire image, developing the frequency source X_{i} . Subsequently, the generalizable detector D^{i} can
spectrum as a representation of up-sampling artifacts. LOG be obtained by training on the artifact P_{i} originating from
[44] integrates information from both color and frequency X_i , yet it demonstrates robust performance when faced with
domains to detect manipulated face images and videos. F3- images from previously unseen sources denoted as X_{t} . The
Net [51] introduces frequency components partition and the ability to generalize across unseen sources is a crucial ob-
discrepancy of frequency statistics between real and forged jective of our detector representation extractor f().
images into face forgery detection. Luo et al. [41] utilize
multiple high-frequency features of images to enhance gen- Condition Generator Discriminator
eralization performance. ADD [65] develops two distilla- up-sampling
tion modules for detecting highly compressed deepfakes, Semantic Map

including frequency attention distillation and multi-view at- Text Prompt


. . . .
tention distillation. BiHPF [21] amplifies the magnitudes
Noise Vector
of artifacts through two high-pass filters. FreGAN [23]
observes that unique frequency-level artifacts in generated (a) GAN pipelines
images can lead to overfitting to training sources. Conse- Pixel Space Latent Space
quently, FreGAN mitigates the impact of frequency-level Image χ z Diffusion Process zT
artifacts through frequency-level perturbation maps. Fre- Condition
qNet [59] aims to enhance frequency space learning for
Image � z Denosing U-Net
deepfake detection. zT

(b) Latent Diffusion pipelines


3. Methodology
Figure 2. In the pipelines of common generation models, GAN
Our work is dedicated to designing a generalizable artifacts and Diffusion, up-sampling is employed to transform the low-
representation through an analysis of common up-sampling resolution latent space into high resolution.
operations in popular generators. We introduce a form of
up-sampling
local up-sampling artifacts, named Neighboring Pixel Rela-
tionships (NPR), the details are presented in this section.
Classifier

3.1. Problem setup


The overarching objective of Generalizable Deepfake De-
Generator Grid Image NPR
tection is to develop a universal detector capable of accu-
�1 �2 �1 =�1 − �� �1 �2
Y Y
rately identifying deepfake images, even when faced with
�2 =�2 − ��
0.93 1.01 0.00 0.08
�3 �4 0.89 0.91 �3 =�3 − �� �3 �4
�4 =�4 − ��
limitations in the availability of diverse training sources. -0.04 -0.02

�� = �1
0.93 0.91 0.84 0.81 0.00 -0.02 0.00 -0.03
In the given context, we consider a real-world image 0.74 0.87 0.79 0.85 -0.19 -0.06 -0.05 0.01
X Relative and Local Artifacts X
scenario denoted as X, which is sampled from n different Grid Image Neighboring Pixel Relationships

sources: Figure 3. The overview of Neighboring Pixel Relationships.


We rethink artifacts stemming from the up-sampling component
\begin {split} &X = \{ X_{1}, X_{2}, \ldots ,X_{i}, \ldots , X_{n} \}, \\ &X_{i} = \{ x_{j}^{i}, y_{j}\}_{j=1}^{N_{i}}, \end {split} \label {eq:eq1} of common generation models. The proposed Neighboring Pixel
(1) Relationships focus on the local interdependence between image
pixels caused by up-sampling operators. The NPR is employed to
where Ni represents the number of images originating from train detector as artifact representation.
the ith source Xi , and xij is the jth image of Xi . Each

28132
3.2. Up-sampling operations in generator pipeline I ∈ R(l×W )×(l×H)×3 .
Before we dive into the details of the method, let’s briefly \begin {split} & \hat {x} = up( x ), \\ & I = conv(\hat {x}), \end {split} \label {eq:eq3}
explore the up-sampling operations commonly used in gen- (3)
erator pipelines, such as those in GANs and Diffusions.
GAN pipelines: We present an overview of the fundamen- where x̂ ∈ R(l×W )×(l×H)×C is the up-scaled feature map.
tal pipeline inherent to Generative Adversarial Networks We then divide the image I and x̂ into W × H grids. Each
(GANs), as depicted in Figure 3 (a). It comprises two pri- gird is the l × l patches. Let VI and Vx̂ denote grids set of
mary constituents, namely the discriminator and the gen- I and x̂, respectively. The vIc ∈ VI and vx̂c ∈ Vx̂ indicate a
erator. In the context of a GAN, the generator function gird of I and x̂, respectively. Most of generators commonly
serves to establish a mapping that originates in a lower- employ an up-sampling layer with l = 2 scale.
dimensional latent space and extends to the image space. The elements of vx̂c exhibit a strong correlation gener-
Within the architecture of the generator, two predominant ated by the up-sampling layer. For instance, when adopt-
components are typically incorporated, including convolu- ing nearest neighbor interpolation as the up-sampling layer,
tional layers and up-sampling layers. In these up-sampling the elements of vx̂c share same value. Here are some key
layers, their primary function is to accept low-resolution characteristics: 1) The elements of vx̂c has strong correla-
features as input and subsequently generate high-resolution tion generated by upsampling layer, 2) The function conv
features as their output. It is noteworthy to emphasize that is fixed during inference, 3) The function conv is transla-
while the architectural configurations of GAN models ex- tion invariance. Consequently, the correlation of elements
hibit substantial diversity, the adoption of an upsampling is presented in vIc . We capture the correlation of local pixels
module maintains consistency. in vIc as the up-sampling artifacts.
Diffusion pipelines: Additionally, the Diffusion pipeline is Specifically, the differences in each vIc are extracted as
illustrated in Figure 3 (b). Recently, diffusion models in- artifacts representation, as following:
clude two structures: diffusion with U-Net and latent diffu-
sion models. In diffusion with U-Net, the U-Net model is \begin {split} & v_{I}^{c} = \{w_1, ..., w_i, ..., w_n \}, n = l{\times }l\\ &\hat {v}_{I}^{c} = \{w_1-w_j, ..., w_i-w_j, ..., w_n-w_j \}, 1 \leq j \leq n, \end {split} \label {eq:eq4}
employed to estimate the noise component from a noisy im-
age. During inference time, diffusion models sample noise (4)
and gradually reduce the noise level until obtaining a clean
image. The latent diffusion models include an encoder, de- where wi is the elements of vIc , v̂Ic denotes the neighboring
noising U-Net, and decoder. It uses a U-Net to perform dif- pixel relationships of vIc . We adopt subtraction to capture
fusion in a latent domain and then decodes the latent signal relative relationship of pixels in vIc . The wj can be em-
with a decoder to generate an image. Although the pro- ployed by any element in vIc . The NPR of the whole image
cesses of generation in the diffusion model and GAN are is the set of all grids v̂Ic . Our NPR set l and j to 2 and 1,
different, the decoder of the diffusion model also widely respectively. In the Section 4, We will discuss the effect of
adopts up-sampling layers to generate images. l and wj , and explore the possibility of replacing wj with
max or mean of vIc .
3.3. Neighboring pixel relationships We employ the proposed neighboring pixel relationships
v̂Ic as the artifacts representation to train the classifier for
Building upon the above analysis of generation pipelines, deepfake detection. The NPR captures the local relative
we observe that up-sampling operations are commonly em- correlation between pixels in local patches. This correla-
ployed in current image generation techniques, including tion, presented in the image domain, derives from the up-
GANs and Diffusions. While existing research has delved sampling layer and benefits from the translation invariance
into studying global up-sampling artifacts in the frequency of the convolutional layer. The relative and local nature of
domain [12, 13], Jeong et al.[23] have discovered that the proposed up-sampling artifacts allows the neighboring
frequency-based artifacts are insufficient for achieving gen- pixel relationship to be generalized to unknown sources.
eralization detection, given the diverse patterns in the fre-
quency domain of GANs. In this context, we reconsider the 4. Experiments
up-sampling layer in popular generation models and intro-
duce the concept of local up-sampling artifacts in the spatial 4.1. Settings
domain. Training Dataset:
We focus on the portion of the generator near the out- To ensure a consistent basis for comparison, we employ
put images, consisting of an up-sampling layer up with l the training set of ForenSynths [62] to train the detectors,
scale, convolutional layers conv with activate functions, in- following baselines [21, 23, 62]. The training set consists
put feature maps x ∈ RW ×H×C , and the output images of 20 distinct categories, each comprising 18,000 synthetic

28133
images generated using ProGAN, alongside an equal num- Baselines: We perform comparisons the proposed NPR
ber of real images sourced from the LSUN dataset. In line with existing deepfake detection works, including CN-
with previous research [21, 23], we adopt specific 4-class NDetection(CVPR2020) [62], Frank(PRML 2020) [13],
training settings, denoted as (car, cat, chair, horse). Durall(CVPR 2020) [12], Patchfor(ECCV 2020) [5],
Testing Dataset: F3Net(ECCV 2020) [51], SelfBland(CVPR 2022)[57],
To assess the generalization ability of the proposed GANDetection(ICIP 2022) [42], BiHPF(WACV 2022)
method on the real-world scenarios, we adopt various real [21], FrePGAN(AAAI 2022)[23], LGrad(CVPR 2023)
images and diverse GAN and Diffusions models. The eval- [60], Ojha(CVPR 2023) [49]. We re-implement baselines
uation dataset consists of five datasets containing 28 gen- [5, 12, 13, 49, 51, 62] with the official codes using 4-classes
eration models. training setting, and adopt the official pretrained models of
• 8 models from ForenSynths[62] : The test set in- baselines[42, 57, 60].
cludes fake images generated by 8 generation models 2 .
Real images are sampled from 6 datasets (LSUN[66], 4.2. Generalization capability evaluation
ImageNet[55], CelebA[39], CelebA-HQ[25], COCO[33], In this section, we demonstrate that the local artifacts repre-
and FaceForensics++[54]). sentation, Neighboring Pixel Relationships, induced by the
• 9 GANs from GANGen[10]: To replicate the unpre- up-sampling operations in common generation pipelines,
dictability of wild scenes, we extend our evaluation by col- can be easily employed for identifying generated image
lecting images generated by 9 additional GANs 3 . There are data. Even a detector trained on a GAN model exhibits the
4K test images for each model, with equal numbers of real ability to generalize to recently generated diffusion images.
and fake images. To analyze if the proposed local up-sampling artifacts is
• 8 Diffusions from DIRE [64]: To expand the testing a common occurrence for different generation models, we
scope, we adopt the diffusions dataset of DIRE [64] for perform the evaluation on a cross-sources dataset compris-
evaluation, including ADM [11], DDPM [20], IDDPM [47], ing images from 28 distinct generation models. The details
LDM [53], PNDM [37], Vqdiffusion [15], Stable Diffusion of test set are given in the Section 4.1 and the supplementary
v1 [53], Stable Diffusion v2 [53]. The real images are sam- material. The detectors of NPR are trained by the images
pled from LSUN [66] and ImageNet[55] datasets. from ProGAN and subsequently evaluated on 16 GANs, 1
• 4 Diffusions from Ojha [49]: This test set contains Deepfake, and 11 Diffusion models. We adopt specific 4-
images generated from ADM [11], Glide [46], DALL-E- classes training settings for all experiments in this paper,
mini [52], LDM [53]. It adopts images of LAION[56] and denoted as P roGAN -(car, cat, chair, horse).
ImageNet[55] datasets as the real data.
• 5 Diffusions from Diffusion1kStep: Moreover,
we sample test images generated from diffusion mod- 4.2.1 GAN-Sources Evaluation
els using 1000 diffusion steps, namely DDPM[20], In order to valid the generalization ability on images of
IDDPM[47], ADM[11], collect images of Midjourney4 , and GAN sources, two test sets, ForenSynths[62] and self-
DALLE[52]5 from social platform Discord. synthesis GAN datasets, are employed for evaluation.
More detailed information on the test set is given in the These datasets encompass 17 distinct generation models
supplementary material. used to test the detection performance of the NPR detec-
Implementation Details: We design a lightweights CNN tor trained on ProGAN images. The results are presented in
network using convolutional layer and Resnet[17] block as Table 1 and Table 2.
the classifiers for NPR with 1.44 million parameters. The Table 1 provides a comprehensive overview of the per-
detector is trained using the Adam optimizer[28] with a formance of detectors on the test set of ForenSynths[62].
learning rate of 2 \times 10^{-4} , a batch size of 32. Our method The Neighboring Pixel Relationships (NPR) outperforms its
is implemented using the PyTorch on Nvidia GeForce counterparts, showcasing higher mean accuracy (Acc.) and
RTX 3090 GPU. To assess the performance of the pro- comparable mean average precision (A.P.) metrics. Partic-
posed method, we follow the evaluation metrics used in the ularly noteworthy are the mean accuracy values of NPR,
baselines[21, 23, 49], which include the average precision which reach 92.5%. It is worth emphasizing the remarkable
score (A.P.) and accuracy (Acc.). superiority of NPR over the current state-of-the-art meth-
2 ProGAN[25], StyleGAN[26], StyleGAN2[27], BigGAN[3], Cycle- ods, LGrad and Ojha. In terms of mean accuracy, NPR sur-
GAN [73], StarGAN [8], GauGAN[50] and Deepfake [54] passes LGrad and Ojha by 6.4% and 3.4%, respectively, un-
3 AttGAN[19], BEGAN[2], CramerGAN[1], InfoMaxGAN[29], derscoring its efficacy in generalizable deepfake detection.
MMDGAN[31], RelGAN[48], S3GAN[40], SNGAN[45], and
STGAN[38]
To further assess the generalization ability of Neigh-
4 discord.com/channels/662267976984297473 boring Pixel Relationships (NPR) across GAN-sources, we
5 discord.com/channels/974519864045756446 expanded the evaluation to include results from 9 addi-

28134
ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN Deepfake Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection[62] 91.4 99.4 63.8 91.4 76.4 97.5 52.9 73.3 72.7 88.6 63.8 90.8 63.9 92.2 51.7 62.3 67.1 86.9
Frank[13] 90.3 85.2 74.5 72.0 73.1 71.4 88.7 86.0 75.5 71.2 99.5 99.5 69.2 77.4 60.7 49.1 78.9 76.5
Durall[12] 81.1 74.4 54.4 52.6 66.8 62.0 60.1 56.3 69.0 64.0 98.1 98.1 61.9 57.4 50.2 50.0 67.7 64.4
Patchfor[5] 97.8 100.0 82.6 93.1 83.6 98.5 64.7 69.5 74.5 87.2 100.0 100.0 57.2 55.4 85.0 93.2 80.7 87.1
F3Net[51] 99.4 100.0 92.6 99.7 88.0 99.8 65.3 69.9 76.4 84.3 100.0 100.0 58.1 56.7 63.5 78.8 80.4 86.2
SelfBland[57] 58.8 65.2 50.1 47.7 48.6 47.4 51.1 51.9 59.2 65.3 74.5 89.2 59.2 65.5 93.8 99.3 61.9 66.4
GANDetection[42] 82.7 95.1 74.4 92.9 69.9 87.9 76.3 89.9 85.2 95.5 68.8 99.7 61.4 75.8 60.0 83.9 72.3 90.1
BiHPF[21] 90.7 86.2 76.9 75.1 76.2 74.7 84.9 81.7 81.9 78.9 94.4 94.4 69.5 78.1 54.4 54.6 78.6 77.9
FrePGAN[23] 99.0 99.9 80.7 89.6 84.1 98.6 69.2 71.1 71.1 74.4 99.9 100.0 60.3 71.7 70.9 91.9 79.4 87.2
LGrad [60] 99.9 100.0 94.8 99.9 96.0 99.9 82.9 90.7 85.3 94.0 99.6 100.0 72.4 79.3 58.0 67.9 86.1 91.5
Ojha [49] 99.7 100.0 89.0 98.7 83.9 98.4 90.5 99.1 87.9 99.8 91.4 100.0 89.9 100.0 80.2 90.2 89.1 98.3
NPR(our) 99.8 100.0 96.3 99.8 97.3 100.0 87.5 94.5 95.0 99.5 99.7 100.0 86.6 88.8 77.4 86.2 92.5 96.1

Table 1. Cross-GAN-Sources Evaluation on the test set of ForenSynths[62]. The results of [12, 13, 21, 23, 62] are from [21, 23]. Red
and Blue represent the best and second-best performance, respectively.
AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN RelGAN S3GAN SNGAN STGAN Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection[62] 51.1 83.7 50.2 44.9 81.5 97.5 71.1 94.7 72.9 94.4 53.3 82.1 55.2 66.1 62.7 90.4 63.0 92.7 62.3 82.9
Frank[13] 65.0 74.4 39.4 39.9 31.0 36.0 41.1 41.0 38.4 40.5 69.2 96.2 69.7 81.9 48.4 47.9 25.4 34.0 47.5 54.7
Durall[12] 39.9 38.2 48.2 30.9 60.9 67.2 50.1 51.7 59.5 65.5 80.0 88.2 87.3 97.0 54.8 58.9 62.1 72.5 60.3 63.3
Patchfor[5] 68.0 92.9 97.1 100.0 97.8 99.9 93.6 98.2 97.9 100.0 99.6 100.0 66.8 68.1 97.6 99.8 92.7 99.8 90.1 95.4
F3Net[51] 85.2 94.8 87.1 97.5 89.5 99.8 67.1 83.1 73.7 99.6 98.8 100.0 65.4 70.0 51.6 93.6 60.3 99.9 75.4 93.1
SelfBland[57] 63.1 66.1 56.4 59.0 75.1 82.4 79.0 82.5 68.6 74.0 73.6 77.8 53.2 53.9 61.6 65.0 61.2 66.7 65.8 69.7
GANDetection[42] 57.4 75.1 67.9 100.0 67.8 99.7 67.6 92.4 67.7 99.3 60.9 86.2 69.6 83.5 66.7 90.6 69.6 97.2 66.1 91.6
LGrad [60] 68.6 93.8 69.9 89.2 50.3 54.0 71.1 82.0 57.5 67.3 89.1 99.1 78.5 86.0 78.0 87.4 54.8 68.0 68.6 80.8
Ojha [49] 78.5 98.3 72.0 98.9 77.6 99.8 77.6 98.9 77.6 99.7 78.2 98.7 85.2 98.1 77.6 98.7 74.2 97.8 77.6 98.8
NPR(our) 83.0 96.2 99.0 99.8 98.7 99.0 51.8 70.7 94.5 98.3 98.6 99.0 99.6 100.0 79.0 80.0 88.8 97.4 98.0 100.0

Table 2. Cross-GAN-Sources Evaluation on the Self-Synthesis 9 GANs dataset.

Stable Stable
ADM DDPM IDDPM LDM PNDM VQ-Diffusion Mean
Method Diffusion v1 Diffusion v2
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection[62] 53.9 71.8 62.7 76.6 50.2 82.7 50.4 78.7 50.8 90.3 50.0 71.0 38.0 76.7 52.0 90.3 51.0 79.8
Frank[13] 58.9 65.9 37.0 27.6 51.4 65.0 51.7 48.5 44.0 38.2 51.7 66.7 32.8 52.3 40.8 37.5 46.0 50.2
Durall[12] 39.8 42.1 52.9 49.8 55.3 56.7 43.1 39.9 44.5 47.3 38.6 38.3 39.5 56.3 62.1 55.8 47.0 48.3
Patchfor[5] 77.5 93.9 62.3 97.1 50.0 91.6 99.5 100.0 50.2 99.9 100.0 100.0 90.7 99.8 94.8 100.0 78.1 97.8
F3Net[51] 80.9 96.9 84.7 99.4 74.7 98.9 100.0 100.0 72.8 99.5 100.0 100.0 73.4 97.2 99.8 100.0 85.8 99.0
SelfBland[57] 57.0 59.0 61.9 49.6 63.2 66.9 83.3 92.2 48.2 48.2 77.2 82.7 46.2 68.0 71.2 73.9 63.5 67.6
GANDetection[42] 51.1 53.1 62.3 46.4 50.2 63.0 51.6 48.1 50.6 79.0 51.1 51.2 39.8 65.6 50.1 36.9 50.8 55.4
LGrad [60] 86.4 97.5 99.9 100.0 66.1 92.8 99.7 100.0 69.5 98.5 96.2 100.0 90.4 99.4 97.1 100.0 88.2 98.5
Ojha [49] 78.4 92.1 72.9 78.8 75.0 92.8 82.2 97.1 75.3 92.5 83.5 97.7 56.4 90.4 71.5 92.4 74.4 91.7
NPR (our) 88.6 98.9 99.8 100.0 91.8 99.8 100.0 100.0 91.2 100.0 100.0 100.0 97.4 99.8 93.8 100.0 95.3 99.8
Table 3. Cross-Diffusion-Sources Evaluation on the test of DiffusionForensics [64].

DALLE Glide 100 10 Glide 100 27 Glide 50 27 ADM LDM 100 LDM 200 LDM 200 cfg Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection[62] 51.8 61.3 53.3 72.9 53.0 71.3 54.2 76.0 54.9 66.6 51.9 63.7 52.0 64.5 51.6 63.1 52.8 67.4
Frank[13] 57.0 62.5 53.6 44.3 50.4 40.8 52.0 42.3 53.4 52.5 56.6 51.3 56.4 50.9 56.5 52.1 54.5 49.6
Durall[12] 55.9 58.0 54.9 52.3 48.9 46.9 51.7 49.9 40.6 42.3 62.0 62.6 61.7 61.7 58.4 58.5 54.3 54.0
Patchfor[5] 79.8 99.1 87.3 99.7 82.8 99.1 84.9 98.8 74.2 81.4 95.8 99.8 95.6 99.9 94.0 99.8 86.8 97.2
F3Net[51] 71.6 79.9 88.3 95.4 87.0 94.5 88.5 95.4 69.2 70.8 74.1 84.0 73.4 83.3 80.7 89.1 79.1 86.5
SelfBland[57] 52.4 51.6 58.8 63.2 59.4 64.1 64.2 68.3 58.3 63.4 53.0 54.0 52.6 51.9 51.9 52.6 56.3 58.7
GANDetection[42] 67.2 83.0 51.2 52.6 51.1 51.9 51.7 53.5 49.6 49.0 54.7 65.8 54.9 65.9 53.8 58.9 54.3 60.1
LGrad [60] 88.5 97.3 89.4 94.9 87.4 93.2 90.7 95.1 86.6 100.0 94.8 99.2 94.2 99.1 95.9 99.2 90.9 97.2
Ojha [49] 89.5 96.8 90.1 97.0 90.7 97.2 91.1 97.4 75.7 85.1 90.5 97.0 90.2 97.1 77.3 88.6 86.9 94.5
NPR (our) 94.5 99.5 98.2 99.8 97.8 99.7 98.2 99.8 75.8 81.0 99.3 99.9 99.1 99.9 99.0 99.8 95.2 97.4

Table 4. Cross-Diffusion-Sources Evaluation on the diffusion test set of Ojha [49] .

tional GAN models, as presented in Table 2. The re- and LGrad [60], which attain accuracy values of 77.6% and
sults demonstrate the consistent outperformance of NPR in 75.4%, respectively.
terms of generalization performance on GAN-sources. NPR The results obtained across 17 diverse generation models
achieves an impressive average accuracy of 98.0%, substan- underscore the remarkable generalization capability of the
tially surpassing the best-performing baselines, Ojha [49] proposed artifacts representation derived from up-sampling

28135
DDPM IDDPM ADM Midjourney DALLE Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Mean Acc. of
Method
CNNDetection[62] 50.0 63.3 48.3 52.68 53.4 64.4 48.6 38.5 49.3 44.7 49.9 52.7 38 sub-testsets
Frank[13] 47.6 43.1 70.5 85.7 67.3 72.2 39.7 40.8 68.7 65.2 58.8 61.4 CNNDetection[62] 57.3
Durall[12] 54.1 53.6 63.2 71.7 39.1 40.8 45.7 47.2 53.9 52.2 51.2 53.1 Frank[13] 56.8
Patchfor[5] 54.1 66.3 35.8 34.2 68.6 73.7 66.3 68.8 60.8 65.1 57.1 61.6 Durall[12] 56.6
Patchfor[5] 80.6
F3Net[51] 59.4 71.9 42.2 44.7 73.4 80.3 73.2 80.4 79.6 87.3 65.5 72.9
F3Net[51] 78.1
SelfBland[57] 55.3 57.7 63.5 62.5 57.1 60.1 54.3 56.4 48.8 47.4 55.8 56.8 SelfBland[57] 61.2
GANDetection[42] 47.3 45.5 47.9 57.0 51.0 56.1 50.0 44.7 49.8 49.7 49.2 50.6 GANDetection[42] 59.5
LGrad [60] 59.8 88.5 45.2 46.9 72.7 79.3 68.3 76.0 75.1 80.9 64.2 74.3 LGrad [60] 80.5
Ojha [49] 69.5 80.0 64.9 74.2 81.3 90.8 50.0 49.8 66.3 74.6 66.4 73.9 Ojha [49] 79.8
NPR (our) 88.5 95.1 77.9 84.8 75.8 79.3 77.4 81.9 80.7 83.0 80.1 84.8 NPR (our) 93.3
Table 5. Cross-Diffusion-Sources Evaluation on the Self-Synthesis Diffusion dataset. The images of Table 6. The mean accuracy of
DALLE and Midjourney are collected from the official channel of Discord. The images of other diffusion all 28 generation models on five
models are sampled from official pre-trained models with 1000 diffusion steps. datasets.

ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN Deepfake Mean


Size l × l wj
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
2×2 w1 99.8 100.0 96.3 99.8 97.3 100.0 87.5 94.5 95.0 99.5 99.7 100.0 86.6 88.8 77.4 86.2 92.5 96.1
2×2 w2 99.1 100.0 94.1 98.5 90.5 98.9 76.7 83.9 91.7 99.1 98.0 100.0 75.9 78.2 73.7 83.4 87.5 92.7
2×2 w3 99.9 100.0 97.1 99.9 95.6 99.9 84.2 91.3 89.3 98.6 99.2 100.0 89.7 92.5 72.8 79.8 91.0 95.3
2×2 w4 99.9 100.0 96.4 99.0 98.9 100.0 85.9 93.3 89.3 99.0 99.5 100.0 88.6 93.6 72.6 81.3 91.4 95.8
2×2 avg(vIc ) 99.9 100.0 95.3 98.8 98.9 100.0 80.1 86.8 89.9 95.1 100.0 100.0 73.6 72.0 68.8 77.0 88.3 91.2
2×2 max(vIc ) 99.9 100.0 98.8 100.0 94.9 99.9 84.2 91.8 93.1 95.8 91.3 98.6 80.0 86.0 84.9 93.0 90.9 95.6
3×3 w1 99.9 100.00 95.8 99.9 97.8 100.0 82.5 88.1 81.7 93.7 96.7 99.6 81.4 87.9 79.8 83.2 89.4 94.1
3×3 w2 99.9 100.0 92.6 99.3 95.0 99.8 83.1 92.3 86.0 97.5 99.5 100.0 83.3 88.9 81.7 90.6 90.1 96.1
3×3 w3 99.6 100.0 90.8 98.5 96.1 99.8 75.2 84.0 88.0 94.2 100.0 100.0 78.7 82.5 84.0 93.0 89.0 94.0
3×3 w4 99.6 100.0 94.7 99.8 95.3 99.9 82.7 89.7 81.9 96.7 95.4 99.8 78.9 81.7 88.3 94.5 89.6 95.3
3×3 avg(vIc ) 99.9 100.0 97.6 99.9 95.3 99.8 81.8 90.6 86.9 96.6 98.5 99.9 79.9 89.0 57.3 92.4 87.1 96.0
3×3 max(vIc ) 99.9 100.0 99.3 100.0 99.3 100.0 76.3 85.2 84.5 94.1 99.0 100.0 77.8 84.8 87.9 94.7 90.5 94.8

Table 7. Effect of the hyperparameters of Neighboring Pixel Relationships.

operations. Notably, training the detector on ProGAN im- ages generated by ProGAN, NPR exhibits strong general-
ages enables Neighboring Pixel Relationships (NPR) to ization capabilities across various diffusion models. Our
generalize effectively to previously unseen GAN sources. method achieves 95.3% and 99.8% in terms of mean Ac-
This success can be attributed to NPR’s unique ability to curacy (Acc.) and mean Average Precision (A.P.), respec-
capture and analyze the distinctive traces left by the up- tively. NPR outperforms the current state-of-the-art meth-
sampling component within common GAN pipelines. The ods LGrad and Ojha [49] by 7.1% and 20.9%, respectively,
insights gained from this localized analysis contribute to in terms of mean Acc. metric. Additionally, when com-
NPR’s effectiveness across a spectrum of GAN images. pared to DIRE [64], which specifically focuses on detection
in the diffusion domain, NPR demonstrates comparable re-
4.2.2 Diffusion-Sources Evaluation sults, particularly noteworthy considering our training set
comprises ProGAN images while DIRE relies on diffusion
To present a more challenging evaluation scenario, we de- models for training.
vise a comprehensive experiment where the detector is
trained on images generated by ProGAN and subsequently Given that a significant portion of images in Diffusion-
tested on images produced by a diverse array of diffusion Forensics [64] belongs to the bedroom class, we further
models. Three diffusion datasets are employed to pre- evaluate the performance on the diffusion dataset from Ojha
form evaluation on the diffusion-sources. It’s important [49]. In this dataset, diffusion models are utilized to gen-
to note that the detectors are trained using the ProGAN 4- erate images with 100 or 200 steps. The results on the
classes setting to ensure consistency. This evaluation setup Ojha diffusion dataset are presented in Table 4. The pro-
aims to assess the detector’s adaptability and performance posed NPR achieves a mean Accuracy (Acc.) value of
when faced with the inherent challenges posed by diffusion- 95.2%, demonstrating its robust performance. When com-
generated images, providing valuable insights into the gen- pared to the current state-of-the-art methods LGrad and
eralization capability of Neighboring Pixel Relationships Ojha [49], our NPR exhibits substantial improvements, sur-
(NPR) across different image generation techniques. passing these methods by 4.3% and 8.3% in mean accuracy.
The detection performance on DiffusionForensics [64] This comparison underscores NPR’s ability to maintain sat-
is presented in Table 3. Despite being trained on im- isfactory generalization across unseen diffusion datasets.

28136
The diffusion dataset from Ojha [49] adopts only 100 or
200 steps to generate images, which may lack clarity and re-
alism. To address this limitation, we further collect images
of DALLE and Midjourney from the official Discord chan-
nels and sample images from other models with 1000 dif-
fusion steps. The results are reported in Table 5. The NPR
achieves gains of 15.9% and 13.7% compared to LGrad and
Ojha, obtaining a mean accuracy value of 80.1%. This re-
sult suggests that NPR maintains strong generalization per-
formance even when faced with diffusion datasets generated (a)Fake Image (b)NPR of fake (c)CAM of fake (d)Real Image (e)NPR of real (f)CAM of real
with a more extended diffusion process (1000 steps).
Figure 4. The visualization of CAM [70] extracted from detector
In terms of the mean accuracy across 28 generation mod- on image of Midjourney, DALLE, and ImageNet. Warmer color
els, our NPR achieves 93.3% shown in Table 6, outperform- indicates a higher probability.
ing Ojha and LGrad by 13.5% and 12.8%, respectively. This
result demonstrates that the proposed local up-sampling ar-
tifact, Neighboring Pixel Relationships, is capable of gener- Class Activation Map (CAM) [70]. Figure 4 illustrates the
alizing to both unseen GAN sources and diffusion sources, Class Activation Maps for images sourced from Midjour-
even when trained on ProGAN. This success can be at- ney, DALLE, and ImageNet. Notably, the CAMs for real
tributed to the NPR’s ability to rethink generator architec- images highlight a broader portion of the image, whereas
tures and explore the trace of up-sampling from the perspec- the CAMs for fake images tend to emphasize localized
tive of local spatial information. regions.Intriguingly, despite the detector being primarily
Different Upsampling Techniques. The performance of trained on a dataset encompassing cars, cats, chairs, and
Neighboring Pixel Relationships on 28 generation tech- horses, it demonstrates the capacity to recognize these di-
niques indicates a strong generalization ability to unseen verse images. Certainly, this emphasizes the generalization
sources. Despite being trained on ProGAN using nearest- ability of our detector in identifying various deepfake sig-
neighbor up-sampling, the detector performs well on gen- natures, showcasing its capacity to extend recognition capa-
eration models with other up-sampling operations, such as bilities beyond the training classes.
bilinear. This phenomenon can be attributed to several fac-
tors: 1) Up-sampling operations are applied to feature maps, 5. Conclusion
while NPR is exploited in image space. 2) NPR captures
implicit artifacts representations caused by up-sampling op- This work focuses on developing a generalizable arti-
erations. 3) The proposed NPR focuses on local and relative facts representation for both GANs and diffusions detec-
information, which enhances generalization ability across tion. We reconsider the architectures of CNN-based genera-
different upsampling techniques. tors, aiming to establish source-invariant forgery detection.
Our findings reveal that the up-sampling operator, beyond
Effect of choice of NPR‘s hyperparameters. We evaluate
frequency-based artifacts, can produce generalized forgery
the impact of NPR’s size l and index j in Equation 4 on gen-
artifacts. Existing works typically consider its influence on
eralization ability. Simultaneously, to validate the effective-
the whole image in the frequency domain. In contrast, we
ness of the subtraction between elements in Equation 4, we
explore the trace of the up-sampling layer from the local
replace wj with avg(vIc ) and max(vIc ) to implement NPR.
image pixels. We present a simple but effective artifact rep-
We employ the (car, cat, chair, horse) of ProGAN as the
resentation, named Neighboring Pixel Relationships (NPR),
training set and apply the test set of ForenSynths[62] for
to achieve generalized deepfake detection. Extensive exper-
evaluation. The results are shown in Table 7. Observations:
iments on 28 generation models indicate that the proposed
1) When l = 2, NPR achieves better performance, likely
NPR contributes to a strong AI-generated image detector.
due to most generators employing 2 scaled up-sampling lay-
ers. 2) NPR with avg(vIc ) and max(vIc ) show similar detec-
tion performance. This suggests that information in the 2×2
6. Acknowledgments
block of images can effectively reveal differences between This work was supported in part by the National Key R&D
real and fake images. Program of China (No.2021ZD0112100), National NSF of
Qualitative Analysis of NPR. The above quantitative ex- China (No.U1936212, No.62120106009, No.U23A20314),
periments have indicated the effectiveness of the proposed National Natural Science Foundation of China under Grants
Neighboring Pixel Relationships. To obtain a more pro- 62072394, Natural Science Foundation of Hebei province
found understanding of its intrinsic properties, we conduct a under Grant F2021203019, and A*STAR Career Develop-
qualitative analysis of NPR, employing the visualization of ment Funding Award (Grant No:222D800031).

28137
References [19] Zhenliang He et al. Attgan: Facial attribute editing by only
changing what you want. IEEE Transactions on Image Pro-
[1] Marc G Bellemare et al. The cramer distance as a so- cessing, 28(11):5464–5478, 2019. 5
lution to biased wasserstein gradients. arXiv preprint
[20] Jonathan Ho et al. Denoising diffusion probabilistic mod-
arXiv:1705.10743, 2017. 5
els. Advances in neural information processing systems, 33:
[2] David Berthelot et al. Began: Boundary equilibrium genera- 6840–6851, 2020. 1, 2, 5
tive adversarial networks. arXiv preprint arXiv:1703.10717,
[21] Yonghyun Jeong et al. Bihpf: Bilateral high-pass filters for
2017. 5
robust deepfake detection. In WACV, pages 48–57, 2022. 2,
[3] Andrew Brock et al. Large scale gan training for high fidelity
3, 4, 5, 6
natural image synthesis. In ICLR, 2018. 5
[22] Yonghyun Jeong et al. Fingerprintnet: Synthesized finger-
[4] Junyi Cao et al. End-to-end reconstruction-classification
prints for generated image detection. In ECCV, pages 76–94.
learning for face forgery detection. In CVPR, pages 4113–
Springer, 2022. 2
4122, 2022. 2
[23] Yonghyun Jeong et al. Frepgan: robust deepfake detection
[5] Lucy Chai et al. What makes fake images detectable? un-
using frequency-level perturbations. In AAAI, pages 1060–
derstanding properties that generalize. In ECCV, pages 103–
1068, 2022. 2, 3, 4, 5, 6
120. Springer, 2020. 2, 5, 6, 7
[24] Yan Ju et al. Fusing global and local features for generalized
[6] Liang Chen et al. Ost: Improving generalization of deep-
ai-synthesized image detection. In 2022 IEEE ICIP, pages
fake detection via one-shot test-time training. In Advances
3465–3469. IEEE, 2022. 2
in Neural Information Processing Systems, 2022. 2
[25] Tero Karras et al. Progressive growing of gans for improved
[7] Liang Chen et al. Self-supervised learning of adversarial ex-
quality, stability, and variation. In ICLR, 2018. 1, 5
ample: Towards good generalizations for deepfake detection.
In CVPR, pages 18710–18719, 2022. 2 [26] Tero Karras et al. A style-based generator architecture for
[8] Yunjey Choi et al. Stargan: Unified generative adversarial generative adversarial networks. In CVPR, pages 4401–
networks for multi-domain image-to-image translation. In 4410, 2019. 1, 5
CVPR, pages 8789–8797, 2018. 5 [27] Tero Karras et al. Analyzing and improving the image qual-
[9] François Chollet. Xception: Deep learning with depthwise ity of stylegan. In CVPR, pages 8110–8119, 2020. 5
separable convolutions. In CVPR, pages 1251–1258, 2017. [28] Diederik P Kingma and Jimmy Ba. Adam: A method for
1, 2 stochastic optimization. In ICLR (Poster), 2015. 5
[10] Tan Chuangchuang, Tao Renshuai, Liu Huan, and Yao Zhao. [29] Kwot Sin Lee et al. Infomax-gan: Improved adversarial im-
Gangen-detection: A dataset generated by gans for gener- age generation via information maximization and contrastive
alizable deepfake detection. https://github.com/ learning. In WACV, pages 3942–3952, 2021. 5
chuangchuangtan/GANGen-Detection, 2024. 5 [30] Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang,
[11] Prafulla Dhariwal et al. Diffusion models beat gans on im- Mohamad Shahbazi, Xiaopeng Hong, and Luc Van Gool.
age synthesis. Advances in neural information processing A continual deepfake detection benchmark: Dataset, meth-
systems, 34:8780–8794, 2021. 2, 5 ods, and essentials. In Proceedings of the IEEE/CVF Win-
[12] Ricard Durall et al. Watch your up-convolution: Cnn based ter Conference on Applications of Computer Vision, pages
generative deep neural networks are failing to reproduce 1339–1349, 2023. 3
spectral distributions. In CVPR, pages 7890–7899, 2020. 1, [31] Chun-Liang Li et al. Mmd gan: Towards deeper understand-
2, 3, 4, 5, 6, 7 ing of moment matching network. Advances in neural infor-
[13] Joel Frank et al. Leveraging frequency analysis for deep mation processing systems, 30, 2017. 5
fake image recognition. In ICML, pages 3247–3258. PMLR, [32] Yuezun Li et al. In ictu oculi: Exposing ai created fake videos
2020. 1, 2, 3, 4, 5, 6, 7 by detecting eye blinking. In 2018 IEEE International work-
[14] Ian J Goodfellow et al. Generative adversarial nets. In NIPS, shop on information forensics and security (WIFS), pages 1–
2014. 1 7. IEEE, 2018. 2
[15] Shuyang Gu et al. Vector quantized diffusion model for text- [33] Tsung-Yi Lin et al. Microsoft coco: Common objects in
to-image synthesis. In Proceedings of the IEEE/CVF Con- context. In ECCV, pages 740–755. Springer, 2014. 5
ference on Computer Vision and Pattern Recognition, pages [34] Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao
10696–10706, 2022. 5 Zhao, and Jingdong Wang. Unified frequency-assisted trans-
[16] Alexandros Haliassos et al. Lips don’t lie: A generalisable former framework for detecting and grounding multi-modal
and robust approach to face forgery detection. In CVPR, manipulation. arXiv preprint arXiv:2309.09667, 2023. 1
pages 5039–5049, 2021. 2 [35] Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei,
[17] Kaiming He et al. Deep residual learning for image recogni- Yao Zhao, and Jingdong Wang. Forgery-aware adap-
tion. In CVPR, pages 770–778, 2016. 5 tive transformer for generalizable synthetic image detection.
[18] Yang He et al. Beyond the spectrum: Detecting deepfakes via arXiv preprint arXiv:2312.16649, 2023.
re-synthesis. In Proceedings of the Thirtieth International [36] Huan Liu, Xiaolong Liu, et al. Padvg: A simple base-
Joint Conference on Artificial Intelligence, IJCAI-21, pages line of active protection for audio-driven video generation.
2534–2541. International Joint Conferences on Artificial In- ACM Transactions on Multimedia Computing, Communica-
telligence Organization, 2021. 2 tions and Applications, 2024. 1

28138
[37] Luping Liu et al. Pseudo numerical methods for diffusion [57] Kaede Shiohara et al. Detecting deepfakes with self-blended
models on manifolds. In ICLR, 2022. 5 images. In CVPR, pages 18720–18729, 2022. 2, 5, 6, 7
[38] Ming Liu et al. Stgan: A unified selective transfer network [58] Chuangchuang Tan, Ping Liu, RenShuai Tao, Huan Liu, Yao
for arbitrary image attribute editing. In CVPR, pages 3673– Zhao, Baoyuan Wu, and Yunchao Wei. Data-independent
3682, 2019. 5 operator: A training-free artifact representation extractor for
[39] Ziwei Liu et al. Deep learning face attributes in the wild. In generalizable deepfake detection, 2024. 3
ICCV, pages 3730–3738, 2015. 5 [59] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu,
[40] Mario Lučić et al. High-fidelity image generation with fewer Ping Liu, and Yunchao Wei. Frequency-aware deepfake de-
labels. In ICML, pages 4183–4192. PMLR, 2019. 5 tection: Improving generalizability through frequency space
[41] Yuchen Luo et al. Generalizing face forgery detection with learning. arXiv preprint arXiv:2403.07240, 2024. 3
high-frequency features. In CVPR, pages 16317–16326, [60] Chuangchuang Tan et al. Learning on gradients: Generalized
2021. 3 artifacts representation for gan-generated images detection.
[42] Sara Mandelli et al. Detecting gan-generated images by or- In CVPR (CVPR), pages 12105–12114, 2023. 2, 3, 5, 6, 7
thogonal training of multiple cnns. In 2022 IEEE ICIP, pages [61] Chengrui Wang et al. Representative forgery mining for fake
3091–3095. IEEE, 2022. 5, 6, 7 face detection. In CVPR, pages 14923–14932, 2021. 2
[43] Francesco Marra et al. Do gans leave artificial fingerprints? [62] Sheng-Yu Wang et al. Cnn-generated images are surprisingly
In 2019 IEEE conference on multimedia information pro- easy to spot... for now. In CVPR, pages 8695–8704, 2020. 2,
cessing and retrieval (MIPR), pages 506–511. IEEE, 2019. 4, 5, 6, 7, 8
2
[63] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun
[44] Iacopo Masi et al. Two-branch recurrent network for isolat- Wang, and Houqiang Li. Altfreezing for more general video
ing deepfakes in videos. In ECCV, pages 667–684. Springer, face forgery detection. In CVPR, pages 4129–4138, 2023. 3
2020. 3
[64] Zhendong Wang et al. Dire for diffusion-generated image de-
[45] Takeru Miyato et al. Spectral normalization for genera-
tection. In Proceedings of the IEEE/CVF International Con-
tive adversarial networks. arXiv preprint arXiv:1802.05957,
ference on Computer Vision (ICCV), pages 22445–22455,
2018. 5
2023. 5, 6, 7
[46] Alex Nichol et al. Glide: Towards photorealistic image gen-
[65] Simon Woo et al. Add: Frequency attention and multi-
eration and editing with text-guided diffusion models. arXiv
view based knowledge distillation to detect low-quality com-
preprint arXiv:2112.10741, 2021. 5
pressed deepfake images. In AAAI, pages 122–130, 2022. 3
[47] Alexander Quinn Nichol et al. Improved denoising diffusion
probabilistic models. In ICML, pages 8162–8171. PMLR, [66] Fisher Yu et al. Lsun: Construction of a large-scale image
2021. 5 dataset using deep learning with humans in the loop. arXiv
preprint arXiv:1506.03365, 2015. 5
[48] Weili Nie et al. Relgan: Relational generative adversarial
networks for text generation. In ICLR, 2019. 5 [67] Ning Yu et al. Attributing fake images to gans: Learning
[49] Utkarsh Ojha et al. Towards universal fake image detectors and analyzing gan fingerprints. In Proceedings of the ICCV,
that generalize across generative models. In CVPR, pages pages 7556–7566, 2019. 2
24480–24489, 2023. 2, 3, 5, 6, 7, 8 [68] Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen,
[50] Taesung Park et al. Semantic image synthesis with spatially- and Yunchao Wei. Slca: Slow learner with classifier align-
adaptive normalization. In CVPR, pages 2337–2346, 2019. ment for continual learning on a pre-trained model. In ICCV,
5 pages 19148–19158, 2023. 3
[51] Yuyang Qian et al. Thinking in frequency: Face forgery de- [69] Xu Zhang et al. Detecting and simulating artifacts in gan
tection by mining frequency-aware clues. In ECCV, pages fake images. In 2019 IEEE international workshop on in-
86–103. Springer, 2020. 2, 3, 5, 6, 7 formation forensics and security (WIFS), pages 1–6. IEEE,
[52] Aditya Ramesh et al. Zero-shot text-to-image generation. In 2019. 2
ICML, pages 8821–8831. PMLR, 2021. 5 [70] Bolei Zhou et al. Learning deep features for discriminative
[53] Robin Rombach et al. High-resolution image synthesis with localization. In CVPR, pages 2921–2929, 2016. 8
latent diffusion models. In Proceedings of the IEEE/CVF [71] Hongguang Zhu, Yunchao Wei, et al. Ctp: Towards vision-
conference on computer vision and pattern recognition, language continual pretraining via compatible momentum
pages 10684–10695, 2022. 1, 2, 5 contrast and topology preservation. In ICCV, pages 22257–
[54] Andreas Rossler et al. Faceforensics++: Learning to de- 22267, 2023. 3
tect manipulated facial images. In Proceedings of the ICCV, [72] Jiapeng Zhu et al. In-domain gan inversion for real image
pages 1–11, 2019. 2, 5 editing. In ECCV, pages 592–608. Springer, 2020. 1
[55] Olga Russakovsky et al. Imagenet large scale visual recog- [73] Jun-Yan Zhu et al. Unpaired image-to-image translation us-
nition challenge. International journal of computer vision, ing cycle-consistent adversarial networks. In ICCV, pages
115(3):211–252, 2015. 5 2223–2232, 2017. 5
[56] Christoph Schuhmann et al. Laion-400m: Open dataset of
clip-filtered 400 million image-text pairs. In NeurIPS Work-
shop Datacentric AI, number FZJ-2022-00923. Jülich Super-
computing Center, 2021. 5

28139

You might also like