0% found this document useful (0 votes)
10 views

Towards Generalizable Deepfake Detection

Uploaded by

xueping wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Towards Generalizable Deepfake Detection

Uploaded by

xueping wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Towards Generalizable Deepfake Detection


with Locality-Aware AutoEncoder
Mengnan Du*, Shiva Pentyala*, Yuening Li, Xia Hu
Department of Computer Science and Engineering, Texas A&M University
{dumengnan,pk123,liyuening,xiahu}@tamu.edu

ABSTRACT 1 INTRODUCTION
With advancements of deep learning techniques, it is now possible Recently advanced deep learning and computer vision techniques,
to generate super-realistic images and videos, i.e., deepfakes. These e.g., generative adversarial networks (GAN), have enabled genera-
deepfakes could reach mass audience and result in adverse impacts tion of super-realistic fake images and videos, known as deepfakes.
on our society. Although lots of efforts have been devoted to de- These techniques enable attackers or even lay users of machine
tect deepfakes, their performance drops significantly on previously learning to manipulate an image/video by swapping its content
unseen but related manipulations and the detection generalization with alternative contents and synthesize a new image/video. For
capability remains a problem. Motivated by the fine-grained na- instance, FaceSwap could generate forged videos about real people
ture and spatial locality characteristics of deepfakes, we propose performing fictional things, where even humans have difficulty in
Locality-Aware AutoEncoder (LAE) to bridge the generalization gap. distinguishing these forgeries from real ones [9, 14]. In this paper,
In the training process, we use a pixel-wise mask to regularize local we employ the broad definition of deepfakes, and do not limit it
interpretation of LAE to enforce the model to learn intrinsic rep- to facial manipulations. We also consider general GAN-based in-
resentation from the forgery region, instead of capturing artifacts painting manipulation as deepfakes, since it is also fake contents
in the training set and learning superficial correlations to perform generated by deep learning techniques. The deepfakes could be
detection. We further propose an active learning framework to se- further shared on social media for malicious purposes, such as
lect the challenging candidates for labeling, which requires human spreading fake news, influencing elections or manipulating stock
masks for less than 3% of the training data, dramatically reducing prices, thus could cause serious negative impact on our society [8].
the annotation efforts to regularize interpretations. Experimental To help mitigate the adverse effects, it is essential that we develop
results on three deepfake detection tasks indicate that LAE could methods to detect the manipulated forgeries [2, 3].
focus on the forgery regions to make decisions. The analysis further Current deepfake detection approaches are usually formulated
shows that LAE outperforms the state-of-the-arts by 6.52%, 12.03%, as a binary classification problem, which roughly falls into two
and 3.08% respectively on three deepfake detection tasks in terms categories: hand-crafted feature based methods and deep learning
of generalization accuracy on previously unseen manipulations. based methods. The first category is based on the philosophy that
deepfakes are generated by algorithms rather than real camera, thus
CCS CONCEPTS lots of clues and artifacts could be detected. It relies on hypothesis
on artifacts or inconsistencies of a video or image, such as lack of
• Computing methodologies → Object recognition.
realistic eye blinking [22], face warping artifacts [23], and lacking
KEYWORDS self-consistency [16]. Handcrafted features are thereafter created
Deepfake Detection; GAN; Generalization; Interpretation to detect forgeries. In contrast, the second category develops deep
learning models to automatically extract discriminative features
ACM Reference Format: to detect forgeries. Those methods take either the whole or partial
Mengnan Du*, Shiva Pentyala*, Yuening Li, Xia Hu. 2020. Towards Generaliz-
image as input and then classify it as fake or not by designing
able Deepfake Detection with Locality-Aware AutoEncoder. In Proceedings
of the 29th ACM International Conference on Information and Knowledge
various architectures of convolutional networks [1, 26].
Management (CIKM ’20), October 19–23, 2020, Virtual Event, Ireland. ACM, Despite abundant efforts have been devoted to forensics, it re-
New York, NY, USA, 10 pages. https://doi.org/10.1145/3340531.3411892 mains a challenging problem to develop deepfake detection methods
which have high generalization capability on previously unseen
forgeries. Firstly, as evidenced by recent work [8, 18], although
*These authors contributed equally. current methods could achieve 99% accuracy on hold-out test set
for most tasks, the performance drops to around 50% random guess
accuracy on previously unseen forgery images/videos. Secondly, in
Permission to make digital or hard copies of all or part of this work for personal or our preliminary experiments we have observed that these models
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
fail to focus on the forgery regions to make detection, leveraging
on the first page. Copyrights for components of this work owned by others than ACM heatmaps by local interpretation methods [32, 39]. Instead, they
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, have concentrated on non-forgery parts, and learn superficial and
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. easier-to-learn correlations to separate true and fake images. Due to
CIKM ’20, October 19–23, 2020, Virtual Event, Ireland the independent and identically distributed (i.i.d.) training-test split
© 2020 Association for Computing Machinery. of data, these superficial patterns happen to be predictive on hold-
ACM ISBN 978-1-4503-6859-9/20/10. . . $15.00
https://doi.org/10.1145/3340531.3411892
out test set. In contrast, forgeries generated by alternative methods

325
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

may not contain these superficial correlations. This can possibly with external annotations, which could guide the model to learn in-
explain their high accuracy on hold-out test set and low accuracy trinsic representation (Sec. 2.2) from the right and justified forgery
on alternative test set. Thirdly, to some extent these methods have region (Sec. 2.3). Besides, active learning framework is utilized to
solved the training dataset, but it is still too far away from really reduce human annotation efforts for model regularization (Sec. 2.4).
solving the deepfake detection problem. As new types of deepfake
manipulations emerge quickly, unfortunately the forensic meth- 2.1 Problem Statement
ods without sufficient generalization capability cannot be readily
In this section, we first introduce the basic notations used in this pa-
applied to real world data [18, 37].
per. Then we present the generalizable deepfake detection problem
To bridge the generalization gap, we propose to investigate the
that we aim to tackle.
following two distinct characteristics of the deepfakes. Firstly, deep-
fake detection is a fine-grained classification task. The difference Notations: In this paper, we employ a general definition of deep-
between true and fake images is so subtle, even human eyes are fake, which denotes fake contents generated by advanced deep
hard to distinguish them. Secondly, deepfakes usually have spatial learning techniques. Representative examples include face swap,
locality, where forgery occupies a certain ratio of the whole image facial attributes manipulation and inpainting manipulations. Given
input. For instance, DeepFake videos [9] use GAN-based technol- a seen dataset D containing both true images X𝑇 and fake images
ogy to replace one’s face with the other. This manipulation changes X𝐹 generated by a forgery method. D is split into training set
𝑁 trn 𝑁 val
human faces, while leaving the background part unchanged. Con- Dtrn = {(𝑥𝑖 , 𝑙𝑖 )}𝑖=1 , validation set Dval = {(𝑥𝑖 , 𝑙𝑖 )}𝑖=1 and test
𝑁 tst
sidering these two properties, a desirable detection model should set Dtst = {(𝑥𝑖 , 𝑙𝑖 )}𝑖=1 , where 𝑙𝑖 ∈ [0, 1] denotes fake and true
be able to concentrate on the forgery region to learn effective repre- class label respectively. A detection model 𝑓 (𝑥) is learned from the
sentations. Towards this end, the detection model needs to possess training set Dtrn . Besides seen set D, there is also a unseen dataset
local interpretability, which could indicate which region is attended 𝑁 unseen
Dunseen = {(𝑥𝑖 , 𝑙𝑖 )}𝑖=1 , which is used to test the generalization
by the model to make decisions. The benefit is that we can control of model 𝑓 (𝑥) on unseen manipulations. Fake images in Dunseen
the local interpretation explicitly by imposing extra supervision on and in D belong to the same kind of forgery task, while are not gen-
instance interpretation in the learning process, in order to enforce erated by the same forgery methods. Take face swap for example:
the model to focus on the forgery region to learn representations. D contains forgery images created by FaceSwap [14], while fake
In this work, based on aforementioned observations, we develop images in Dunseen are created by an alternative forgery method,
a Locality-Aware AutoEncoder (LAE) for better generalization of such as Face2Face [34]. Besides, the unseen dataset Dunseen only
deepfake detection. LAE considers both fine-grained representation serves the testing purpose, and none of its images is used for model
learning and enforcing locality in a single framework for image training, hyperparameters tuning or validation purposes.
forensics. To guarantee fine-grained representation learning, our
Generalizable Deepfake Detection: Our objective is to train a
work builds upon an autoencoder, which employs reconstruction
model, which could generalize across a large variety of possible
losses and latent space loss for capturing the distribution for the
manipulations, as long as they are for the same detection task. For
trained images. To suppress the superficial correlations learned by
instance, for the face manipulation detection task, we expect the
the autoencoder, we augment local interpretability to the antoen-
model trained on FaceSwap [14] is able to generalize to alternative
coder and use extra pixel-wise forgery ground truth to regularize
manipulation methods, such as Face2Face [34] and other FaceSwap
the local interpretation. As such, the LAE is enforced to capture
implementations. This is significant in the real-world scenario, since
discriminative representations from the forgery region. We further
new manipulation methods emerge day by day, and retraining the
employ an active learning framework to reduce the efforts to create
detector is difficult and even impractical due to the lack of sufficient
pixel-wise forgery masks. The major contributions of this paper
labeled data from the new manipulation methods.
are summarized as follows:

• We propose a deepfake detection method, called LAE, which 2.2 AutoEncoder for Deepfake Detection
makes predictions relying on correct evidence in order to boost A key characteristic of deepfake detection lies in its fine-grained
generalization accuracy. nature. Thus effective representation is needed for both true and
• We present an active learning framework to reduce the anno- fake images in order to ensure high detection accuracy. As such, we
tation efforts, where less than 3% annotations are needed to use an autoencoder to learn more distinguishable representations
regularize LAE during training. which could separate true and fake images in the latent space.
• Experimental results on three deepfake detection tasks validate The autoencoder is denoted using 𝑓 , which consists a sub-network
that LAE could push models to learn intrinsic representation encoder 𝑓𝑒 (·) and decoder 𝑓𝑑 (·). This encoder maps the input image
from forgery regions. The proposed LAE achieves state-of-the- 𝑥 ∈ R𝑤×ℎ×3 to the low-dimensional latent vector space encoding
art generalization accuracy on previously unseen manipulations. 𝑧 ∈ R𝑑𝑧 , where 𝑑𝑧 is the dimension of latent vector 𝑧. Then the
decoder remaps latent vector 𝑧 back to the input space 𝑥ˆ ∈ R𝑤×ℎ×3 .

2 THE PROPOSED LAE FRAMEWORK 𝑧 = 𝑓𝑒 (𝑥, 𝜃 𝑒 ), 𝑥ˆ = 𝑓𝑑 (𝑧, 𝜃𝑑 ), (1)


In this section, we introduce the proposed framework for general-
izable deepfake detection. The pipeline of the proposed framework where 𝜃 𝑒 and 𝜃𝑑 are parameters for the encoder and decoder respec-
is illustrated in Fig. 1. The key idea is to regularize detection model tively. To force our model to learn meaningful and intrinsic features,

326
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Locality-aware AutoEncoder Better Representation Learning Regularization Active learning


xi
fe Labeled dataset

Pixel loss
D Adversarial loss G(xi)

Attention loss
fd ‫ݔ‬ො ݅

GAP
Conv layers Challenging?
Perceptual loss

xi z C ෡ i)
‫(ܯ‬x
Unlabeled data
Latent space loss

Figure 1: Schematic of LAE training for generalizable deepfake detection. Latent space and reconstruction losses are used to force LAE to
learn effective representation. Extra supervision is utilized to regularize local interpretation to boost generalization accuracy. Active learning
is exploited to reduce forgery masks annotation efforts.

we introduce the latent space loss as well as reconstruction loss. is trained to trick the discriminator network into classifying the
L1 (𝜃 𝑒 , 𝜃𝑑 , 𝑥, 𝑙) = 𝛼 1 Lrec + 𝛼 2 Llatent . (2) generated images as real. The discriminator 𝐷 is trained using the
following objective:
These two losses will be elaborated in following sections.
Latent Space Loss. We make use of the latent space representa-
tion to distinguish the forgery images from the true ones [8]. The L𝐷 = −[E𝑥∼𝑃𝑋 [log𝐷 (𝑥)] + E𝑥∼𝑃𝑋 [log(1 − 𝐷 (𝑥))]].
ˆ (6)
latent space vector is first split into two parts: 𝑇 = {1, ..., 𝑑2𝑧 }, and
𝐹 = { 𝑑𝑧2+2 , ..., 𝑑𝑧 }. The total activation of 𝑥𝑖 for the true and fake Parameter 𝛽 1 , 𝛽 2 , 𝛽 3 are employed to adjust the impact of individual
category respectively are denoted as: losses. The three losses serve the purpose of ensuring reconstructed
2 2 image to: 1) be sound in pixel space, 2) be reliable in the high-level
𝑎𝑖,𝑇 = ||𝑧𝑖,𝑐 ||1, 𝑐 ∈ 𝑇 ; 𝑎𝑖,𝐹 = ||𝑧𝑖,𝑐 ||1, 𝑐 ∈ 𝐹 . (3)
𝑑𝑧 𝑑𝑧 feature space, and 3) look realistic respectively. The implicit effect
The final latent space loss is defined as follows: is to force the vector 𝑧 to learn intrinsic representation which could
 make it better separate fake and true images. Besides, using three
Llatent = |𝑎𝑖,𝑇 − 𝑙𝑖 | + |𝑎𝑖,𝐹 − (1 − 𝑙𝑖 )|, (4) losses instead of using only pixel-wise loss could help stabilize the
𝑖 training in less number of epoches [10].
where 𝑙𝑖 is the ground truth of input image 𝑥𝑖 . The key idea of this
loss is to enforce the activation of the true part: {𝑧𝑖,𝑐 }, 𝑐 ∈ 𝑇 to be
maximally activated if the input 𝑥𝑖 is a true image, and similarly 2.3 Locality-Aware AutoEncoder (LAE)
to increase the fake part {𝑧𝑖,𝑐 }, 𝑐 ∈ 𝐹 activation values for fake The key idea of LAE is that the model should focus on correct re-
image inputs. At testing stage, the deepfake detection is based on gions and exploit reasonable evidences rather than capture dataset
the activation value of the latent space partitions. The input image biases to make predictions. Due to the pure data-driven training
𝑥𝑖 is considered to be true if 𝑎𝑖,𝑇 > 𝑎𝑖,𝐹 , and vice versa. paradigm, the autoencoder developed in last section is not guaran-
Reconstruction Loss. To force the fake and true images more teed to focus on the forgery region to make predictions. Instead the
distinguishable in the latent space, it is essential to learn effective autoencoder may capture superficial correlations which happen
representations. Specifically, we use reconstruction loss which con- to be predictive in current dataset. This would lead to decreased
tains three parts: pixel-wise loss, perceptual loss, and adversarial loss, generalization accuracy on unseen data generated by alternative
to learn intrinsic representation for all training samples. The overall forgery methods. In LAE (see Fig. 1), we explicitly enforce the model
reconstruction loss Lrec is defined as follows: to rely on the forgery region to make detection predictions, by aug-
 menting the model with local interpretability and regularizing the
Lrec = 𝛽 1 | |𝑥𝑖 − 𝑥ˆ𝑖 | |22 +𝛽 2 | |𝐶 (𝑥𝑖 ) − 𝐶 (𝑥ˆ𝑖 ) | |22 +𝛽 3 [−log (𝐷 (𝑥ˆ𝑖 )) ] . (5)
𝑖    interpretation attention map with extra supervision.
Pixel Loss Perceptual Loss Adversarial Loss Augmenting Local Interpretability. The goal of local interpre-
The pixel-wise loss is measured using mean absolute error (MAE) tation is to identify the contributions of each pixel in the input image
between original input image 𝑥𝑖 and reconstructed image 𝑥ˆ𝑖 . For towards a specific model prediction. The interpretation is illustrated
perceptual loss, a pretrained comparator 𝐶 (·) (e.g., VGGNet [33]) is in the format of heatmap (or attention map). Inspired by the CNN
used to map input image to feature space: R𝑤×ℎ×3 → R𝑤1 ×ℎ1 ×𝑑1 . local interpretation method Class Activation Map (CAM) [39], we
Then MAE difference at the feature space is calculated, which repre- use global average pooling (GAP) layer as ingredient in the encoder,
sents high-level semantic difference of 𝑥𝑖 and 𝑥ˆ𝑖 . In terms of adver- as illustrated in Fig. 1. This enables the encoder to output atten-
sarial loss, a discriminator 𝐷 (·) is introduced aiming to discriminate tion map for each input. Let 𝑙-layer denotes the last convolutional
the generated images 𝑥ˆ𝑖 from real ones 𝑥𝑖 . This subnetwork 𝐷 (·) layer of the encoder, and 𝑓𝑙,𝑘 (𝑥𝑖 ) represents the activation matrix
is the standard discriminator network introduced in DCGAN [28], at 𝑘-channel of 𝑙-layer for input image 𝑥𝑖 . Let 𝑤𝑘𝑐 corresponds to
and is trained concurrently with our autoencoder. The autoencoder the weight of 𝑘-channel to the unit 𝑐 of latent vector 𝑧. The CAM

327
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

attention map for unit 𝑐 is defined as follows: Algorithm 1: Locality-Aware AutoEncoder (LAE).
𝑑𝑙 Input: Training data 𝐷 = { (𝑥𝑖 , 𝑙𝑖 ) }𝑖=1 𝑁 .

𝑀𝑐 (𝑥𝑖 ) = 𝑤𝑘𝑐 · 𝑓𝑙,𝑘 (𝑥𝑖 ). (7) 1 Set hyperparameters 𝛼 1 , 𝛼 2 , 𝛽 1 , 𝛽 2 , 𝛽 3 , 𝜆1 , 𝜆2 , learning rate 𝜂,
𝑘=1 iteration number 𝑚𝑎𝑥_𝑖𝑡𝑒𝑟 1, 𝑚𝑎𝑥_𝑖𝑡𝑒𝑟 2, epoch index 𝑡 = 0;
Later we upsample 𝑀𝑐 (𝑥𝑖 ) to the same dimension as the input 2 Initialize autoencoder parameters 𝜃 𝑒 , 𝜃𝑑 ;
image 𝑥𝑖 using bilinear interpolation. Each entry within 𝑀𝑐 (𝑥𝑖 ) 3 while 𝑡 ≤ 𝑚𝑎𝑥_𝑖𝑡𝑒𝑟 1 do
directly indicates the importance of the value at that spatial grid 4 L1 (𝜃 𝑒 , 𝜃𝑑 , 𝑥, 𝑙) = 𝛼 1 Lrec + 𝛼 2 Llatent ;
of image 𝑥𝑖 leading to the activation 𝑧𝑐 . The final attention map 5 𝜃 𝑒,𝑡 +1 , 𝜃𝑑,𝑡 +1 = 𝐴𝑑𝑎𝑚 ( L1 (𝜃 𝑒 , 𝜃𝑑 , 𝑥, 𝑙), 𝜂);
𝑀ˆ (𝑥𝑖 ) for an input image 𝑥𝑖 is denoted as: 6 𝑡 = 𝑡 + 1;
𝜂
𝑑𝐹
 𝑑𝑙
𝑑𝐹 
 7 Reduce the learning rate: 𝜂 ← 10 , 𝑡 ← 0;
𝑀ˆ (𝑥𝑖 ) = |𝑧𝑖,𝑐 | · 𝑀𝑐 (𝑥𝑖 ) = |𝑧𝑖,𝑐 | · 𝑤𝑘𝑐 · 𝑓𝑙,𝑘 (𝑥𝑖 ), (8) 8 while 𝑡 ≤ 𝑚𝑎𝑥_𝑖𝑡𝑒𝑟 2 do

𝑐=1 𝑐=1 𝑘=1 9 L𝑤 = 𝑖=1 [𝑙𝑖 · log(𝑝 (𝑢𝑖 )) + (1 − 𝑙𝑖 ) · log(1 − 𝑝 (𝑢𝑖 )) ];
where 𝑧𝑖,𝑐 denotes the 𝑐-th unit of the latent vector 𝑧 for 𝑥𝑖 . The 10 Select out 𝑁 active images as active candidates;
𝑁
heatmap is end-to-end differentiable, amenable for training with 11 Request labeling pixel-wise masks {𝐺 (𝑥𝑖 ) }𝑖=1active ;
𝑁active ˆ
backpropagation and updating model parameters. 12 Lattention (𝜃 𝑒 , 𝑥, 𝐺) = 𝑖=1 [𝑀 (𝑥𝑖 ) − 𝐺 (𝑥𝑖 ) ] 2 ;
Regularizing Local Interpretation. To enforce the network to 13 L2 (𝜃 𝑒 , 𝑥, 𝑙, 𝐺) = 𝜆1 Llatent + 𝜆2 Lattention ;
focus on the correct forgery region to make detection, a straightfor- 14 𝜃 𝑒,𝑡 +1 = 𝐴𝑑𝑎𝑚 ( L2 (𝜃 𝑒 , 𝑥, 𝑙, 𝐺), 𝜂);
𝜂
ward way is to use instance-level forgery ground truth to regularize 15 𝑡 = 𝑡 + 1; 𝜂 ← 10 if 𝑡 % 3 = 0;
the local interpretation. Specifically the regularization is achieved Output: LAE makes right predictions via right reasons.
by minimizing the distance between individual interpretation map
𝑀ˆ (𝑥𝑖 ) and the extra supervision for all the 𝑁 𝐹 forgery images. The
as the channel’s average activation scores for an image. Specifi-
attention loss is defined as follows:
cally, the contribution of channel 𝑘 towards image 𝑥𝑖 is denoted as:
𝑁𝐹
 𝑑𝑐
{𝑢𝑖,𝑘 }𝑘=1 , where 𝑑𝑐 is the number of channels. We learn a linear
Lattention (𝜃 𝑒 , 𝑥, 𝐺) = [𝑀ˆ (𝑥𝑖 ) − 𝐺 (𝑥𝑖 )] 2, (9)
model based on the 𝑑𝑐 concepts to predict the possibility of image
𝑖=1 exp(𝑤 ·𝑢 )
𝑥𝑖 to be fake: 𝑝 (𝑢𝑖 ) = 1+exp(𝑤 ·𝑢𝑖 ) . The loss function is defined as:
where 𝐺 (𝑥𝑖 ) denotes extra supervision, which is annotated ground 𝑖

truth for forgery. This ground truth is given in the format of pixel- 
L𝑤 = [𝑙𝑖 · log(𝑝 (𝑢𝑖 )) + (1 − 𝑙𝑖 ) · log(1 − 𝑝 (𝑢𝑖 ))]. (10)
wise binary segmentation mask (see Fig. 1 for an illustrative exam-
𝑖=1
ple). The attention loss is end-to-end trainable and can be utilized
to update the model parameters. Ultimately the trained model could After this training, we select 10 highest components of the opti-
focus on the manipulated regions to make decisions. mized linear weight vector 𝑤 and the corresponding channels are
considered as more relevant to the forgery decision.
2.4 Active Learning for Regularization Active Candidate Selection. After locating the most possible
In last section, we introduce regularizing LAE with pixel-wise seg- channels corresponding to the forgery prediction, we feed all the 𝑁 𝐹
mentation masks. However, generating these masks is extremely fake images to the LAE model. Those who have highest activation
time consuming, especially if we plan to label all 𝑁 𝐹 forgery im- value for these top 10 channels are deemed as the challenging
ages. We are interested in employing only a small ratio of data with case. The key idea for this choice is that these highest activation
extra supervision. In this section, we propose an active learning images are mostly likely to contain easy patterns which can be
framework to select challenging candidates for annotation. We will captured by the model to separate true and fake images, and which
describe below how the active learning works in three steps. are hard to be generalized beyond training and hold-out test set.
Thus we would like to request their pixel-wise forgery masks and
Channels Concept Ranking. Due to the hierarchical structure
followed by regularizing them. Based on this criteria, we select
of encoder, the last convolutional layer has larger possibility to
out 𝑁 active images as active candidates. The candidates number
capture high-level semantic concepts. In our case, we have 512
𝑁 active is less than 3% of total images and is empirically shown
channels at this layer. A desirable detector could possess some
significant improvement on generalization accuracy. Comparing to
channels which are responsive to specific and semantically mean-
the number of total training samples which is larger than 10k, we
ingful natural part (e.g., face, mouth, or eye), while other channels
have dramatically reduced the labelling efforts.
may capture concepts related to forgery, (e.g., warping artifacts,
or contextual inconsistency). Nevertheless, in practice the detec- Local Interpretation Loss. Equipped with the active image
tor may rely on some superficial patterns which only exist in the candidates, we request labeling those images for pixel-wise forgery
𝑁 active
training set to make forgery predictions. Those samples leading to masks {𝐺 (𝑥𝑖 )}𝑖=1 . The attention loss is calculated using the
this concept are considered as the most challenging case, since they distance between attention map and annotated forgery mask for
cause the model to overfit to dataset specific bias and artifacts. 𝑁 active candidates, rather than all 𝑁 𝐹 forgery images as in Eq.(9).
We intend to select out a subset of channels in the last convolu- 𝑁
active
tional layer deemed as most influential to the forgery classification Lattention (𝜃 𝑒 , 𝑥, 𝐺) = [𝑀ˆ (𝑥𝑖 ) − 𝐺 (𝑥𝑖 )] 2 . (11)
decision. The contribution of a channel towards a decision is defined 𝑖=1

328
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

The attention loss in Eq.(11) is further combined with latent space Table 1: Dataset statistics for three tasks: face swap manipulation,
loss in Eq.(4) to update model parameters. facial attribute manipulation, and inpainting-based manipulation.

L2 (𝜃 𝑒 , 𝑥, 𝑙, 𝐺) = 𝜆1 Llatent + 𝜆2 Lattention . (12) Swap Attribute Inpainting


Face2face FaceSwap StarGAN Glow G&L ContextAtten
The overall learning algorithm of LAE is presented in Algorithm 1.
Train 288000 - 41590 - 28000 -
We apply a two-stage optimization to derive a generalizable forgery Validation 2800 - 11952 - 6000 -
detector. In the first stage, we use L1 loss in Eq.(2) to learn an effec- Test 2800 2800 5982 5982 6000 6000
tive representation. In the second stage, we need the model to focus
on forgery regions to learn better representations. So we exploit
H.2641 with quantization parameter set to 23. There are 1000
the active learning framework to select out challenging candidates
videos for each of real, seen, unseen datasets. Each dataset is split
to get their pixel-wise forgery masks. Then we reduce the learning
into 720, 140, 140 for training, validation and testing respectively.
rate one-tenth every 3 epoches and fine-tune the parameters of
Finally, we use 200 frames per video for training and 10 frames
the encoder using the L2 loss in Eq.(12). Note that during training
per video for validation and testing.
we also add random noise to the input image, in order to prevent
• Facial Attributes Manipulation This manipulation modifies
model from learning low-level training set specific statistics which
some attributes of the face, such as color of skin or hair, smile,
are bad at generalization. After training the model and during the
gender, age, etc, based on GAN-based techniques [8]. Real im-
testing stage, we use latent space activation in Eq.(3) to distinguish
ages from CelebA dataset [24] are modified with two methods:
forgery from true ones. The test images are considered to be true if
StarGAN [5] and Glow [20], which are chosen to be seen and
𝑎𝑖,𝑇 > 𝑎𝑖,𝐹 , and vice versa.
unseen dataset respectively. All images are 256×256 pixels.
• Inpainting-based Manipulation Inpainting is also referred
3 EXPERIMENTS as image completion. In this task we consider fake images by two
In this section, we conduct experiments to evaluate performance inpainting methods, G&L [17] and ContextAtten [38], consisting
of LAE and answer the following research questions (RQs). seen and unseen dataset respectively. The inpainting is performed
• RQ1 - Does LAE increase the generalization accuracy when to central 128×128 pixels of the original images.
processing unseen instances, especially for those produced by
3.1.2 Baseline Methods. We compare LAE with six baselines,
alternative methods?
where all baselines are trained using the same loss optimizations
• RQ2 - Does LAE provide better attention maps after augmenting
mentioned in the original papers.
extra supervision in the training process?
• RQ3 - How do different components and hyperparameters affect • SuppressNet: A generic manipulation detector [4]. An architec-
the performance of LAE? ture is specifically designed to adaptively suppress the high-level
• RQ4 - Does LAE provide some insights about the detection model content of the image. This model uses a constrained convolu-
as well as training dataset? tional layer followed by two convolutional, two max-pooling and
three fully-connected layers. Constrained convolution layer is
designed to suppress the high-level contents of the image. We
3.1 Experimental Setup
use a learning rate = 10−5 with batch size of 64.
In this section, we introduce the overall experimental setups, includ- • ResidualNet: Residual-based descriptors are used for forgery
ing tasks and datasets, baseline methods, networks architectures detection [7]. This model recasts the hand-crafted Steganalysis
and implementation details. features used in the forensic community to a CNN-based network.
Basically, these features are extracted as co-occurrences on 4
3.1.1 Tasks and Datasets. The overall empirical evaluation is per-
pixels patterns along horizontal and vertical direction on the
formed on three types of deepfake detection tasks. For each task,
residual image, which is obtained after high-pass filtering of the
we use two datasets: seen dataset and unseen dataset. The seen
original input image. We set the learning rate and batch size as
dataset is split into training, validation and test set, which are used
10−5 and 16 respectively.
to train the model, tune the hyperparameters and test the model
• StatsNet: This method integrates the computation of statistical
accuracy respectively. In contrast, unseen dataset contains forgery
feature extraction within a CNN framework [29]. To optimize
images generated by an alternative method, and is only utilized
the feature extraction scheme, this model uses CNN framework
to assess the true generalization ability of the detection models.
with a global pooling layer that computes four statistics (mean,
Corresponding dataset statistics are given in Tab. 1. All subsets
variance, maximum, minimum). We consider the Stats-2L net-
of the three tasks are balanced, where the ratio of true and fake
work since this model has the best performance. We use batch
images are 1:1.
size of 64 with learning rates of 10−4 .
• Face Swap Manipulation This task explores human face ma- • MesoInception: An inception module based deepfakes detector,
nipulations, where face of a person is swapped by face of another where mean square error instead of cross-entropy is used as loss
person in the video. We use videos from Faceforensics++ [31]. function [1]. This is a CNN-based network specifically designed
The seen dataset is generated using the manipulation method to detect face manipulations in videos. It uses two inception
Face2Face [34], while the unseen one is obtained via the manipu-
lation method FaceSwap [14]. The videos are compressed using 1 https://www.h264encoder.com/

329
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Table 2: Network architecture and output shapes.

Original
Encoder layer Output shape Decoder layer Output shape
Conv2d [64, 128,128] ConvTranspose2d [256, 4,4]
Relu [64, 128,128] BatchNorm2d & Relu [256, 4,4]
Conv2d [128,64,64] ConvTranspose2d [128, 8,8]
BatchNorm2d [128,64,64] BatchNorm2d & Relu [128, 8,8]
Relu [128,64,64] ConvTranspose2d [64, 16,16]

Face2face
Conv2d [256,32,32] BatchNorm2d & Relu [64, 16,16]
BatchNorm2d [256,32,32] ConvTranspose2d [32, 32,32]
Relu [256,32,32] BatchNorm2d & Relu [32, 32,32]
Conv2d [512,16,16] ConvTranspose2d [16, 64,64]
BatchNorm2d [512,16,16] BatchNorm2d & Relu [16, 64,64]
Relu [512,16,16] ConvTranspose2d [8, 128,128]

Pixel-wise mask
Conv2d [512,16,16] BatchNorm2d & Relu [8, 128,128]
Relu [512,16,16] ConvTranspose2d [3, 256,256]
AvgPool2d [512,1,1] Tanh [3, 256,256]
Linear [128]

modules, two convolutional layers with max-pooling, followed


by two fully-connected layers at the end. We use batch size of 64 Figure 2: Pixel-wise ground truth masks. The first row displays
with learning rates of 10−3 . original face images, the second row shows Face2face manipula-
• XceptionNet: This is a CNN network based on separable con- tions, and the third row represents pixel-wise ground truth masks.
volutions with residual connections [6]. We use a pretrained
network on ImageNet by replacing last fully connected layer
from learning low level statistics which are bad at generalization.
with two outputs in order to match our use-case. We use Ima-
Standard deviation is randomly set to a value between 0 and 5 for
geNet weights to initialize all other layers. XceptionNet is trained
each batch. Besides, the noise is only added during training time.
with batch size of 32 and learning rate of 0.0002. To set up the
The Adam optimizer [19] is utilized to optimize these models with
newly inserted fully connected layer, we fix all weights up to this
betas of 0.9 and 0.999, epsilon set to 10−8 , and batch size set as 64.
new layer and pre-train the network for 3 epochs. Finally, we
For all tasks, the learning rate is fixed as 0.001 for the first stage
train the network for additional 20 epochs and choose the one
in Algorithm 1. Later during finetuning, we freeze the parameters
with with best accuracy on validation set.
of decoder and discriminator, and only finetune encoder network
• ForensicTransfer: AutoEncoder-based detector is designed to
parameters in the second learning stage of Algorithm 1. We reduce
adapt well to novel manipulation methods [8]. This is an encoder-
the learning rate by one-tenth every 3 epoches during finetuning
decoder based architecture with 5 convolution layers in each
stage. Number of finetuning epochs depends on number of active
sub-network. Decoder additionally uses a 2 × 2 nearest-neighbor
fake images. For instance, 4 and 7 epochs work well for 100 and
up-sampling before each convolution (except the last one) to
500 active images respectively. For the first two hyperparameters
recover the original size. The latent space (encoder output) has
(𝛽 1, 𝛽 2 ) in Eq.(5), we have tuned values between 0 and 1 with 0.1 as
128 feature maps among which 64 are associated with the real
interval and for the third (𝛽 3 ), we have tried {0.0001, 0.001, 0.01, 0.05,
class and 64 with the fake class. We use a learning rate of 0.001
0.1, 0.5, 1.0}. During finetuning, we have tried values between 0 and
and a batch size of 64. For a fair comparison with others, we use
1 with 0.1 as interval for 𝜆1, 𝜆2 in Eq.(12). Ultimately, the following
their version that is not fine-tuned on unseen dataset.
ones work well for all three tasks: 𝛼 1 =1.0, 𝛼 2 =1.0, 𝛽 1 =1.0, 𝛽 2 =1.0,
For fair comparison, all models are trained on the same training data 𝛽 3 =0.01, 𝜆1 =0.5, 𝜆1 =1.0. We apply normalization with mean (0.485,
and tested on the same hold-out test set and unseen test data. We 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225). Besides,
train the models for a maximum of 40 epochs with early stopping unseen dataset only serves testing purposes, and none of images is
(patience=10). Validation loss is used as criteria for early stopping. used to train model or tune hyperparameters.
3.1.3 Network Architectures. For encoder and decoder, we use a
structure similar to U-net [30]. Details about the layers and corre- 3.1.5 Linear Model in Active Learning. For linear model mentioned
sponding output shapes are given in Tab. 2. The AvgPool2d cor- in Eq.(10), we use flattened output of Encoder’s AvgPool2d layer
responds to global average pooling layer, which transforms the as input features. Thus every image input to linear model would
[512,16,16] activation layer into 512 dimension vector. After that, be represented with 512 features. We train this linear model for 5
we use a Linear layer to turn it into the 128-dimension latent space epochs with SGD optimizer and 0.001 as learning rate. The linear
vector 𝑧 (see Fig. 1). For comparator 𝐶 (·), we use the 16-layer ver- model accuracy in hold-out test sets is 96.92%, 100.0%, and 99.74%
sion VGGNet [33], and the activation after 10-th convolutional layer respectively for Face2face, StarGAN and G&L datasets.
with output shape [512,28,28] is used to calculate the perceptual loss.
For discriminator 𝐷 (·), we use the standard discriminator network 3.1.6 Pixel-wise Masks. We illustrate some examples of pixel-wise
introduced in DCGAN [28]. forgery masks in Fig. 2. These masks are for training Face2face
dataset of face swap manipulation task, and give the detailed ma-
3.1.4 Implementation Details of LAE. For image pre-processing, nipulated regions. Specifically, the white color region denotes ma-
we augment Gaussian noise to the input image, to prevent model nipulated parts, while black color regions represent pristine parts.

330
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Table 3: Detection accuracy on hold-out test set of seen dataset and


generalization accuracy on unseen dataset.

Swap Attribute Inpainting

Face Swap
Models Face2face FaceSwap StarGAN Glow G&L ContextAtten
SuppressNet 93.86 50.92 99.98 49.94 99.08 49.98
ResidualNet 86.67 61.54 99.98 49.86 98.96 58.45
StatsNet 92.94 57.74 99.98 50.04 96.17 50.12
MesoInception 94.38 47.32 100.0 50.01 86.90 61.34
XceptionNet 98.02 49.94 100.0 49.67 99.86 50.16
ForensicTransfer 93.91 52.81 100.0 50.08 99.65 50.05
LAE_100 96.84 61.09 99.91 59.05 99.04 57.64
LAE_400 96.82 65.24 99.75 60.08 98.95 60.67
LAE_800 96.80 68.06 99.67 62.11 98.94 64.42

Inpainting
3.2 Generalization Accuracy Evaluation
In this section, we evaluate the generalization performance of de-
tection models. For three tasks, detection accuracy on hold-out test
set and data generated by alternative methods (unseen) are given
in Tab. 3. We summarize some key findings.

3.2.1 Generalization Gap. There is a dramatic accuracy gap be- Attribute


tween seen and unseen dataset. All baseline methods have relatively
high accuracy on hold-out test set (most of them are over 90%),
while having random classification (around 50%) on unseen dataset.
On one hand, this indicates strong overfitting of these models to
the superficial patterns in the training set. On the other hand, it
reveals the limitation of evaluation schemes in existing literature.
Usually the detection performance is calculated using the prediction (a) True input (b) Deepfake (c) LAE (d) MesoInception (e) XceptionNet

accuracy on the test set. Due to the independent and identically dis-
tributed (i.i.d.) training-test split of data, especially in the presence Figure 3: Attention map comparison with baselines.
of strong priors, detection model can succeed by simply recognize
patterns that happen to be predictive on instances over the test 3.3 Attention Maps Evaluation
set. This is problematic, and thus test set might fail to adequately In this section, we provide case studies to qualitatively illustrate
measure how well detectors perform on new and previously unseen the effectiveness of the generated explanation using attention maps
inputs. As new types of forgery emerge quickly, it is recommended for all three deepfake detection tasks.
for detectors to adopt stronger evaluation protocols.
3.3.1 Comparison with Baselines. LAE attention maps are com-
pared with two baselines: MesoInception and XceptionNet (see
3.2.2 LAE Reduces Generalization Gap. There are three observa- Fig. 3). The heatmaps for both baselines are generated using Grad-
tions in terms of performance of LAE. Firstly, LAE reduces the gen- CAM [32]. The visualization indicates that LAE has truly grasped
eralization gap by using a small ratio of extra supervision. LAE_100, the intrinsic patterns encoded in the forgery part, instead of pick-
LAE_400 and LAE_800 mean the number 𝑁 active is set as 100, 400 ing up superficial and undesirable correlation during the training
and 800 respectively. When using 400 annotations (less than 2% than process. For the first two rows (face swap manipulation), LAE could
total number of training data in Tab. 1), we achieve state-of-the-art focus attention on eyes, noses, mouths and beards. In contrast, two
performance on face swap and attribute manipulation tasks. LAE baselines mistakenly highlight some background region, e.g., collar
outperforms best baselines by 3.7% and 10% respectively on unseen and forehead. For the third-fourth rows (inpainting) and fifth-sixth
dataset of two tasks. Secondly, using more annotations could bring rows (facial attribute), LAE correctly focuses on the inpainted eagle
better generalization enhancement. Compared to 400 annotations, neck, central part of dog and the modified hair, beard regions re-
using 800 annotations has boosted the detection accuracy on all spectively. By comparison, baselines depends more on non-forgery
three tasks. The generalization accuracy on unseen set has been part, e.g., wings and eyes to make detection.
improved by 6.52%, 12.03%, and 3.08% respectively comparing to
best baselines on three deepfake detection tasks. This indicates that 3.3.2 Effectiveness of Attention Loss. To qualitatively evaluate the
LAE has potential to further promote generalization accuracy with effectiveness of attention loss and active learning, we provide abla-
more annotations. Thirdly, the increase of generalization accuracy tion visualizations in Fig. 4. Specifically, we compare LAE_no_atten
does not sacrifice the accuracy on hold-out test set. Our accuracy on (without using attention loss and active learning) and LAE. For
Face2face, StarGAN and G&L is comparable to baseline methods. the face swap task, before using attention loss, we can observe

331
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Table 5: Hyperparameter analysis for 𝛽 1, 𝛽 2, 𝛽 3 .

𝛽1 𝛽2 𝛽3 Face2face FaceSwap
Alter pixel 1.0 1.0 0.01 96.92 50.54
0.5 1.0 0.01 96.01 50.86
(a) Face swap 0.1 1.0 0.01 95.55 50.86
Alter perceptual 1.0 1.0 0.01 96.92 50.54
1.0 0.5 0.01 96.74 50.82
1.0 0.1 0.01 95.84 50.50
Alter adversarial 1.0 1.0 0.1 54.16 49.92
1.0 1.0 0.05 58.28 50.01
(b) Inpainting (c) Facial attribute 1.0 1.0 0.01 96.92 50.54

Figure 4: Effectiveness of attention loss. (a) Face swap manipula-


tion, (b) Inpainting, (c) Facial attribute manipulation. For each task, Table 6: Hyperparameter analysis for 𝜆1, 𝜆2 .
left: before active learning with attention loss, and right: after active
learning with attention loss.
𝜆1 𝜆2 Face2face FaceSwap
Fix 𝜆1 =1.0 1.0 1.0 96.85 63.14
Table 4: Ablation analysis of LAE for face swap detection.
1.0 0.5 96.88 55.12
1.0 0.1 96.90 50.61
AE_rec AE_latent AE_latent_pixel AE_latent_rec LAE
Fix 𝜆1 =0.5 0.5 1.0 96.80 68.06
Face2face 50.39 95.82 96.57 96.92 96.80 0.5 0.5 96.83 62.67
FaceSwap 49.46 50.70 50.58 50.54 68.06 0.5 0.1 96.88 53.05
Fix 𝜆1 =0.1 0.1 1.0 95.97 64.92
0.1 0.5 96.42 58.01
that the model does not accurately rely on the forgery region to
0.1 0.1 96.85 51.06
make decisions. For the female case, LAE_no_atten focuses mostly
on brow areas, while LAE concentrates correctly on the modified
eyes and nose part. For the male case, LAE focuses mostly on the losses. Thirdly, attention loss based on candidates selected via active
eyes and beard parts, which could mostly distinguish the forgery learning could significantly increase accuracy on unseen dataset
from real ones. Similarly, for inpainting task and facial attribute (around 17.5%).
manipulation tasks, LAE focuses mostly on the inpainted eagle
neck and hair part, comparing to the attention of LAE_no_atten 3.4.2 Hyperparameters Analysis. We evaluate the effect of differ-
on eagle head and female left eye. This validates the effectiveness ent hyperparameters towards model performance by altering the
of attention loss. After finetuning with attention loss with a small values of 𝛽 1, 𝛽 2, 𝛽 3 in Eq.(5) and 𝜆1, 𝜆2 in Eq.(12). Corresponding
ratio of samples provided by active learning, the model eventually results are reported in Tab. 5 (without attention loss and active
concentrates on the forgery part to make predictions. learning) and Tab. 6 (with attention loss and active learning) re-
spectively. Firstly, the results in Tab. 5 indicate that increase of
3.4 Ablation and Hyperparameters Analysis weights for pixel loss and perceptual loss could enhance model
In this section, we utilize models trained on face swap manipulation performance on test set. In contrast, a small weight for adversarial
task to conduct ablation and hyperparameter analysis to study the loss is beneficial for accuracy improvement. Secondly, as shown in
contribution of different components in LAE. Tab. 6, fixing 𝜆1 and reducing 𝜆2 from 1.0 to 0.5 then to 0.1 have
significantly decreased the accuracy on unseen dataset. This accu-
3.4.1 Ablation Analysis. We compare LAE with its ablations to racy drop confirms the significance of attention loss in improving
identify the contributions of different components. Four ablations generalization accuracy.
include: AE_rec, trained only with reconstruction loss of Eq.(5);
AE_latent, using only latent space loss in Eq.(4); AE_latent_pixel, 3.4.3 Random vs. Active Learning. We evaluate active learning
using both latent space loss and pixel loss in Eq.(5); AE_latent_rec, based challenging candidate selection, by comparing it with ran-
using latent space loss and whole reconstruction loss. Note that dom selection. The generalization comparison on unseen dataset
no attention loss is used in the ablations. The comparison results (FaceSwap) is illustrated in Fig. 5. There is a dramatic gap between
are given in Tab. 4. There are several key findings. Firstly, latent random selection and active learning. For instance, active learning
space loss is the most important part, without which even hold-out could increase unseen dataset accuracy by 7.11% when the annota-
test set accuracy could drop to 50.39%. Secondly, all of pixel-wise, tion number is 100 (< 0.2% of training data). From number 100 to
perceptual, and adversarial losses could contribute to performance 800, on average active learning could increase accuracy by 11.67%
on hold-out test set. At the same time, no significant increase is comparing to random selection. This indicates that active learning
observed on the unseen dataset with any combination of these is effective in terms of selecting challenging candidates.

332
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Towards this end, using a small number of unseen dataset data to


finetune model could possibly further reduce the generalization
gap, and this direction would be explored in our future research.

4 RELATED WORK
In this section, we briefly review three lines of research which are
most relevant to ours.
Traditional Manipulated Forgery Detection Techniques for
traditional manipulated forgery detection have existed for a long
history [15]. These forgeries usually are created using traditional
Figure 5: Random and active learning selection comparison. The x image processing techniques. Two typical examples are image splic-
axis denotes annotation number which has pixel-wise masks. ing and copy-move, which is the process of cutting one part of a
source image, such as the face regions, and inserting it in the target
3.4.4 Forgery Ground Truth Number Analysis. We study the effect image. State-of-the-art methods for traditional forgery detection
of attention regularization by altering the number of challenging usually rely on artifacts introduced by forgery generation pipeline
candidates(𝑁 active ) selected by active learning (see Fig. 5). There are to differentiate between forgeries and reals ones, and at the same
two main observations. First, increasing the number of annotations time could localize forgery regions [16, 36].
typically improves model generalization. Especially using 800 sam- Generalization of Deepfake Detection Deepfake is defined
ples has increased the accuracy by 17.5% than using zero samples, as forgeries generated by advanced deep learning and computer
indicating the benefit of extra supervision. Second, using forgery vision techniques [35]. Note that traditional manipulated forgeries
masks for less than 0.2% of training data (100 annotations) has typically have some artifacts, which enable detection algorithms
increased accuracy by 7.6%. Considering the annotation effort of easily discriminate forgeries with real ones. In contrast, deepfakes
pixel-wise masks, this advantage of requiring small ratio of forgery usually employ advanced techniques, such as GAN, perceptual loss
mask annotations is significant. and adversarial loss to smooth out artifacts [21, 27]. This generates
forgeries which leave almost no clues for detection. Detecting deep-
3.5 Debugging Model and Dataset fakes is more challenging than traditional forgeries, and is thus the
In this section, we provide further analysis for the detection task, focus of this paper. There are generally two categories of methods
to get some insights about how to further improve generalization. for deepfake detection: hand-crafted features based methods, and
Specifically, we use interpretability as a debugging tool to analyze deep learning based methods [25]. The first one relies on high-level
our LAE as well as the datasets. The statistical analysis of the visual artifacts of the forgeries, such as lack of realistic eye blink-
attention maps has provided some clues about the weakness of our ing [22], and face warping artifacts [23]. However, these models
model and thus possible solutions for improvement. would fail when these hypothesis do not hold. The second cate-
gory typically formulates deepfake detection as supervised binary
3.5.1 Superficial Patterns Captured by Model. Since we only use classification problem: designing diverse CNN architectures and
a small ratio of annotated samples (less than 3%) to regularize LAE, using end-to-end training for detection. However, these methods
there is still generalization gap on unseen dataset (Tab. 3). Deepfake tend to perform poorly on new unseen manipulations, where the
detection is a fine-grained classification task, and thus is prone to performance reduces to around 50% random guess accuracy for
capture superficial patterns existing in training set. The analysis of previous unseen data [8].
attention maps for failure cases of LAE provides some insights about The most similar work to ours is ForensicTransfter method [8].
the superficial patterns. Firstly, the model focuses on some semanti- Both methods focus on enhancing the generalization of detection
cally meaningful part of object of interest. However, these patterns methods on previously unseen forgeries. However, their method
usually fall in spuriously correlated background region, rather than requires samples from unseen forgeries to finetune detection model,
the true forgery region. Secondly, the model has captured some while our method does not require any data from unseen forgeries.
low-level patterns, such as textures. Both categories happen to be
CNN Local Interpretation Local interpretation aims to identify
predictive in hold-out test set, while performing poorly on unseen
the contributions of each feature in the input towards a specific
sets. It indicates more inductive bias is needed for architectures and
DNN prediction [11]. The final interpretation is illustrated in the
stronger regularization is needed for model training.
format of feature importance visualization [12, 13]. Most current
3.5.2 Seen and Unseen Difference. Through attention map visu- research focus on providing interpretation for opaque DNN models,
alizations, we also observe the distribution difference of seen and and these work have exposed some limitations of DNN models. For
unseen dataset. For example in face swap detection task, Face2face instance, some models are heavily driven by superficial patterns in
mainly changes lips and eye brows, while FaceSwap changes mostly data, rather than capturing useful representations. This motivates
nose and eyes. This validates the distribution difference between us to make use of DNN interpretation as an ingredient to promote
seen and unseen dataset and brings challenges to generalization model’s generalization. We use a decomposition based explanation
accuracy. The accuracy increase bound of LAE depends on the method CAM [39], since it is end-to-end differentiable, amenable
distribution difference between seen dataset and unseen dataset. for training with backpropagation and updating CNN parameters.

333
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

5 CONCLUSIONS AND FUTURE WORK [13] Mengnan Du, Ninghao Liu, Fan Yang, and Xia Hu. 2019. On Attribution of
Recurrent Neural Network Predictions via Additive Decomposition. The Web
We propose a new deepfake detection method, called Locality- Conference (WWW) (2019).
Aware AutoEncoder (LAE), to boost the generalization accuracy [14] Faceswap. 2019. https://github.com/shaoanlu/faceswap-GAN.
[15] Hany Farid. 2009. Image forgery detection. IEEE Signal processing magazine
by making predictions relying on correct forgery evidence. A key (2009).
characteristic of LAE is the augmented local interpretability, which [16] Minyoung Huh, Andrew Liu, Andrew Owens, and Alexei A Efros. 2018. Fight-
could be regularized using extra pixel-wise forgery masks, in or- ing fake news: Image splice detection via learned self-consistency. In European
Conference on Computer Vision (ECCV).
der to learn intrinsic and meaningful forgery representations. We [17] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and
also present an active learning framework to reduce the efforts locally consistent image completion. ACM Transactions on Graphics (ToG) (2017).
to get forgery masks (less than 3% of training data). Extensive ex- [18] Ali Khodabakhsh, Raghavendra Ramachandra, Kiran Raja, Pankaj Wasnik, and
Christoph Busch. 2018. Fake Face Detection Methods: Can They Be Generalized?.
periments conducted on three deepfake detection tasks show that In 2018 International Conference of the Biometrics Special Interest Group (BIOSIG).
our resulting models have a higher probability to look at forgery [19] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980 (2014).
region rather than unwanted bias and artifacts to make predictions. [20] Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with in-
Empirical analysis further demonstrates that LAE outperforms best vertible 1x1 convolutions. In Thirty-second Conference on Neural Information
baselines by 6.52%, 12.03%, and 3.08% respectively on three deepfake Processing Systems (NeurIPS).
[21] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and
detection tasks on previously unseen manipulated forgeries. Baining Guo. 2020. Face X-ray for More General Face Forgery Detection. The
Due to the inherent difficulty of the detection problem, we still Conference on Computer Vision and Pattern Recognition (CVPR) (2020).
could observe generalization gap between test set and unseen [22] Yuezun Li, Ming-Ching Chang, Hany Farid, and Siwei Lyu. 2018. In ictu oculi:
Exposing ai generated fake face videos by detecting eye blinking. IEEE Workshop
dataset that is generated by alternative methods. Although they are on Information Forensics and Security (WIFS) (2018).
related and belong to the same task, there remains slight distribu- [23] Yuezun Li and Siwei Lyu. 2019. Exposing deepfake videos by detecting face
warping artifacts. Workshop on Media Forensics (in conjuction with CVPR) (2019).
tion differences between them. To further reduce this generalization [24] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face
gap, we would explore more comprehensive framework in future attributes in the wild. In International Conference on Computer Vision (ICCV).
research, by combining transfer learning and other techniques. Be- [25] Falko Matern, Christian Riess, and Marc Stamminger. 2019. Exploiting visual
artifacts to expose deepfakes and face manipulations. In 2019 IEEE Winter Appli-
sides, currently we focus on deepfake image detection, and would cations of Computer Vision Workshops (WACVW).
explore deepfake video and audio detection in future research. [26] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. 2019. Capsule-Forensics:
Using Capsule Networks to Detect Forged Images and Videos. In International
Conference on Acoustics, Speech, and Signal Processing (ICASSP).
[27] Thanh Thi Nguyen, Cuong M Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen,
ACKNOWLEDGMENTS and Saeid Nahavandi. 2019. Deep Learning for Deepfakes Creation and Detection.
The authors thank the anonymous reviewers for their helpful com- arXiv preprint arXiv:1909.11573 (2019).
[28] Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised represen-
ments. The work is in part supported by NSF grants CNS-1816497, tation learning with deep convolutional generative adversarial networks. The
IIS-1900990 and DARPA grant N66001-17-2-4031. International Conference on Learning Representations (ICLR) (2016).
[29] Nicolas Rahmouni, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2017.
Distinguishing computer graphics from natural images using convolution neural
networks. In 2017 IEEE Workshop on Information Forensics and Security (WIFS).
REFERENCES [30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional
[1] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. networks for biomedical image segmentation. In International Conference on
Mesonet: a compact facial video forgery detection network. In 2018 IEEE In- Medical image computing and computer-assisted intervention. Springer.
ternational Workshop on Information Forensics and Security (WIFS). [31] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies,
[2] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated
Li. 2019. Protecting World Leaders Against Deep Fakes. In The Conference on facial images. International Conference on Computer Vision (ICCV) (2019).
Computer Vision and Pattern Recognition (CVPR) Workshop. [32] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan-
[3] AWS, Facebook, Microsoft, and academics. 2019. Deepfake Detection Challenge tam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from
(DFDC). deep networks via gradient-based localization. In International Conference on
[4] Belhassen Bayar and Matthew C Stamm. 2016. A deep learning approach to Computer Vision (ICCV).
universal image manipulation detection using a new convolutional layer. In ACM [33] Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks
Workshop on Information Hiding and Multimedia Security. for large-scale image recognition. The International Conference on Learning
[5] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Representations (ICLR) (2015).
Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi- [34] Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and
domain image-to-image translation. In The Conference on Computer Vision and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment
Pattern Recognition (CVPR). of rgb videos. In The Conference on Computer Vision and Pattern Recognition
[6] François Chollet. 2017. Xception: Deep learning with depthwise separable con- (CVPR).
volutions. In The Conference on Computer Vision and Pattern Recognition (CVPR). [35] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and
[7] Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2017. Recasting residual- Javier Ortega-Garcia. 2020. DeepFakes and Beyond: A Survey of Face Manipula-
based local descriptors as convolutional neural networks: an application to image tion and Fake Detection. arXiv preprint arXiv:2001.00179 (2020).
forgery detection. In ACM Workshop on Information Hiding and Multimedia [36] Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2019. ManTra-Net:
Security. Manipulation Tracing Network for Detection and Localization of Image Forgeries
[8] Davide Cozzolino, Justus Thies, Andreas Rössler, Christian Riess, Matthias With Anomalous Features. In The Conference on Computer Vision and Pattern
Nießner, and Luisa Verdoliva. 2018. ForensicTransfer: Weakly-supervised Domain Recognition (CVPR).
Adaptation for Forgery Detection. arXiv preprint arXiv:1812.02510 (2018). [37] Xinsheng Xuan, Bo Peng, Jing Dong, and Wei Wang. 2019. On the generalization
[9] DeepFake. 2019. https://github.com/iperov/DeepFaceLab. of GAN image forensics. arXiv preprint arXiv:1902.11153 (2019).
[10] Alexey Dosovitskiy and Thomas Brox. 2016. Generating images with perceptual [38] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018.
similarity metrics based on deep networks. In Thirtieth Conference on Neural Generative image inpainting with contextual attention. In The Conference on
Information Processing Systems (NIPS). Computer Vision and Pattern Recognition (CVPR).
[11] Mengnan Du, Ninghao Liu, and Xia Hu. 2020. Techniques for interpretable [39] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
machine learning. Communications of the ACM (CACM) (2020). 2016. Learning deep features for discriminative localization. In The Conference
[12] Mengnan Du, Ninghao Liu, Qingquan Song, and Xia Hu. 2018. Towards Expla- on Computer Vision and Pattern Recognition (CVPR).
nation of DNN-based Prediction with Guided Feature Inversion. ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD) (2018).

334

You might also like