0% found this document useful (0 votes)
42 views

DeepFake Detection Based On Discrepancies Between Faces and Their Context

Uploaded by

Pratik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

DeepFake Detection Based On Discrepancies Between Faces and Their Context

Uploaded by

Pratik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO.

10, OCTOBER 2022 6111

DeepFake Detection Based on Discrepancies


Between Faces and Their Context
Yuval Nirkin , Lior Wolf, Yosi Keller , and Tal Hassner

Abstract—We propose a method for detecting face swapping and other identity manipulations in single images. Face swapping
methods, such as DeepFake, manipulate the face region, aiming to adjust the face to the appearance of its context, while leaving the
context unchanged. We show that this modus operandi produces discrepancies between the two regions (e.g., Fig. 1). These
discrepancies offer exploitable telltale signs of manipulation. Our approach involves two networks: (i) a face identification network that
considers the face region bounded by a tight semantic segmentation, and (ii) a context recognition network that considers the face
context (e.g., hair, ears, neck). We describe a method which uses the recognition signals from our two networks to detect such
discrepancies, providing a complementary detection signal that improves conventional real versus fake classifiers commonly used for
detecting fake images. Our method achieves state of the art results on the FaceForensics++ and Celeb-DF-v2 benchmarks for face
manipulation detection, and even generalizes to detect fakes produced by unseen methods.

Index Terms—Image forensics, deep learning, deep fake, face swapping, fake image detection

1 INTRODUCTION other specific authenticity signals such as heartbeat [13],


[14] and specular highlights [15].
HOTOGRAPHY is widely perceived as offering authentic
P evidence of actual events, including, in particular, the
presence and actions of human subjects in images and vid-
The Face X-ray method [16] focuses on the blending step,
which is a common post-processing step for methods that
manipulate faces in videos. This model detects the bound-
eos. Although this perception is slowly shifting, contempo-
aries of the blending mask, which is then classified as real
rary technology allows far easier and more accessible
or fake. Focusing on a generic step in the manipulation
manipulation of images than many realize. This gap repre-
pipeline, makes the approach better suited for unseen
sents a societal threat whenever manipulated media is
manipulation methods. Similar to Face X-ray we also focus
released over social networks and consumed by a public
on a common trait shared by most face swapping methods.
that is ill-equipped to question its authenticity.
While Face X-ray focuses on the seam between real and face
For instance, existing technology makes it easier for an
content, we focus on the discrepancy in identities between
actor to speak a given text, and then change her facial
the two.
appearance and voice to imitate those of someone else.
Application-wise, swapping is of particular interest, as
Alternatively, the face of a person captured in a crime-scene
many of the existing face manipulation methods are
can be manipulated and replaced by another. Both of these
designed for such identity modifying use cases. To this end
examples are referred to as face swapping. A third scenario
we make two assumptions: (A1) Facial manipulation meth-
involves the reenactment of a person’s face to change
ods only manipulate the internal part of the face. (A2) The
expression or lip motion (aka face reenactment). We note,
context of the face, which includes the head, neck, and hair
however, that the third scenario differs from the first two,
regions outside the internal part of the face, provides a sig-
as it does not involve a change in identity.
nificant identity signal for the subject.
Most contemporary approaches for detecting such
We verify assumption A2 in Section 3.2. Our findings are
manipulations relate to these three scenarios similarly: by
consistent with previous reports, showing that context alone
training a classifier to distinguish between real and fake
indeed provides strong identity cues [17], [18]. To support
images or videos [8], [9], [10], [11], [12]. Recently, detection
assumption A1, Fig. 2 shows the affected regions of six dif-
methods have been proposed that focus on liveliness and
ferent state of the art facial manipulation methods. Figs. 2a
and 2b present two reenactment methods by Thies et al. [2],
 Yuval Nirkin and Yosi Keller are with the Faculty of Engineering, Bar Ilan [3]. Both methods manipulate the regions corresponding to
University, Ramat Gan 5290002, Israel. E-mail: {yuval.nirkin, yosi.keller}
@gmail.com. a 3D morphable model (3DMM) [19], [20], covering a facial
 Lior Wolf is with Tel Aviv University, Tel Aviv 6997801, Israel. region that contains part of the forehead at the top and most
E-mail: [email protected]. of the jaw on the bottom. Figs. 2c and 2d shows two deep-
 Tal Hassner is with Facebook AI, Menlo Park, CA 94025 USA.
fakes variants samples from the FaceForensics++ [5] and
E-mail: [email protected].
DFD [1] datasets, both affecting a square region in the mid-
Manuscript received 20 Aug. 2020; revised 14 Apr. 2021; accepted 14 June 2021.
Date of publication 29 June 2021; date of current version 9 Sept. 2022. dle of the face. Fig. 2e is another 3DMM-based face swap-
(Corresponding author: Yuval Nirkin.) ping method, affecting similar regions as the reenactment
Recommended for acceptance by K. Sunkavalli. methods, excluding the internal part of the mouth (sample
Digital Object Identifier no. 10.1109/TPAMI.2021.3093446

0162-8828 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
6112 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 10, OCTOBER 2022

Fig. 1. Detecting swapped faces by comparing faces and their context. Two example fake (swapped) faces from DFD [1]. Left: The arm of the eye-
glasses does not extend from face to context. Right: An apparent identity mismatch between face and context. We show how these and similar dis-
crepancies can be used as powerful signals for automatic detection of swapped faces.

Fig. 2. Affected regions of different manipulation methods. (a) + (b) Face2Face [2] and NeuralTextures [3]; (c) + (d) Deepfake [4] variants of FaceFor-
ensics++ [5] and DFD [1]; (e) FaceSwap [6]; (f) FSGAN [7]. In all cases, faces are manipulated but their context is left unchanged.

obtained from previous work [5]). Fig. 2f is the output of Second, this cue generalizes well to different manipulation
FSGAN [7] which uses face segmentation to manipulate methods, whereas artifact detecting methods rely on algo-
entire face regions. rithm-specific flaws. Finally, since the proposed cue is
We claim that it is no coincidence that all face manipula- largely unrelated to artifact detection methods, it is comple-
tion methods we know of do not affect the entire head: mentary, and can thus be readily combined with such
While human faces have simple, easily modeled geometries, approaches to improve accuracy.
their context (neck, ears, hair, etc.) are highly irregular and To summarize, we make the following contributions: (1)
therefore difficult to consistently reconstruct and manipu- We propose a novel approach to identifying the results of
late, especially when considering the temporal constraints face swapping methods. (2) Our method is based on a novel
in video. fake detection cue that compares two image-derived iden-
We present a novel signal for identifying fake images tity embeddings. (3) The proposed approach is shown to
based on comparing the inner face region – the one that is outperform existing state-of-the-art schemes when applied
directly manipulated – with its outer context, which is left to FaceForensics++ [5], Celeb-DF-v2 [21], and DFDC [22].
unaltered by all face manipulation methods we are aware (4) We show further results on two additional face swap-
of. We do this by representing these two regions, faces and ping benchmarks, created using the FaceForensics++ data
their context, with two separate identity vectors. The two and additional swapping techniques, not included in Face-
vectors are obtained by training two separate face recogni- Forensics++.
tion networks: one trained for identifying a person based on
the face region and the other trained to identify the person
based on face context. We compare these two vectors, seek- 2 RELATED WORK
ing identity-to-identify discrepancies. Face Swapping Techniques. Semi- and fully-automatic face
Importantly, we do not assume prior knowledge of the swapping methods were introduced nearly two decades
identity of the person appearing in the image (source or tar- ago [23], [24]. These early methods were proposed as a
get subject identities). Instead, given an image, we compare means for preserving privacy [24], [25], [26], recreation [27],
the representations for the one or two (unknown) identities, and entertainment (e.g., [28], [29]); a far cry from some of
obtained from the face and its context using our two, spe- their less appealing applications today in misinformation
cially trained networks. and fake news. Nearly all pre-deep learning approaches
The cue we derive using these two networks differs from relied to some extent on 3D face representations, notably
those obtained by methods that search for artifacts caused 3DMM [19], [20]. Some of the more recent examples of such
by particular face manipulation techniques. Compared to methods are the Face2Face approach for expression trans-
other methods, our cue has three distinct advantages: First, fer [2], face reenactement [30], expression manipulation [31],
our cue is based on the inherent design of face swap and face swapping methods [18].
schemes and so is expected to hold even if future Public awareness of face manipulation methods began
approaches produce photo-realistic, artifact-free results. following the introduction of deep learning–based swapping
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
NIRKIN ET AL.: DEEPFAKE DETECTION BASED ON DISCREPANCIES BETWEEN FACES AND THEIR CONTEXT 6113

and reenactment, particularly through the use of generative fake generation method, or to the use of particular training
adversarial networks (GAN). A few notable examples of data. These sets, therefore, include media generated with a
such techniques are GANimation [32], GANnotation [33], variety of synthesis methods. Our approach is designed to
and others [34], [35], [36], [37]. Unlike earlier, 3D-based be invariant to such incidental biases: Rather than seeking
methods, GAN-based approaches are able to produce near particular artifacts, we consider a perceptual effect shared
photo-realistic results, not only in still photos, but also in by swapping techniques in general and show that our
videos. The quality of these results, along with the availabil- method can detect fakes produced by previously unseen
ity of public software, led to the use of what is now collec- face manipulation techniques.
tively known as DeepFakes, for undesirable applications,
including porn and fake news. 3 RECOGNITION OF FACES AND THEIR CONTEXT
More recently, FSGAN [7] showed convincing swapping
results without requiring a dedicated training procedure for We describe the two complementary face recognition net-
each source or target person, i.e., it is trained to replace any works used to obtain identity cues for the face and its con-
face with any other face. The FaceShifter state of the art text. We further explain how we use these two networks in
swapping method [38] first merges the source identity with our proposed fake detection method. Deep neural networks
the features from the target face using multi-scale attention are extensively used for face identification, and we focus on
blocks, and then refines the result, handling occlusions in the contributions of two very specific facial regions, dictated
an unsupervised manner. by the desired application: the segmented face and its sur-
rounding context.

2.1 Detecting Manipulated Faces 3.1 Detecting and Segmenting Faces


Over the years, many proposed methods for detecting We begin by applying the dual shot face detector
generic, copy-move and splicing manipulations in images (DSFD) [59]. We then increase detected bounding box sizes
and videos [39], [40], [41], [42]. Faces, however, received far by 20 percent, relative to their height, to expose more of the
less attention, likely because until recently, it was far harder context around the face, as DSFD is trained to return tight
to produce photo-realistic face manipulations. facial bounding boxes. Face crops are then resized to
The elevated threat posed by recent face manipulation 299299 pixels; the input resolution of the Xception architec-
methods is now being answered by increased efforts to ture [60] which we use for our face/context cues (Section 3.2).
develop automatic fake detection methods. Early methods To determine which parts of the crop are processed by the
for detecting manipulated visual media relied on hand- face network and which by the context network, we segment
crafted features [11]. A more modern, deep learning–based the crop into foreground (face) and background (context)
implementation of this approach was recently described by using a face segmentation network. The exact architecture
Cozzolino et al. [10], followed by other deep learning–based and training details for the segmentation network are pro-
methods, [8], [9], [12], [43], [44], [45], [46], [47], [48], as well vided in the supplementary material, which can be found
as approaches utilizing multiple cues [42], [49], [50], [51], on the Computer Society Digital Library at http://doi.
[52], [53], [54]. ieeecomputersociety.org/10.1109/TPAMI.2021.3093446.
Sabir et al. [48] recently proposed a recurrent neural net- Given the cropped face I and its corresponding face segmen-
work which uses temporal cues to detect Deepfake manipu- tation mask S, we generate image If and its complementary
lations in videos. Stehouwer et al. [55] applied an attention image Ic , representing the face and its context, respectively.
mechanism to intermediate feature maps of different back-
bone classifiers, to improve manipulated region detection 3.2 Recognition Networks
accuracy. Songsri et al. [56] showed that using additional
Recognition Network Architecture. Our networks are based on
facial landmarks improves both detection and localization
the Xception architecture [60] following its success in detect-
of Deepfakes. Finally, Nguyen et al. [51] suggested a fake
ing other DeepFake cues [5]. We train the network using a
detection architecture based on the capsule networks. Their
vanilla cross entropy loss, although other loss functions
work achieves results equivalent to previous methods,
could presumably also be used. Xception is based on the
while utilizing significantly fewer parameters.
Inception architecture [61] but with Inception modules
replaced with depth-wise separable convolutions. As far as
2.2 Benchmarking Face Manipulation we know, it was never used for face recognition.
A number of recent efforts try to provide the research com- In our implementation, the Xception network consists of
munity with standard, high quality, fake detection bench- a strided convolution block, followed by twelve depth-wise
marks. These efforts include FaceForensics [47], DeepFake- separable convolutions blocks with residual connections,
TIMIT [57], Celeb-DF [21], VTD dataset [58], FaceForensics+ except for the last one. The network is terminated by two
+ challenge [5], and the DFD dataset [1]. Several industry depth-wise separable convolutions, a pooling operation and
research labs have also recently contributed to these efforts, a fully connected layer.
leading to the announcement of the DeepFake Detection We train two identification networks: Ef which maps an
Challenge (DFDC) [22]. image of size 299299 containing pixels from the face region
These benchmarks represent multiple manipulation tech- to a vector of pseudo-probabilities associated with the data-
niques – not just face swapping. By using a single (or few) set faces, and, similarly, network Ec maps the remaining
synthesis methods, biases can be inadvertently introduced pixels from the detection bounding box (the context) to a
into these challenges: artifacts that are unique to a particular vector of pseudo-probabilities of the same classes.
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
6114 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 10, OCTOBER 2022

TABLE 1 4 FAKE DETECTION USING FACES VERSUS


Face Recognition Accuracy on VGGFace2
CONTEXT
Method Train set Validation set We illustrate our proposed fake detection approach in
Context 99.90 87.06 Fig. 4. Our method combines multiple Xception networks:
Face 99.89 95.10 The recognition networks, Ef and Ec , described in Section 3,
Entire region 99.98 96.98 a binary Xception net, Es , trained to distinguish between
real and manipulated images by face swapping methods,
Results reported for three face identification Xception networks, each applied to
a different part of the face. As expected, the entire region, containing both face and another, optional, binary Xception net, Er (not shown in
and context, is the most accurate. Even context alone, however, provides a Fig. 4), which we train to differentiate real images from
strong cue for identification, as previously observed by others [17], [18]. those manipulated by face reenactment methods. We next
describe these components in detail.
We train both Ef and Ec on images from the standard,
publicly available VGGFace2 dataset [62]. VGGFace2 con- 4.1 Face Discrepancy Component
tains 9,131 subjects from which we filtered images with a We train the face discrepancy network to predict whether a
resolution lower than 128128, resulting in 8,631 identities. face and its context share the same identity. It uses the out-
The output of these two networks is, therefore, in R8;631 . put of the two recognition networks, Ef and Ec , described
Validating Recognition Capabilities. To validate and com- in Section 3. We pre-train these two networks and do not
pare the recognition accuracy of these networks, we test their change their weights after they are combined, in order to
performance on both the VGGFace2 [62] test set and the test ensure that the identity cues remain the dominant ones. In
set of the Labeled Faces in the Wild (LFW) [63] benchmark Section 5.3 we show that training with the recognition
(no additional training or fine tuning was applied to the net- network’s weights unfrozen leads to a reduced accuracy
works before being tested on LFW images). when generalizing to unseen methods. We process the face
Unsurprisingly, addressing the internal appearance of the and context images, If and Ic , with two separate identity
face, network Ef outperforms Ec in term of accuracy, though classifiers, Ef and Ec , respectively, to compute a discrep-
both accuracies are high. These results are evident from ancy feature vector vd
Table 1 for VGGFace2 and Fig. 3 for LFW. We note that the
accuracy demonstrated by Ec – its ability to recognize faces vd ¼ Ef ðIf Þ  Ec ðIc Þ ¼ vf  vc : (1)
despite only seeing the context – is unsurprising: similar
results were reported by others, showing that faces can be
recognized, even when only their context is visible [17], [18]. 4.2 Manipulation Specific Networks
Importantly, Fig. 3b shows that the representations typi- Previous approaches trained classifiers to distinguish
cally used for face recognition – the activations of the penul- between real and fake faces, without considering the partic-
timate layer of the face recognition network, do not match ular manipulation applied to the faces – swapping or reen-
well for the same person, since the two networks were actment. These two manipulations types differ significantly:
trained independently. When combining the responses Swapping manipulates the identity of the face, whereas
from these two networks, we, therefore, use their final out- reenactment manipulates facial pose and expression. While
put: the per-subject pseudo-probabilities (Section 4.1). the latter is not the focus of our work, it is required by the

Fig. 3. LFW verification accuracy for identification networks trained on different face regions. (a) Results obtained by representing faces with the final
layers of the Xception architectures. (b) Faces represented using the activations of the penultimate layers of Xception. In the latter case, face versus
context do not match well for the same person, since the two networks were trained independently. Our approach, therefore, uses the final layers of
the networks, representing subject pseudo-probabilities, when comparing the two (top).
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
NIRKIN ET AL.: DEEPFAKE DETECTION BASED ON DISCREPANCIES BETWEEN FACES AND THEIR CONTEXT 6115

Fig. 4. Method overview. Following initial preprocessing, we obtain regions for the face, If , and its context, Ic . The two are processed by the face
identification networks, Ef and Ec , respectively. A separate network, Es , considers the input image, I, seeking apparent swapping artifacts to decide
if it is a face swapping result. The pseudo-probability vectors of the two face identification networks are subtracted and, jointly with the representa-
tions obtained from the method type network, Es , are passed to the final classifier, D.

FaceForensics++ benchmark used in our tests (Section 5.2). method does contain all three versions. The training process
Our approach, therefore, includes also a component for applied to Ef and Ec is detailed in Section 3.
detecting face reenactment. Once the four networks are trained, we freeze the
Specifically, we decouple swapping and reenactment by weights of Ef and Ec , and train the final classification net-
training a separate, dedicated classifier for each: Network work, D, using the three output vectors (vs ; vr ; vd ), while
Es is trained to detect swapping artifacts and network Er only fine-tuning Er and Es . The final training is done on the
(not shown in Fig. 4) is trained to detect reenactment. We same split of the FaceForensics++ videos. For more technical
use Xception networks, similar to those described in Sec- details, please see supplemental, available online.
tion 3.2 for recognition, and train these networks to classify
genuine versus manipulated. Our training process first pre-
trains both networks on examples of their particular manip- 4.5 Inference on Full Images
ulation versus pristine images. Our reenactment network, During inference, we often process images containing mul-
Er , is used in cases where the task is to detect both face tiple faces. In such cases, we only classify detected faces
swapping and face reenactment methods. Otherwise, we having a height larger than 64 pixels, and discard the rest as
use a three network solution, where Er is omitted. background faces. The only exceptions are images where
the largest face does not comply with this criterion, in which
4.3 Combining All Detection Cues case we process the largest detected face.
We chose the simplest method for combining the various We further remove false detections by applying a thresh-
signals: concatenating the three vectors vd , vs and vr , where old on the number of face pixels in the face segmentation
vd 2 R8;631 is defined in Eq. (1), and vs ¼ Esp ðIÞ and vr ¼ mask, S, for each detection. We start with a threshold of 15
Erp ðIÞ, both in R2;048 , denote the activations of the penulti- percent of the face pixels, relative to the number of pixels in
mate layers of the binary Es and Er , respectively. the cropped region. If this step filters-out all our detections,
The concatenated vector is passed to classifier D, which we reduce the threshold by half. If none of the images pass
outputs a real versus fake binary signal, trained using a the 7.5 percent threshold, we simply consider the one face
logistic loss function. The classifier D consists of an initial patch with the maximal number of detected pixels.
linear layer, followed by batch normalization, ReLU, and a Finally, we apply the compound network, including
final linear layer. Em ; Ef ; Ec , and D, to the remaining face patches (one or
more) and obtain one score per face patch as the output of
D. We take the minimal output of these scores – the face
4.4 Training patch predicted as most likely to be fake – in cases where
We first pre-train the four classifiers, Es , Er , Ef , and Ec , only a single face is manipulated.
each on its own task. We train network Es on the subset of
videos in FaceForensics++ [5] consisting of pristine videos
and videos manipulated by the face swapping methods:
FaceSwap and Deepfakes. Network Er is trained on the face 5 EXPERIMENTAL RESULTS
reenactment methods: Face2Face and NeuralTextures. Note We evaluated our proposed scheme using three recent, chal-
that we only use the compressed versions of these videos lenging benchmarks: FaceForensics++ [5], DFDC [22], and
for training, with C23 (HQ) and C40 (LQ) compressions. We Celeb-DF-v2 [21]. In order to evaluate our method using
chose not to use the raw videos for training because there is additional face swapping techniques and test its generaliza-
little difference between them and the C23 compressed vid- tion abilities, we further create our own test set, using two
eos. The FaceForensics++ benchmark used to test our more swapping methods.
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
6116 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 10, OCTOBER 2022

TABLE 2 TABLE 3
Face Swap Detection Results FaceForensics++ Image Benchmark Results

Methods FF-DF Celeb-DF-v2 Methods DF F2F FS NT Pristine Total


Two-stream [54] 70.1 53.8 Steg. Features [11] 73.6 73.7 68.9 63.3 34.0 51.8
Meso4 [8] 84.7 54.8 Cozzolino et al. [10] 85.4 67.8 73.7 78.0 34.4 55.2
MesoInception4 [8] 83.0 53.6 Rahmouni et al. [12] 85.4 64.2 56.3 60.0 50.0 58.1
Bayar and Stamm [9] 84.5 73.7 82.5 70.6 46.2 61.6
HeadPose [53] 47.3 54.6 MesoNet [8] 87.2 56.2 61.1 40.6 72.6 66.0
FWA [45] 80.1 56.9 Xception [5] 96.3 86.8 90.3 80.7 52.4 71.0
DSP-FWA [45] 93.0 64.0 Ours 94.5 80.3 84.5 74.0 67.6 75.0
VA-MLP [49] 66.4 55.0
Columns are: DeepFakes (DF), Face2Face (F2F), FaceSwap (FS), Neural-
VA-LogReg [49] 78.0 55.1 Textures (NT), and Pristine categories. It is hard to compare specific col-
XceptionNet-raw [5] 99.7 48.2 umns, since there is a threshold-based trade-off between real and fake. These
XceptionNet-c23 [5] 99.7 65.3 columns are therefore provided only for completeness. Our method leads in the
XceptionNet-c40 [5] 95.5 65.5 Total score, which is the meaningful metric for this benchmark.

Multi-task [64] 76.3 54.3 results reported for our method on Celeb-DF-v2 testify to its
Capsule [51] 96.6 57.5 improved generalization abilities compared to the baseline
Ours 99.7 66.0 methods.

Comparison of our approach and leading state of the art methods on two bench-
marks using frame-level AUC (%). 5.2 Experiments on FaceForensics++
The full FaceForensics++ dataset [5] contains 1,000 videos
5.1 Face Swapping Detection Experiments obtained from the web, from which 1,000 video pairs were
We use the following three datasets containing only face randomly selected and used to generate additional 1,000
swapping examples: manipulated videos representing four face manipulation
FF-DF. FF-DF [21] is a subset of the FaceForensics++ schemes. Two of these methods perform face swapping: a 3D-
benchmark [5], which includes only faces swapped using based face swapping method [6] using a traditional graphics
the Deepfakes method [4]. These tests therefore include pipeline and blending, and a GAN-based method [4], trained
1,000 videos from the pristine subset and 1,000 videos from using the images of pairs of subjects to compute a mapping
the Deepfakes subset (the full FaceForensics++ is described between them. Two additional methods perform face reenact-
in Section 5.2). ment: Face2Face [2], a 3DMM-based method that manipulates
DFDC. The recently announced, industry-backed, pre- facial expressions by changing the expression-coefficients esti-
view of the DFDC benchmark [22] offers a total of 5,244 vid- mated for the face, and NeuralTextures [3] which learns a face
eos of 66 actors: 4,464 training videos and 780 test videos, neural texture from a video and uses it to realistically render a
1,131 of them are real videos and 4,113 are fakes generated 3D reconstructed face model.
by two different, unknown, face swapping methods. Results on FaceForensics++ Image Benchmark. In this bench-
Celeb-DF-v2. Another recent dataset containing 590 real mark the results are calculated on a private server by
videos and 5,639 DeepFake videos of 59 celebrities [21]. This uploading binary predictions. It is therefore required to
set is especially challenging as most state of the art methods select a threshold for the model’s prediction scores, which
tested on this set report near-chance accuracies. we selected by optimizing on the validation set. Table 3
Training and Evaluation. In these tests, we do not use our shows that our total accuracy outperforms all previous
reenactment network, Er . We train on FaceForensics++, as methods by a large margin. Importantly, the accuracy in
described in Section 3. Results for all baseline methods were each of the different categories, on its own, is not a direct
previously reported [21]. These methods were trained indication of detection performance, since there is a thresh-
mainly on FaceForensics++, sometimes with additional self old-dependent trade-off between the accuracy on real and
collected data. None of these methods was trained on fake images. These results hint at the relative detection diffi-
DFDC or Celeb-DF-v2 and so these experiments also com- culty of each class and are provided for completeness.
pare the generalization of the different methods.
All methods were compared using the area under the
curve (AUC), at the frame level, on all frames in which faces 5.3 Ablation Study and Generalization Experiment
were detected. This metric is very convenient for comparing Face manipulation methods sometimes leave behind arti-
methods that output per-frame classification as there is no facts, possibly imperceptible, that can be leveraged for
need to set thresholds. detection. Different manipulation methods, however, can
Face Swap Detection Results. We report our results in produce different artifacts, as shown in Fig. 5. There is,
Table 2. Our method achieves the best AUC scores on all therefore, no guarantee that a fake detection method would
the benchmarks. On FaceForensics’s DeepFakes subset [5] perform well when presented with fakes generated by
our method achieves similar results as the current state of unseen schemes which do not leave such known, recogniz-
the art, this is due to the accuracy being saturated. On the able artifacts. We next verify the accuracy of our proposed
more challenging Celeb-DF-v2 benchmark, small improve- scheme in detecting fakes produced by methods that were
ments on the AUC scores are significant. Note also that the not part of its training set.
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
NIRKIN ET AL.: DEEPFAKE DETECTION BASED ON DISCREPANCIES BETWEEN FACES AND THEIR CONTEXT 6117

Fig. 5. Extending FaceForensics++ with unseen methods. Examples shown for the same source / target face pair, using the 3D-based methods,
FaceSwap [6] and Nirkin et al. [18], and the GAN-based methods, Deepfakes [4] and FSGAN [7]. Despite using the same image pairs in all four
examples, the results are different, each exhibiting its own artifacts.

We conduct these tests by extending the FaceForensics++ As evident from the ROC curves in Fig. 6, the frozen ver-
set, applying two additional face swapping methods to its sion of our method, in which the method specific classifier
videos: (1) FSGAN [7] and (2) Nirkin et al. [18], a 3D-based is not given the option to adjust to the identity signal, is the
face swapping method that uses single image 3D face recon- worst performing variant. The end-to-end version of our
struction and segmentation, both have publicly available method is also less able to generalize. This result is due to
implementations. Examples of the four face swapping meth- the end-to-end training process sullying the face and con-
ods, using the same source and target, can be seen in Fig. 5. text classifiers roles for extracting aligned identity represen-
Each method generates face swaps with distinct artifacts, tations. The concatenation variant performed slightly worse
with the exception of FSGAN, which produces images with than our method. This could be a result of the increase in
fewer apparent artifacts. the capacity of D.
The extended version of the benchmark follows the pair Finally, note that the face discrepancy signal by itself is
selections prescribed by the original FaceForensics++ data- not competitive with networks trained to detect fakes. How-
set. Because Nirkin et al. [18] was designed for image-to- ever, it is indicative of fake videos and its contribution to the
image face swapping, for each frame in the target video we overall method is seen by comparing our method with the
select its closest frame in the source video, in terms of esti- baseline XceptionNet.
mated head pose.
In all our generalization experiments, we train the var-
iants of our method and its XceptionNet baseline on the 5.3.2 Image Laundering Ablation
pristine and face swapping manipulations, using the official
We demonstrate our method’s generalization performance
training and validation subsets of FaceForensics++. In these
under different image laundering attacks, on three face
experiments, we do not use the reenactment detection net-
swapping methods, from older to newer: 3D-based
work Er .
swap [18], FSGAN [7], and FaceShifter [38]. The image laun-
dering operations include JPEG compressions of 25, 50, and
5.3.1 Generalization and Ablation Results 75 percent, where higher percentage means stronger
We studied the effect of our face versus context discrep-
ancy approach by comparing it to a naive classifier. Three TABLE 4
additional variants of our method were also considered: Generalization Ablation
(i) a version where all classifiers are frozen in the training
3D-based swap FSGAN
process, (ii) an end-to-end version of our method, where
all the classifiers are unfrozen in the training process, and Methods Fake Real Total Fake Real Total
finally, (iii) a variant where instead of subtracting vf and Face identity difference 47.33 77.66 62.50 34.66 80.50 57.58
vc , we concatenate them. Binary XceptionNet [10] 55.38 97.72 76.55 24.80 94.68 59.74
We report our generalization results in Table 4 (ROC Ours (frozen) 52.79 96.44 74.62 34.76 92.46 63.61
curves provided in Fig. 6). For results appearing at the top Ours (end-to-end) 54.74 97.70 76.22 31.66 95.38 63.52
Ours (concat) 55.42 96.54 75.98 41.64 93.30 67.47
of Table 4, we fix the thresholds for XceptionNet and our
Ours 68.20 95.10 81.65 47.14 90.56 68.85
method at zero. In the bottom of Table 4 we optimize both
thresholds on the test set. The threshold of the face identity Face identity difference 60.20 66.12 63.16 38.96 77.50 58.23
Binary XceptionNet [10] 89.03 81.36 85.20 73.92 64.04 68.98
difference in the first experiment is optimized using the
Ours (frozen) 85.52 86.92 86.22 67.66 76.20 71.93
VGGFace2 test set. Ours (end-to-end) 90.77 83.54 87.16 79.58 71.40 75.49
Our results show that our method significantly outper- Ours (concat) 91.41 84.34 87.87 71.92 78.00 74.96
forms the baseline on both unseen methods. The performance Ours 90.52 88.20 89.36 78.72 71.66 75.19
gap is greater on FSGAN generated faces, where artifacts are
more rare. Artifacts produced by the 3DMM-based method Generalization results of variants of our method on our extended version of
FaceForensics++ [5] test set. Top: Results with a fixed threshold at zero. Bot-
are more similar to the ones we encounter in other methods, tom: Upper bound results, obtained with a fixed threshold maximizing total
and so the gap is smaller. accuracy on the test set. See Section 5.3 for more details.
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
6118 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 10, OCTOBER 2022

Fig. 6. Results on our two variations of FaceForensics++ videos. (a) Generalization results with FSGAN generated swaps [7]. (b) Generalization
results with swaps generated by Nirkin et al. [18]. See Section 5.3 for more details.

TABLE 5
Image Laundry Ablation

Generalization results on three face swapping methods using the videos from FaceForensics++ [5] test set: 3D-based swap [18], FSGAN [7], and FaceShifter [38],
where the images are subject to different resizing and compressions.’RAW’, the image is unaltered, ’C##’, JPEG compression operation (higher percentage means
stronger compression), and ’S##’ is a scaling operation (percentage relative to original resolution).

compression, and scaling relative to the original resolution, visible artifacts, which that method was optimized to detect.
also 25, 50, and 75 percent. Our method includes a face swapping component, Es (Sec-
The results are detailed in Table 5. As expected, when tion 4.2), trained to detect similar method-specific artifacts,
applying more than 25 percent compression and more but does not provide the same detection accuracy as the
aggressive scaling than 75 percent, the laundry attacks baseline when such artifacts are present. Our overall
reduce the accuracy of all the detection methods and the approach still outperforms the baseline by a wide margin,
larger the compression or the scaling, the larger the drop in as reported in Sections 5.1 and 5.2. Finally, the fakes missed
accuracy. Our method consistently outperforms Xception- by both methods are typically challenging images with low
Net [10] and the face identity difference baseline, by a mar- contrast or blurry features as in Fig. 7c.
gin, under all the different laundering attacks.
Finally, the results indicate that the face identity difference
becomes less effective on the more recent methods. Recent 6 DISCUSSION AND LIMITATIONS
face swapping methods improve the estimated pose and
expression of the target face. These methods therefore allow Some of the most recent methods perform face manipula-
for a more of the target face’s identity signal to remain, and tion by generating the entire head [65], [66]. Those methods
hence reduce the effectiveness of the face identity difference. usually employ a pretrained StyleGAN2 [67] network, or
employ its architecture. The generation is controlled by
manipulating StyleGAN2’s latent code to maintain the
5.4 Qualitative Results source identity and preserve the attributes of the target face.
Fig. 7 presents qualitative examples of detected and missed While the methods are successful in maintaining the
fake faces from the DFDC collection. Fig. 7a shows example appearance of the source face and incorporating the attrib-
fakes detected by our method but undetected by the state of utes of the target face, the pose and expression are currently
the art XceptionNet fake detector [60]. Fig. 7b offers exam- less accurate. As a result, the methods lack temporal coher-
ple fakes which were detected by XceptionNet, but were ence when applied to videos.
missed by our method. Finally, Fig. 7c shows fakes missed In the future, those methods might overcome the current
by both approaches. limitation, and will be able to perform full head face swap-
Clearly, our method excels in cases where swapping arti- ping in videos. This would create a new class of methods
facts are hard to detect (Fig. 7a). Examining Fig. 7b shows for which the assumptions underlying our method will not
that fake images detected by XceptionNet often exhibit hold.
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
NIRKIN ET AL.: DEEPFAKE DETECTION BASED ON DISCREPANCIES BETWEEN FACES AND THEIR CONTEXT 6119

Fig. 7. Qualitative detection results. Examples taken from the DFDC collection. (a) Fakes detected by our method, but undetected by a leading base-
line, XceptionNet fake detector [60]. (b) Fakes detected by XceptionNet but missed by our approach. (c) Fakes missed by both methods. See Sec-
tion 5.4 for more details.

6.1 Face Reenactment Detection Cues research and innovation programme under Grant ERC CoG
Face reenactments are detected by the Er network (see Sec- 725974. Lior Wolf, Yosi Keller, and Tal Hassner have equally
tion 4.2) that is specifically trained to differentiate real images contributed.
from those manipulated by face reenactment methods.
Moreover, considering Figs. 2a and 2b, it seems that the REFERENCES
manipulated regions of face reenactment methods resemble [1] Google AI, “Contributing data to deepfake detection
those created by face swapping schemes. Thus, Ef and Ec research.” [Online]. Available: https://ai.googleblog.com/
might utilize cues, other than the subject identity, such as 2019/09/contributing-data-to-deepfake-detection.html
those due to the sensor and lenses. Those marks might be [2] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M.
Nießner, “Face2face: Real-time face capture and reenactment of
overridden in the synthesis process, and might improve the RGB videos,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2016,
detection of face swapping manipulations as well. pp. 2387–2395.
[3] J. Thies, M. Zollh€ ofer, and M. Nießner, “Deferred neural rendering:
Image synthesis using neural textures,” 2019, arXiv:1904.12356.
7 CONCLUSION [4] Deepfakes, “Deepfakes.” Accessed: Nov. 15, 2019. [Online]. Avail-
able: https://github.com/deepfakes/faceswap
While the ability to manipulate faces in images and video [5] A. R€ ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M.
has increased dramatically in the last few years, most recent Nießner, “Faceforensics++: Learning to detect manipulated facial
images,” 2019, arXiv:1901.08971.
methods follow similar patterns. In this work, we propose a [6] FaceSwap, “FaceSwap.” Accessed: Nov. 15, 2019. [Online]. Avail-
novel detection cue which utilizes the commonalities of all able: https://github.com/MarekKowalski/FaceSwap/
recent face identity manipulation methods. It is complemen- [7] Y. Nirkin, Y. Keller, and T. Hassner, “FSGAN: Subject agnostic
face swapping and reenactment,” in Proc. Int. Conf. Comput. Vis.,
tary to conventional real/fake classifiers and can be used 2019, pp. 7184–7193.
alongside them. Overcoming this approach would require a [8] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: A
much broader integration of the new identity into the compact facial video forgery detection network,” in Proc. Int.
image, making our contribution hard to circumvent without Workshop Inf. Forensics Secur., 2018, pp. 1–7.
[9] B. Bayar and M. C. Stamm, “A deep learning approach to univer-
additional technological breakthroughs. This is in contrast sal image manipulation detection using a new convolutional
to artifact detection methods, which are susceptible to the layer,” in Proc. Int. Workshop Inf. Hiding Multimedia Secur., 2016,
constant progress in the visual quality of generated images. pp. 5–10.
[10] D. Cozzolino, G. Poggi, and L. Verdoliva, “Recasting residual-
It is our hope that by further analyzing the design principles
based local descriptors as convolutional neural networks: An
of face swapping techniques, additional methods of identi- application to image forgery detection,” in Proc. Int. Workshop Inf.
fying fake images and videos would be discovered, leading Hiding Multimedia Secur., 2017, pp. 159–164.
to effective mitigation of the societal risks of such media. [11] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digi-
tal images,” IEEE Trans. Inform. Forensics Secur., vol. 7, no. 3, pp.
868–882, Jun. 2012.
ACKNOWLEDGMENTS [12] N. Rahmouni, V. Nozick, J. Yamagishi, and I. Echizen,
“Distinguishing computer graphics from natural images using
This work was supported by the European Research Coun- convolution neural networks,” in Proc. Int. Workshop Inf. Forensics
cil (ERC) through the European Unions Horizon 2020 Secur., 2017, pp. 1–6.
Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
6120 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 10, OCTOBER 2022

[13] U. A. Ciftci, I. Demir, and L. Yin, “How do the hearts of deep fakes [40] Y. Wu, W. Abd-Almageed, and P. Natarajan, “Busternet: Detect-
beat? deep fake source detection via interpreting residuals with ing copy-move image forgery with source/target localization,” in
biological signals,” in Proc. IEEE Int. Joint Conf. Biometrics (IJCB), Proc. Eur. Conf. Comput. Vis., 2018, pp. 168–184.
2020, pp. 1–10. [41] Y. Wu, W. Abd-Almageed , and P. Natarajan, “Image copy-move
[14] H. Qi et al., “Deeprhythm: Exposing deepfakes with attentional forgery detection via an end-to-end deep neural network,” in
visual heartbeat rhythms,” in Proc. 28th ACM Int. Conf. Multimedia, Proc. Winter Conf. Appl. Comput. Vis., 2018, pp. 1907–1915.
2020, pp. 4318–4327. [42] Y. Wu, W. AbdAlmageed, and P. Natarajan, “ManTra-Net:
[15] S. Hu, Y. Li, and S. Lyu, “Exposing GAN-generated faces using Manipulation tracing network for detection and localization of
inconsistent corneal specular highlights,” 2020, arXiv:2009.11924. image forgeries with anomalous features,” in Proc. IEEE Conf.
[16] L. Li et al., “Face X-ray for more general face forgery detection,” in Comput. Vis. Pattern Recognit., 2019, pp. 9543–9552.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, [43] P. Korshunov and S. Marcel, “Speaker inconsistency detection
pp. 5001–5010. in tampered video,” in Proc. Eur. Signal Process. Conf., 2018,
[17] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, pp. 2375–2379.
“Attribute and simile classifiers for face verification,” in Proc. [44] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing AI gener-
Conf. Comput. Vis. Pattern Recognit., 2009, pp. 365–372. ated fake face videos by detecting eye blinking,” 2018,
[18] Y. Nirkin, I. Masi, A. T. Tuan, T. Hassner, and G. Medioni, “On arXiv:1806.02877.
face segmentation, face swapping, and face perception,” in Proc. [45] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face
Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 98–105. warping artifacts,” 2018, arXiv:1811.00656.
[19] V. Blanz, S. Romdhani, and T. Vetter, “Face identification across [46] W. Quan, K. Wang, D.-M. Yan, and X. Zhang, “Distinguishing
different poses and illuminations with a 3D morphable model,” in between natural and computer-generated images using convolu-
Proc. Int. Conf. Autom. Face Gesture Recognit., 2002, pp. 192–197. tional neural networks,” Trans. Inform. Forensics Secur., vol. 13,
[20] V. Blanz and T. Vetter, “Face recognition based on fitting a 3D no. 11, pp. 2772–2787, 2018.
morphable model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, [47] A. R€ ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M.
no. 9, pp. 1063–1074, Sep. 2003. Nießner, “Faceforensics: A large-scale video dataset for forgery
[21] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A new dataset detection in human faces,” 2018, arXiv:1803.09179.
for deepfake forensics,” 2019, arXiv:1909.12962. [48] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P.
[22] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer, Natarajan, “Recurrent convolutional strategies for face manipula-
“The deepfake detection challenge (DFDC) preview dataset,” tion detection in videos,” CVPRw, pp. 80–87, 2019.
2019, arXiv:1910.08854. [49] F. Matern, C. Riess, and M. Stamminger, “Exploiting visual arti-
[23] D. Bitouk, N. Kumar, S. Dhillon, P. Belhumeur, and S. K. Nayar, facts to expose deepfakes and face manipulations,” in Proc. Winter
“Face swapping: Automatically replacing faces in photographs,” Conf. Appl. Comput. Vis. Workshops, 2019, pp. 83–92.
ACM Trans. Graph., vol. 27, no. 3, 2008, Art. no. 39. [50] H. H. Nguyen, T. Tieu, H.-Q. Nguyen-Son , V. Nozick, J. Yama-
[24] V. Blanz, K. Scherbaum, T. Vetter, and H.-P. Seidel, “Exchanging gishi, and I. Echizen, “Modular convolutional neural network for
faces in images,” Comput. Graph. Forum, vol. 23, no. 3, pp. 669–676, discriminating between computer-generated images and photo-
2004. graphic images,” in Proc. Int. Conf. Availability, Rel. Secur., 2018,
[25] Y. Lin, S. Wang, Q. Lin, and F. Tang, “Face swapping under large pp. 1–10.
pose variations: A 3D model based approach,” in Proc. Int. Conf. [51] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Use of a capsule net-
Multimedia Expo, 2012, pp. 333–338. work to detect fake images and videos,” 2019, arXiv:1910.12467.
[26] S. Mosaddegh, L. Simon, and F. Jurie, “Photorealistic face de-iden- [52] S.-Y. Wang, O. Wang, A. Owens, R. Zhang, and A. A. Efros,
tification by aggregating donors’ face components,” in Proc. Asian “Detecting photoshopped faces by scripting photoshop,” in Proc.
Conf. Comput. Vis., 2014, pp. 159–174. Int. Conf. Comput. Vis., 2019, pp. 10072–10081.
[27] I. Kemelmacher-Shlizerman, “Transfiguring portraits,” ACM [53] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsis-
Trans. Graph., vol. 35, no. 4, 2016, Art. no. 94. tent head poses,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro-
[28] O. Alexander, M. Rogers, W. Lambeth, M. Chiang, and P. Debe- cess. (ICASSP), 2019, pp. 8261–8265.
vec, “Creating a photoreal digital actor: The digital emily project,” [54] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Two-stream neu-
in Proc. Conf. Vis. Media Prod., 2009, pp. 176–187. ral networks for tampered face detection,” in Proc. Conf. Comput.
[29] L. Wolf, Z. Freund, and S. Avidan, “An eye for an eye: A single Vis. Pattern Recognit. Workshops, 2017, pp. 1831–1839.
camera gaze-replacement method,” in Proc. Conf. Comput. Vis. Pat- [55] J. Stehouwer, H. Dang, F. Liu, X. Liu, and A. Jain, “On the detec-
tern Recognit., 2010, pp. 817–824. tion of digital face manipulation,” 2019, arXiv:1910.01717.
[30] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman , [56] K. Songsri-in and S. Zafeiriou, “Complement face forensic
“Synthesizing Obama: Learning lip sync from audio,” ACM Trans. detection and localization with faciallandmarks,” 2019,
Graph., vol. 36, no. 4, 2017, Art. no. 95. arXiv:1910.05455.
[31] H. Averbuch-Elor , D. Cohen-Or, J. Kopf, and M. F. Cohen, [57] P. Korshunov and S. Marcel, “Vulnerability assessment and detec-
“Bringing portraits to life,” ACM Trans. Graph., vol. 36, no. 6, 2017, tion of deepfake videos,” in Proc. Int. Conf. Biometrics, 2019, pp. 1–6.
Art. no. 196. [58] O. I. Al-Sanjary , A. A. Ahmed, and G. Sulong, “Development of a
[32] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. video tampering dataset for forensic investigation,” Forensic Sci.
Moreno-Noguer , “Ganimation: Anatomically-aware facial ani- Int., vol. 266, pp. 565–572, 2016.
mation from a single image,” in Proc. Eur. Conf. Comput. Vis., [59] J. Li et al.“DSFD: Dual shot face detector,” in Proc. Conf. Comput.
2018, pp. 818–833. Vis. Pattern Recognit., 2019, pp. 5060–5069.
[33] E. Sanchez and M. Valstar, “Triple consistency loss for pairing dis- [60] F. Chollet, “Xception: Deep learning with depthwise separable
tributions in GAN-based face synthesis,” 2018, arXiv:1811.03492. convolutions,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2017,
[34] H. Kim et al., “Deep video portraits,” ACM Trans. Graph., vol. 37, pp. 1251–1258.
no. 4, 2018, Art. no.163. [61] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-
[35] R. Natsume, T. Yatagawa, and S. Morishima, “FSNet: An identity- v4, inception-resnet and the impact of residual connections on
aware generative model for image-based face swapping,” in Proc. learning,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 4278–4284.
Asian Conf. Comput. Vis., 2018, pp. 117–132. [62] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman,
[36] R. Natsume, T. Yatagawa, and S. Morishima, “RsGAN: Face “Vggface2: A dataset for recognising faces across pose and age,”
swapping and editing using face and hair representation in latent in Proc. Int. Conf. Autom. Face Gesture Recognit., 2018, pp. 67–74.
spaces,” 2018, arXiv:1804.03447. [63] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller ,
[37] K. Nagano et al., “paGAN: Real-time avatars using dynamic “Labeled faces in the wild: A database for studying face recogni-
textures,” ACM Trans. Graph. (TOG), vol. 37, no. 6, pp. 1–12, 2018. tion in unconstrained environments,” UMass Amherst, Univ.
[38] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, “Faceshifter: Towards Massachusetts, Tech. Rep. 07–49, 2007.
high fidelity and occlusion aware face swapping,” 2019, [64] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, “Multi-task
arXiv:1912.13457. learning for detecting and segmenting manipulated facial images
[39] S. Jia, Z. Xu, H. Wang, C. Feng, and T. Wang, “Coarse-to-fine and videos,” 2019, arXiv:1906.06876.
copy-move forgery detection for video forensics,” IEEE Access, [65] Y. Shen and B. Zhou, “Closed-form factorization of latent seman-
vol. 6, pp. 25323–25335, 2018. tics in GANs,” 2020, arXiv:2007.06600.

Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.
NIRKIN ET AL.: DEEPFAKE DETECTION BASED ON DISCREPANCIES BETWEEN FACES AND THEIR CONTEXT 6121

[66] E. H€ ark€
onen, A. Hertzmann, J. Lehtinen, and S. Paris, Yosi Keller received the BSc degree in electrical
“GANSpace: Discovering interpretable GAN controls,” 2020, engineering from the Technion Israel Institute of
arXiv:2004.02546. Technology, Haifa, in 1994, and the MSc and PhD
[67] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, degrees, summa cum laude, in electrical engineer-
“Analyzing and improving the image quality of styleGAN,” in ing from Tel Aviv University in 1998 and 2003,
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. respectively. From 2003 to 2006, he was a Gibbs
8110–8119. assistant professor with the Department of Mathe-
[68] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- matics, Yale University, New Haven, CT, USA. He is
mization,” 2014, arXiv:1412.6980. currently an associate professor with the Faculty of
[69] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional Engineering, Bar Ilan University, Ramat-Gan, Israel.
networks for biomedical image segmentation,” in Proc. Int. Conf. His research interests include computer vision,
Med. Image Comput. Comput.-Assist. Interv., 2015, pp. 234–241. machine and deep learning, and biometrics.
[70] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, “Openface
2.0: Facial behavior analysis toolkit,” in Proc. Int. Conf. Autom. Face
Gesture Recognit., 2018, pp. 59–66. Tal Hassner received the MSc and PhD degrees
in applied mathematics and computer science
Yuval Nirkin received the BSc degree in com- from the Weizmann Institute of Science in 2002
puter engineering from the Technion Israel Insti- and 2006, respectively. In 2008 he joined the
tute of Technology, Haifa, in 2011 and the MSc Department of Mathematics and Computer Sci-
degree in computer science from The Open Uni- ence, The Open University of Israel, where he
versity of Israel, Ra’anana, Israel, in 2017. He is was an associate professor until 2018. From
currently working toward the PhD degree with the 2015 to 2018, he was a senior computer scientist
Faculty of Electrical Engineering, Bar-Ilan Univer- with the Information Sciences Institute (ISI) and a
sity, Ramat-Gan, Israel. His research interests visiting research associate professor with the
include deep learning, computer vision, and com- Institute for Robotics and Intelligent Systems,
puter graphics. He was a reviewer of ECCV, USC Viterbi School of Engineering, CA, USA. From 2018 to 2019, he
ICCV, and CVPR, and was recognized as a high was a principal applied scientist with AWS Rekognition where he
quality reviewer in ECCV’20. designed the latest AWS face recognition pipelines. Since 2019 he has
been an applied research lead with Facebook AI, supporting both text
(OCR) and people (faces) photo understanding teams. He has been a
Lior Wolf received the PhD degree from the program chair at WACV’18 and ICCV’21. He was also a workshop chair
Hebrew University, under the supervision of Prof. at CVPER’20, a tutorial chair at ICCV’17 and ECCV’22, and the area
Shashua. He is currently a full professor with the chair for CVPR, ECCV, and AAAI. He is an associate editor for the IEEE
School of Computer Science, Tel-Aviv University, Transactions on Pattern Analysis and Machine Intelligence and the
Israel. He was a postdoctoral researcher with IEEE Transactions on Biometrics, Behavior, and Identity Science.
prof. Poggio’s lab, Massachusetts Institute of
Technology. He is an ERC grantee and was the
recepient of ICCV 2001 and ICCV 2019 honor- " For more information on this or any other computing topic,
able mention, and the best paper awards at please visit our Digital Library at www.computer.org/csdl.
ECCV 2000 and ICANN 2016.

Authorized licensed use limited to: Modern Education Society's College of Engineering. Downloaded on September 26,2023 at 09:48:05 UTC from IEEE Xplore. Restrictions apply.

You might also like