0% found this document useful (0 votes)
477 views

Deepfakes Generation and Detection

This paper provides a comprehensive review of deepfake generation and detection techniques for both audio and visual media. It discusses the state-of-the-art in deepfake creation using techniques like face swapping, lip syncing, and voice conversion. It also reviews approaches for detecting such manipulations in both audio and video. The paper aims to outline the challenges in these domains and discuss future directions to improve deepfake generation and detection.

Uploaded by

Jimmy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
477 views

Deepfakes Generation and Detection

This paper provides a comprehensive review of deepfake generation and detection techniques for both audio and visual media. It discusses the state-of-the-art in deepfake creation using techniques like face swapping, lip syncing, and voice conversion. It also reviews approaches for detecting such manipulations in both audio and video. The paper aims to outline the challenges in these domains and discuss future directions to improve deepfake generation and detection.

Uploaded by

Jimmy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Deepfakes Generation and Detection: State-of-the-art,

open challenges, countermeasures, and way forward


Momina Masood1, Mariam Nawaz2, Khalid Mahmood Malik3, Ali Javed4, Aun Irtaza5
1,2
Department of Computer Science, University of Engineering and Technology-Taxila, Pakistan
3,4
Department of Computer Science and Engineering, Oakland University, Rochester, MI, USA
5
Electrical and Computer Engineering Department, University of Michigan-Dearborn, MI, USA

Abstract
Easy access to audio-visual content on social media, combined with the availability of modern tools such as
Tensorflow or Keras, open-source trained models, and economical computing infrastructure, and the rapid evolution
of deep-learning (DL) methods, especially Generative Adversarial Networks (GAN), have made it possible to generate
deepfakes to disseminate disinformation, revenge porn, financial frauds, hoaxes, and to disrupt government
functioning. The existing surveys have mainly focused on the detection of deepfake images and videos. This paper
provides a comprehensive review and detailed analysis of existing tools and machine learning (ML) based approaches
for deepfake generation and the methodologies used to detect such manipulations for both audio and visual deepfakes.
For each category of deepfake, we discuss information related to manipulation approaches, current public datasets,
and key standards for the performance evaluation of deepfake detection techniques along with their results.
Additionally, we also discuss open challenges and enumerate future directions to guide future researchers on issues
that need to be considered to improve the domains of both deepfake generation and detection. This work is expected
to assist the readers in understanding the creation and detection mechanisms of deepfakes, along with their current
limitations and future direction.

Keywords Artificial intelligence, Deepfakes, Deep learning, Face swap, Lip-synching, Puppetmaster, Speech
synthesis, Voice conversion.

1 Introduction
The availability of economical digital smart devices like cellphones, tablets, laptops, and digital cameras has resulted
in the exponential growth of multimedia content (e.g. images and videos) in cyberspace. Additionally, the evolution
of social media over the last decade has allowed people to share captured multimedia content rapidly, leading to a
significant increase in multimedia content generation and ease of access to it. At the same time, we have witnessed
tremendous advancement in the field of ML with the introduction of sophisticated algorithms that can easily
manipulate multimedia content to spread disinformation online through social media platforms. Given the ease with
which false information may be created and spread, it has become increasingly difficult to know the truth and trust
the information, which may result in harmful consequences. Moreover, today we live in a “post-truth” era, where a
piece of information or disinformation is utilized by malevolent actors to manipulate public opinion. Disinformation
is an active measure that has the potential to cause severe damage: election manipulation, creation of warmongering
situations, defaming any person, etc. In recent times, the deepfakes generation has significantly advanced and it could
be used to propagate disinformation around the globe and may pose a severe threat, in the form of fake news, in the
future. Deepfakes are synthesized, AI-generated, videos, and audio. The use of videos as evidence in every sector of
litigation and criminal justice proceedings is currently the norm. A video admitted as evidence must be authentic and
its integrity must be verified. On the other hand, most of the existing multimedia forensic examiners face the challenge
of analyzing as evidence multimedia files that originate from social networks and sharing websites, e.g., YouTube,
Facebook, etc. Satisfying the authentication and integrity requirements and flagging the manipulated videos on social
media is a challenging task, especially as deepfakes generation becomes more sophisticated. Once the deepfakes have
been created, the further use of powerful, sophisticated, and easy-to-use manipulation tools (e.g. Zao[1], REFACE[2],
FaceApp[3], Audacity [4], Soundforge [5]) could make authentication and integrity verification of generated videos
even more difficult task.
Deepfakes videos can be categorized into the following types: i) face-swap ii) lip-synching iii) puppet-master iv) face
synthesis and attribute manipulation, and v) audio deepfakes. In face-swap deepfakes, the face of the source person is
replaced with the target person to generate a fake video of the target person, trying to portray actions to the target
person which in reality the source person has done. Face-swap-oriented deepfakes are usually generated to target the
popularity or reputation of famous personalities by showing them in scenarios in which they never appeared [6], and
to damage reputations in the face of the public, for example, in non-consensual pornography [7]. In lip-synching-
based deepfakes, the movements of the target person’s lips are transformed to make them consistent with some specific
audio recording. Lip-syncing is generated with the aim of showing an individual speaking in a way in which the
attacker devises the victim to speak. With puppet-master, deepfakes are created by mimicking the expressions of the
target person, such as eye movement, facial expressions, and head movement. Puppet-master deepfakes aim to hijack
the source person’s expression, or even the full-body, [8] in a video, and to animate according to the impersonator’s
desire. Face synthesis and attribute manipulation involve the generation of photo-realistic face images and facial
attribute editing. This manipulation is generated to spread disinformation on social media using fake profiles. Lastly,
audio deepfakes focus on the generation of the target speaker’s voice using deep learning techniques to portray the
speaker saying something they have not said [9, 10]. The fake voices can be generated using either text-to-speech
synthesis (TTS) or voice conversion (VC).
Unlike deepfake videos, less attention has been paid to the detection of audio deepfakes. In the last few years, voice
manipulation has also become very sophisticated. Synthetic voices are not only a threat to automated speaker
verification systems, but also to voice-controlled systems deployed in the Internet of Things (IoT) settings [11, 12].
Voice cloning has tremendous potential to destroy public trust and to empower criminals to manipulate business
dealings or private phone calls. For example, recently a case was reported in which bank robbers used voice cloning
of a company executive’s speech to dupe their subordinates into transferring hundreds of thousands of dollars into a
secret account [13]. The integration of voice cloning into deepfakes is expected to become a unique challenge for
deepfake detection. Therefore, it is important that, unlike current approaches that focus only on detecting video signal
manipulations, audio forgeries should also be examined.
There are no existing recently published surveys on deepfake generation and detection that focus on the generation
and detection of both the audio and video modalities of deepfakes. Most of the existing surveys focus only on
reviewing deepfakes images, and video detection. In [14], the main focus was on generic image manipulation and
multimedia forensic techniques. However, this work has not discussed deepfake generation techniques. In [15], an
overview of face manipulation and detection techniques was presented. Another survey [16] was presented on visual
deepfakes detection approaches but does not discuss audio cloning and its detection. The latest work presented by
Mirsky et al. [17] gives an in-depth analysis of visual deepfake creation techniques, however, deepfake detection
approaches are only briefly discussed. Moreover, this work [17] lacks a discussion of audio deepfakes. According to
the best of our knowledge, this paper is the first attempt to provide a detailed analysis and review of both the audio
and visual deepfake detection techniques, as well as generative approaches. The following are the main contributions
of our work:
i. To give the research community an insight into various types of video and audio-based deepfake
generation and detection methods.
ii. To provide the reader with the latest improvements, trends, limitations, and challenges in the field of
audio-visual deepfakes.
iii. To give an understanding to the reader about the possible implications of audio-visual deepfakes.
iv. To act as a guide to the reader to understand the future trends of audio and visual deepfakes.
The rest of the paper is organized as follows. Section 2 presents a discussion of deepfakes as a source of disinformation.
In Section 3, the history and evolution of deepfakes is briefly discussed. Section 4 presents the overview of state-of-
the-art audio and visual deepfake generation and detection techniques. We have also discussed open challenges for
both audio-visual deepfake generation and detection in Section 4. Section 5 presents the available datasets used for
both audio and video deepfakes detection. In Section 6, we discuss the possible future trends of both deepfakes
generation and detection, and finally, we conclude our work in Section 7.
Methodology
In this paper, we have reviewed the existing approaches, used for the generation and detection of audio and visual
manipulations, published in various authentic venues. A detailed description of the approach and protocol employed
for the review is given in Table 1.
Table 1: Literature collection and preparation protocol
Preparation Protocol Description
Purpose • To provide a brief overview of existing state-of-the-art techniques and identify potential gaps
both in audio-visual deepfake generation and detection.
• To provide systematic review and structure to the existing state-of-the-art techniques with respect
to each category of audio-visual deep fake generation and detection.
Data sources Google Scholar, Springer Link, ACM digital library, IEEE explorer and DBLP
Query Methodological approach was designed on data sources mentioned above and the following query
strings were used:
Deepfakes/Faceswap/ Face reenactment/ lip-syncing /Deepfakes AND Faceswap/ Deepfakes
AND Face reenactment/ Deepfakes AND lip-syncing/ GAN synthesized/ face manipulation/
Attribute Manipulation/GAN AND Puppet Mastery/ GAN AND Expression Manipulation/
Video Synthesis/ Audio synthesis/ Deep learning AND TTS/ Deep learning AND Voice
Conversion/ Deep learning AND Voice Cloning/Deepfakes AND Dataset/ Deepfakes AND
Audio/ Deepfakes AND Video/Deepfakes AND image
Method We have systematically categorized the literature of video and audio deepfakes as follows (see
figure 1):
a) Video deepfake generation and detection into the following categories: face swap, lip-syncing,
puppet-mastery, entire face synthesis, and facial attribute manipulation.
b) Audio deepfake generation and detection into the following categories: text-to-speech synthesis
and voice conversion.
Size A total of 350 papers were retrieved using the method and query mentioned above from listed data
sources till 07-08-2021. We have selected only those studies that were relevant and passed the
criteria to be ‘deepfakes’ were included in the positive set. The other relevant but not in the positive
set were included in the negative set. All other studies were excluded from the final selected papers
i.e., white paper and articles.
Study types/inclusion The peer-reviewed journal papers, and articles of conference proceedings, were given more
and exclusion importance. Additionally, few articles from archive literature were also considered.

Figure 1: Categorization of Audio and Visual Deepfakes

2 Disinformation and Misinformation using Deepfakes


Misinformation is defined as false or inaccurate information that is communicated, regardless of an intention to
deceive, whereas disinformation is the set of strategies employed by influential society representatives to fabricate
original information to achieve the planned political or financial objectives. It is expected to become the main process
of intentionally spreading manipulated news to affect public opinion or obscure reality. Because of the extensive use
of social media platforms, it is now very easy to spread false news [18]. Although all categories of fake multimedia
(i.e. fake news, fake images, and fake audio) could be sources of disinformation and misinformation, audiovisual-
based deepfakes are expected to be much more devastating. Historically, deepfakes were created to make famous
personalities controversial among their fans. For example, in 2017 a celebrity faced such a situation when a fake
pornographic video was circulated in cyberspace. This is evidence that deepfakes can be used to damage the
reputations, i.e. character’s assassination of renowned people to defame them [16], blackmailing individuals for
monetary benefits, or to create political or religious unrest by targeting politicians or religious scholars with fake
videos/speeches [19], etc. This damage is not limited to targeting individuals; rather deepfakes can be used to
manipulate elections, create warmongering situations by showing fake videos of missiles launched to destroy the
enemy state or used to deceive military analysts by portraying fake information, like showing a fake bridge across the
river, to mislead troop deployment, and so on.
The deepfakes are expected to advance the following current sources of disinformation and misinformation to the next
level.
Trolls: Independent Trolls are hobbyists who spread inflammatory information to cause disorder and reactions in
society by playing with the emotions of people [20]. For example, posting audio-visual manipulated racist or sexist
content and infuriating individuals may promote hatred among the individuals. Similarly, during the 2020 election
campaign of US President Donald Trump, conflicting narratives about Trump and Biden were circulated on social
media, contributing to an environment of fear [21]. As opposed to independent trolls who spread false information for
their own satisfaction, hired trolls will perform the same job for monetary benefits. Different actors, like political
parties, businessmen, and companies routinely hire people to forge news related to their competitors and spread it in
the market [22]. For example, according to a report published by Western intelligence [23], Russia is running “troll
farms,” where trolls are trained to affect conversations related to national or international issues. According to these
reports, deepfake videos generated by hired trolls are the newest weapon in the ongoing fabricated news war that can
bring a more devastating effect on society.
Bots: Bots are automated software or algorithms used to spread fabricated or misleading content among people. A
study published in [24, 25] concludes that during the US election campaign-2016, bots were employed to generate
one-fifth of the tweets during the last month of the campaign. The emergence of deepfakes has empowered the negative
impact of bots i.e. recently, a messaging app named telegram [26] used bots to produce nude pictures of women, which
is under investigation by Italian authorities.
Conspiracy Theorists: Conspiracy Theorists can range from nonprofessional filmmakers to Reddit agents who spread
vague and doubtful claims on the internet either through “documentaries” or by posting stories and memes [27]. They
believe that certain prominent communities are running the public while concealing their activities, like conspiracy
theories about a Jewish plan to control the world [27, 28]. Moreover, recently, several conspiracy theorists have
connected the current COVID pandemic with the USA and China. In such a situation, the use of fabricated audio-
visual deepfake content by these theorists can increase controversy in global politics.
Hyper-partisan Media: Hyper-partisan media includes fake news websites and blogs which intentionally spread false
information. Because of the extensive usage of social media, Hyper-partisan media is one of the biggest potential
incubators for spreading fabricated news among the people [29]. The convincing AI-generated fake content can assist
these bloggers to easily spread disinformation to attract visitors or increase views. As social platforms are largely
independent and ad-driven mediums, spreading fabricated information may purely be a profit-making strategy.
Politicians: One of the main sources of disinformation is the political parties themselves, which may spread
manipulated information for point-scoring. Due to a large number of followers on social platforms, politicians are
central nodes in online networks. So, politicians can use their fame and public support to spread false news among
their followers. To defame opponent parties, politicians can use deepfakes to post controversial content about their
competitors on conventional media [27].
Foreign Governments: As the Internet has converted the world into a “Global Village,” it is easy for conflicting
countries to spread false news to advance their agendas abroad. Their motive is to target the reputation of a country in
the rest of the world. Many countries are running government‐sponsored social media accounts, websites, and
applications, contributing to political propaganda globally. Particularly, the governments of China, Israel, Turkey,
Russia, the UK, Ukraine, India, and North Korea, etc. are believed to be involved in using ‘digital footsoldiers’ to
smear opponents, spreading disinformation, and posting fake texts for ‘pocket money’ [30]. These countries run
numerous official social sites over various online platforms like Twitter, Instagram, and Facebook, etc. [31]. The
ability to doctor multimedia content has become so easy that private actors may be able to initiate foreign attacks on
their own to increase the stress among countries.

3 DeepFakes Evolution
The earliest example of manipulated multimedia content occurred in 1860 when a portrait of southern politician John
Calhoun was skillfully manipulated by replacing his head with that of US President Abraham Lincoln [32]. Usually,
such manipulation is accomplished by adding (splicing), removing (inpainting), and replicating (copy-move) the
objects within or between two images [14]. Then, suitable post-processing steps like scaling, rotating, and color
adjustment are applied to improve the visual appearance, scale, and perspective coherence.
Aside from these traditional manipulation methods, advancements in Computer Graphics and DL techniques now
offer a variety of different automated approaches for digital manipulation with better semantic consistency. The recent
trend involves the synthesis of videos from scratch using autoencoders, or Generative Adversarial Network (GAN),
for different applications [33] and, more specifically, photorealistic human face generation based on any attribute [34-
37]. Another pervasive manipulation, called “shallow fakes” or “cheap fakes,” are audio-visual manipulations created
using cheaper and more accessible software. Shallow fakes involve basic editing of a video utilizing slowing, speeding,
cutting, and selectively splicing together unaltered existing footage that can alter the whole context of the information
delivered. In May 2019, a video of US Speaker Nancy Pelosi was selectively edited to make it appear that she was
slurring her words and was drunk or confused [38]. The video was shared on Facebook and received more than 2.2
million views within 48 hours. Video manipulation for the entertainment industry, specifically in film production, has
been done for decades. Fig. 2 shows the evolution of deepfakes over the years. An early notable academic project was
Video Rewrite Program [39], intended for applications in movie dubbing, published in 1997. It was the first software
used to automatically reanimate facial movements in an existing video to a different audio track, and it achieved
surprisingly convincing results.
The first true deepfake appeared online in September 2017 when a Reddit user named “deepfake” posted a series of
computer-generated videos of famous actresses with their faces swapped onto pornographic content [16]. Another
notorious deepfake case was the release of the deepNude application that allowed users to generate fake nude images
[40]. This was the beginning of when deepfakes gained wider recognition within a large community. Today deepfake
technology/applications, i.e. FakeApp [41], FaceSwap [42], and ZAO [1] are easily accessible, and users without a
computer engineering background can create a fake video within seconds. Moreover, open-source projects on GitHub,
such as DeepFaceLab [43] and related tutorials, are easily available on YouTube. A list of other available deepfake
creation applications, software, and open-source projects is given in Table 2. Contemporary academic projects that
lead to the development of deepfake technology are Face2Face [36] and Synthesizing Obama [35], published in 2016
and 2017 respectively. Face2Face [36] captures the real-time facial expressions of the source person as they talk into
a commodity webcam. It modifies the target person’s face in the original video to depict them, mimicking source
facial expressions. Synthesizing Obama [35] is a video rewrite 2.0 program, used to modify the mouth movement in
the video footage of a person to depict the person saying the words contained in an arbitrary audio clip. These works
[35, 36] are focused on the manipulation of the head and facial region only. Recent development expands the
application of deepfakes to the entire body, [8, 44] and the generation of deepfakes from a single image [45-47].

Figure 2: The timeline of Deepfakes evolution


Most deepfakes currently present on social platforms like YouTube, Facebook or Twitter may be regarded as harmless,
entertaining, or artistic. However, there are also some examples where deepfakes have been used for revenge porn,
hoaxes, political or non-political influence, and financial fraud [48, 49]. In 2018, a deepfake video went viral online
in which former U.S. President Barak Obama appeared to insult the current president, Donald Trump [50]. In June
2019, a fake video of Facebook CEO Mark Zuckerberg was posted to Instagram by the Israeli advertising company
“Canny” [48]. Recently, extremely realistic deepfake videos of Tom Cruise posted on the TikTok platform have gained
1.4million views within a few days [51].
Table 2: An overview of Audio-visual deepfakes generation software, applications, and open-source projects
Tool Type Reference/Developer Technique
Cheap fakes
Adobe Premiere Commercial Desktop Adobe Audio Video Editing,
Software AI-powered video reframing
Corel VideoStudio Commercial Desktop Corel Proprietary AI
Software
Lip-sync
dynalips Commercial Web App www.dynalips.com/ Proprietary
crazytalk Commercial Web App www.reallusion.com/crazytalk/ Proprietary
Wav2Lip Open source github.com/Rudrabha/Wav2Lip GAN with pre-trained discriminator network
implementation and visual quality loss function
Facial Attribute Manipulation
FaceApp MobileApp FaceApp Inc Deep generative CNNs
Adobe Commercial Desktop Adobe DNNs + filters
Software
Rosebud Commercial Web App www.rosebud.ai/ Proprietary AI
Face Swap
ZAO Mobile app Momo Inc Proprietary
REFACE Mobile app Neocortext, Inc Proprietary
Reflect Mobile app Neocortext, Inc Proprietary
Impressions Mobile app Synthesized Media, Inc. Proprietary
FakeApp Desktop App www.malavida.com/en/soft/fakeapp/ GAN
FaceSwap Open source faceswapweb.com/ Employed two pairs of encoder-decoder.
implementation Shared encoder parameters.
DFaker Open source github.com/dfaker/df For face reconstruction DSSIM loss function
implementation [34] is utilized.
Keras library-based implementation.
DeepFaceLab Open source github.com/iperov/DeepFaceLab - provide several face extraction methods, e.g.
implementation dlib, MTCNN, S3FD etc.
- Extend different Faceswap model i.e. H64,
H128, LIAEF128, SAE [33].
FaceSwapGAN Open source github.com/shaoanlu/faceswap-GAN Uses two loss functions namely adversarial loss
implementation and perceptual loss to the auto-encoder.
DeepFake-tf Open source github.com/StromWine/DeepFake-tf Same as DFaker however, used tensor-flow for
implementation implementation.
Faceswapweb Commercial Web App faceswapweb.com/ GAN
Face Reenactment
Face2Face Open source web.stanford.edu/~zollhoef/papers/CVPR2 Uses 3DMM and ML technique
implementation 016_Face2Face/page.html
Dynamixyz Commercial Desktop www.dynamixyz.com/ Machine-learning
Software
FaceIT3 Open source github.com/alew3/faceit_live3 GAN
implementation
Face Generation
Generated Photos Commercial Web App generated.photos/ StyleGAN
Voice Synthesis
Overdub Commercial Web App www.descript.com/overdub Proprietary (AI based)
Respeecher Commercial Web App www.respeecher.com/ Combined traditional digital signal processing
algorithms with proprietary deep generative
modeling techniques
SV2TTS Open source github.com/CorentinJ/Real-Time-Voice- LSTM with Generalized end-to-end loss
implementation Cloning
ResembleAI Commercial Web App www.resemble.ai/ Proprietary (AI based)
Voicery Commercial Web App www.voicery.com/ Proprietary AI and deep learning
VoiceApp Mobile app Zoezi AB Proprietary (AI-based)

Apart from visual manipulation, audio deepfakes are a new form of cyber-attack, with the potential to cause severe
damage to individuals due to highly sophisticated speech synthesis techniques i.e. WaveNet [52], Tacotron [53], and
deep voice1 [54]. Fake audio-assisted financial scams have increased significantly in 2019 due to the progression in
speech synthesis technology. In August 2019, a European company’s chief executive officer, tricked by an audio
deepfake, made a fraudulent transfer of $243,000 [13]. A voice-mimicking AI software was used to clone the voice
patterns of the victim by training ML algorithms using audio recordings obtained from the internet. If such techniques
can be used to imitate the voice of a top government official or a military leader and applied at scale, it could have
serious national security implications [55].

4 Audio Visual Deepfakes Types and Categorization of Literature


This section provides an in-depth analysis of existing state-of-the-art methods for audio and visual deepfakes. A review
for each category of deepfake in terms of creation and detection is provided to give a deeper understanding of the
various approaches. We provide a critical investigation of existing literature which includes the technologies, their
capabilities, limitations, challenges, and future trends for both deepfake creation and detection. Deepfakes are broadly
categorized into two groups namely visual and audio manipulations depending on the targeted forged modality (Fig.
1). The visual deepfakes are further grouped into the following types based on manipulation level (i) face swap/identity
swap, (ii) lip-syncing, (iii) face-reenactment/puppet-mastery, iv) entire face synthesis and v) facial attribute
manipulation. The audio deepfakes are further classified as i) text-to-speech synthesis and ii) voice conversion.

4.1 Face-swap
Generation
Visual manipulation is nothing new; images and videos have been forged since the beginning. In face-swap [56], or
face replacement, the face of the person in the source video is automatically replaced by the face in the target video,
as shown in Fig. 3. Traditional face-swap approaches [57-59] generally take three steps to perform a face-swap
operation. First, these tools detect the face in source images and then select a candidate face image from the facial
library that is similar to input facial appearance and poses. Second, the method replaces the eyes, nose, and mouth of
the face and further adjusts the lighting and color of the candidate face image to match the appearance of input images,
and seamlessly blends the two faces. Finally, the third step ranks the blended candidate replacement by computing a
match distance over the overlap region. These approaches may offer good results under certain conditions but have
two major limitations. First, they completely replace the input face with the target face, and expressions of the input
face image are lost. Second, the synthetic result is very rigid, and the replaced face looks unnatural e.g. it requires a
matching pose to generate good results.

Figure 3: A visual representation of Face-Swap based deepfake


Recently, DL-based approaches have become popular for synthetic media creation due to their realistic results. At the
same time, deepfakes showed how these approaches can be applied with automated digital multimedia manipulation.
In 2017, the first deepfake video that appeared online was created using a face-swap approach, where the face of a
celebrity was shown in pornographic content [16]. This approach used a neural network to morph a victim’s face onto
someone else’s features while preserving the original facial expression. As time went on, face-swap software i.e.
FakeApp [41] and FaceSwap [42] has made it both easier and quicker to produce deepfakes with more convincing
results by replacing the face in a video. These approaches typically use two encoder-decoder pairs. Usually, an encoder
is used to extract the latent features of a face from the image and then the decoder is used to reconstruct the face. To
swap faces between the source and target image, two pairs of encoder and decoder are required, where each encoder
is first trained on the source and then the target image. Once training is complete, the decoders are swapped, so that
an original encoder of the source image and decoder of the target image is used to regenerate the target image with
the features of the source image. The resulting image has the source’s face on the target’s face, while keeping the
target’s facial expressions. Fig. 4 is an example of a deepfake creation where the feature set of face A is connected
with the decoder B to reconstruct face B from the original face A. The recently launched ZAO [1], REFACE [2], and
FakeApp [41] applications are more popular due to their effectiveness in producing realistic face swap-based
deepfakes. FakeApp allows the selective modification of facial parts. ZAO and REFACE have gone viral lately as less
tech-savvy users can swap their faces with movie stars and embed themselves into well-known movies and TV clips.
There are many publicly available implementations of face-swap technology using deep neural networks, such as
FaceSwap [42], DFaker [60], DeepFaceLab [43], DeepFake-tf [61], and FaceSwapGAN [62], leading to the creation
of a growing number of synthesized media clips.

Figure 4: Creation of a Deepfake using an auto-encoder and decoder. The same encoder-decoder pair is
used to learn the latent features of the faces during training, while during generation decoders are
swapped, such that latent face A is subjected to decoder B to generate face A with the features of face B

Until recently, most of the research focused on advances in face-swapping technology, either using a reconstructed
3D morphable model (3DMM) [56, 63], or GANs based model [62, 64]. Korshunova et al. [63] proposed a convolution
neural network (CNN) based approach that transferred the semantic content, e.g., face posture, facial expression, and
illumination conditions of the input image to create that style in another image. They introduced a loss function that
was a weighted combination of style loss, content loss, light loss, and total variation regularization. This method [63]
generates more realistic deepfakes compared to [57], however, it requires a large amount of training data. Moreover,
the trained model can be used to transform only one image at a time. Nirkin et. al [56] presented a method that used a
full convolution network (FCN) for face segmentation and replacement while a 3DMM was established to estimate
facial geometry and corresponding texture. Then the face reconstruction was performed on a target image by adjusting
the model parameters. These approaches [56, 63] have the limitation of subject-specific or pair-specific training.
Recently subject agnostic approaches have been proposed to address this limitation.
In [62], an improved deepfake generation approach using GAN was proposed which adds adversarial loss and
perceptual loss to VGGface implemented in the auto-encoder architecture [42]. The addition of VGGFace perceptual
loss made the direction of the eyeball appear more realistic and consistent with the input and also helped to smooth
the artifacts added in the segmentation mask, resulting in a high-quality output video. FSGAN [64] allowed face
swapping and reenactment in real-time by following the reenact and blend strategy. This method simultaneously
manipulates pose, expression, and identity while producing high-quality and temporally coherent results. These GAN-
based approaches [62, 64] outperform several existing autoencoder-decoder methods [41, 42] as they work without
being explicitly trained on subject-specific images. Moreover, the iterative nature makes them well-suited for face
manipulation tasks such as generating realistic images of fake faces.
Some of the work used a disentanglement concept for face swap by using VAEs. RSGAN [65] employed two separate
VAEs to encode the latent representation of facial and hair regions respectively. Both encoders were conditioned to
predict the attributes that describe the target identity. Another approach, FSNet [66], presented a framework to achieve
face-swapping using a latent space, to separately encode the face region of the source identity and landmarks of the
target identity, which was later combined to generate the swapped face. However, these approaches [65, 66] hardly
preserves target attributes like target occlusion and illumination conditions.
Facial occlusions are always challenging to handle in face-swapping methods. In many cases, the facial region in the
source or target is partially covered with hair, glasses, a hand, or some other object. This results in visual artifacts and
inconsistencies in the resultant image. FaceShifter [67] generates a swapped face with high-fidelity and preserves the
target attributes such as pose, expression, and occlusion. The last layer of a facial recognition classifier was used to
encode the source identity and the target attributes, with feature maps being obtained via the U-Net decoder. These
encoded features were passed to a novel generator with cascaded Adaptive Attentional Denormalization layers inside
residual blocks which adaptively adjusted the identity region and target attributes. Finally, another network was used
to fix occlusion inconsistencies and refine the results. Table 3 presents the detail of Face-swap based deepfakes
creation approaches.
Table 3: An overview of Face-swap based Deepfake generation techniques
Reference Technique Features Dataset Output Limitations
Quality
Faceswap [42] Encoder- Facial landmarks Private 256×256 ▪ Blurry results due to lossy compression
decoder ▪ Lack of pose, facial expression, gaze
direction, hairstyle, and lighting
▪ Requires massive no. of target images
FaceSwapGAN GAN VGGFace VGGFace 256×256 ▪ Lack of texture details and generate
[62] overly smooth results
DeepFaceLab [68] Encoder- Facial landmarks Private 256×256 ▪ Fails to blend very different facial hues
decoder ▪ Requires target training data
Fast Face-swap [63] CNN VGGFace ▪ CelebA (200,000 256×256 ▪ Works for a single person only
images) ▪ Gives better result for frontal face view
▪ Yale Face ▪ Lack of skin texture details, e.g., smooth
Database B results and Facial Expression transfer
(different pose and ▪ Lack of occluding objects i.e. glasses
lighting conditions)
Nirkin et al. [56] FCN-8s- ▪ Basel Face IARPA Janus CS2 256×256 ▪ Poor results in case of different image
VGG Model to (1275 face videos) resolutions
architecture represent faces ▪ Fails to blend very different facial hues
▪ 3DDFA model
for expression
Chen et al. [69] VGG-16 net 68 facial Helen (2330 images) 256×256 ▪ Provide more realistic results but
landmarks sensitive to variation in posture and gaze
FSNet [66] GAN Facial landmarks CelebA 128×128 ▪ Sensitive to variation in angle
RSGAN [65] GAN Facial landmarks, CelebA 128×128 ▪ Sensitive to variation in angle, occlusion,
segmentation mask lightning
▪ Limited output resolution
FaceShifter [67] GAN Attributes (face, ▪ VGG Face 256×256 ▪ Stripped artifacts
occlusions, ▪ CelebA-HQ
lighting or styles) ▪ FFHQ

Detection
As shown in Table 4, attempts were made to detect the faceswap based deepfakes using both handcrafted and deep
features.
Techniques based on handcrafted Features: Zhang et al. [70] proposed a technique to detect swapped faces by
using Speeded Up Robust Features (SURF) descriptor for feature extraction that were then used to train the SVM for
classification. This technique was then tested on the set of Gaussian blurred images. This approach has improved the
deepfakes image detection performance however, unable to detect manipulated videos. Yang et al. [71] introduced an
approach to detect deepfakes by estimating the 3D head position from 2D facial landmarks. The computed difference
among the head poses was used as a feature vector to train the SVM classifier and was later used to differentiate
between original and forged content. This technique exhibits good performance for deepfake detection but has a
limitation in estimating landmark orientation in the blurred images, which degrades the performance of this method
under such scenarios. Guera et al. [72] presented a method for detecting synthesized faces from videos. Multimedia
stream descriptors [73] were used to extract the features that were then used to train the SVM, and random forest
classifiers to differentiate between the real and manipulated faces from the videos. This technique gives an effective
solution to deepfakes detection however unable to perform well against video re-encoding attacks. Ciftci et al. [74]
introduced an approach to detect forensic changes within videos by computing the biological signals (e.g. heart rate)
from the face portion of the videos. Temporal and spatial characteristics of facial features were computed to train the
SVM and CNN model to differentiate between bonafide and fake videos. This technique has improved deepfake
detection accuracy, however, it has a large feature vector space and its detection accuracy drops significantly when
dimensionality reduction techniques are applied. Jung et al. [75] proposed a technique to detect deepfakes by
identifying an anomaly based on the time, repetition, and intervened eye-blinking duration within videos. This method
combined the Fast-HyperFace [76] and EAR technique (eye detect) [77] to detect eye blinking. An integrity
authentication method was employed by tracking the fluctuation of eye blinks based on gender, age, behavior, and
time factor to spot the real and fake videos. The approach in [75] exhibits better deepfake detection performance,
however, it is not appropriate if the subject in the video is suffering from mental illness as we often experience
abnormal eye blinking patterns for such people. Furthermore, the work in [78] [79] have presented ML based
approaches for face-swap detection, however, still require performance improvement under the presence of post-
processing attacks.

Techniques based on Deep Features: Several studies have employed the DL-based method for Face-swap
manipulation detection. Li et al. [80] proposed a method of detecting the forensic modifications made within the
videos. First, the facial landmarks were extracted using the dlib software package [81]. Next, CNN-based models
named ResNet152, ResNet101, ResNet50, and VGG16 were trained to detect forged content from videos. This
approach is more robust in detecting the forensic changes; however, it exhibits low performance on multi-time
compressed videos. Guera et al. [82] proposed a novel CNN to extract the features at the frame level. Then the RNN
was trained on the set of extracted features to detect deepfakes from the input videos. This work achieves good
detection performance but only on videos of short duration i.e. videos of 2 seconds or less. Li et al. [83] proposed a
technique to detect deepfakes by using the fact that the manipulated videos lack accurate eye blinking in synthesized
faces. CNN/RNN approach was used to detect the lack of eye blinking in the videos to expose the forged content. This
technique shows better deepfake detection performance, however, it only uses the lack of eye blinking as a clue to
detect the deepfakes. This approach has the following potential limitations: i) it is unable to detect the forgeries in
videos with frequent eye blinking, ii) it is unable to detect manipulated faces with closed eyes in training, and iii) it is
inapplicable in scenarios where forgers can create realistic eye blinking in synthesized faces. Montserrat et al. [84]
introduced a method for detecting visual manipulations in a video. Initially, a Multi-task convolutional neural network
(MTCNN) [85] was employed to detect the faces from all video frames on which CNN was applied, to compute the
features. In the next step, the Automatic Face Weighting (AFW) mechanism, along with a Gated Recurrent Unit, was
used to discard the false-detected faces. Finally, an RNN was employed to combine the features from all steps and
locate the manipulated content in the videos. The approach in [84] works well for deepfake detection, however, it is
unable to obtain the prediction from the features in multiple frames. Lima et al. [86] introduced a technique to detect
video manipulation by learning the temporal information of frames. Initially, VGG-11 was employed to compute the
features from video frames, on which LSTM was applied for temporal sequence analysis. Several CNN frameworks,
named R3D, ResNet, I3D, were trained on the temporal sequence descriptors outputted by the LSTM, to identify
original and manipulated videos. This approach [86] improves deepfake detection accuracy but at the expense of high
computational cost. Agarwal et al. [87] presented an approach to locate face-swap-based manipulations by combining
both facial and behavioral biometrics. The behavioral biometric was recognized with the encoder-decoder network
(Facial Attributes-Net, FAb-Net) [88]. Whereas VGG-16 was employed for facial features computation. Finally, by
merging both metrics the inconsistencies in the matching identities were revealed to locate face-swap deepfakes. This
approach [87] works well for unseen cases, however, it may not generalize well to lip-synch-based deepfakes.
Fernandes et al. [89] introduced a technique to locate visual manipulation by measuring the heart-rate of the subjects.
Initially, three techniques: skin color variation [90], average optical intensity [91], and Eulerian video magnification
[92], were used to measure heart rate. The computed heart-rate was used to train a Neural Ordinary Differential
Equations (Neural-ODE) model [93] to differentiate the original and altered content. This technique [89] works well
for deepfakes detection but has increased computational complexity. Other works [94-98] have explored CNN based
methods for detection of swapped faces, however, still there is a need for more robust approach.
Table 4: An overview of face swap deepfake detection techniques and their limitations
Author Technique Features Best Evaluation Dataset Limitations
performance
Handcrafted features
Zhang et al. [70] SURF + SVM 64-D features ▪ Precision= 97% Generate deepfake ▪ Unable to preserve facial
using SURF ▪ Recall= 88% dataset using LFW expressions
▪ Accuracy= 92% face database. ▪ Works with static images
▪ only.
Yang et al. [71] SVM Classifier 68-D facial ROC=89% ▪ UADFV ▪ Degraded performance for
landmarks ROC=84% ▪ DARPA MediFor blurry images.
using DLib GAN Image/ Video
Challenge.
Guera et al. [72] SVM, RF Multimedia AUC= 93% (SVM) Custom dataset. ▪ Fails on video re-encoding
Classifier stream AUC= 96% (RF) attacks
descriptor [29]
Ciftci et al. [74] CNN medical signals Accuracy= 96% Face Forensics ▪ Large feature vector space.
features dataset
Jung et al. [75] Fast- Landmark Accuracy= 87.5% Eye Blinking ▪ Inappropriate for people
HyperFace[76], features Prediction dataset with mental illness
EAR[77]
Matern et al. MLP, Logreg 16-D texture ▪ AUC= .0.851(MLP) FF++ ▪ Only applicable to face
[78] energy based ▪ AUC=0.784 (LogReg) images with open eyes and
features of eyes clear teeth.
and teeth [99]
Agarwal et al. SVM Classifier 16 AU’s using AUC= 93% Own dataset. ▪ Degraded performance in
[79] OpenFace2 cases where a person is
toolkit looking off-camera.
Deep Learning-based features
Li e al. [80] VGG16, DLib facial AUC=84.5 (VGG16), DeepFake-TIMIT ▪ Not robust for multiple video
ResNet50, landmarks 97.4 (ResNet50), 95.4 compression.
ResNet101, (ResNet101), 93.8
ResNet152 (ResNet152)
Guera et al. CNN/ RNN Deep features Accuracy=97.1% Customized dataset ▪ Applicable to short videos
[82] only (2 sec).
Li et al. [83] CNN/RNN DLib facial TPR= 99% Customized dataset ▪ Fails over frequent and
landmarks closed eyes blinking.
Montserrat et CNN + RNN Deep features Accuracy=92.61% DFDC ▪ Performance needs
al. [84] improvement.
Lima et al. [86] VGG11 + Deep features Accuracy= 98.26%, Celeb-DF ▪ Computationally complex.
LSTM AUC= 99.73%
Agarwal et al. VGG6 + Deep features AUC= 99% WLDR ▪ Unable to generalize well to
[87] encoder- + behavioral AUC= 99% FF unseen deepfakes.
decoder biometrics AUC= 93% DFD
network AUC= 99% Celeb-DF
Fernandes et al. Neural-ODE Heart-rate Loss=0.0215 Custom ▪ Computationally expensive
[89] model Loss=0.0327 DeepfakeTIMIT
Sabir et al. [94] CNN/RNN CNN features Accuracy= 96.3% FF++ ▪ Results are reported for
static images only.
Afchar et al. MesoInception- Deep features TPR= 81.3 % FF++ ▪ Performance degrades on
[95] 4 (DF) low quality videos.
Nguyen et al. CNN Deep features Accuracy=83.71% FF++ ▪ Degraded detection
[96] performance for unseen
cases.
Stehouwer et al. CNN Deep features Accuracy=99.43% Diverse Fake Face ▪ Computationally expensive
[97] Dataset (DFFD) due to large feature vector
space.
Rossle et al. SVM + CNN Co-Occurance Accuracy= 90.29% FF++ ▪ Low performance on
[98] matrix + DF compressed videos.

4.2 Lip-syncing
Generation
The Lip-syncing approach involves synthesizing a video of a target identity such that the mouth region in the
manipulated video is consistent with arbitrary audio input [35] (Fig. 5). A key aspect of synthesizing a visual speech
is the movement and appearance of the lower portion of the mouth and its surrounding region. To convey a message
more effectively and naturally, it is important to generate proper lip movements along with expressions. From a
scientific point of view, lip-syncing has many applications in the entertainment industry, such as making audio-driven
photorealistic digital characters in films or games, voice-bots, and dubbing films in foreign languages. Moreover, it
can also help hearing-impaired persons understand a scenario by lip-reading from a video created using the genuine
audio.

Figure 5: A visual representation of lip-syncing of an existing video to an arbitrary audio clip

Existing works on lip-syncing [100, 101] require the reselection of frames from a video or transcription, along with
target emotions, to synthesize the lip's motion. These approaches are limited to a dedicated emotional state or don’t
generalize well to unseen faces. However, the DL models are capable of learning and predicting the movements from
audio features. A detailed analysis of existing methods used for Lip-sync-based deepfakes detection is presented in
Table 5. Suwajanakorn et al. [35] proposed an approach to generate a photo-realistic lip-synced video using a target’s
video and an arbitrary audio clip as input. The recurrent neural network (RNN) based model was employed to learn
the mapping between audio features and mouth shape for every frame, and later used frame reselection to fill in the
texture around the mouth based on the landmarks. This synthesis was performed on the lower facial regions i.e. mouth,
chin, nose, and cheeks. This approach applied a series of post-processing steps, such as smoothing jaw location and
re-timing the video to align vocal pauses, or talking head motion, to produce videos that appear more natural and
realistic. In this work, Barak Obama was considered as a case study due to the sufficient availability of online video
footage. Thus, this model is required to retrain for each individual. The Speech2Vid [102] model took an audio clip
and a static image of a target subject as input and generated a video that is lip-synced with the audio clip. This model
used the Mel Frequency Cepstral Coefficients (MFCC) features extracted from the audio input and fed them into a
CNN-based encoder-decoder. As a post-processing step, a separate CNN was used for frame deblurring and
sharpening to preserve the quality of visual content. This model generalizes well to unseen faces and thus does not
need retraining for new identities. However, this work is unable to synthesize a variety of emotions on facial
expression.
The GAN-based manipulations such as [103] employed a temporal GAN, consisting of an RNN, to generate a
photorealistic video directly from a still image and speech signal. The resulting video included synchronized lip
movements, eye-blinking, and natural facial expression without relying on manually handcrafted audio-visual
features. Multiple discriminators were employed to control frame quality, audio-visual synchronization, and overall
video quality. This model can generate lip-syncing for any individual in real-time. In [104], an adversarial learning
method was employed to learn the disentangled audio-visual representation. The speech encoder was trained to project
both the audio and visual representations into the same latent space. The advantage of using a disentangled
representation was that both the audio and video could serve as a source of speech information during the generation
process. As a result, it was possible to generate realistic talking face sequences on an arbitrary identity with
synchronized lip movement. Garrido et al. [105] presented a Vdub system that captures the high-quality 3D facial
model of both the source and the target actor. The computed facial model was used to photo-realistically reconstruct
a 3D mouth model of the dubber to be applied on the target actor. An audio channel analysis was performed to better
align the synthesized visual content with the audio. This approach better renders a coarse-textured teeth proxy however
it fails to synthesize a high-quality interior mouth region. In [106] a face-to-face translation method, LipGAN, was
proposed to synthesize a talking face video of any individual utilizing a given single image and audio segment as
input. LipGAN consists of a generator network to synthesize portrait video frames with a modified mouth and jaw
area from the given audio and target frames and uses a discriminator network to decide whether the synthesized face
is synchronized with the given audio. This approach is unable to ensure temporal consistency in the synthesized
content, as blurriness and jitter can be observed in the resultant video. Recently, Prajwal et al. [107] proposed a wav2lip
speaker-independent model that can accurately synchronize the lips movement in a video recording with a given audio
clip. This approach employs a pre-trained lip-sync discriminator that is further trained on noisy generated videos in
the absence of a generator. This model uses several consecutive frames instead of a single frame in the discriminator
and employs visual quality loss along with contrastive loss, thus increasing the visual quality by considering temporal
correlation.
The recent approaches can synthesize photo-realistic fake videos from speech (audio-to-video) or text (text-to-video)
with convincing video results. The methods proposed in [35, 108] can edit existing video of a person to the desired
speech to be spoken from text input by modifying the mouth movement and speech accordingly. These approaches
are more focused on synchronizing lip-movements by synthesizing the region around the mouth. In [109] a VAE based
framework was proposed to synthesize full pose video with facial expressions, gestures, and body posture movements
from given audio.
Table 5: An overview of Lip sync-based Deepfake generation techniques
Reference Technique Features Dataset Output Limitations
Quality
Suwajanakorn et RNN (single-layer ▪ Mouth landmarks (36- Youtube videos (17 2048×1024 ▪ Requires large amount of
al. [35] unidirectional D features) hours) training data for target person.
LSTM) ▪ MFCC audio features ▪ Require retraining for each
(28-D) identity.
▪ Sensitive to the 3D movement of
head
▪ No direct control over facial
expressions
Speech2Vid[102] Encoder–decoder ▪ VGG-M network ▪ VGG Face 109×109 ▪ lacks the synthesis of emotional
CNN ▪ MFCC audio features ▪ LRS2 (41.3-hour facial expressions
video)
▪ VoxCeleb2 (test)
Vougioukas et al. Temporal GAN MFCC audio features ▪ GRID 96×128 ▪ lacks the synthesis of emotional
[103] ▪ TCD TIMIT facial expressions
▪ flickering and jitter
▪ sensitive to large facial motions
Zhou et al. [104] Temporal GAN Deep audio-video ▪ LRW 256×256 ▪ lacks the synthesis of emotional
features ▪ MS-Celeb-1M facial expressions
Vdub [105] 3DMM ▪ 66 facial feature points ▪ Private 1024×1024 ▪ Requires video of the target
▪ MFCC features
LipGAN [106] GAN ▪ VGG-M network ▪ LRS 2 1280×720 ▪ visual artifacts and temporal
▪ MFCC features inconsistency
▪ unable to preserve source lip
region characteristics
Wav2Lip[107] GAN ▪ Mel-spectrogram ▪ LRS2 1280×720 ▪ lacks the synthesis of emotional
representation facial expressions

Detection
Techniques based on handcrafted Features: Initially, ML-based methods are employed for the detection of lip-sync
visual deepfakes. Korshunov et al. [110] proposed a technique employing 40-D MFCC features containing the 13-D
MFCC, 13-D delta, and 13-D double-delta, along with the energy, in combination with mouth landmarks to train the
four classifiers, i.e. SVM, LSTM, multilayer perceptron (MLP), and Gaussian mixture model (GMM). Three publicly
available datasets, named VidTIMIT[111], AMI corpus [112], and GRID corpus [113] were used to evaluate the
performance of this technique. From the results, it was concluded in [110] that LSTM achieves better performance
over other techniques. However, lip-syncing deepfake detection performance of the LSTM method drops for the
VidTIMIT [111] and AMI [112] datasets due to fewer training samples for each person in both of these datasets over
the GRID dataset. In [113] MFCC features were substituted with DNN embeddings i.e., language-specific phonetic
features used for automatic speaker recognition. The evaluations showed an improved performance as compared to
[110], however, performance is not evaluated on large scale realistic dataset and GAN based manipulation.
Techniques based on Deep Features: The other DL-based techniques such as [114] proposed a detection approach
by exploiting the inconsistencies between phoneme-viseme pairs. In [114] authors observed that in a video the lips
shape associated with specific phenomes such as M, B, or P must be completely closed to pronounce them, however,
the deepfake videos often lack this aspect. They analyzed the performance by creating deepfakes using Audio-to-
Video (A2V) [35] and Text-to-Video (T2V) [108] synthesis techniques. However, it fails to generalize well for unseen
samples during training. Haliassos et al. [115] proposed a lip-sync deepfake detection approach namely LipForensics
using a spatio-temporal network. Initially, a feature extractor 3D-CNN ResNet18 and a multiscale temporal
convolutional network (MS-TCN) are trained on lip-reading dataset such as Lipreading in the Wild (LRW). Then, the
model is fine-tuned on deepfake videos using FaceForensics++ (FF++) dataset. The method also performed well over
different post-processing operations such as blur, noise, compression etc., however, the performance substantially
decreases when there is a limited mouth movement such as pauses in speech or less movement in lips in videos. Chugh
et al. [116] proposed a deepfake detection mechanism by finding a lack of synchronization between the audio and
visual channels. They computed a modality dissimilarity score (MDS) between the audio and visual modalities. A
sub-network based on 3D-ResNet architecture is used for feature computation and employed two loss functions, a
cross-entropy loss at the output layer for robust feature learning, and a contrastive loss is computed over segment-
level audiovisual features. The MDS is calculated as the total audiovisual dissonance over all segments of the video
and is used for the classification of a video as real or fake. Mittal et al. [117] proposed a siamese network architecture
for audio-visual deepfake detection. This approach compares the correlation between emotion-based differences in
facial movements and speech to distinguish between real and fake. However, this approach requires a real-fake video
pair for the training of the network and fails to classify correctly if only a few frames in the video have been
manipulated. Chintha et al. [118] proposed a framework based on XceptionNet CNN for facial feature extraction and
then passed it to a bidirectional LSTM network for the detection of temporal inconsistencies. The network is trained
via two loss functions, i.e. cross-entropy and KL-divergence to discriminate the feature distribution of real video from
that of manipulated video. Table 6 presents a comparison of handcrafted and deep learning techniques employed for
detection of lip sync-based deepfakes.
Table 6: An overview of Lip sync-based Deepfake detection techniques
Author Technique Performance reported Dataset used Limitations

Handcrafted features
Korshunov et al. SVM, LSTM, MLP, GMM EER=24.74 (LSTM), VidTIMIT LSTM performs better than others
[110] 53.45 (MLP), but its performance degrades as the
56.18(SVM), 56.09(GMM) training samples decrease.
EER=33.86 (LSTM), AMI
41.21(MLP), 48.39(SVM),
47.84 (GMM)
EER=14.12 (LSTM), GRID
28.58(MLP), 30.06
(SVM), 46.81(GMM)
Agarwal et al. SVM Accuracy=99.6% Custom dataset Performance degrades for unseen
[114] samples
Deep Learning based features
Haliassos et al. 3D-ResNet18, multi-scale AUC=97.1% FF++ Performance degrades in cases
[115] temporal convolutional when there is limited lip movement
network
Mittal et al. siamese network Accuracy =84.4% DFDC Requires a real–fake video pair for
[117] architecture training.
AUC=96.3%(LQ), DF-TIMIT
94.9%(HQ)
Chintha et al. XceptionNet CNN with Accuracy=97.83% Celeb-Df Performance degrades on
[118] bidirectional LSTM compressed samples
network Accuracy=96.89% FF++

4.3 Puppet-master
Generation
Puppet-master, also known as face reenactment, is another common variation of deepfakes that manipulates the facial
expressions of a person e.g., transferring the facial gestures, eye, and head movements to an output video which reflect
those of the source actor [119] as shown in Fig. 6. Puppet-mastery aims to deform the person's mouth movement to
make fabricated content. Facial reenactment has various applications, i.e. altering the facial expression and mouth
movement of a participant to a foreign language in an online multilingual video conference, dubbing or editing an
actor's head and their facial expressions in film industry post-production systems, or creating photorealistic animation
for movies and games, etc.
Figure 6: A visual representation of face-reenactment based deepfake
Initially, 3D facial modeling-based approaches for facial reenactment were proposed because of their ability to
accurately capture the geometry and movement, and for improved photorealism in reenacted faces. Thies et al. [120,
121] presented the first real-time facial expressions transfer method from an actor to a target person. A commodity
RGB-D sensor was used to track and reconstruct the 3D model of a source and target actor. For each frame, the tracked
deformations of the source face were applied to the target face model, and later the altered face was blended onto the
original target face while preserving the facial appearance of the target face model. Face2Face [36] is an advanced
form of facial reenactment technique as presented in [120]. This method works in real-time and is capable of altering
the facial movements of generic RGB video streams e.g., YouTube videos, using a standard webcam. The 3D model
reconstruction approach was combined with image rendering techniques to generate the output. This creates a
convincing and instantaneous re-rendering of the target actor with a relatively simple home setup. This work was
further extended to control the facial expressions of a person in a target video based on intuitive hand gestures [122]
using an inertial measurement unit [123].
Later, GANs have been successfully applied for facial reenactment due to their ability to generate photo-realistic
images. Pix2pixHD [124] produces high-resolution images with better fidelity by combining multi-scale conditional
GANs (cGAN) architecture using a perceptual loss. Kim et al. [125] proposed an approach that allows the full
reanimation of portrait videos by an actor, such as changing head pose, eye gaze, and blinking, rather than just
modifying the facial expression of the target identity and thus produced photorealistic dubbing results. At first, a face
reconstruction approach was used to obtain a parametric representation of the face and illumination information from
each video frame to produce a synthetic rendering of the target identity. This representation was then fed to a render-
to-video translation network based on the cGAN to predict the synthetic rendering into photo-realistic video frames.
This approach requires training the videos for target identity. Wu et al. [126] proposed ReenactGAN which encodes
the input facial features into a boundary latent space. A target-specific transformer was used to adapt the source
boundary space according to the specified target, and later the latent space was decoded onto the target face.
GANimation [127] employed a dual cGAN generator conditioned on emotion action units (AU) to transfer facial
expressions. The AU-based generator used an attention map to interpolate between the reenacted and original images.
Instead of relying on AU estimations, GANnotation [128] used facial landmarks along with the self-attention
mechanism for facial reenactment. This approach introduced a triple consistency loss to minimize visual artifacts but
requires the images to be synthesized with a frontal facial view for further processing. These models [89-90] require
a large amount of training data for target identity to perform well at oblique angles or they will lack the ability to
generate photo realistic reenactment for unknown identities.
Recently, few shot or one-shot face reenactment approaches have been proposed to achieve reenactment using a few
or even a single source image. In [37], a self-supervised learning model, X2face, using multiple modalities such as
driving frame, facial landmarks, or audio to transfer the pose and expression of the input source to target expression,
was proposed. X2face used two encoder-decoder networks: an embedding network and a driving network. The
embedding network learns face representation from the source frame and the driving network learns pose and
expression information from the driving fame to the vector map. The driving network was crafted to interpolate face
representation from the embedded network to produce target expressions. Zakharov et al. [129] presented a meta-
transfer learning approach where the network was first trained on multiple identities and then fine-tuned on the target
identity. First, target identity encoding was obtained by averaging the target’s expressions and associated landmarks
from different frames. Then a pix2pixHD [124] GAN was used to generate the target identity using source landmarks
as input, and identity encoding via adaptive instance normalization (AdaIN) layers. This approach works well at
oblique angles and directly transfers the expression without requiring intermediate boundary latent space or
interpolation map, as in [37]. Zhang et al. [130] proposed an auto-encoder-based structure to learn the latent
representation of the target’s facial appearance and source’s face shape. These features were used as input to SPADE
residual blocks for the face reenactment task, which preserved the spatial information and concatenated the feature
map in a multi-scale manner from the face reconstruction decoder. This approach can better handle large pose changes
and exaggerated facial actions. In FaR-GAN [131], learnable features from convolution layers were used as input to
the SPADE module instead of using multi-scale landmark masks, as in [130]. Usually, few-shot learning fails to
completely preserve the source identity in the generated results for cases where there is a large pose difference between
the reference and target image. MarioNETte [46] was proposed to mitigate identity leakage by employing attention
block and target feature alignment. This helped the model to accommodate the variations between face structures
better. Finally, the identity was retained by using a novel landmark transformer, influenced by the 3DMM facial model
[132].
The real-time face reenactment approach such as FSGAN [64] performs both facial replacement and reenactment with
occlusion handling. For reenactment, a pix2pixHD [124] generator takes the target’s image and source’s 3D facial
landmark as input and outputs a reenacted image and 3-channel (hair, face, and background) encoded segmentation
mask. The recurrent generator was trained recursively where output was iterated multiple times for incremental
interpolation from source to target landmarks. The results were further improved by applying Delaunay Triangulation
and barycentric coordinate interpolation to generate output similar to the target’s pose. This method achieves real-
time facial reenactment at 30fps and can be applied to any face without requiring identity-specific training. Table 7
provides the summary of techniques adopted for facial expression manipulation and mentioned above.
In the next few years, photo-realistic full-body reenactment [8] videos will also be viable, where the target’s
expression, along with mannerism, will be manipulated to create realistic deepfakes. The videos that will be generated
using the above-mentioned techniques will be further merged with fake audio to create the fabricated content
completely [133]. These progressions enable the real-time manipulation of facial expressions and motion in videos
while making it challenging to distinguish between real and synthesized video.
Table 7: An overview of face reenactment based Deepfake generation techniques
Reference Technique Features Dataset Output Limitations
Quality
Face2Face [36] 3DMM ▪ parametric model customized 1024×1024 ▪ Sensitive to facial occlusions
▪ Facial landmark
features
Kim et al. [125] cGAN parametric model of customized 1024×1024 ▪ 1-3 min. video of target
the face (261 ▪ Sensitive to facial occlusions
parameters/frame)
ReenactGAN GAN Facial landmark ▪ CelebV dataset 256×256 ▪ 30 min. video of target
[126] features ▪ WFLW Dataset ▪ Lack of gaze adaption
▪ Helen, DISFA
GANimation GAN (2 Encoder- 2 AUs ▪ EmotioNet dataset 128×128 ▪ Lack of pose and gaze adaption
[127] Decoder) ▪ RaFD dataset
GANnotation GAN Facial landmark ▪ 300- 128x128 ▪ Lack of gaze adaption
[128] features VWChallenge
dataset
▪ BP4D dataset
▪ Helen, LFPW,
AFW, IBUG, and
a subset of
multiple datasets
X2face [37] 2Encoder- ▪ Facial landmark ▪ VGG Face dataset 256×256 ▪ Wrinkle artifacts
2Decoder features ▪ VoxCeleb dataset ▪ Lack of gaze adaption
▪ 256-D audio ▪ AFLW dataset
features
Zakharov et al. GAN (1Encoder- Facial landmark VoxCeleb dataset 256×256 ▪ Sensitive to source identity
[129] 2Decoder) features leakage
▪ Lack of gaze adaption
Zhang et al. GAN (1Encoder- Appearance and ▪ VGG Face dataset 256×256 ▪ Low visual quality output
[130] 2Decoder) shape feature Map ▪ WFLW (256×256)
▪ EOTT dataset
▪ CelebA-HQ
dataset
▪ LRW dataset.
FaR-GAN [131] GAN Facial landmark and ▪ VGG Face dataset 256×256 ▪ Sensitive to source identity
Boundary features ▪ VoxCeleb1 leakage
dataset ▪ Lack of gaze adaption
MarioNETte GAN (2Encoder- Facial landmark ▪ VoxCeleb1 256×256 ▪ Fails to preserve source facial
[46] 1Decoder) features characteristics completely
FSGAN[64] GAN+RNN ▪ Facial landmarks ▪ IJB-C dataset 256×256 ▪ The identity and texture quality
▪ LFW parts label set (5500 face videos) degrade in case of large angular
▪ VGGFace2 differences
▪ CelebA ▪ Fail to fully capture facial
▪ Figaro dataset expressions
▪ blurriness in image texture
▪ limited to the resolution of
training data

Detection
Techniques based on handcrafted Features: Matern et al. [78] presented an approach for classifying forged content
by employing simple facial handcrafted features like the color of eyes, missing artifact information in the eyes and
teeth, and missing reflections. These features were used to train two models, i.e. logistic regression and MLP, to
distinguish the manipulated content from the original data. This technique has a low computational cost; however, it
applies only to the visual content with open eyes or visible teeth. Amerini et al. [134] proposed an approach based on
optical flow fields to detect synthesized faces in digital videos. The optical flow fields [135] of each video frame were
computed using PWC-Net [136]. The estimated optical flow fields of frames were used to train the VGG16 and
ResNet50 to classify bonafide and fake content. This method [134] exhibits better deepfake detection performance,
however, only initial results have been reported. Agarwal et al. [79] presented a user-specific technique for deepfakes
detection. First, GAN was used to generate all three types of deepfakes for US ex-president Barack Obama. Then the
OpenFace2 [137] toolkit was used to estimate facial and head movements. The estimated difference between the 2D
and 3D facial and head landmarks was used to train the binary SVM to classify between the original face and
synthesized face of Barack Obama. This technique provides good detection accuracy, however, it is vulnerable in
those scenarios where a person is looking off-camera.
Techniques based on Deep Features: Several research works have focused on employing DL-based methods for
puppet-mastery deepfakes detection. Sabir et al. [94] observed that while generating the manipulated content, forgers
often do not impose temporal coherence in the synthesis process. So, in [94], a recurrent convolutional model was
used to investigate the temporal artifacts to identify synthesized faces in the images. This technique [94] achieves [138]

better detection performance, however, it works well on static frames. Rossler et al. [98] employed both the
handcrafted (co-occurrence matrix) and learned features for detecting manipulated content. It was concluded in [98]
that the detection performance of both networks, either employing hand-crafted or deep features, degrade when
evaluating them on compressed videos. To analyze the mesoscopic properties of manipulated content, Afchar et al.
[95] proposed an approach where they employed two variants of the CNN model with a small number of layers named
Meso-4 and MesoInception-4. This method has managed to reduce the computational cost by downsampling the
frames but at the expense of a decrease in accuracy in deepfake detection. Nguyen et al. [96] proposed a multi-task,
learning-based CNN network to simultaneously detect and localize manipulated content from the videos. An
autoencoder was used for the classification of forged content, while a y-shaped decoder was applied to share the
extracted information for the segmentation and reconstruction steps. This model is robust to deepfakes detection;
however, the evaluation accuracy degrades over unseen scenarios. To overcome the issue of performance degradation
as in [96], Stehouwer et al. [97] proposed a Forensic transfer (FT) based CNN approach for deepfake detection. This
work [97], however, suffers from high computational cost due to a large feature space. The comparison of handcrafted
and deep features-based face reenactment deepfake detection techniques mentioned above is presented in Table 8.
Table 8: An overview of face reenactment based Deepfake detection techniques
Author Technique Features Best Evaluation Dataset Limitations
performance
Handcrafted
Matern et al. [78] MLP, Logreg 16-D texture ▪ AUC=.823 (MLP) FF++ ▪ Only applicable to face
energy based ▪ AUC=.866 (LogReg) images with open eyes and
features of clear teeth.
eyes and
teeth [99]

Agarwal et al. SVM Classifier 16 AU’s ▪ AUC= 98% Own dataset. ▪ Degraded performance in
[79] using cases where a person is
OpenFace2 looking off-camera.
toolkit
Amerini et al. VGG16, Optical flow Accuracy= 81.61% FF++ ▪ Very few results are
[134] ResNet fields (VGG16), 75.46% reported
(ResNet)
Deep Learning
Sabir et al. [94] CNN/RNN CNN Accuracy= 94.35 % FF++ ▪ Results are reported for
features static images only.
Afchar et al. [95] MesoInception- Deep TPR= 81.3% FF++ ▪ Performance degrades on
4 features (DF) low quality videos.
Nguyen et al. CNN Deep Accuracy=92.50% FF++ ▪ Degraded detection
[96] features performance for unseen
cases.
Stehouwer et al. CNN Deep Accuracy=99.4% Diverse Fake ▪ Computationally expensive
[97] features Face Dataset due to large feature vector
(DFFD) space.
Rossle et al. [98] SVM + CNN Co- Accuracy= 86.86% FF++ ▪ Low performance on
Occurance compressed videos.
matrix + DF

4.4 Face Synthesis


Generation
Facial editing in digital images has been heavily explored for decades. It has been widely adopted in the art, animation,
and entertainment industry. However, lately, it has been exploited to create deepfakes for identity impersonation. Face
generation involves the synthesis of photorealistic images of a human face that may or may not exist in real life. The
tremendous evolution in deep generative models has made them widely adopted tools for face image synthesis and
editing. Generative deep learning models, i.e. GAN [139] and VAE [140], have been successfully used to generate
photo-realistic fake human face images. In facial synthesis, the objective is to generate non-existent but realistic-
looking faces. Face synthesis has enabled a wide range of beneficial applications, like automatic character creation
for video games and 3D face modeling industries. AI-based face synthesis could also be used for malicious purposes,
as the synthesis of photorealistic fake images for social network accounts identity to spread misinformation. Several
approaches have been proposed to generate realistic-looking facial images that humans are unable to recognize as to
whether they are real or synthesized. Fig. 7 shows synthetic facial images and the improvement in their quality between
2014 and 2019 that are nearly indistinguishable from real photographs. Table 9 provides a summary of works
presented for generation entire synthetic faces.
Since the emergence of GAN [139] in 2014, significant efforts have been made to improve the quality of synthesized
images. The images generated using the first GAN model [139] were low-resolution and not very convincing. DCGAN
[141] was the first approach that introduced a deconvolution layer in the generator to replace the fully connected layer,
which achieved better performance in synthetic image generation. Liu et al. [142] proposed CoGAN, based on VAE,
for learning joint distributions of two-domain images. This model trained a couple of GANs rather than a single one,
and each was responsible for synthesizing images in one domain. The size of generated images still remained relatively
small, e.g. 64×64 or 128×128 pixels.

Figure 7: Increasingly improving improvements in the quality of synthetic faces, as generated by variations on
GANs. In order, the images are from papers by Goodfellow et al. (2014) [139], Radford et al. (2015) [141], Liu
et al. (2016) [142], Karras et al. (2017) [143], and Style-based (2018 [144], 2019 [145])
The generation of high-resolution images was limited earlier due to memory constraints. Karras et al. [143] presented
ProGAN, a training methodology for GANs, that employed an adaptive mini-batch size that progressively increased
the resolution, depending on the current output resolution, by adding layers to the networks during the training process.
StyleGAN [144] is an improved version of ProGAN [143]. Instead of mapping latent code z to a resolution, a Mapping
Network was employed that learned to map input latent vector (Z) to an intermediate latent vector (W) which
controlled different visual features. The improvement is that the intermediate latent vector is free from any certain
distribution restriction, and this reduces the correlation between features (disentanglement). The layers of the generator
network are controlled via an AdaIN operation which helps decide the features in the output layer. Compared to [139,
141, 142], StyleGAN [144] achieved state-of-the-art high resolution in the generated images i.e., 1024 × 1024, with
fine details. StyleGAN2 [145] further improved the perceived image quality by removing unwanted artifacts, such as
a change in gaze direction and teeth alignment, with the facial pose. Huang et al. [146] presented a Two-Pathway
Generative Adversarial Network (TP-GAN) that could simultaneously perceive global structures and local details,
like humans, and synthesize a high-resolution frontal view facial image from a single ill-posed face image. Image
synthesis using this approach preserves the identity under large pose variations and illumination. Zhang et al. [147]
introduced a self-attention module in convolutional GANs (SAGAN) to handle global dependencies, and thus ensured
that the discriminator can accurately determine the related features in distant regions of the image. This work further
improved the semantic quality of the generated image. In [148], the authors proposed BigGAN architecture, which
uses residual networks to improve image fidelity and the variety of generated samples by increasing the batch size and
varying latent distribution. In BigGAN, the latent distribution is embedded in multiple layers of the generator to
influence features at different resolutions and levels of the hierarchy rather than just adding to the initial layer. Thus,
the generated images were photo-realistic and very close to real-world images from the ImageNet dataset. Zhang et
al. [149] proposed a stacked GAN (StackGAN) model to generate high-resolution images (e.g., 256×256) with details
based on a given textual description.

Table 9: An overview of synthetic facial deepfake generation techniques


Reference Technique Features Dataset Output Quality Limitations
Liu et al. [142] CoGAN Deep Features CelebA 64×64 or 128×128 ▪ Generate low-quality samples
Karras et al. ProGAN Deep Features CelebA 1024×1024 ▪ Limited control on the generated
[143] output
Karras et al. StyleGAN Deep Features ▪ ImageNet 1024×1024 ▪ Blob-like artifacts
[145]
Huang et al. TP-GAN Deep Features ▪ LFW 256x256 ▪ Lack fine details
[146] ▪ Lack semantic consistency
Zhang et al. SAGAN Deep Features ▪ ImageNet201 128×128 ▪ Unwanted visible artifacts
[147] 2
Brock et al. BigGAN Deep Features ▪ ImageNet 512×512 ▪ Class-conditional image synthesis
[148] ▪ Class leakage
Zhang et al. StackGAN Deep Features ▪ CUB 256×256 ▪ Lack semantic consistency
[149] ▪ Oxford
▪ MS-COCO

Detection
Techniques based on handcrafted Features: A lot of literature is available on image forgery detection [150-153].
As AI-manipulated data is a new phenomenon, there are a small number of forensic techniques that work well for
deepfake detection. Recently, some researchers [70, 154] have adopted the idea of employing the traditional methods
of image forgery identification to detect synthesized faces, however, these approaches are unable to identify fake facial
images. Currently, researchers have focused on new ML-based techniques such as McCloskey et al. [155] presented
an approach to identify fake images by employing the fact that the color information is dissimilar between the real
camera and fake synthesis samples. The color key-points from input samples were used to train the SVM for
classification. This approach [155] exhibits better fake sample detection accuracy, however, it may not perform well
for blurred images. Guarnera et al. [156] proposed a method to identify fake images. Initially, the EM algorithm was
used to calculate the image features. The computed key-points were used to train three types of classifiers, KNN,
SVM, and LDA. The approach in [156] performs well for synthesized image identification, but may not perform well
for compressed images.
Techniques based on Deep Features: DL-based work such as Guarnera et al. [156] presented an approach to detect
image manipulation. Initially, Expectation-Maximization (EM) technique was applied to obtain the image features
based on which the naive classifier was trained to discriminate against original and fake images. This approach shows
better deepfake identification accuracy, however, it is only applicable to static images. Nataraj et al. [138] proposed a
method to detect forged images by calculating the pixel co-occurrence matrices at three color channels of the image.
Then a CNN model was trained to learn important features from the co-occurrence matrices to differentiate
manipulated and non-manipulated content. Yu et al. [157] presented an attribution network architecture to map an
input sample to its related fingerprint image. The correlation index among each sample fingerprint and model
fingerprint acts as a softmax logit for classification. This approach [157] exhibits better detection accuracy, however,
it may not perform well with post-processing operations i.e. noise, compression, and blurring, etc. Marra et al. [158]
proposed a study to identify the GAN-generated fake images. Particularly, [158] introduced a multi-task incremental
learning detection approach to locate and classify new types of GAN-generated samples without affecting the detection
accuracy of the previous ones. Two solutions related to the position of the classifier were introduced by employing
the iCaRL algorithm for incremental learning [159], named as Multi-Task MultiClassifier, and Multi-Task Single
Classifier. This approach [158] is robust to unseen GAN-generated samples but unable to perform well if the
information on the fake content generation method is not available. Table 10 presents the comparison of face synthesis
deepfake detection techniques mentioned above.

Table 10: An overview of synthetic facial deepfake detection techniques


Author Technique Features Best Evaluation Dataset Limitations
performance
Handcrafted
Guarnera et al. EM + Deep features ▪ Accuracy=99.22 CelebA Not robust to compressed
[156] (KNN, SVM, (KNN) images.
LDA) ▪ Accuracy=
99.81(SVM)
▪ Accuracy= 99.61
(LDA)
McCloskey et al. SVM Color channels ▪ AUC=70% MFC2018 Performance degrades over
[155] blurry samples.
Deep Learning
Nataraj et al [138] CNN Deep features + Accuracy = 99.49% ▪ cycleGAN ▪ Works with static images
co-occurrence only.
matrices Accuracy = 93.42% ▪ StarGAN ▪ Low performance for jpeg
compressed images.
Yu et al. [157] CNN Deep features Accuracy = 99.43% CelebA ▪ Poor performance on post-
processing operations.
Marra et al. [158] CNN + Deep features Accuracy = 99.3% Customized ▪ Needs source
Incremental manipulation technique
Learning information

4.5 Facial Attribute Manipulation


Generation
Face attribute editing involves altering the facial appearance of an existing sample by modifying the attribute-specific
region while keeping the irrelevant regions unchanged. Face attribute editing includes removing/wearing eyeglasses,
changing viewpoint, skin retouching (e.g., smoothing skin, removing scars, and minimizing wrinkles), and even some
higher-level modifications, such as age and gender, etc. Increasingly, people have been using commercially available
AI-based face editing and mobile applications such as FaceApp [3] to automatically alter the appearance of an input
image.
Recently, several GAN-based approaches have been proposed to edit facial attributes, such as the color of the skin,
hairstyle, age, and gender by adding/removing glasses and facial expression, etc., of the given face. In this
manipulation, the GAN takes the original face image as input and generates the edited face image with the given
attribute, as shown in Fig. 8. A summary of face attribute manipulation approaches is presented in Table 11. Perarnau
et al. [160] introduced the Invertible Conditional GAN (IcGAN), which uses an encoder in combination with cGANs
for face attribute editing. The encoder maps the input face image into latent representation and attributes manipulation
vector and cGAN reconstructs the face image with new attributes given the altered attributes vector as the condition.
This suffers from information loss and alters the original face identity in the synthesized image. In [161], a Fader
Network was presented, where an encoder-decoder architecture was trained in an end-to-end manner which generated
an image by disentangling the salient information of the image and the attribute values directly in latent space. This
approach, however, adds unexpected distortion and blurriness, and thus fails to preserve the original fine details in the
generated image.
Figure 8: Examples of different face manipulations: original sample (Input) and manipulated samples

Prior studies [160, 161] have been focused on handling image-to-image translations between two domains. These
methods required the different generators to be trained independently to handle translations between each pair of
image domains and thus limit their practical usage. StarGAN [34], an enhanced approach, is capable of translating
images among multiple domains using a single generator. A conditional facial attribute transfer network was trained
via attribute classification loss and cycle consistency loss. StarGAN achieved promising visual results in terms of
attribute manipulation and expression synthesis. However, this approach adds some undesired visible artifacts in the
facial skin such as the uneven color tone in the output image. The recently proposed StarGAN-v2 [162] achieved state-
of-the-art visual quality of the generated images as compared to [34] by adding a random Gaussian noise vector into
the generator. In AttGAN [163], an encoder-decoder architecture was proposed that considers the relationship between
attributes and latent representation. Instead of imposing an attribute independent constraint on latent representation
like in [160, 161], an attribute classification constraint was applied to the generated image to guarantee the correct
change of the desired attributes. AttGAN provided improved facial attribute editing results, with other facial details
well preserved. However, the bottleneck layer i.e., down-sampling in the encoder-decoder architecture, adds unwanted
changes and blurriness and generates low-quality edited results. Liu et al. [164] proposed the STGAN model that
incorporated an attribute difference indicator and a selective transfer unit with an encoder-decoder to adaptively select
and modify the encoded features. STGAN only focuses on the attribute-specific region and does not guarantee good
preservation of the details in attribute-irrelevant regions.
Other works introduce the attention mechanism for attribute manipulation. SAGAN [165] introduced a GAN-based
attribute manipulation network to perform alteration and a global spatial attention mechanism to localize and explicitly
constrain editing within a specified region. This approach preserves the irrelevant details well but at the cost of
attribute correctness in the case of multiple attribute manipulation. PA-GAN [166] employed a progressive attention
mechanism in GAN to progressively blend the attribute features into the encoder features constrained inside a proper
attribute area by employing an attention mask from high to low feature level. As the feature level gets lower (higher
resolution), the attention mask gets more precise and the attribute editing becomes fine. This approach successfully
performs the multiple attributes manipulation and well preserves irrelevance within a single model. However, some
undesired artifacts appear in cases where significant modifications are required such as baldness and open mouth.
Table 11: An overview of facial attribute manipulation based Deepfake generation techniques
Author Technique Features Best Evaluation Dataset Limitations
performance
Perarnau et IcGAN ▪ Deep Features ▪ CelebA 64×64 ▪ Fails to preserve original face
al. [160] ▪ MNIST identity
Fader Encoder-decoder ▪ Deep Features ▪ CelebA 256×256 ▪ Unwanted distortion and blurriness
Network ▪ Fails to preserve fine details
[161]
Choi et al. StarGAN ▪ Deep Features ▪ CelebA 512×512 ▪ Undesired visible artifacts in the
[162] ▪ RaFD facial skin e.g., the uneven color tone
He et al. AttGAN ▪ Deep Features ▪ CelebA 384 × 384 ▪ Generates low-quality results and
[163] ▪ LFW adds unwanted changes, blurriness
Liu et al. STGAN ▪ Deep Features ▪ CelebA 384×384 ▪ Poor performance for multiple
[164] attribute manipulation
Zhang et al. SAGAN ▪ Deep Features ▪ CelebA 256×256 ▪ Lack of details in the attribute-
[165] irrelevant region
He et al. PA-GAN ▪ Deep Features ▪ CelebA 256×256 ▪ undesired artifacts in case of
[166] baldness and open mouth etc.

Detection
Techniques based on handcrafted Features: Researchers have employed the traditional ML-based approaches for
the detection of facial attributes manipulation. Like in [167], the author used the pixel co-occurrence matrices to
compute the features from the suspected samples. The extracted keypoints were used to train a CNN classifier to
differentiate the original and manipulated faces. The method in [167] shows better facial attribute manipulation
detection accuracy, however, may not perform well over the noisy samples. An identification approach using
keypoints computed from the frequency domain, instead of employing raw sample pixels, was introduced in [168].
For each input sample, a 2D DFT was applied to transform the image to the frequency domain to acquire one frequency
sample per RGB channel. For predicting the real and fake samples, the work [168] used the AutoGAN classifier. The
generalization ability of the work in [168] was evaluated over unseen GAN frameworks. More specifically, they
considered two GAN frameworks namely StarGAN [34] and the GauGAN [169]. The work shows better prediction
accuracy for the StarGAN model, however, in the case of GauGAN the technique faces serious performance drop.
Techniques based on Deep Features: The research community has presented several methods to detect facial
manipulations by evaluating the internal GAN pipeline. Similar work was presented in [170] where the author gave
the concept that analyzing the internal neuron behaviors could assist in identifying the manipulated faces, as layer-by-
layer neuron activation arrangements extract a more representative set of image features which are significant for
recognizing the original and fake faces. The proposed solution in [170] namely FakeSpoter, computed the deep
features via employing several DL-based face recognition frameworks i.e. VGG-Face [171], OpenFace [172], and
FaceNet [173]. The extracted features were used to train the SVM classifier to categorize the fake and real faces. The
work [170] worked well for facial attributes manipulation detection, however, it may not perform well for samples
with intense light variations.
Existing works on facial attribute manipulation have either employed entire faces or pass face patches to spot real and
manipulated content. A face patch-based technique was presented in [174], where the Restricted Boltzmann Machine
(RBM) was used to compute the deep features. Then, the extracted features were used to train a two-class SVM
classifier to classify the real and forged faces. The method in [174] is robust to manipulated face detection, however,
at the expense of increased computational cost. Another similar approach was proposed in [175], where a CNN-based
keypoints extractor was presented. The CNN approach comprised 6 convolutional layers along with 2 fully connected
layers. Additionally, residual connections were introduced encouraged from the ResNet frameworks to compute the
deep features from the input samples. Finally, the calculated features were used to train the SVM classifier to predict
the real and manipulated faces. The approach in [175] shows better manipulation identification performance, however,
does not report the results over various post-processing attacks i.e. noise, blurring, intensity variations, and color
changes. Some researchers have employed the entire faces instead of using the face patches to detect the facial attribute
manipulation from visual content. One of such works was presented by Tariq et al. [176] where several DL-based
frameworks i.e. VGG-16, VGG-19, ResNet, and XceptionNet were trained over the suspected samples to locate the
digital facial attribute forgeries. The work in [176] shows better face attribute manipulation detection performance,
however, its performance degrades for real-world scenarios. Some works use attention mechanisms to further enhance
the training procedure of the attribute manipulation detection systems. Dang et al. [177] introduced a framework to
locate several types of facial manipulations. This work employed attention mechanisms to enhance the feature maps
calculation procedures of CNN frameworks. In the case of face attribute manipulation recognition, two different
methods of attribute manipulation generation were taken: i) fake samples generated by using the public FaceApp
software, by considering various available filters ii) fake samples generated with the StarGAN network. The work
[177] is robust to face forgeries detection, however, at the expense of enhanced economic burden.
Wang et al. [164] proposed a framework to detect the manipulated faces. The proposed solution comprised two
classification steps namely local and global predictors. For global estimation, a new model namely Dilated Residual
Networks (DRN) was used to predict the real and fake samples. While for local estimation, the optical flow fields
were utilized. The approach [164] works well for face attribute manipulation identification, however, requires
extensive training data. Similarly, the work in [158] proposed a DL-based framework namely XceptionNet for the
face attributes forgeries detection and show robust performance. However, the method in [158] is suffering from high
computational costs. Rathgeb et al. [178] introduced a face attribute manipulation recognition framework namely
Photo Response Non-Uniformity (PRNU). More precisely, scores gathered after performing the analysis of spatial
and spectral features computed from the PRNU patterns from entire image samples were fused. The approach [178]
is robust to differentiate between the bonafide and retouched facial samples, however, detection accuracy needs further
improvements
To conclude the face attribute manipulation detection section, we can say that most of the existing detection work is
based on employing DL-based approaches and showing robust performance close to 100% as shown in Table 12. The
main reason for the accurate detection accuracy of models is due to the presence of GAN fingerprint information in
the manipulated samples. However, now, the researchers have presented such approaches which have removed the
fingerprints from the forged samples while maintaining the image realism which is showing a new challenge even for
the high-performing attribute manipulation detection frameworks.

Table 12: An overview of facial attribute manipulation based deepfake detection techniques
Author Technique Features Best Evaluation Dataset Limitations
performance
Hand-crafted
[167] co-occurrence Co-Occurrence Accuracy = 99.4% Private dataset. ▪ Its evaluation performance reduces over
matrices along matrix noisy images.
with CNN
[168] GAN Frequency Accuracy =100% Private dataset. ▪ The technique faces serious
Discriminator domain performance degradation for GauGAN
features framework-based face attribute
manipulations.
Deep Learning
[170] FakeSpoter Deep features Accuracy = 84.7% Private dataset ▪ Its detection performance degrades over
the samples with intense light
variations.
[174] RBM along Deep features Accuracy = 96.2% Private dataset ▪ This method is suffering from the high
with the SVM Accuracy = 87.1% Private dataset computational cost.
classifier (Celebrity
Retouching, ND-
IIITD
Retouching)
[175] CNN + SVM Deep features Accuracy =99.7% Private dataset ▪ Results are reported for post-processing
attacks.
[176] CNNs Deep features AUC=74.9% Private dataset ▪ Performance degrades for real-world
scenarios.
[177] Attention Deep features AUC=99.9% DFFD ▪ This work is computationally complex.
Mechanism
along with CNN
[164] DRN Deep features Average Private dataset ▪ The approach should be evaluated over
precision=99.8% a standard dataset.
[158] Incremental Deep features Accuracy =99.3% Private dataset ▪ This work is economically inefficient.
Learning along
with the CNN
[178] Score-Level PRNU EER=13.7% Private dataset ▪ The work needs to improve the
Fusion Features classification accuracy.

4.6 Audio Deepfakes Generation


AI-synthesized audio manipulation is a type of deepfake that can clone a person’s voice and depict that voice saying
something outrageous, that the person never said. Recent advancements in AI-synthesized algorithms for speech
synthesis and voice cloning have shown the potential to produce realistic fake voices that are nearly indistinguishable
from genuine speech. These algorithms can generate synthetic speech that sounds like the target speaker based on text
or utterances of the target speaker, with highly convincing results [55, 179]. The synthetic voice is widely adapted for
the development of different applications, such as automated dubbing for TV and film, chatbots, AI assistants, text
readers, and personalized synthetic voices for vocally handicapped people. Aside from this, synthetic/fake voices have
become an increased threat to voice biometric systems [180] and can be used for malicious purposes, such as political
gains, fake news, and fraudulent scams, etc. More complex audio synthesis could be combining the power of AI and
manual editing. For example, neural network-powered voice synthesis models, such as Google’s Tacotron [53],
Wavenet [52] or AdobeVoco [181], can generate realistic and convincing fake voices that resemble the victim’s voice,
as the first step. Later on, audio editing software, e.g. Audacity [4], can be used to combine the different pieces of
original and synthesized audios to make more powerful audios.
AI-based impersonation is not limited to visual content; recent advancements in AI-synthesized fake voices are
assisting the creation of highly realistic deepfakes videos [35]. These developments in speech synthesis have shown
their potential to produce realistic and natural audio deepfakes, exhibiting real threats to society [182]. Combining
synthetic audio content with visual manipulation can significantly make deepfake videos more convincing and
increase their harmful impact [35]. Despite much progress, these synthesized speeches lack some aspects of voice
quality, like expressiveness, roughness, breathiness, stress, and emotion, etc., specific to a target identity [183]. The
AI research community is doing efforts to overcome these challenges and produce human-like voice quality with high
speaker similarity.
The different modalities for audio deepfakes are TTS synthesis and VC. TTS synthesis is a technology that can
synthesize the natural-sounding voice of any speaker based on the given input text [184]. VC is a technique that
modifies the audio waveform of a source speaker to a sound similar to the target speaker’s voice [185]. A VC system
takes an audio-recorded file of an individual as a source and creates a deepfake audio of the target individual. It
preserves the linguistic and phonetic characteristics of the source utterance and emphasis the naturalness and similarity
to that of the target speaker. TTS synthesis and VC represent a genuine threat as both generate completely synthetic
computer-generated voices that are nearly indistinguishable from genuine speech. Moreover, the cloned replay attacks
[12] impose a potential risk for voice biometric devices because the latest speech synthesis techniques can produce
voice with high speaker similarity [186]. This section lists the latest progress in speech synthesis including TTS and
voice conversion techniques as well as detection techniques.
TTS Voice Synthesis
TTS is a decades-old technology that can synthesize the natural-sounding voice of a speaker from a given input text,
and thus enables a voice to be used for better human-computer interaction. The initial research on TTS synthesis
technology has been done using the methods of speech concatenation or parameter estimation. The concatenative TTS
systems are based on separating high-quality recorded speech into small fragments followed by concatenation into a
new speech. In recent years, this method has become outdated and unpopular as it is not scalable and consistent. In
contrast, parametric models map the text to the salient parameters of the speech, and convert them into an audio signal
using the vocoders. Later on, the deployment of deep neural networks gradually become a dominant method for speech
synthesis that achieved much better voice quality. These methods include Neural vocoders [57-62], GAN[63-64],
autoencoder [65], autoregressive models [52, 53, 187], and other emerging techniques [188-192] have promoted the
rapid development of the speech synthesis industry. Fig. 9 shows the principle design of modern TTS methods.

Figure 9: Workflow diagram of the latest TTS systems

The significant developments in voice/speech synthesis are WaveNet [52], Tacotron [53], and DeepVoice3 [193],
which can generate realistic sounding synthetic speech from a text input to provide an enhanced interaction experience
between humans and machines. Table 13 presents an overview of state-of-the-art speech synthesis methods. WaveNet
[52] was developed by DeepMind in 2016 and evolved from pixelCNN [194]. WaveNet models utilize raw audio
waveforms by using acoustic features, i.e. spectrograms, through a generative framework that is trained on actual
recorded speech. Parallel WaveNet has been introduced to enhance the sampling efficacy and produce high-fidelity
audio signals [195]. Another DL based using a variant of WaveNet, namely Deep Voice 1 [54], is presented by
replacing each module containing an audio signal, voice generator, or a text analysis front-end through a related NN
model. Due to the independent training of each module, however, it is not a real end-to-end speech synthesis system.
In 2017, Google introduced tacotron [53] an end-to-end speech synthesis model. Tacotron can synthesize speech from
given <text, audio> pairs and thus generalizes well to other datasets. Similar to WaveNet, the Tacotron framework is
a generative framework comprised of a seq2seq model that contains an encoder, an attention-based decoder, and a
post-processing network. Even though the Tacotron model has attained better performance it has one potential
limitation i.e. it must employ multiple recurrent components. The inclusion of these units makes it economically
inefficient so that it requires high-performance systems for model training. Deep Voice 2 [196] combines the
capabilities of both the Tacotron and WaveNet models for voice synthesis. Initially, Tacotron is employed for
converting the input text to a linear scale spectrogram, then it is later converted to voice through the WaveNet model.
In [197], Tacotron2 was introduced for vocal synthesis and it exhibits an impressive high mean opinion score very
similar to human speech. Tacotron2 consists of a recurrent sequence-to-sequence keypoint estimation framework that
maps character embedding to mel-scale spectrograms. To deal with the time complexities of recurrent unit-based
speech synthesis models, a new, fully-convolutional character-to-spectrogram model named DeepVoice3 [193] was
presented. The Deep Voice 3 model is faster than its peers due to performing fully parallel computations. Deep Voice
3 is comprised of three main modules: i) an encoder that accepts text as input and transforms it into an internal learned
form, ii) a decoder that converts the learned representations in an autoregressive manner, and iii) post-processing,
fully convolutional network that predicts the final vocoder parameters.
Another model for voice synthesis is VoiceLoop [187], which uses a memory framework to generate speech from
voices unseen during training. VoiceLoop builds a phonological store by executing a shifting buffer as a matrix. Text
strings are characterized as a list of phonemes that are later decoded in short vectors. The new context vector is
produced by assessing the encoding of the resulting phonemes and summing them together. The above-mentioned
powerful end-to-end speech synthesizer models [193, 197] have enabled the production of large-scale commercial
products, such as Google Cloud TTS, Amazon AWS Polly, and Baidu TTS. All these projects aim to attain a high
similarity between synthesized and human voices.
The latest TTS systems can convert given text to a human speech with a particular voice identity. Using generative
models, researchers have built voice imitating TTS models that can clone the voice of a particular speaker in real-time
using few samples of reference speech samples [188, 189]. The key distinction between voice cloning and speech
synthesis systems is that the former focuses on preserving the characteristics of the specific identity speech attributes
while the latter lacks this feature to maintain the quality of the generated speech [190]. Various AI-enabled voice
cloning online platforms are available such as Overdub1, VoiceApp2 , and iSpeech3 which can produce synthesized
fake voices that closely resemble target speech and gives the public access to this technology. Jia et al. [188] proposed
a Tacotron 2 based TTS system capable of producing multi-speaker speech, including those unseen during training.
The framework consists of three independently trained neural networks. The findings show that although the synthetic
speech resembles a target speaker’s voice it does not fully isolate the voice of the speaker from the prosody of the
audio reference. Arik et al. [55] proposed a Deep Voice 3 based technique comprised of two modules: speaker
adaptation and speaker encoding. For speaker adaptation, a multi-speaker generative framework is fine-tuned. For
speaker encoding, an independent model is trained to directly infer a new speaker embedding, which is applied to the
multi-speaker generative model.
Loung et al. [190] proposed a voice cloning framework that can synthesize target-specific voice, either from input text
or a reference raw audio waveform from a source speaker. The framework consists of a separate encoder and decoder
for text and speech and a neural vocoder. The model is jointly trained with linguistic latent features and the speech
generation model learns a speaker-disentangled representation. The obtained results achieve quality and speaker
similarity to the target speaker; however, it takes almost 5 minutes to generate the cloned speech. Chen et al. [191]
proposed a meta-learning approach using waveNet model for voice adaption with limited data. Initially, speaker
adaptation is computed by fine-tuning the speaker embedding. Then a text-independent parametric approach is applied
whereby an auxiliary encoder network is trained to predict the embedding vector of new speakers. This approach
performs well on clean and high-quality training data. The presence of noise deviates the speaker encoding and directly
affects the performance of synthesized speech. In [192], the authors proposed a seq2seq multi-speaker framework
with domain adversarial training to produce a target speaker voice from only a few available noisy samples. The results
showed improved naturalness of synthetic speech. However, similarity still remains challenging to achieve due to lack
of transferring target accents, and prosody to synthesized speech with a limited amount of low-quality speech data.

Table 13: An overview of the state-of-the-art speech synthesis techniques


Methods Technique Features Dataset Limitations

1
https://www.descript.com/overdub
2
https://apps.apple.com/us/app/voiceapp/id1122985291
3
https://www.ispeech.org/apps
WaveNet [52] Deep neural ▪ linguistic features ▪ VCTK (44 hrs.) ▪ Computationally expensive
network ▪ fundamental
frequency (log F0)
Tacotron [53] Encoder-Decoder ▪ Deep features Private (24.6 hrs.) ▪ Costly to train the model
with RNN
Deep Voice 1[54] Deep neural ▪ linguistic features Private (20 hrs.) ▪ Independent training of each module
networks leads to a cumulative error in
synthesized speech
Deep Voice 2 [196] RNN ▪ Deep features VCTK (44 hrs.) ▪ Costly to train the model

DeepVoice3 [193] Encoder-decoder ▪ Deep features ▪ Private (20 hrs.) ▪ Does not generalized well for unseen
▪ VCTK (44 hrs.) samples.
▪ LibriSpeech ASR
(820 hrs.)
Parallel WaveNet Feed-forward neural ▪ linguistic features Private ▪ Requires a large amount of target’s
[195] network with dilated speech training data.
causal convolutions
VoiceLoop [187] Fully-connected ▪ 63-dimensional ▪ VCTK (44 hrs.) ▪ Low ecological validity
neural network audio features ▪ Private
Tacotron2[197] ▪ Encoder-decoder ▪ linguistic features ▪ Japanese speech ▪ Lack of real time speech synthesis
corpus from the
ATR Ximera
dataset (46.9 hrs.)
Arik et al. [55] Encoder- decoder ▪ Mel spectrograms ▪ LibriSpeech (820 ▪ Low performance for multi-speaker
hrs.) speech generation in the case of low-
▪ VCTK (44 hrs.) quality audio
Jia et al. [188] Encoder-decoder ▪ Mel spectrograms ▪ LibriSpeech (436 ▪ Fails to attain human-level naturalness
hrs.) ▪ Lacks in transferring the target accent,
▪ VCTK (44 hrs.) prosody to synthesized speech
Luong et al. [190] Encoder-decoder ▪ Mel spectrograms ▪ LibriSpeech (245 ▪ Low performance in the case of noisy
hrs.) audio samples
▪ VCTK (44 hrs.)
Chen et al. [191] Encoder + deep ▪ Mel spectrograms ▪ LibriSpeech (820 ▪ Low performance in the case of a low-
neural network hrs.) quality audio sample
▪ private
Cong et al. [192] Encoder-decoder ▪ Mel spectrograms ▪ MULTI-SPK ▪ Lacks in synthesizing utterances of a
▪ CHiME-4 target speaker

Voice Conversion
VC is speech-to-speech synthesis technology that manipulates the input voice to sound like target voice identity while
the linguistic content of the source speech remains unchanged. VC has various applications in real life including
expressive voice synthesis, personalized speech speaking assistance, vocally impaired people, voice dubbing for the
entertainment industry, and many others [185]. The recent development of anti-spoofing for automated speaker
verification [180] included VC systems for the generation of spoofing data [198-200].
In general, to perform VC, high-level features of the speech such as voice timbre and prosody characteristics are used.
Voice timber is concerned with spectral properties of the vocal tract during phonation, whereas prosody relates to
suprasegmental characteristics i.e., pitch, amplitude, stress, and duration. Various VC challenges (VCC) have been
held to encourage the development of VC techniques and improve the quality of converted speech [198-200]. The
earlier VCC aimed to convert source speech to target speech by using non-parallel and parallel data [198, 199].
Whereas, the latter [200] focused on the development of cross-lingual VC techniques, where the source speech is
converted to sound like target speech using nonparallel training data and across different languages.
The earlier studies VC techniques are based on spectrum mapping using paired training data, where speech samples
from both the source and target speaker uttering the same linguistic content are required for conversion. Methods
using GMM [201, 202], partial least square regression [203], exemplar-based [204] techniques and others [205-207]
are proposed for parallel spectral modeling. These [201-204] are “shallow” VC methods that transform source speech
spectral features directly in the original feature space. Nakashika et al. [205] proposed a speaker-dependent sequence
modeling method based on RNN to capture temporal correlation in an acoustic sequence. In [206, 207] deep
bidirectional LSTM (DBLSTM) is employed to capture long-range contextual information and generates high-quality
converted speech. DNN based methods [205-207] efficiently learn feature representation for feature mapping in
parallel VC. However, require large-scale paired source and target speaker utterance data for parallel training that is
not feasible for practical applications in the real world.
The VC methods for non-parallel (unpaired) training data are proposed to achieve VC for multiple speakers with
different languages. The powerful VC techniques based on neural network [208], vocoder [209, 210], GAN [211-
217], VAE [218-220] are introduced for non-parallel spectral modeling. Auto-encoder-based approaches attempt to
learn disentangle speaker information from linguistic content and independently convert the speaker identity. Work
in [220] investigates the quality of learned representation by comparing different auto-encoding methods. It was shown
that a combination of Vector Quantized VAE and WaveNet [52] decoder better preserves speaker invariant linguistic
content and retrieves information discarded by the encoder. However, VAE/GAN-based methods over smooth the
transformed features as due to dimensionality reduction bottleneck certain low-level information, e.g. pitch contour,
noise, and channel data is lost that results in the buzzy-sounding converted voices.
Recently GAN-based approaches, such as CycleGAN [211-214], VAW-GAN [215], and StarGAN [216] attempt to
achieve high-quality transformed speech using non-parallel training data. Studies [212, 216] demonstrate state-of-the-
art performance for multilingual VC in terms of both naturalness and similarity. However, performance is speaker-
dependent and degrades for unseen speakers. Neural vocoders have rapidly become the most popular vocoding
approach for speech synthesis due to their ability to generate human-like speech [193]. The vocoder learns to generate
audio waveform from acoustic features. The study [210] analyzed the performance of different vocoders and showed
that parallel-WaveGAN [221] can effectively simulate the data distribution of human speech with acoustic
characteristics for VC. However, the performance is still restricted for unseen speaker identity and noisy samples
[179]. The recent VC methods based on TTS like AttS2S-VC [222], Cotatron [223], and VTN [224] use text labels to
synthesize speech directly by extracting aligned linguistic characteristics from the input voice. This assures that the
converted speaker and the target speaker identity are the same. However, these methods necessitate the use of text
labels, which are not always readily accessible.
Recently, one-shot VC techniques [225, 226] have been presented. In contrast to earlier techniques, the data samples
of source and target speakers are not required to be seen during training. Furthermore, just one utterance from the
source and target speakers is required for conversion. The speaker embedding is extracted from the target speech
which can control the speaker identity of the converted speech independently. Despite these advancements, the
performance of few-shot VC techniques for unseen speakers is not stable [227]. This is primarily due to the inadequacy
of speaker embedding extracted from a single speech of an unseen speaker [228] that significantly impacts the
reliability of one-shot conversions. The other work [229-231] adopt zero-shot VC, the source and target speakers are
unseen during training also without re-training the model by employing encoder-decoder architecture. The speaker
encoder extracts style and content information into style embedding and content embedding, the decoder constructs
speech sample by combining style and content embedding. The zero-shot VC scenario is attractive because no adaption
data or parameters are required. However, the adaptability quality is insufficient, especially when the target and source
speakers are unseen, very diverse, and noisy [227]. The summary of voice conversion techniques discussed above is
presented in Table 14.

Table 14: An overview of the state-of-the-art voice conversion techniques


Methods Technique Features Dataset Limitations
Ming et al. [206] DBLSTM F0 and energy ▪ CMU-ARCTIC ▪ Require parallel training data
contour [232]
Nakashika et al. Recurrent temporal MCC, F0, and ▪ ATR Japanese ▪ Lack temporal dependencies of speech
[205] restricted Boltzmann aperiodicity speech database sequences
machines (RTRBMs) [233]
Sun et al. [207] DBLSTM-RNN MCC, F0 and ▪ CMU-ARCTIC ▪ Require parallel training data
Aperiodicity [232]
Wu et al. [14] DBLSTM- i-vectors 19D-MCCs, Delta ▪ VCTK corpus ▪ Computationally expensive
and Delta-Delta, F0,
400-D i-vector
Liu et al. [209] WaveNet vocoder MCC and F0 ▪ VCC 2018 ▪ Performance degrades on inter-gender
conversions
Kaneko et al. [213] Encoder-decoder 34D-MCC, F0, and ▪ VCC 2018 ▪ Computationally Expensive
with GAN aperiodicity ▪ Domain-specific voice
Kameoka et al. Encoder-decoder 36D-MCC, F0, and ▪ VCC 2018 ▪ Performance degrades on cross-gender
[216] with GAN aperiodicity conversion
▪ Limited performance for unseen
speaker
Zhang et al. [217] VAW-GAN STRAIGHT spectra ▪ VCC2016 ▪ Lack target speaker similarity
[234], F0 and
aperiodicity
Huang et al. [218] Encoder-decoder STRAIGHT spectra ▪ VCC 2018 ▪ Lack multi-target VC
[234] MCCs ▪ Introduce abnormal fluctuations in
generated speech
Chorowski et al. VQ-VAE, WaveNet 13D-MFCC ▪ LibriSpeech ▪ Over smooth and low naturalness in
[220] decoder ▪ ZeroSpeech 2017 generated speech
▪ Increased training complexity
Tanaka et al. [222] BiLSTM encoder- Acoustic features ▪ CMU Arctic ▪ Requires extensive training data
LSTM decoder database
Park et al. et al. Encoder-decoder Mel-spectrogram ▪ LibriTTS ▪ Requires transcribed data
[223] ▪ VCTK dataset, ▪ Lack target speaker similarity
Huang et al. [224] VAE-vocoder MCCs, log F0, and ▪ CMU ARCTIC ▪ Requires parallel training data
aperiodicity dataset
▪ VCTK corpus
Lu et al. [225] Aattention 13D-MFCCs, PPGs ▪ VCTK corpus ▪ Low target similarity and naturalness
mechanism in and log F0 in generated speech
encoder-decoder
Liu et al. [226] Encoder and 19 MFCCs, log F0 ▪ VCTK corpus ▪ Low target similarity and naturalness
DBLSTM and PPG in generated speech
Chou et al. [229] Attention mechanism 19 MFCCs, log F0 ▪ VCTK Corpus ▪ Low quality of converted voices in
in encoder-decoder and PPG case of noisy samples
Qian et al. [230] Encoder-decoder speech spectrogram ▪ VCTK corpus ▪ Prosody flipping between the source
and the target.
▪ Not well-generalized to unseen data

Audio Deepfake Detection


Due to recent advances in TTS [52, 193] and VC [227] techniques, audio deepfakes have become an increased threat
to voice biometric interfaces and society [13]. In the field of audio forensics, there are several approaches for
identifying various types of audio spoofing. However, existing works fail to fully tackle the detection of synthetic
speech [235]. In this section, we have reviewed the approaches proposed for the detection of audio deepfakes.
Techniques based on handcrafted Features: Yi et al. [236] presented an approach to identify TTS-based
manipulated audio content. In [236] hand-crafted features Constant Q cepstral coefficients (CQCC) were to train
GMM and LCNN classifier to detect TTS synthesized speech. The approach exhibits better detection performance for
fully synthesized audio however performance degrades rapidly for partial synthesized audio clips. Li et al. [237]
proposed a modified ResNet model Res2Net. They evaluated the model using different acoustic features and obtained
the best performance with CQT features. This model exhibits better audio manipulation detection performance
however generalization ability needs further improvement. In [238] Mel-spectrogram features with ResNet-34 were
employed to detect spoofed speech. This approach works well however performance needs improvement. Monteiro
et al. [239] proposed an ensemble-based model for the detection of synthetic speech. Deep learning models LCNNs
and ResNets were used to compute the deep features which were later fused to differentiate between real and spoofed
speech. This model is robust to fake speech detection, however, needs evaluation on some standard dataset. Gao et al.
[240] proposed a synthetic speech detection approach based on inconsistencies. They employed a global 2D-DCT
feature to train a residual network to detect the manipulated speech. The model has better generalization ability,
however, the performance degrades on noisy samples. Zhang et al. [241] proposed a model to detect fake speech by
using a ResNet model with a transformer encoder (TEResNet). Initially, a transformer encoder was employed to
compute contextual representations of the acoustic keypoints by considering correlation between audio signal frames.
The computed keypoints were then used to train a residual network to differentiate real and manipulated speech. This
work shows better fake audio detection performance, however, requires extensive training data. Das et al. [242]
proposed a method to detect manipulated speeches. Initially, a signal companding technique for data augmentation
was used to increase the diversity of training data. Then CQT features were computed from the obtained data which
were later used to train the LCNN classifier. The method improves the fake audio detection accuracy however requires
extensive training data.
Aljasem et al. [12] proposed a hand-crafted features-based approach to detect cloned speeches. Initially, sign-modified
acoustic local ternary pattern features were extracted from input samples. Then the computed keypoints were used to
train an asymmetric bagging-based classifier to categorize the bonafide and fake speeches. The work is robust to noisy
cloned voice replay attacks, however, performance needs further improvement. Ma et al. [243] presented a continual
learning-based technique to enhance the generalization ability of manipulated speech detection system. A knowledge
distillation loss function was introduced in the framework to enhance the learning ability of the model. The approach
is computationally efficient and can detect unseen fake spoofing manipulations, however performance is not evaluated
on noisy samples. Borrelli et al. [244] employed bicoherence features together with long-term short-term features. The
extracted features were used to train three different types of classifiers i.e., random forest, a linear SVM, and radial
basis function (RBF) SVM. The method obtains the best accuracy with the SVM classifier. However, due to
handcrafted features, this work is not generalized to unseen manipulations. In [245] bispectral analysis was performed
to identify specific and unusual spectral correlations present in GAN generated speech samples. Similarly in [246]
bispectral and Mel-cepstral analysis was performed to detect missing durable power components in synthesized
speech. The computed features were used to train several ML-based classifiers and attained the best performance using
Quadratic SVM. These approaches [245, 246] are robust to TTS synthesized audio, however may not detect high-
quality synthesized speech. Malik et al. [247] proposed a CNN for cloned speech detection. Initially, audio samples
were converted to spectrograms on which a CNN framework was used to compute deep features and classify real and
fake speech samples. This approach shows better fake audio detection accuracy however, performance degrades on
noisy samples. Chen et al. [248] proposed a DL-based framework for audio deepfakes detection. The 60-dimensional
linear filter banks (LFB) were extracted from speech samples that were later used to train a modified ResNet model.
This work improves the fake audio detection performance, however, suffers from high computational cost. Huang et
al. [249] presented an approach for audio spoofing detection. Initially, short-term zero-crossing rate and energy were
utilized to identify the silent segments from each speech signal. In the next step, the linear filter bank (LFBank) key-
points were computed from the nominated segments in the relatively high-frequency domain. Lastly, an attention-
enhanced DenseNet-BiLSTM framework was built to locate audio manipulations. This method [249] can avoid over-
fitting, however, it is at the expense of high computational cost. Wu et al. [250] introduced a novel key-points
genuinization based light convolutional neural networks (LCNN) framework for the identification of synthetic speech
manipulation. The attributes of the original speech were utilized to train a model using CNN. It was then converted to
an original key-point distribution closer to that of genuine speech. The transformed key-points were used with an
LCNN to identify genuine and altered speech. This approach [250] is robust to synthetic speech manipulation
detection. It is, however, unable to deal with replay attack detection.
Techniques based on Deep Features: Zhang et al. [251] proposed a DL-based approach using ResNet-18 and one-
class (OC) softmax. They trained the model to learn feature space in which real speech can be discriminated from
manipulated samples by a certain margin. This method improves the performance generalization ability against unseen
attacks, however, performance degrades on VC attacks generated using waveform filtering. In [252] authors proposed
a Light Convolutional Gated RNN (LCGRNN) model to compute the deep features and classify the real and fake
speech. This model is computationally efficient however, not generalized well to real-world examples. Hua et al. [253]
proposed end-to-end synthetic speech detection model Res-TSSDNet for the computation of deep features and
classification. This model is generalized well to unseen samples however at the expense of increased computational
cost. Wang et al. [254] proposed a DNN based approach with a layer-wise neuron activation mechanism to
differentiate between real and synthetic speech. This approach performs well for fake audio detection, however, the
framework requires evaluation on challenging datasets. Jiang et al. [255] proposed a self-supervised learning-based
approach comprising eight convolutional layers to compute the deep features and classify the original and fake
speeches. This work is computationally efficient, however detection accuracy needs enhancement.
Most of the above mentioned fake speech detection have been evaluated on ASVspoof2019[180], however, the
recently launched ASVspoof2021[256] has opened new challenge for the research community. This dataset has
introduced a separate speech deepfake category that includes highly compressed TTS and VC samples without speaker
verification.
Table 15: An overview of Audio deepfake detection techniques
Author Technique Features Best Evaluation Dataset Limitations
performance
Hand-crafted features
Li et al. [237] Res2Net CQT EER=2.502 ASVspoof2019 ▪ Needs generalization
improvement
Yi et al. [236] GMM/LCNN CQCC EER=19.22 (GMM) Propriety ▪ Performance degrades for
EER=6.99 (LCNN) partial synthesized audio
clip
Das et al. [242] LCNN CQT EER=3.13 ASVspoof2019 ▪ Requires extensive training
data
Aljasem et al. Asymmetric Combination of EER=5.22 ASVspoof2019 ▪ Performance needs further
[12] bagging MFCC, GTCC, improvement
ALTP, and spectral
features
Ma et al. [243] CNN 60-D LFCC EER=9.25 ASVspoof2019 ▪ Performance degrades on
noisy samples
AlBadawy et al. logistic Bispectral features AUC=0.99 Propriety ▪ Performance may degrade
[245] regression on high quality speech
classifier samples
Singh et al. Quadratic SVM Bispectral and mel- Acc=96.1% Propriety ▪ Needs evaluation on a large
[246] cepstral features scale dataset
Gao et al. [240] ResNet 2D-DCT features EER=4.03 ASVspoof2019 ▪ Performance degrades on
noisy samples
Aravind et al. ResNet34 Mel-spectrogram EER=5.87 ASVspoof2019 ▪ Performance needs
[238] features improvement
Monteiro et al. LCNN/ResNet Spectral features EER=6.38 Propriety ▪ Results should be evaluated
[239] on some standard dataset
Chen et al. [248] ResNet 60-dimensional LFB EER=1.81 ASVspoof2019 ▪ Computationally expensive
approach
Huang et al. DenseNet- LFBank EER=0.53 ASVspoof 2019 ▪ Computationally complex
[249] BiLSTM approach.
Wu et al. [250] LCNN Genuine speech EER= 4.07 ASVspoof ▪ Can’t deal with replay
features 2019 attack detection.
Zhang et al. TEResNet Spectrum features EER=5.89 ASVspoof2019 ▪ Requires extensive training
[241] EER=3.99 Fake-or-Real data
dataset [257]
Deep Learning features
Zhang et al. ResNet- Deep features EER=2.19 ASVspoof2019 ▪ Performance degrades on
[251] 18+OC-softmax VC.
Gomez-Alanis et LCG- RNN Deep features EER=6.28 ASVspoof 2019 ▪ Fails to generalize for
al. [252] unseen attacks
Hua et al. [253] Res-TSSDNet Deep features EER=1.64 ASVspoof2019 ▪ Computationally complex
Jiang et al. [255] CNN Deep features EER=5.31 ASVspoof2019 ▪ Performance needs further
improvement
Wang et al. DNN Deep features EER=0.021 Fake-or-Real ▪ Requires evaluation on
[254] dataset [257] challenging dataset

4.7 Discussion
This section provides a summary of recent significant advancements in audio-visual deepfake creation and detection
techniques.

Generation
In recent years, the deepFake generation has advanced significantly. The high quality of generated images across
different visual manipulation categories (face-swap, face-reenactment, lip-sync, entire face synthesis, and attribute
manipulation) has made it increasingly difficult for human eyes to differentiate between fake and genuine content.
Among the significant advances are (1) unpaired self-supervised training strategies to avoid the requirement for
extensive labeled training data, and (2) addition of AdaIN layers, pix2pixHD network, self-attention modules, and
feature disentanglement for improved synthesized faces (3) one/few-shot learning strategies to enable identity theft
with limited target training data (4) use of temporal discriminator and optical flow estimation to improve coherence
in the synthesized videos (5) introduction of secondary network for seamless blending of composites to reduce the
boundary artifacts (6) use of multiple loss functions to handle different tasks such as conversion, blending, occlusion,
pose, illumination, etc., for improved final output and (7) adoption of perceptual loss with pre-trained VGG-Face
network dramatically enhanced synthesize facial quality. Current deepfake systems have a few limitations such as in
facial reenactment generation techniques, frontal poses are always used to drive and create the content. As a result,
the reenactment is restricted to a somewhat static performance. Currently, Face-swapping onto the body of lookalike
is performed to achieve facial reenactment, however, this approach has limited flexibility because having a good match
is not always achievable. Moreover, face reenactment depends on the driver’s performance to portray the target
identity personality. Recently, there has been a trend towards identity-independent deepfakes generation models.
Another development is real-time deepfakes that allows swapping faces in video chats. Real-time deepfakes at 30fps
have been achieved in works such as [64, 102]. The next generation deepfakes are expected to utilize video stylization
techniques to generate target manipulated content with projected expression and mannerism. Although, existing
deepfakes are not perfect, however, the rapid development of high-quality real/fake image dataset promote the
deepfake generation research.
In recent years, the quality of synthetic voice has significantly improved by using deep learning techniques. The
significant improvements include voice adaptation [55] [191], one/few-shot learning [225, 226], self-attention network
[229], and cross-lingual voice transfer [212, 216]. However, their ability to produce more human-like natural-sounding
utterances with limited training samples under varying settings remains challenging [258].
Detection
In this subsection, we have presented a summary of the work performed for audiovisual deepfakes detection. Based
on the in-depth analysis of various detection approaches, we have concluded that most of the existing detection work
is based on employing DL-based approaches and showing robust performance close to 100%. The main reason for the
accurate detection accuracy of models is due to the presence of fingerprint information, visible artifacts in the
audiovisual manipulated samples. However, now, the researchers have presented such approaches which have
removed the information from the forged samples while maintaining the fake realism which is showing a new
challenge even for the high-performing attribute manipulation detection frameworks. It has been observed that most
of the existing detection techniques perform well on face swap detection, and are relatively easier to identify as entire
face is swapped with target identity which usually leaves artifacts. However, expression swap and lip-sync are more
challenging to detect as these manipulations tamper soft biometrics of the same person identity. In the case of visual
manipulation detection, most of the research work has utilized ACC and AUC for the evaluation of their results, while
audio deepfakes detection has used the EER metric. For visual deepfakes detection, it has been observed that it's easy
for the research community to detect image-based manipulations in comparison to video-based deepfakes. While for
audio manipulations VC detection is more challenging as compared to TTS. Both for audio or visual deepfakes, most
of the research work have use publically available datasets instead of using their own synthesized datasets. The
existing work has reported robust performance for audiovisual deepfakes detection, however, has faced serious
performance drop for unseen cases which depicts a lack of generalization ability. Moreover, these approaches are
unable to give proof to differentiate real and manipulated content, so, these approaches lack explainability. It has been
observed that several deepfake detection methods are presented in previous years, however, due to implementation
complexities such as variation in datasets, configuration environment, and complicated architectures, it is difficult to
implement and use them. Now, different software and online platforms such as DeepFake-o-meter [259], FakeBuster
[260], and Video Authenticator (not publicly available) [261] are introduced to ease the audio-visual detection and
give access to the general audience. However, these platforms are at the infancy stage and need further development
to handle emerging deepfakes.
We have used a figure representation to group the existing work performed for audio and visual deepfake detection
(Fig. 10). Table 16 presents the detailed description of each category. Existing approaches have either targeted spatial
and temporal artifacts left during the generation, or data-driven classification. The spatial artifacts include
inconsistencies [75, 78, 110, 245, 262-264], abnormalities in background [155, 265, 266], and GAN fingerprints [71,
157, 267]. The temporal artifacts involve detecting variation in a person’s behavior [79, 87, 268], physiological signals
[74, 75, 83, 89], coherence [269, 270], and video frame synchronization [72, 82, 94, 134]. Instead of focusing on a
specific artifact, some approaches are data-driven, which detect manipulations by classification [70, 80, 84, 86, 95-
98, 114, 118, 138, 156, 158, 250, 254, 271-274] or anomaly identification [116, 117, 275-277]. Moreover, in Fig, red-
colored references are showing the DL-based approaches employed for deepfakes detection, while others show the
hand-coded methods.
Table 16: Description of classification categories for existing deepfake detection methods
Inconsistencies Visible artifacts within the frame such as inconsistent head poses and landmarks etc.
Environment Abnormalities in the background such as lighting and other details.
Forensics GAN fingerprints left during the generation process.
Behavioral Monitoring abnormal gestures and facial expressions.
Synchronization Temporal consistency such as inconsistencies between adjacent frames/modality.
Physiology Lack of biological signals such as eye blinking patterns and heart rate
Coherence Missing optical flow field and artifacts such as flickering and jitter between frames
Classification End-to-end CNN based data-driven models
Anomaly Detection Outliers identification such as reconstructing real images and comparing to the encoded
image. They are used to see unknown creation methods.
Figure 10: Categorization of deepfake detection techniques (The red color shows Face-Swap detection
approaches, purple for Face-Reenactment, Orange for lip-syncing, Blue for facial image synthesis, and
pink for facial attribute manipulation detection techniques, where * shows deep-learning based
approaches)
4.8 Open Challenges in Deepfakes Generation
Although extensive efforts have been shown to improve the visual quality of generated deepfakes there are still several
challenges that need to be addressed. A few of them are discussed below.
Generalization: The generative models are data-driven, and therefore they reflect the learned features during training
in the output. To generate high-quality deepfakes a large amount of data is required for training. Moreover, the training
process itself requires hours to produce convincing deepfake audiovisual content. Usually, it is easier to obtain a
dataset of the driving content but the availability of sufficient data for a specific victim is a challenging task. Also
retraining the model for each specific target identity is computationally complex. Because of this, a generalized model
is required to enable the execution of a trained model for multiple target identities unseen during training or with few
training samples available.
Identity Leakage: The preservation of target identity is a problem when there is a significant mismatch between the
target identity and the driving identity, specifically in face reenactment tasks where target expressions are driven by
some source identity. The facial data of the driving identity is partially transferred to the generated face. This occurs
when training is performed on single or multiple identities, but data pairing is accomplished for the same identity.
Paired Training: A trained supervised model can generate high-quality output but at the expense of data pairing.
Data pairing is concerned with generating the desired output by identifying similar input examples from the training
data. This process is laborious and inapplicable to those scenarios where different facial behaviors and multiple
identities are involved in the training stage.
Pose Variations and Distance from camera: Existing deepfake techniques generate good results of the target for
frontal facial view. However, the quality of manipulated content degrades significantly for scenarios where a person
is looking off camera. This results in undesired visual artifacts around the facial region. Furthermore, another big
challenge for convincing deepfake generation is the facial distance of the target from the camera, as an increase in
distance from capturing devices results in low-quality face synthesis.
Illumination Conditions: Current deepfake generation approaches produce fake information in a controlled
environment with consistent lighting conditions. However, an abrupt change in illumination conditions such as in
indoor/outdoor scenes results in color inconsistencies and strange artifacts in the resultant videos.
Occlusions: One of the main challenges in deepfake generation is the occurrence of occlusion, which results when
the face region of the source and victim are obscured with a hand, hair, glasses, or any other items. Moreover, occlusion
can be the result of the hidden face or eye portion which eventually causes inconsistent facial features in the
manipulated content.
Temporal Coherence: Another drawback of generated deepfakes is the presence of evident artifacts like flickering
and jitter among frames. These effects occur because the deepfake generation frameworks work on each frame without
taking into account the temporal consistency. To overcome this limitation, some works either provide this context to
generator or discriminator, consider temporal coherence losses, employ RNNs, or take a combination of all these
approaches.
Lack of realism in synthetic audio: Though the quality is certainly getting much better, there is still a need for
improvement. The main challenges of audio-based deepfakes are the lack of natural emotions, pauses, breathiness,
and the pace at which the target speaks.
Based on the above-mentioned limitations we can argue that there exists a need to develop effective deepfake
generation methods that are robust to variations in illumination conditions, temporal coherence, occlusions, pose
variations, camera distance, identity leakage, and paired training.
4.9 Challenges in Deepfakes detection methods
Although remarkable advancements have been made in the performance of deepfake detectors there are numerous
concerns about current detection techniques that need attention. Some of the challenges of deepfake detection
approaches are discussed in this section.
Quality of Deepfake Datasets: The accessibility of large databases of deepfakes is an important factor in the
generation of deepfake detection techniques. However, analyzing the quality of videos from these datasets reveals
several ambiguities in comparison to actual manipulated content found on the internet. Different visual artifacts that
can be visualized in these databases are: i) temporal flickering in some cases during the speech, ii) blurriness around
the facial regions, iii) over smoothness in facial texture/lack of facial texture details, iv) lack of head pose movement
or rotation, v) lack of face occluding objects such as glasses, lightning effect, etc., vi) sensitive to variations in input
posture or gaze, skin color inconsistency, and identity leakage, and vii) limited availability of a combined high-quality
audio-visual deepfake dataset. The aforementioned dataset ambiguities are due to imperfect steps in the manipulation
techniques. Furthermore, manipulated content of low quality can be barely convincing or create a real impression.
Therefore, even if detection approaches exhibit better performance over such videos it is not guaranteed that these
methods will perform well when employed in the wild.
Performance Evaluation: Presently, deepfake detection methods are formulated as a binary classification problem,
where each sample can be either real or fake. Such classification is easier to build in a controlled environment, where
we generate and verify deepfake detection techniques by utilizing audio-visual content that is either original or
fabricated. However, for real-world scenarios, videos can be altered in ways other than deepfakes, so content not
detected as manipulated does not guarantee the video is an original one. Furthermore, deepfake content can be the
subject of multiple types of alteration i.e. audio/visual, and therefore a single label may not be completely accurate.
Moreover, in visual content with multiple people’s faces, usually, one or more of them are manipulated with deepfakes
over a segment of frames. Therefore, the binary classification scheme should be enhanced to multiclass/multi-label
and local classification/detection at the frame level, to cope with the challenges of real-world scenarios.
Lack of Explainability in Detection Methods: Existing deepfake detection approaches are typically designed to
perform batch analysis over a large dataset. However, when these techniques are employed in the field by journalists
or law enforcement, there may only be a small set of videos available for analysis. A numerical score parallel to the
probability of an audio or video being real or fake is not as valuable to the practitioners if it cannot be confirmed with
appropriate proof of the score. In those situations, it is very common to demand an explanation for the numerical score
for the analysis to be believed before publication or utilization in a court of law. Most deepfakes detection methods
lack such an explanation, however, particularly those which are based on DL approaches due to their black-box nature.
Lack of fairness and Trust: It has been observed that existing audio and visual deepfakes datasets are biased and
contain imbalanced data of different races and genders. Furthermore, the employed detection techniques can be biased
as well. Although researchers have started doing work in this area to fill this gap, however, very little work is available
[278]. Hence, there is an urgent need to introduce such approaches that improve the data and fairness in detection
algorithms.
Temporal Aggregation: Existing deepfake detection methods are based on binary classification at the frame level,
i.e. checking the probability of each video frame as real or manipulated. However, these approaches do not consider
temporal consistency between frames, and suffer from two potential problems: (i) deepfake content shows temporal
artifacts, and (ii) real or fake frames could appear in sequential intervals. Furthermore, these techniques require an
extra step to compute the integrity score at the video level, as these methods need to combine the score from each
frame to generate a final value.
Social Media Laundering: Social platforms like Twitter, Facebook, or Instagram are the main online networks used
to spread audio-visual content among the population. To save the bandwidth of the network or to secure the user’s
privacy, such content is stripped of meta-data, down-sampled, and substantially compressed before uploading. These
manipulations, normally known as social media laundering, remove clues with respect to underlying forgeries and
eventually increase false positive detection rates. Most deepfake detection approaches employing signal level key-
points are more affected by social media laundering. A measure to increase the accuracy of deepfake identification
approaches over social media laundering is to keenly include simulations of these effects in training data, and also
increase the evaluation databases to contain data on social media laundered visual content.
DeepFake Detection Evasion: Mostly deepfake detection methods are concerned to locate missing information
and artifacts left during the generation process. However, detection techniques may fail in case of the unavailability
of such data as attackers attempt to remove such traces during the manipulation generation process. Such fooling
techniques are classified into three types such as adversarial perturbation attack, elimination of manipulation traces in
the frequency domain, and employing image filtering to mislead detectors. In the case of visual adversarial attacks,
different perturbations such as random cropping, noise, and JPEG compression, etc., are added to the training data,
which ultimately results in high false alarms for detection methods. Different works [279, 280] have evaluated the
performance of state-of-the-art visual deepfake detectors under the presence of adversarial attack and showed the
intense reduction in accuracy. While in the case of audio, studies such as [281, 282] showed that several adversarial
pre/post-processing operations can be used to evade spoof detection. Similarly, the second method is concerned with
improving the quality of GAN-generated samples by enhancing spectral distributions [283]. Such methods ultimately
result in removing fake traces in the frequency domain and complicates the detection process [284, 285]. The third
method uses advanced image filtering techniques to improve generation quality such as removal of fingerprints left
during generation and addition of noise to remove fake signs [286-288]. The aforementioned methods impose a real
challenge for deepfake detection methods, thus research community needs to propose such techniques that are robust
and reliable to such attacks.

5 Deepfake Datasets
To analyze the detection accuracy of proposed methods it is of utmost importance to have a good and representative
dataset for performance evaluation. Moreover, the techniques should be validated over cross datasets to show their
generalization power. Therefore, researchers have put in significant effort over the years by preparing the standard
datasets for manipulated visual and audio content. In this section, we have presented a detailed review of the standard
datasets that are currently used to evaluate the performance of audio and video deepfake detection techniques. Tables
17 and 18 show a comparison of available video and audio deepfake datasets respectively.
5.1 Video Datasets
UADFV: The first dataset released for deepfake detection was UADFV [71]. It consists of a total of 98 videos, where
49 are real videos collected from YouTube and manipulated by using the FakeApp application [41] to generate 49
fake videos. The average length of videos is 11.14 sec with an average resolution of 294×500 pixels. However, the
visual quality of videos is very low, and the resultant alteration is obvious and thus easy to detect.
DeepfakeTIMIT: DeepfakeTIMIT [271] is another standard dataset for deepfake detection which was introduced in
2018. This dataset consists of a total of 620 videos of 32 subjects. For each subject, there are 20 deepfake videos of
two quality levels, where 10 videos belong to DeepFake-TIMIT-LQ and the remaining 10 belong to DeepFake-TIMIT-
HQ. In DeepFake-TIMIT-LQ, the resolution of the output image is 64×64, whereas, in DeepFake-TIMIT-HQ, the
resolution of output size is 128×128. The fake content is generated by employing face swap-GAN [62], however, the
generated videos are only 4 seconds long and the dataset contains no audio channel manipulation. Moreover, the
resultant videos are often blurry and people in actual videos are mostly presented in full frontal face view with a
monochrome color background.
FaceForensics++: One of the most famous datasets for deepfake detection is FF++ [98]. This dataset was presented
in 2019 as an extended form of the FaceForensics dataset [289], which contains videos with facial expressions
manipulation only, and which was released in 2018. The FF++ dataset has four subsets named FaceSwap [290],
DeepFake [42], Face2Face [36], and NeuralTextures [291]. It contains 1000 original videos collected from the
YouTube-8M dataset [292] and 3,000 manipulated videos generated using the computer graphics and deepfake
approaches specified in [289]. This dataset is also available in two quality levels i.e. uncompressed and H264
compressed format, which can be used to evaluate the performance of deepfake detection approaches on both
compressed and uncompressed videos. The FF++ dataset fails to generalize lip-sync deepfakes however, and some
videos exhibit color inconsistencies around the manipulated faces.
Celeb-DF: Another popular dataset used for evaluating deepfake detection techniques is Celeb-DF [265]. This dataset
presents videos of higher quality and tries to overcome the problem of visible source artifacts found in previous
databases. The CelebDF dataset contains 408 original videos and 795 fake videos. The original content was collected
from Youtube, which is divided into two parts named Real1 and Real2 respectively. In Real1, there are a total of 158
videos of 13 subjects with different gender and skin color. Real2 comprises 250 videos, each having a different subject,
and the synthesized videos are generated from these original videos through the refinement of existing deepfake
algorithms [293, 294].
Deepfake Detection Challenge (DFDC): Recently, the Facebook community launched a challenge, aptly named the
Deepfake Detection Challenge (DFDC)-preview [295], and released a new dataset that contains 1131 original videos
and 4119 manipulated videos. The altered content is generated using two unknown techniques. The final version of
the DFDC database is publicly available on [296]. It contains 100,000 fake videos along with 19,000 original samples.
The dataset is created using various face-swap-based methods with different augmentations (i.e., geometric and color
transformations, varying frame rate, etc.) and distractors (overlaying different types of objects) in a video.
DeeperForensics (DF): Another Large-Scale dataset for deepfake detection containing 50,000 original and 10,000
manipulated videos is built-in [297]. A novel conditional autoencoder, namely DF-VAE is used to create manipulated
videos. The dataset comprises highly diverse samples in terms of actor's appearance. Further, a mixture of distortions
and perturbations such as compression, blurry, noise, etc. are added to better represent the real-world scenarios. As
compared to previous datasets [71, 265, 271], the quality of generated samples is significantly improved.
WildDeepfake: WildDeepfake (WDF)[298] is considered as one of the challenging deepfake detection datasets. It
contains both real and deepfake samples collected from the internet in comparison to existing datasets.
All of the above-mentioned datasets contain a synthesized face portion only and the datasets lack upper/full body
deepfakes. A more robust dataset is needed which should be able to synthesize the entire body of the source person.

Table 17: Comparison of Deepfakes detection datasets


UADFV [71] DF- FF++ [98] Celeb-DF DFDC- DF [297] WDF[298]
TIMIT[271] [265] preview
[296]
Released Nov, 2018 Dec, 2018 Jan, 2019 Nov, 2019 Oct, 2019 June, 2020 Oct, 2020
Total videos 98 620 4000 1203 5250 60,000
Real content 48 Nill 1000 408 1131 10,000 3,805
Fake content 48 620 3000 795 4119 50,000 3,509
Tool/ FakeApp faceswap- deepfake, CG- deepfake Unknown DF‐VAE unknown
technology application GAN [62] manipulations
used for fake [41]
content
generation
Avg. 11.4 sec 4 sec 18 sec- 13 sec 30 sec - -
Duration
Resolution 294×500 64×64 (LQ) 480p, 720p, 1080p various 180p – 1920×1080 various
128×128 (HQ) 2160p
Format - JPG H.264, CRF=0, 23, mp4 H.264 mp4 mp4
40
Visual low low low high high high high
quality
Temporal yes yes yes improved improved Significantly -
flickering improved
modality visual Audio/visual visual visual Audio/visual visual visual

5.2 Audio Datasets


LJ speech and M-AILabs dataset: LJSpeech [299] and M-AILabs [300] dataset are famous for the real-speech
database employed in numerous TTS applications, i.e. DeepVoice 3 [193]. The LJSpeech database is comprised of
13,100 clips totaling 24 hours length. All utterances are recorded by a female speaker. The M-AILABS dataset consists
of total 999 hours and 32 minutes of audio. This dataset was created with multiple speakers in 9 different languages.
Mozilla TTS: Mozilla Firefox a well-known publicly available browser, released the biggest open-source database
of people speaking [301]. Initially, the database included 1400 hours of recorded voices, in 18 different languages, in
2019. Later it was extended to 7,226 hours of recoded voices in 54 diverse languages. This dataset contains 5.5 million
audio clips and was employed by Mozilla’s Deep Speech toolkit.
ASV spoof 2019: Another well-known dataset for fake audio detection is ASVspoof-2019 [180], which is comprised
of two parts for performing logical access (LA) and physical access (PA) state analysis. Both LA and PA are created
from the VCTK base corpus, which comprises audio clips taken from 107 speakers (46 males, 61 females). LA consists
of both voice cloning and voice conversion samples, whereas PA consists of replay samples along with bona fide ones.
Both datasets are further divided into three databases, named training, development, and evaluation, which contain
clips from 20- (8 males, 12 females), 10- (4 males, 6 females), and 48- (21 males, 27 females) speakers respectively.
Further categorization is diverse in terms of presenters, and the recording situations are the same for all source samples.
The training and development sets contain spoofing occurrences created with the same method/conditions (labeled as
known attacks), while the evaluation set contains samples with unknown attacks.
Fake-or-Real (FOR) dataset: The FOR database [257] is another dataset that is widely employed for synthetic
voice detection. This database consists of over 195,000 samples both from humans and AI-synthetic speech. This
database groups samples from the new TTS method (i.e. Deep Voice 3[193] and Google-Wavenet [52]) together with
diverse human speech samples ( i.e Arctic Dataset, LJSpeech Dataset, VoxForge Dataset). The FOR database has four
versions, namely for-original (FO), for-norm (FN), for-2sec (F2S), and for-rerec (FR). FO contains unbalanced voices
without alterations, while FN comprises balanced unaltered samples in terms of gender, class, and volume, etc. F2S
contains data from FN, however, the samples are trimmed to 2 seconds, and the FR version is a rerecorded version of
the F2S database, to simulate a condition in which an invader passes a sample via a voice channel (i.e. a cellphone
call or a voice message).
Baidu Dataset: The Baidu Silicon Valley AI Lab cloned audio dataset is another database employed for cloned
speech detection [55]. This database is comprised of 10 ground truth speech recordings, 120 cloned samples, and 4
morphed samples.
Table 18: Comparison of audio fakes detection datasets
LJ speech M-AILabs Mozilla TTS FOR dataset Baidu Dataset ASV spoof
dataset [299] dataset [300] [301] [257] [55] 2019[180]
Released 2017 2019 2019 2019 2018 2019
Total samples 13100 - 5.5 million 195,000 120 122157
Length (hrs) 24 999 hrs 32min 7226 - 0.6 -
Speaker Accent Native Native 24% US Native US English, Native
English, 8% British English
British English
Languages 1 9 54 1 1 1
Speaker gender 100% Female Male, female 47% Male 15% 50% male, 50% 50% male, 50% 43% male, 57
Female female female female
Format wav wav mp3 mp3 mp3 mp3
Tool/ technology recorded recorded recorded Deep Voice 3, Neural voice Tacotron2 [9]
used for TTs, Google- cloning [55] and WaveNet
generation Wavenet etc. [10]
[257]
6 Future Directions
Synthetic media is gaining a lot of attention because of its potential positive and negative impact on our society. The
competition between deepfake generation and detection will not end in the foreseeable future, although impressive
work has been presented for the generation and detection of deepfakes. There is still, however, room for improvement.
In this section, we discuss the current state of deepfakes, their limitations, and future trends.
6.1 Creation
Visual media has more influence compared to text-based disinformation. Recently, the research community has
focused more on the generation of identity agnostic models and high-quality deepfakes. A few distinguished
improvements are i) a reduction in the amount of training data due to the introduction of un-paired self-supervised
methods [302], ii) quick learning, which allows identity stealing using a single image [129, 131], iii) enhancements
in visual details [124, 145], iv) improved temporal coherence in generated videos by employing optical flow estimation
and GAN based temporal discriminators [103], v) the alleviation of visible artifacts around face boundary by adding
secondary networks for seamless blending [66], and vi) improvements in synthesized face quality by adding multiple
losses with different responsibilities, such as occlusion, creation, conversion, and blending [108]. Several approaches
have been proposed to boost the visual quality and realism of deepfake generation, however, there are a few
limitations. Most of the current synthetic media generation focuses on a frontal face pose. In facial reenactment, for
good results the face is swapped with a lookalike identity. However, it is not possible to always have the best match,
which ultimately results in identity leakage.
AI-based manipulations are not restricted to the creation of visual content only, leading to a generation of highly
genuine audio deepfakes. The quality of audio deepfakes has significantly improved and requires less training data in
to generate more realistic synthetic audio of the target speaker. The employment of synthesized speech for
impersonating targets can produce highly convincing deepfakes with a marked negative adverse impact on society.
The current audio-visual content is generated separately using multiple disconnected steps, which ultimately results
in the generation of asynchronous content. Present deepfake generation focuses on the face region only, however the
next generation of deepfakes is expected to target full body manipulations, such as a change in body pose, along with
convincing expressions. Target-specific joint audio-visual synthesis with more naturalness and realism in speech is a
new cutting-edge application of the technology in the context of persona appropriation [104, 303]. Another possible
trend is the creation of real-time deepfakes. Some researchers have already reported attaining real-time deepfakes at
30fps [64]. Such alterations will result in the generation of more believable deepfakes.
6.2 Detection
To prevent deepfakes misinformation and disinformation, some authors presented approaches to identify forensic
changes made within visual content by employing the concept of blockchain and smart contracts [304, 305]. In [305]
the authors utilized Ethereum smart contracts to locate and track the origin and history of manipulated information
and its source, even in the presence of multiple manipulation attacks. This smart contract applied the hashes of the
interplanetary file system to save videos together with their metadata. This method may perform well for deepfake
identification; however, it is applicable only if the metadata of videos do exist. Thus, development and adoption of
such techniques could be useful for the newswires, however, the vast majority of content created by normal citizens
won’t be protected by such techniques.
Recent automated deepfake identification approaches typically deal with face swapping videos, and the majority of
uploaded fake videos belong in this category. Major improvements in detection algorithms include i) identification of
artifacts left during the generation process, such as inconsistencies in head pose [71], lack of eye blinking [77], color
variations in facial texture [155] and teeth alignment, ii) detection of unseen GAN generated samples, iii) spatial-
temporal features, and iv) psychological signals like heart rate [89], and an individual’s behavior patterns [79].
Although extensive work has been presented for automated detection, however, these automated detection methods
are expected to be short-lived and require improvements on multiple fronts. Following are many of unresolved
challenges in the domain of deepfake detection.
• The existing methods are not robust to post-processing operations like compression, noisy effects, light
variations, etc. Moreover, limited work has been presented that can detect both audio and visual deepfakes.
• Recently, most of the techniques have focused on face-swap detection by exploiting its limitations, like visible
artifacts. However, with immense developments in technology, the near future will produce more sophisticated
face-swaps, such as impersonating someone, with the target having a similar face shape, personality, and
hairstyle. Aside from this, other types of deepfake, like face-reenactment and lip-synching are getting stronger
day by day.
• Existing deepfake detectors have mainly relied on the signatures of existing deepfakes by using ML techniques,
including unsupervised clustering and supervised classification methods, and therefore they are less likely to
detect unknown deepfakes. Both anomaly-based and signature-based detection methods have their own pros and
cons. For example, anomaly detection-based approaches show a high false alarm rate because they may classify
a bona fide multimedia artifact whose patterns are rare in the dataset as an anomaly. On the other hand, signature-
based approaches cannot discover unknown attacks [310]. Therefore, the hybrid approach of using both anomaly
and signature-based detection needs to be tried out to identify known and unknown attacks. Furthermore, a
collaboration with the RL method could be added to the hybrid signature and anomaly approach. More
specifically, RL can give a reward (or penalty) to the system when it selects frames of deepfakes that contain (or
do not contain) anomalies, or any signs of manipulation. Additionally, in the future, deep reinforcement active
learning approaches [313,314] could play a pivotal role in the detection of deepfakes.
• Anti-forensic or adversarial ML techniques can be employed to reduce the classification accuracy of automated
detection methods. The game theoretic approaches could be employed to mitigate the adversarial attacks on
deepfake detectors. Additionally, Reinforcement Learning (RL) and particularly deep reinforcement learning
(DRL) is extremely efficient in solving intricate cyber-defense problems. Thus, DRL could offer great potential
for not only deepfake detection but also to counter antiforensic attacks on the detectors. Since RL can model an
autonomous agent to take sequential actions optimally with limited or without prior knowledge of the
environment, thus it could be used to meet a need for developing algorithms to capture traces of anti-forensic
processing, and to design attack-aware deepfake detectors. The defense of the deepfake detector against
adversarial input could be modeled as a two-player zero-sum game with which player utilities sum to zero at
each time step. The defender here is represented by an actor-critic DRL algorithm [306].
• The current deepfake detectors face challenges, particularly due to incomplete, sparse, and noisy data in training
phases. There is a need to explore innovative AI architectures, algorithms, and approaches that “bake in” physics,
mathematics, and prior knowledge relevant to deepfakes. Embedding physics and prior knowledge using
knowledge-infused learning into AI will help to overcome the challenges of sparse data and will facilitate the
development of generative models that are causal and explanative.
• Most of the existing approaches have focused on one specific type of feature, such as landmark features.
However, as the complexity of deepfakes is increasing, it is important to fuse landmark, photoplethysmography
(PPG) and audio-based features. Likewise, it is important to evaluate the fusion of classifiers. Particularly, the
fusion of anomaly and signature-based ensemble learning will assist to improve the accuracy of deepfakes
detectors.
• Existing research on deepfakes has mainly focused on detecting manipulation in the visual content of the video.
However, audio manipulation, an integral component of deepfakes, is mostly ignored by the research
community. There exists a need to develop unified deepfake detectors that are capable of effectively detecting
both audio (i.e., TTS synthesis, voice conversion, cloned-replay) and visual forgeries (face-swap, lip-sync, and
puppet-master) simultaneously.
• Existing deepfakes datasets lack the potential attributes (i.e. multiple visual and audio forgeries, etc.) required
to evaluate the performance of more robust deepfake detection methods. The research community has hardly
explored the fact that deepfake videos contain not only visual forgeries but audio manipulation as well. Existing
deepfake datasets do not consider audio forgery and only focus on visual forgeries. In near future, the role of
voice cloning (TTS synthesis, VC) and replay spoofing may increase in deepfake video generation. Additionally,
shallow audio forgeries can easily be fused along-with deep audio forgeries in deepfake videos. We have already
developed a voice spoofing detection corpus [311] for single- and multi-order replay attacks. Currently, we are
working on developing a robust voice cloning and audio-visual deepfake dataset that can be effectively used to
evaluate the performance of futuristic audio-visual deepfake detection methods.
• A unified method to address the variation of cloned attacks, such as cloned replay. The majority of voice spoofing
detectors target detecting either replay or cloning attacks [159-161, 196]. These two-class oriented, genuine vs.
spoof countermeasures, are not ready to counter multiple spoofing attacks on automatic speaker verification
(ASV) systems. A study on presentation attack detection indicated that the countermeasures trained on a specific
type of spoofing attack hardly generalizes well for other types of spoofing attacks [312]. Moreover, there does
not exist a unified countermeasure that can detect replay and cloning attacks in multi-hop scenarios, where
multiple microphones and smart speakers are chained together. We addressed the problem of spoofing attack
detection on multi-hop scenarios in our prior work [10], but only for voice replay attacks. Therefore, there exists
an urgent need to develop a unified countermeasure that can effectively detect a variety of spoofing attacks (i.e.
replay, cloning, and cloned replay) in a multi-hop scenario.
• The exponential growth of smart speakers and other voice-enabled devices considers ASV a fundamental
component. However, optimal utilization of ASV in critical domains, such as financial services, health care, etc.,
is not possible unless we counter the threats of multiple voice spoofing attacks on the ASV. Thus, this
vulnerability also presents a need to develop a robust and unified spoofing countermeasure.
• There exists a crucial need to implement federated, learning-based, lightweight approaches to detect the
manipulation at the source, so an attack doesn’t traverse a network of smart speakers (or other IoT devices)
[9,10].

7 Conclusion
This survey paper presents a comprehensive review of existing deepfake generation and detection methods. Not all
digital manipulations are harmful. However, due to immense technological advancements, it is now very easy to
produce realistic fabricated content. Therefore, malicious users can use it to spread disinformation to attack individuals
and cause social, psychological, religious, mental, and political stress. In the future, we imagine seeing the results of
fabricated content in many other modalities and industries. There is a cold war between deepfake generation and
detection methods. As there are improvements in one it causes challenges for the other. We provided a detailed
analysis of existing audio and video deepfake generation and detection techniques, along with their strengths and
weaknesses. We have also discussed existing challenges and the future directions of both deepfake creation and
identification methods.
Acknowledgement
This material is based upon work supported by the National Science Foundation (NSF) under Grant numbers
1815724 and 1816019. Any opinions, findings, and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of the NSF.

References
[1] ZAO, Available at: https://apps.apple.com/cn/app/zao/id1465199127. Accessed: September 09,
2020.
[2] Reface App, Available at: https://reface.app/. Accessed: September 11, 2020.
[3] FaceApp, Available at: https://www.faceapp.com/. Accessed: September 17, 2020.
[4] Audacity, Available at: https://www.audacityteam.org/. Accessed: September 09, 2020.
[5] Sound Forge, Available at: https://www.magix.com/gb/music/sound-forge/. Accessed: January 11,
2021.
[6] J. F. Boylan, "Will deep-fake technology destroy democracy?," The New York Times, Oct, vol. 17,
2018.
[7] D. Harwell, "Scarlett Johansson on fake AI-generated sex videos: ‘Nothing can stop someone from
cutting and pasting my image," Washington Post, 2018.
[8] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, "Everybody Dance Now," in Proceedings of the
IEEE International Conference on Computer Vision, 2019, pp. 5933-5942.
[9] K. M. Malik, H. Malik, and R. Baumann, "Towards vulnerability analysis of voice-driven interfaces
and countermeasures for replay attacks," in 2019 IEEE Conference on Multimedia Information
Processing and Retrieval (MIPR), 2019, pp. 523-528: IEEE.
[10] K. M. Malik, A. Javed, H. Malik, and A. Irtaza, "A light-weight replay detection framework for
voice controlled iot devices," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5,
pp. 982-996, 2020.
[11] A. Javed, K. M. Malik, A. Irtaza, and H. Malik, "Towards protecting cyber-physical and IoT
systems from single-and multi-order voice spoofing attacks," Applied Acoustics, vol. 183, p.
108283, 2021.
[12] M. Aljasem et al., "Secure Automatic Speaker Verification (SASV) System through sm-ALTP
Features and Asymmetric Bagging," IEEE Transactions on Information Forensics Security, 2021.
[13] D. Harwell, An artificial-intelligence first: Voice-mimicking software reportedly used in a major
theft, Available at: https://www.washingtonpost.com/technology/2019/09/04/an-artificial-
intelligence-first-voice-mimicking-software-reportedly-used-major-theft/. Accessed: July 18,
2021.
[14] L. Verdoliva, "Media forensics and deepfakes: an overview," arXiv preprint arXiv:2001.06564,
2020.
[15] R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, and J. Ortega-Garcia, "Deepfakes and
beyond: A survey of face manipulation and fake detection," arXiv preprint arXiv:2001.00179,
2020.
[16] T. T. Nguyen, C. M. Nguyen, D. T. Nguyen, D. T. Nguyen, and S. Nahavandi, "Deep Learning for
Deepfakes Creation and Detection," arXiv preprint arXiv:1909.11573, 2019.
[17] Y. Mirsky and W. Lee, "The Creation and Detection of Deepfakes: A Survey," arXiv preprint
arXiv:2004.11138, 2020.
[18] L. Oliveira, "The current state of fake news," Procedia Computer Science, vol. 121, no. C, pp. 817-
825, 2017.
[19] R. Chesney and D. Citron, "Deepfakes and the New Disinformation War: The Coming Age of Post-
Truth Geopolitics," Foreign Aff., vol. 98, p. 147, 2019.
[20] W. Phillips, This is why we can't have nice things: Mapping the relationship between online trolling
and mainstream culture. MIT Press, 2015.
[21] T. Higgin, "FCJ-159/b/lack up: What trolls can teach us about race," The Fibreculture Journal, no.
22 2013: Trolls and The Negative Space of the Internet, 2013.
[22] T. Mihaylov, G. Georgiev, and P. Nakov, "Finding opinion manipulation trolls in news community
forums," in Proceedings of the nineteenth conference on computational natural language learning,
2015, pp. 310-314.
[23] T. P. Gerber and J. Zavisca, "Does Russian propaganda work?," The Washington Quarterly, vol.
39, no. 2, pp. 79-98, 2016.
[24] P. N. Howard and B. Kollanyi, "Bots, StrongerIn, and Brexit: computational propaganda during
the UK-EU referendum," 2016, Art. no. Available at SSRN 2798311.
[25] O. Varol, E. Ferrara, C. A. Davis, F. Menczer, and A. Flammini, "Online human-bot interactions:
Detection, estimation, and characterization," in Eleventh international AAAI conference on web
and social media, 2017.
[26] H. Setiaji and I. V. Paputungan, "Design of telegram bots for campus information sharing," in IOP
Conference Series: Materials Science and Engineering, 2018, vol. 325, no. 1, p. 012005: Institute
of Physics Publishing.
[27] A. Marwick and R. Lewis, "Media manipulation and disinformation online," New York: Data
Society Research Institute, 2017.
[28] C. R. Sunstein and A. Vermeule, "Conspiracy theories: Causes and cures," Journal of Political
Philosophy, vol. 17, no. 2, pp. 202-227, 2009.
[29] R. Faris, H. Roberts, B. Etling, N. Bourassa, E. Zuckerman, and Y. Benkler, "Partisanship,
propaganda, and disinformation: Online media and the 2016 US presidential election," Berkman
Klein Center Research Publication, vol. 6, 2017.
[30] A. Hussain and S. Menon, The dead professor and the vast pro-India disinformation campaign,
Available at: https://www.bbc.com/news/world-asia-india-55232432. Accessed: August 11, 2021.
[31] L. Benedictus, "Invasion of the troll armies: from Russian Trump supporters to Turkish state
stooges," The Guardian, vol. 6, p. 2016, 2016.
[32] N. A. Mhiripiri and T. Chari, Media law, ethics, and policy in the digital age. IGI Global, 2017.
[33] H. Huang, P. S. Yu, and C. Wang, "An introduction to image synthesis with generative adversarial
nets," arXiv preprint arXiv:1803.04469, 2018.
[34] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, "Stargan: Unified generative adversarial
networks for multi-domain image-to-image translation," in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 8789-8797.
[35] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, "Synthesizing Obama: learning lip
sync from audio," ACM Trans. Graph., vol. 36, no. 4, pp. 95:1-95:13, 2017.
[36] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, "Face2face: Real-time face
capture and reenactment of rgb videos," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 2387-2395.
[37] O. Wiles, A. Sophia Koepke, and A. Zisserman, "X2face: A network for controlling face generation
using images, audio, and pose codes," in Proceedings of the European Conference on Computer
Vision (ECCV), 2018, pp. 670-686.
[38] B. Paris and J. Donovan, "Deepfakes and Cheap Fakes," United States of America: Data & Society,
2019.
[39] C. Bregler, M. Covell, and M. Slaney, "Video rewrite: Driving visual speech with audio," in
Proceedings of the 24th annual conference on Computer graphics and interactive techniques, 1997,
pp. 353-360.
[40] J. Vincent, New AI deepfake app creates nude images of women, Available at:
https://www.theverge.com/2019/6/27/18760896/deepfake-nude-ai-app-women-deepnude-non-
consensual-pornography. Accessed: September 18, 2020.
[41] FakeApp 2.2.0, Available at: https://www.malavida.com/en/soft/fakeapp/. Accessed: September
18, 2020.
[42] Faceswap: Deepfakes software for all, Available at: https://github.com/deepfakes/faceswap.
Accessed: September 08, 2020.
[43] DeepFaceLab, Available at: https://github.com/iperov/DeepFaceLab. Accessed: August 18, 2020.
[44] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, "First order motion model for image
animation," in Advances in Neural Information Processing Systems, 2019, pp. 7137-7147.
[45] H. Kim et al., "Deep video portraits," ACM Trans. Graph., vol. 37, no. 4, pp. 163:1-163:14, 2018.
[46] S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim, "Marionette: Few-shot face reenactment preserving
identity of unseen targets," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020,
vol. 34, no. 07, pp. 10893-10900.
[47] Y. Wang, P. Bilinski, F. Bremond, and A. Dantcheva, "ImaGINator: Conditional Spatio-Temporal
GAN for Video Generation," in The IEEE Winter Conference on Applications of Computer Vision,
2020, pp. 1160-1169.
[48] M. Westerlund, "The emergence of deepfake technology: A review," Technology Innovation
Management Review, vol. 9, no. 11, 2019.
[49] M. Borak, Chinese government-run facial recognition system hacked by tax fraudsters, Available
at: https://www.scmp.com/tech/tech-trends/article/3127645/chinese-government-run-facial-
recognition-system-hacked-tax. Accessed: July 26, 2021.
[50] S. Greengard, "Will deepfakes do deep damage?," ed: ACM New York, NY, USA, 2019.
[51] A. Hern, I don't want to upset people': Tom Cruise deepfake creator speaks out, Available at:
https://www.theguardian.com/technology/2021/mar/05/how-started-tom-cruise-deepfake-tiktok-
videos. Accessed: July 22, 2021
[52] A. v. d. Oord et al., "Wavenet: A generative model for raw audio," arXiv preprint
arXiv:1609.03499, 2016.
[53] Y. Wang et al., "Tacotron: Towards end-to-end speech synthesis," arXiv preprint
arXiv:1703.10135, 2017.
[54] S. O. Arik et al., "Deep voice: Real-time neural text-to-speech," arXiv preprint arXiv:1702.07825,
2017.
[55] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, "Neural voice cloning with a few samples," in
Advances in Neural Information Processing Systems, 2018, pp. 10019-10029.
[56] Y. Nirkin, I. Masi, A. T. Tuan, T. Hassner, and G. Medioni, "On face segmentation, face swapping,
and face perception," in 2018 13th IEEE International Conference on Automatic Face & Gesture
Recognition (FG 2018), 2018, pp. 98-105: IEEE.
[57] D. Bitouk, N. Kumar, S. Dhillon, P. Belhumeur, and S. K. Nayar, "Face swapping: automatically
replacing faces in photographs," in ACM Transactions on Graphics (TOG), 2008, vol. 27, no. 3, p.
39: ACM.
[58] Y. Lin, Q. Lin, F. Tang, and S. Wang, "Face replacement with large-pose differences," in
Proceedings of the 20th ACM international conference on Multimedia, 2012, pp. 1249-1250: ACM.
[59] B. M. Smith and L. Zhang, "Joint face alignment with non-parametric shape models," in European
Conference on Computer Vision, 2012, pp. 43-56: Springer.
[60] DFaker, Available at: https://github.com/dfaker/df. Accessed: September 08, 2020.
[61] DeepFake-tf: Deepfake based on tensorflow, Available at:
https://github.com/StromWine/DeepFake-tf. Accessed: September 08, 2020.
[62] Faceswap-GAN Available at: https://github.com/shaoanlu/faceswap-GAN. Accessed: September
18, 2020.
[63] I. Korshunova, W. Shi, J. Dambre, and L. Theis, "Fast face-swap using convolutional neural
networks," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp.
3677-3685.
[64] Y. Nirkin, Y. Keller, and T. Hassner, "FSGAN: Subject Agnostic Face Swapping and
Reenactment," in Proceedings of the IEEE International Conference on Computer Vision, 2019,
pp. 7184-7193.
[65] R. Natsume, T. Yatagawa, and S. Morishima, "Rsgan: face swapping and editing using face and
hair representation in latent spaces," arXiv preprint arXiv:1804.03447, 2018.
[66] R. Natsume, T. Yatagawa, and S. Morishima, "Fsnet: An identity-aware generative model for
image-based face swapping," in Asian Conference on Computer Vision, 2018, pp. 117-132:
Springer.
[67] L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, "Faceshifter: Towards high fidelity and occlusion
aware face swapping," arXiv preprint arXiv:1912.13457, 2019.
[68] I. Petrov et al., "DeepFaceLab: A simple, flexible and extensible face swapping framework," arXiv
preprint arXiv:2005.05535, 2020.
[69] D. Chen, Q. Chen, J. Wu, X. Yu, and T. Jia, "Face Swapping: Realistic Image Synthesis Based on
Facial Landmarks Alignment," Mathematical Problems in Engineering, vol. 2019, 2019.
[70] Y. Zhang, L. Zheng, and V. L. Thing, "Automated face swapping and its detection," in 2017 IEEE
2nd International Conference on Signal and Image Processing (ICSIP), 2017, pp. 15-19: IEEE.
[71] X. Yang, Y. Li, and S. Lyu, "Exposing deep fakes using inconsistent head poses," in ICASSP 2019-
2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019,
pp. 8261-8265: IEEE.
[72] D. Güera, S. Baireddy, P. Bestagini, S. Tubaro, and E. J. Delp, "We Need No Pixels: Video
Manipulation Detection Using Stream Descriptors," arXiv preprint arXiv:1906.08743, 2019.
[73] K. Jack, "Chapter 13-MPEG-2," Video Demystified: A Handbook for the Digital Engineer, pp. 577-
737.
[74] U. A. Ciftci and I. Demir, "FakeCatcher: Detection of Synthetic Portrait Videos using Biological
Signals," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[75] T. Jung, S. Kim, and K. Kim, "DeepVision: Deepfakes Detection Using Human Eye Blinking
Pattern," IEEE Access, vol. 8, pp. 83144-83154, 2020.
[76] R. Ranjan, V. M. Patel, and R. Chellappa, "Hyperface: A deep multi-task learning framework for
face detection, landmark localization, pose estimation, and gender recognition," IEEE Transactions
on Pattern Analysis Machine Intelligence, vol. 41, no. 1, pp. 121-135, 2017.
[77] T. Soukupova and J. Cech, "Eye blink detection using facial landmarks," in 21st computer vision
winter workshop, Rimske Toplice, Slovenia, 2016.
[78] F. Matern, C. Riess, and M. Stamminger, "Exploiting visual artifacts to expose deepfakes and face
manipulations," in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW),
2019, pp. 83-92: IEEE.
[79] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li, "Protecting World Leaders Against
Deep Fakes," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2019, pp. 38-45.
[80] Y. Li and S. Lyu, "Exposing deepfake videos by detecting face warping artifacts," arXiv preprint
arXiv:1811.00656, vol. 2, 2018.
[81] D. E. King, "Dlib-ml: A machine learning toolkit," The Journal of Machine Learning Research,
vol. 10, pp. 1755-1758, 2009.
[82] D. Güera and E. J. Delp, "Deepfake video detection using recurrent neural networks," in 2018 15th
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018,
pp. 1-6: IEEE.
[83] Y. Li, M.-C. Chang, and S. Lyu, "In ictu oculi: Exposing ai generated fake face videos by detecting
eye blinking," arXiv preprint arXiv:1806.02877, 2018.
[84] D. M. Montserrat et al., "Deepfakes Detection with Automatic Face Weighting," in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp.
668-669.
[85] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, "Joint face detection and alignment using multitask
cascaded convolutional networks," IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503,
2016.
[86] O. de Lima, S. Franklin, S. Basu, B. Karwoski, and A. George, "Deepfake Detection using
Spatiotemporal Convolutional Networks," arXiv preprint arXiv:.14749, 2020.
[87] S. Agarwal, T. El-Gaaly, H. Farid, and S.-N. Lim, "Detecting Deep-Fake Videos from Appearance
and Behavior," in 2020 IEEE International Workshop on Information Forensics and Security
(WIFS), 2020, pp. 1-6: IEEE.
[88] O. Wiles, A. Koepke, and A. Zisserman, "Self-supervised learning of a facial attribute embedding
from video," arXiv preprint arXiv:.06882, 2018.
[89] S. Fernandes et al., "Predicting Heart Rate Variations of Deepfake Videos using Neural ODE," in
Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0-0.
[90] D. J. Rezende, S. Mohamed, and D. Wierstra, "Stochastic backpropagation and approximate
inference in deep generative models," arXiv preprint arXiv:. 2014.
[91] H. Rahman, M. U. Ahmed, S. Begum, and P. Funk, "Real time heart rate monitoring from facial
RGB color video using webcam," in The 29th Annual Workshop of the Swedish Artificial
Intelligence Society (SAIS), 2–3 June 2016, Malmö, Sweden, 2016, no. 129: Linköping University
Electronic Press.
[92] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, and W. Freeman, "Eulerian video
magnification for revealing subtle changes in the world," ACM transactions on graphics, vol. 31,
no. 4, pp. 1-8, 2012.
[93] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, "Neural ordinary differential
equations," in Advances in neural information processing systems, 2018, pp. 6571-6583.
[94] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, "Recurrent
Convolutional Strategies for Face Manipulation Detection in Videos," Interfaces (GUI), vol. 3, p.
1, 2019.
[95] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, "Mesonet: a compact facial video forgery
detection network," in 2018 IEEE International Workshop on Information Forensics and Security
(WIFS), 2018, pp. 1-7: IEEE.
[96] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen, "Multi-task learning for detecting and
segmenting manipulated facial images and videos," in 2019 IEEE 10th International Conference
on Biometrics Theory, Applications and Systems (BTAS), 2019, pp. 1-8.
[97] D. Cozzolino, J. Thies, A. Rössler, C. Riess, M. Nießner, and L. Verdoliva, "Forensictransfer:
Weakly-supervised domain adaptation for forgery detection," arXiv preprint arXiv:1812.02510,
2018.
[98] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, "Faceforensics++:
Learning to detect manipulated facial images," in Proceedings of the IEEE International
Conference on Computer Vision, 2019, pp. 1-11.
[99] K. I. Laws, "Textured image segmentation," University of Southern California Los Angeles Image
Processing INST1980.
[100] B. Fan, L. Wang, F. K. Soong, and L. Xie, "Photo-real talking head with deep bidirectional LSTM,"
in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2015, pp. 4884-4888: IEEE.
[101] J. Charles, D. Magee, and D. Hogg, "Virtual immortality: Reanimating characters from tv shows,"
in European Conference on Computer Vision, 2016, pp. 879-886: Springer.
[102] A. Jamaludin, J. S. Chung, and A. Zisserman, "You said that?: Synthesising talking faces from
audio," International Journal of Computer Vision, pp. 1-13, 2019.
[103] K. Vougioukas, S. Petridis, and M. Pantic, "End-to-End Speech-Driven Realistic Facial Animation
with Temporal GANs," in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, 2019, pp. 37-40.
[104] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, "Talking face generation by adversarially
disentangled audio-visual representation," in Proceedings of the AAAI Conference on Artificial
Intelligence, 2019, vol. 33, pp. 9299-9306.
[105] P. Garrido et al., "Vdub: Modifying face video of actors for plausible visual alignment to a dubbed
audio track," in Computer graphics forum, 2015, vol. 34, no. 2, pp. 193-204: Wiley Online Library.
[106] P. KR, R. Mukhopadhyay, J. Philip, A. Jha, V. Namboodiri, and C. Jawahar, "Towards automatic
face-to-face translation," in Proceedings of the 27th ACM International Conference on Multimedia,
2019, pp. 1428-1436.
[107] K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, "A Lip Sync Expert Is All You
Need for Speech to Lip Generation In The Wild," in Proceedings of the 28th ACM International
Conference on Multimedia, 2020, pp. 484-492.
[108] O. Fried et al., "Text-based editing of talking-head video," ACM Transactions on Graphics (TOG),
vol. 38, no. 4, pp. 1-14, 2019.
[109] B.-H. Kim and V. Ganapathi, "LumiereNet: Lecture Video Synthesis from Audio," arXiv preprint
arXiv:1907.02253, 2019.
[110] P. Korshunov and S. Marcel, "Speaker inconsistency detection in tampered video," in 2018 26th
European Signal Processing Conference (EUSIPCO), 2018, pp. 2375-2379: IEEE.
[111] C. Sanderson, "The vidtimit database," IDIAP2002.
[112] A. Anand, R. D. Labati, A. Genovese, E. Muñoz, V. Piuri, and F. Scotti, "Age estimation based on
face images and pre-trained convolutional neural networks," in 2017 IEEE Symposium Series on
Computational Intelligence (SSCI), 2017, pp. 1-7: IEEE.
[113] E. Boutellaa, Z. Boulkenafet, J. Komulainen, and A. Hadid, "Audiovisual synchrony assessment
for replay attack detection in talking face biometrics," Multimedia Tools Applications, vol. 75, no.
9, pp. 5329-5343, 2016.
[114] S. Agarwal, H. Farid, O. Fried, and M. Agrawala, "Detecting deep-fake videos from phoneme-
viseme mismatches," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, 2020, pp. 660-661.
[115] A. Haliassos, K. Vougioukas, S. Petridis, and M. Pantic, "Lips Don't Lie: A Generalisable and
Robust Approach to Face Forgery Detection," arXiv preprint arXiv:.07657, 2020.
[116] K. Chugh, P. Gupta, A. Dhall, and R. Subramanian, "Not made for each other-Audio-Visual
Dissonance-based Deepfake Detection and Localization," in Proceedings of the 28th ACM
International Conference on Multimedia, 2020, pp. 439-447.
[117] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, "Emotions Don't Lie: An Audio-
Visual Deepfake Detection Method using Affective Cues," in Proceedings of the 28th ACM
International Conference on Multimedia, 2020, pp. 2823-2832.
[118] A. Chintha et al., "Recurrent convolutional structures for audio spoof and video deepfake
detection," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 1024-1037,
2020.
[119] J. Thies, M. Zollhöfer, C. Theobalt, M. Stamminger, and M. Nießner, "real-time reenactment of
human portrait videos," ACM Trans. Graph., vol. 37, no. 4, pp. 164:1-164:13, / 2018.
[120] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt, "Real-time
expression transfer for facial reenactment," ACM Trans. Graph., vol. 34, no. 6, pp. 183:1-183:14,
2015.
[121] M. Zollhöfer et al., "Real-time non-rigid reconstruction using an RGB-D camera," ACM
Transactions on Graphics (ToG), vol. 33, no. 4, pp. 1-12, 2014.
[122] J. Thies, M. Zollhöfer, and M. Nießner, "IMU2Face: Real-time Gesture-driven Facial
Reenactment," arXiv preprint arXiv:1801.01446, 2017.
[123] J. Thies, M. Zollhöfer, C. Theobalt, M. Stamminger, and M. Nießner, "Headon: Real-time
reenactment of human portrait videos," ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp.
1-13, 2018.
[124] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, "High-resolution image
synthesis and semantic manipulation with conditional gans," in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2018, pp. 8798-8807.
[125] H. Kim et al., "Deep video portraits," ACM Transactions on Graphics (TOG), vol. 37, no. 4, p.
163, 2018.
[126] W. Wu, Y. Zhang, C. Li, C. Qian, and C. Change Loy, "Reenactgan: Learning to reenact faces via
boundary transfer," in Proceedings of the European Conference on Computer Vision (ECCV),
2018, pp. 603-619.
[127] A. Pumarola, A. Agudo, A. M. Martínez, A. Sanfeliu, and F. Moreno-Noguer, "GANimation:
Anatomically-Aware Facial Animation from a Single Image," in Proceedings of the European
Conference on Computer Vision (ECCV), 2018, pp. 818-833.
[128] E. Sanchez and M. Valstar, "Triple consistency loss for pairing distributions in GAN-based face
synthesis," arXiv preprint arXiv:1811.03492, 2018.
[129] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, "Few-shot adversarial learning of
realistic neural talking head models," in Proceedings of the IEEE International Conference on
Computer Vision, 2019, pp. 9459-9468.
[130] Y. Zhang, S. Zhang, Y. He, C. Li, C. C. Loy, and Z. Liu, "One-shot face reenactment," arXiv
preprint arXiv:1908.03251, 2019.
[131] H. Hao, S. Baireddy, A. R. Reibman, and E. J. Delp, "FaR-GAN for One-Shot Face Reenactment,"
arXiv preprint arXiv:2005.06402, 2020.
[132] V. Blanz and T. Vetter, "A morphable model for the synthesis of 3D faces," in Proceedings of the
26th annual conference on Computer graphics and interactive techniques, 1999, pp. 187-194.
[133] J. Lorenzo-Trueba et al., "The voice conversion challenge 2018: Promoting development of parallel
and nonparallel methods," arXiv preprint arXiv:1804.04262, 2018.
[134] I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, "Deepfake Video Detection through Optical
Flow based CNN," in Proceedings of the IEEE International Conference on Computer Vision
Workshops, 2019, pp. 0-0.
[135] L. Alparone, M. Barni, F. Bartolini, and R. Caldelli, "Regularization of optic flow estimates by
means of weighted vector median filtering," IEEE transactions on image processing, vol. 8, no. 10,
pp. 1462-1467, 1999.
[136] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, "Pwc-net: Cnns for optical flow using pyramid, warping,
and cost volume," in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 8934-8943.
[137] T. Baltrušaitis, P. Robinson, and L.-P. Morency, "Openface: an open source facial behavior analysis
toolkit," in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp.
1-10: IEEE.
[138] L. Nataraj et al., "Detecting GAN generated fake images using co-occurrence matrices," arXiv
preprint arXiv:1903.06836, 2019.
[139] I. Goodfellow et al., "Generative adversarial nets," in Advances in neural information processing
systems, 2014, pp. 2672-2680.
[140] D. P. Kingma and M. Welling, "Auto-encoding variational bayes," arXiv preprint
arXiv:1312.6114, 2013.
[141] A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep
convolutional generative adversarial networks," arXiv preprint arXiv:1511.06434, 2015.
[142] M.-Y. Liu and O. Tuzel, "Coupled generative adversarial networks," in Advances in neural
information processing systems, 2016, pp. 469-477.
[143] T. Karras, T. Aila, S. Laine, and J. Lehtinen, "Progressive growing of gans for improved quality,
stability, and variation," arXiv preprint arXiv:1710.10196, 2017.
[144] T. Karras, S. Laine, and T. Aila, "A style-based generator architecture for generative adversarial
networks," in Proceedings of the IEEE conference on computer vision and pattern recognition,
2019, pp. 4401-4410.
[145] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, "Analyzing and improving the
image quality of stylegan," in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2020, pp. 8110-8119.
[146] R. Huang, S. Zhang, T. Li, and R. He, "Beyond face rotation: Global and local perception gan for
photorealistic and identity preserving frontal view synthesis," in Proceedings of the IEEE
International Conference on Computer Vision, 2017, pp. 2439-2448.
[147] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, "Self-attention generative adversarial
networks," in International Conference on Machine Learning, 2019, pp. 7354-7363: PMLR.
[148] A. Brock, J. Donahue, and K. Simonyan, "Large scale gan training for high fidelity natural image
synthesis," arXiv preprint arXiv:1809.11096, 2018.
[149] H. Zhang et al., "Stackgan: Text to photo-realistic image synthesis with stacked generative
adversarial networks," in Proceedings of the IEEE international conference on computer vision,
2017, pp. 5907-5915.
[150] J.-L. Zhong, C.-M. Pun, and Y.-F. Gan, "Dense Moment Feature Index and Best Match Algorithms
for Video Copy-Move Forgery Detection," Information Sciences, 2020.
[151] X. Ding, Y. Huang, Y. Li, and J. He, "Forgery detection of motion compensation interpolated
frames based on discontinuity of optical flow," Multimedia Tools Applications, pp. 1-26, 2020.
[152] P. Niyishaka and C. Bhagvati, "Copy-move forgery detection using image blobs and BRISK
feature," Multimedia Tools Applications, pp. 1-15, 2020.
[153] M. Abdel-Basset, G. Manogaran, A. E. Fakhry, and I. El-Henawy, "2-Levels of clustering strategy
to detect and locate copy-move forgery in digital images," Multimedia Tools Applications, vol. 79,
no. 7, pp. 5419-5437, 2020.
[154] Z. Akhtar and D. Dasgupta, "A comparative evaluation of local feature descriptors for deepfakes
detection," in 2019 IEEE International Symposium on Technologies for Homeland Security (HST),
2019, pp. 1-5: IEEE.
[155] S. McCloskey and M. Albright, "Detecting gan-generated imagery using color cues," arXiv
preprint arXiv:.08247, 2018.
[156] L. Guarnera, O. Giudice, and S. Battiato, "DeepFake Detection by Analyzing Convolutional
Traces," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, 2020, pp. 666-667.
[157] N. Yu, L. S. Davis, and M. Fritz, "Attributing fake images to GANs: Learning and analyzing GAN
fingerprints," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp.
7556-7566.
[158] F. Marra, C. Saltori, G. Boato, and L. Verdoliva, "Incremental learning for the detection and
classification of GAN-generated images," in 2019 IEEE International Workshop on Information
Forensics and Security (WIFS), 2019, pp. 1-6: IEEE.
[159] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, "ICARL: Incremental Classifier and
Representation Learning," in Proceedings of the IEEE conference on Computer Vision and Pattern
Recognition, 2017, pp. 2001-2010.
[160] G. Perarnau, J. Van De Weijer, B. Raducanu, and J. M. Álvarez, "Invertible conditional gans for
image editing," arXiv preprint arXiv:1611.06355, 2016.
[161] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, and M. A. Ranzato, "Fader networks:
Manipulating images by sliding attributes," in Advances in neural information processing systems,
2017, pp. 5967-5976.
[162] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, "Stargan v2: Diverse image synthesis for multiple domains,"
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020,
pp. 8188-8197.
[163] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, "Attgan: Facial attribute editing by only changing
what you want," IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5464-5478, 2019.
[164] M. Liu et al., "Stgan: A unified selective transfer network for arbitrary image attribute editing," in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 3673-
3682.
[165] G. Zhang, M. Kan, S. Shan, and X. Chen, "Generative adversarial network with spatial attention
for face attribute editing," in Proceedings of the European conference on computer vision (ECCV),
2018, pp. 417-432.
[166] Z. He, M. Kan, J. Zhang, and S. Shan, "PA-GAN: Progressive Attention Generative Adversarial
Network for Facial Attribute Editing," arXiv preprint arXiv:2007.05892, 2020.
[167] L. Nataraj et al., "Detecting GAN generated fake images using co-occurrence matrices," Electronic
Imaging, vol. 2019, no. 5, pp. 532-1-532-7, 2019.
[168] X. Zhang, S. Karaman, and S.-F. Chang, "Detecting and simulating artifacts in gan fake images,"
in 2019 IEEE International Workshop on Information Forensics and Security (WIFS), 2019, pp. 1-
6: IEEE.
[169] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, "Image-to-image translation with conditional
adversarial networks," in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 1125-1134.
[170] R. Wang et al., "Fakespotter: A simple yet robust baseline for spotting ai-synthesized fake faces,"
arXiv preprint arXiv:.06122, 2019.
[171] O. M. Parkhi, A. Vedaldi, and A. Zisserman, "Deep face recognition," 2015.
[172] B. Amos, B. Ludwiczuk, and M. Satyanarayanan, "Openface: A general-purpose face recognition
library with mobile applications," CMU School of Computer Science, vol. 6, no. 2, 2016.
[173] F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified embedding for face recognition
and clustering," in Proceedings of the IEEE conference on computer vision and pattern recognition,
2015, pp. 815-823.
[174] A. Bharati, R. Singh, M. Vatsa, and K. W. Bowyer, "Detecting facial retouching using supervised
deep learning," IEEE Transactions on Information Forensics Security, vol. 11, no. 9, pp. 1903-
1913, 2016.
[175] A. Jain, R. Singh, and M. Vatsa, "On detecting gans and retouching based synthetic alterations," in
2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS),
2018, pp. 1-7: IEEE.
[176] S. Tariq, S. Lee, H. Kim, Y. Shin, and S. S. Woo, "Detecting both machine and human created fake
face images in the wild," in Proceedings of the 2nd international workshop on multimedia privacy
and security, 2018, pp. 81-87.
[177] H. Dang, F. Liu, J. Stehouwer, X. Liu, and A. K. Jain, "On the detection of digital face
manipulation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
recognition, 2020, pp. 5781-5790.
[178] C. Rathgeb et al., "PRNU-based detection of facial retouching," IET Biometrics, vol. 9, no. 4, pp.
154-164, 2020.
[179] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi, and T. Kinnunen, "Can we steal
your vocal identity from the Internet?: Initial investigation of cloning Obama's voice using GAN,
WaveNet and low-quality found data," arXiv preprint arXiv:1803.00860, 2018.
[180] X. Wang et al., "ASVspoof 2019: a large-scale public database of synthetized, converted and
replayed speech," Computer Speech & Language, p. 101114, 2020.
[181] Z. Jin, G. J. Mysore, S. Diverdi, J. Lu, and A. Finkelstein, "Voco: Text-based insertion and
replacement in audio narration," ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1-13,
2017.
[182] J. Damiani, A Voice Deepfake Was Used To Scam A CEO Out Of $243,000, Available at:
https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice-deepfake-was-used-to-scam-a-
ceo-out-of-243000/. Accessed: September 6, 2020.
[183] A. Leung, NVIDIA Reveals That Part of Its CEO's Keynote Presentation Was Deepfaked, Available
at: https://hypebeast.com/2021/8/nvidia-deepfake-jensen-huang-omniverse-keynote-video.
Accessed: August 29, 2021.
[184] J. Sotelo et al., "Char2wav: End-to-end speech synthesis," 2017.
[185] B. Sisman, J. Yamagishi, S. King, and H. Li, "An overview of voice conversion and its challenges:
From statistical modeling to deep learning," IEEE/ACM Transactions on Audio, Speech, Language
Processing, 2020.
[186] P. Partila, J. Tovarek, G. H. Ilk, J. Rozhon, and M. Voznak, "Deep Learning Serves Voice Cloning:
How Vulnerable Are Automatic Speaker Verification Systems to Spoofing Trials?," IEEE
Communications Magazine, vol. 58, no. 2, pp. 100-105, 2020.
[187] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, "Voiceloop: Voice fitting and synthesis via a
phonological loop," arXiv preprint arXiv:1707.06588, 2017.
[188] Y. Jia et al., "Transfer learning from speaker verification to multispeaker text-to-speech synthesis,"
in Advances in neural information processing systems, 2018, pp. 4480-4490.
[189] Y. Lee, T. Kim, and S.-Y. Lee, "Voice imitating text-to-speech neural networks," arXiv preprint
arXiv:.00927, 2018.
[190] H.-T. Luong and J. Yamagishi, "NAUTILUS: a Versatile Voice Cloning System," arXiv preprint
arXiv:2005.11004, 2020.
[191] Y. Chen et al., "Sample efficient adaptive text-to-speech," arXiv preprint arXiv:1809.10460, 2018.
[192] J. Cong, S. Yang, L. Xie, G. Yu, and G. Wan, "Data Efficient Voice Cloning from Noisy Samples
with Domain Adversarial Training," arXiv preprint arXiv:2008.04265, 2020.
[193] W. Ping et al., "Deep voice 3: 2000-speaker neural text-to-speech," Proc. ICLR, pp. 214-217, 2018.
[194] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, "Conditional image
generation with pixelcnn decoders," in Advances in neural information processing systems, 2016,
pp. 4790-4798.
[195] A. Oord et al., "Parallel wavenet: Fast high-fidelity speech synthesis," in International conference
on machine learning, 2018, pp. 3918-3926: PMLR.
[196] A. Gibiansky et al., "Deep voice 2: Multi-speaker neural text-to-speech," in Advances in neural
information processing systems, 2017, pp. 2962-2970.
[197] Y. Yasuda, X. Wang, S. Takaki, and J. Yamagishi, "Investigation of enhanced Tacotron text-to-
speech synthesis systems with self-attention for pitch accent language," in ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp.
6905-6909: IEEE.
[198] T. Toda et al., "The Voice Conversion Challenge 2016," in Interspeech, 2016, pp. 1632-1636.
[199] J. Lorenzo-Trueba et al., "The voice conversion challenge 2018: Promoting development of parallel
and nonparallel methods," arXiv preprint arXiv:.04262, 2018.
[200] Y. Zhao et al., "Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual
voice conversion," arXiv preprint arXiv:.12527, 2020.
[201] Y. Stylianou, O. Cappé, and E. Moulines, "Continuous probabilistic transform for voice
conversion," IEEE Transactions on speech audio processing, vol. 6, no. 2, pp. 131-142, 1998.
[202] T. Toda, A. W. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation
of spectral parameter trajectory," IEEE Transactions on speech and audio processing, vol. 15, no.
8, pp. 2222-2235, 2007.
[203] E. Helander, H. Silén, T. Virtanen, and M. Gabbouj, "Voice conversion using dynamic kernel
partial least squares regression," IEEE transactions on audio, speech, language processing, vol.
20, no. 3, pp. 806-817, 2011.
[204] Z. Wu, T. Virtanen, E. S. Chng, and H. Li, "Exemplar-based sparse representation with residual
compensation for voice conversion," IEEE/ACM Transactions on Audio, Speech, Language
Processing, vol. 22, no. 10, pp. 1506-1521, 2014.
[205] T. Nakashika, T. Takiguchi, and Y. Ariki, "High-order sequence modeling using speaker-
dependent recurrent temporal restricted Boltzmann machines for voice conversion," in Fifteenth
annual conference of the international speech communication association, 2014.
[206] H. Ming, D.-Y. Huang, L. Xie, J. Wu, M. Dong, and H. Li, "Deep Bidirectional LSTM Modeling
of Timbre and Prosody for Emotional Voice Conversion," in Interspeech, 2016, pp. 2453-2457.
[207] L. Sun, S. Kang, K. Li, and H. Meng, "Voice conversion using deep bidirectional long short-term
memory based recurrent neural networks," in 2015 IEEE international conference on acoustics,
speech and signal processing (ICASSP), 2015, pp. 4869-4873: IEEE.
[208] J. Wu, Z. Wu, and L. Xie, "On the use of i-vectors and average voice model for voice conversion
without parallel data," in 2016 Asia-Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA), 2016, pp. 1-6: IEEE.
[209] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, "WaveNet Vocoder with Limited Training
Data for Voice Conversion," in Interspeech, 2018, pp. 1983-1987.
[210] P.-c. Hsu, C.-h. Wang, A. T. Liu, and H.-y. Lee, "Towards robust neural vocoding for speech
generation: A survey," arXiv preprint arXiv:.02461, 2019.
[211] T. Kaneko and H. Kameoka, "Cyclegan-vc: Non-parallel voice conversion using cycle-consistent
adversarial networks," in 2018 26th European Signal Processing Conference (EUSIPCO), 2018,
pp. 2100-2104: IEEE.
[212] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, "Multi-target voice conversion without parallel data
by adversarially learning disentangled audio representations," arXiv preprint arXiv:.02812, 2018.
[213] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "Cyclegan-vc2: Improved cyclegan-based non-
parallel voice conversion," in ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2019, pp. 6820-6824: IEEE.
[214] F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, "High-quality nonparallel voice
conversion based on cycle-consistent adversarial network," in 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5279-5283: IEEE.
[215] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, "Voice conversion from unaligned
corpora using variational autoencoding wasserstein generative adversarial networks," arXiv
preprint arXiv:.00849, 2017.
[216] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "Stargan-vc: Non-parallel many-to-many voice
conversion using star generative adversarial networks," in 2018 IEEE Spoken Language
Technology Workshop (SLT), 2018, pp. 266-273: IEEE.
[217] M. Zhang, B. Sisman, L. Zhao, and H. Li, "DeepConversion: Voice conversion with limited parallel
training data," Speech Communication, 2020.
[218] W.-C. Huang et al., "Unsupervised representation disentanglement using cross domain features and
adversarial learning in variational autoencoder based voice conversion," IEEE Transactions on
Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 468-479, 2020.
[219] K. Qian, Z. Jin, M. Hasegawa-Johnson, and G. J. Mysore, "F0-consistent many-to-many non-
parallel voice conversion via conditional autoencoder," in ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6284-6288: IEEE.
[220] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, "Unsupervised speech representation
learning using wavenet autoencoders," IEEE/ACM transactions on audio, speech, language
processing, vol. 27, no. 12, pp. 2041-2053, 2019.
[221] R. Yamamoto, E. Song, and J.-M. Kim, "Parallel WaveGAN: A fast waveform generation model
based on generative adversarial networks with multi-resolution spectrogram," in ICASSP 2020-
2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020,
pp. 6199-6203: IEEE.
[222] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, "AttS2S-VC: Sequence-to-sequence voice
conversion with attention and context preservation mechanisms," in ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6805-
6809: IEEE.
[223] S.-w. Park, D.-y. Kim, and M.-c. Joe, "Cotatron: Transcription-guided speech encoder for any-to-
many voice conversion without parallel data," arXiv preprint arXiv:.03295, 2020.
[224] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda, "Voice transformer network:
Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining," arXiv
preprint arXiv:.06813, 2019.
[225] H. Lu et al., "One-Shot Voice Conversion with Global Speaker Embeddings," in INTERSPEECH,
2019, pp. 669-673.
[226] S. Liu, J. Zhong, L. Sun, X. Wu, X. Liu, and H. Meng, "Voice Conversion Across Arbitrary
Speakers Based on a Single Target-Speaker Utterance," in Interspeech, 2018, pp. 496-500.
[227] T.-h. Huang, J.-h. Lin, and H.-y. Lee, "How Far Are We from Robust Voice Conversion: A
Survey," in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 514-521: IEEE.
[228] N. Li, D. Tuo, D. Su, Z. Li, D. Yu, and A. Tencent, "Deep Discriminative Embeddings for Duration
Robust Speaker Verification," in Interspeech, 2018, pp. 2262-2266.
[229] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, "One-shot voice conversion by separating speaker and content
representations with instance normalization," arXiv preprint arXiv:.05742, 2019.
[230] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, "Autovc: Zero-shot voice style
transfer with only autoencoder loss," in International Conference on Machine Learning, 2019, pp.
5210-5219: PMLR.
[231] Y. Rebryk and S. Beliaev, "ConVoice: Real-Time Zero-Shot Voice Style Transfer with
Convolutional Network," arXiv preprint arXiv:.07815, 2020.
[232] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA workshop on
speech synthesis, 2004.
[233] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, "ATR Japanese
speech database as a tool of speech recognition and synthesis," Speech communication, vol. 9, no.
4, pp. 357-363, 1990.
[234] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, "Restructuring speech representations
using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0
extraction: Possible role of a repetitive structure in sounds," Speech communication, vol. 27, no. 3-
4, pp. 187-207, 1999.
[235] M. R. Kamble, H. B. Sailor, H. A. Patil, and H. Li, "Advances in anti-spoofing: from the perspective
of ASVspoof challenges," APSIPA Transactions on Signal Information Processing, vol. 9, 2020.
[236] J. Yi et al., "Half-Truth: A Partially Fake Audio Detection Dataset," arXiv preprint arXiv:.03617,
2021.
[237] X. Li et al., "Replay and synthetic speech detection with res2net architecture," in ICASSP 2021-
2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021,
pp. 6354-6358: IEEE.
[238] P. Aravind, U. Nechiyil, and N. Paramparambath, "Audio spoofing verification using deep
convolutional neural networks by transfer learning," arXiv preprint arXiv:.03464, 2020.
[239] J. Monteiro, J. Alam, and T. H. J. C. S. Falk, "Generalized end-to-end detection of spoofing attacks
to automatic speaker recognizers," Computer Speech Language, vol. 63, p. 101096, 2020.
[240] Y. Gao, T. Vuong, M. Elyasi, G. Bharaj, and R. Singh, "Generalized Spoofing Detection Inspired
from Audio Generation Artifacts," arXiv preprint arXiv:.04111, 2021.
[241] Z. Zhang, X. Yi, and X. Zhao, "Fake Speech Detection Using Residual Network with Transformer
Encoder," in Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia
Security, 2021, pp. 13-22.
[242] R. K. Das, J. Yang, and H. Li, "Data Augmentation with Signal Companding for Detection of
Logical Access Attacks," in ICASSP 2021-2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2021, pp. 6349-6353: IEEE.
[243] H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, "Continual Learning for Fake Audio Detection,"
arXiv preprint arXiv:.07286, 2021.
[244] C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, "Synthetic speech detection through
short-term and long-term prediction traces," EURASIP Journal on Information Security, vol. 2021,
no. 1, pp. 1-14, 2021.
[245] E. A. AlBadawy, S. Lyu, and H. Farid, "Detecting AI-Synthesized Speech Using Bispectral
Analysis," in CVPR Workshops, 2019, pp. 104-109.
[246] A. K. Singh and P. Singh, "Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics," arXiv preprint arXiv:.01934, 2020.
[247] H. Malik, "Fighting AI with AI: Fake Speech Detection Using Deep Learning," in 2019 AES
INTERNATIONAL CONFERENCE ON AUDIO FORENSICS (June 2019), 2019.
[248] T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury, "Generalization of audio
deepfake detection," in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop,
2020, pp. 132-137.
[249] L. Huang and C.-M. Pun, "Audio Replay Spoof Attack Detection by Joint Segment-Based Linear
Filter Bank Feature Extraction and Attention-Enhanced DenseNet-BiLSTM Network," IEEE/ACM
Transactions on Audio, Speech, Language Processing, vol. 28, pp. 1813-1825, 2020.
[250] Z. Wu, R. K. Das, J. Yang, and H. Li, "Light Convolutional Neural Network with Feature
Genuinization for Detection of Synthetic Speech Attacks," arXiv preprint arXiv:.09637, 2020.
[251] Y. Zhang, F. Jiang, and Z. Duan, "One-class learning towards synthetic voice spoofing detection,"
IEEE Signal Processing Letters, vol. 28, pp. 937-941, 2021.
[252] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez, "A light convolutional GRU-
RNN deep feature extractor for ASV spoofing detection," in Proc. Interspeech, 2019, vol. 2019,
pp. 1068-1072.
[253] G. Hua, A. Bengjinteoh, and H. Zhang, "Towards End-to-End Synthetic Speech Detection," IEEE
Signal Processing Letters, 2021.
[254] R. Wang et al., "DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake
Voices," in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1207-
1216.
[255] Z. Jiang, H. Zhu, L. Peng, W. Ding, and Y. Ren, "Self-Supervised Spoofing Audio Detection
Scheme," in INTERSPEECH, 2020, pp. 4223-4227.
[256] H. e. Delgado et al., ASVspoof 2021, Available at: https://www.asvspoof.org/. Accessed: August
6, 2021.
[257] R. Reimao and V. Tzerpos, "FoR: A Dataset for Synthetic Speech Detection," in 2019 International
Conference on Speech Technology and Human-Computer Dialogue (SpeD), 2019, pp. 1-10: IEEE.
[258] R. Thurairatnam, Raising Awareness About The Dangers of Synthetic Media, Available at:
https://www.dessa.com/post/why-we-made-the-worlds-most-realistic-deepfake. Accessed: August
15, 2021.
[259] Y. Li et al., "DeepFake-o-meter: An Open Platform for DeepFake Detection," in 2021 IEEE
Security and Privacy Workshops (SPW), 2021, pp. 277-281: IEEE.
[260] V. Mehta, P. Gupta, R. Subramanian, and A. Dhall, "FakeBuster: A DeepFakes Detection Tool for
Video Conferencing Scenarios," in 26th International Conference on Intelligent User Interfaces,
2021, pp. 61-63.
[261] Reality Defender 2020: A FORCE AGAINST DEEPFAKES, Available at:
https://rd2020.org/index.html. Accessed: August 03, 2021.
[262] R. Durall, M. Keuper, F.-J. Pfreundt, and J. Keuper, "Unmasking deepfakes with simple features,"
arXiv preprint arXiv:.00686, 2019.
[263] T. Zhao, X. Xu, M. Xu, H. Ding, Y. Xiong, and W. Xia, "Learning to Recognize Patch-Wise
Consistency for Deepfake Detection," arXiv preprint arXiv:.09311, 2020.
[264] H. Malik, "Securing voice-driven interfaces against fake (cloned) audio attacks," in 2019 IEEE
Conference on Multimedia Information Processing and Retrieval (MIPR), 2019, pp. 512-517:
IEEE.
[265] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, "Celeb-df: A new dataset for deepfake forensics," arXiv
preprint arXiv:1909.12962, 2019.
[266] X. Li, K. Yu, S. Ji, Y. Wang, C. Wu, and H. Xue, "Fighting against deepfake: Patch & pair
convolutional neural networks (ppcnn)," in Companion Proceedings of the Web Conference 2020,
2020, pp. 88-89.
[267] Z. Guo, G. Yang, J. Chen, and X. Sun, "Fake face detection via adaptive residuals extraction
network," arXiv preprint arXiv:.04945, 2020.
[268] B. Hosler et al., "Do Deepfakes Feel Emotions? A Semantic Approach to Detecting Deepfakes via
Emotional Inconsistencies," in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 1013-1022.
[269] I. Amerini and R. Caldelli, "Exploiting prediction error inconsistencies through LSTM-based
classifiers to detect deepfake videos," in Proceedings of the 2020 ACM Workshop on Information
Hiding and Multimedia Security, 2020, pp. 97-102.
[270] J. Fei, Z. Xia, P. Yu, and F. Xiao, "Exposing AI-generated videos with motion magnification,"
Multimedia Tools Applications, pp. 1-14, 2020.
[271] P. Korshunov and S. Marcel, "Deepfakes: a new threat to face recognition? assessment and
detection," arXiv preprint arXiv:1812.08685, 2018.
[272] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, "CNN-generated images are
surprisingly easy to spot... for now," in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 8695-8704.
[273] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva, "Detection of gan-generated fake
images over social networks," in 2018 IEEE Conference on Multimedia Information Processing
and Retrieval (MIPR), 2018, pp. 384-389: IEEE.
[274] M. S. Rana and A. H. Sung, "Deepfakestack: A deep ensemble-based learning technique for
deepfake detection," in 2020 7th IEEE International Conference on Cyber Security and Cloud
Computing (CSCloud)/2020 6th IEEE International Conference on Edge Computing and Scalable
Cloud (EdgeCom), 2020, pp. 70-75: IEEE.
[275] R. Wang, L. Ma, F. Juefei-Xu, X. Xie, J. Wang, and Y. Liu, "Fakespotter: A simple baseline for
spotting ai-synthesized fake faces," arXiv preprint arXiv:.06122, 2019.
[276] H. Khalid and S. S. Woo, "OC-FakeDect: Classifying deepfakes using one-class variational
autoencoder," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, 2020, pp. 656-657.
[277] D. Cozzolino, A. Rössler, J. Thies, M. Nießner, and L. Verdoliva, "ID-Reveal: Identity-aware
DeepFake Video Detection," arXiv preprint arXiv:.02512, 2020.
[278] L. Trinh and Y. Liu, "An Examination of Fairness of AI Models for Deepfake Detection," arXiv
preprint arXiv:.00558, 2021.
[279] N. Carlini and H. Farid, "Evading deepfake-image detectors with white-and black-box attacks," in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, 2020, pp. 658-659.
[280] P. Neekhara, B. Dolhansky, J. Bitton, and C. C. Ferrer, "Adversarial threats to deepfake detection:
A practical perspective," in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 923-932.
[281] C.-y. Huang, Y. Y. Lin, H.-y. Lee, and L.-s. Lee, "Defending your voice: Adversarial attack on
voice conversion," in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 552-
559: IEEE.
[282] Y.-Y. Ding, J.-X. Zhang, L.-J. Liu, Y. Jiang, Y. Hu, and Z.-H. Ling, "Adversarial Post-Processing
of Voice Conversion against Spoofing Detection," in 2020 Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA ASC), 2020, pp. 556-560: IEEE.
[283] R. Durall, M. Keuper, and J. Keuper, "Watch your up-convolution: Cnn based generative deep
neural networks are failing to reproduce spectral distributions," in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2020, pp. 7890-7899.
[284] S. Jung and M. Keuper, "Spectral distribution aware image generation," arXiv preprint
arXiv:.03110, 2020.
[285] Y. Huang et al., "FakeRetouch: Evading DeepFakes Detection via the Guidance of Deliberate
Noise," arXiv preprint arXiv:.09213, 2020.
[286] J. C. Neves, R. Tolosana, R. Vera-Rodriguez, V. Lopes, H. Proença, and J. Fierrez, "Ganprintr:
Improved fakes and evaluation of the state of the art in face manipulation detection," IEEE Journal
of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 1038-1048, 2020.
[287] T. Osakabe, M. Tanaka, Y. Kinoshita, and H. Kiya, "CycleGAN without checkerboard artifacts for
counter-forensics of fake-image detection," in International Workshop on Advanced Imaging
Technology (IWAIT) 2021, 2021, vol. 11766, p. 1176609: International Society for Optics and
Photonics.
[288] Y. Huang et al., "Fakepolisher: Making deepfakes more detection-evasive by shallow
reconstruction," in Proceedings of the 28th ACM International Conference on Multimedia, 2020,
pp. 1217-1226.
[289] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, "Faceforensics: A large-
scale video dataset for forgery detection in human faces," arXiv preprint arXiv:1803.09179, 2018.
[290] Faceswap, Available at: https://github.com/MarekKowalski/FaceSwap/. Accessed: August 14,
2020.
[291] J. Thies, M. Zollhöfer, and M. Nießner, "Deferred neural rendering: Image synthesis using neural
textures," ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1-12, 2019.
[292] S. Abu-El-Haija et al., "Youtube-8m: A large-scale video classification benchmark," arXiv preprint
arXiv:1609.08675, 2016.
[293] A. Aravkin, J. V. Burke, L. Ljung, A. Lozano, and G. Pillonetto, "Generalized Kalman smoothing:
Modeling and algorithms," Automatica, vol. 86, pp. 63-86, 2017.
[294] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, "Color transfer between images," IEEE
Computer graphics, vol. 21, no. 5, pp. 34-41, 2001.
[295] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer, "The deepfake detection challenge
(dfdc) preview dataset," arXiv preprint arXiv:.08854, 2019.
[296] B. Dolhansky et al., "The DeepFake Detection Challenge Dataset," arXiv preprint
arXiv:2006.07397, 2020.
[297] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, "Deeperforensics-1.0: A large-scale dataset for
real-world face forgery detection," in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 2889-2898.
[298] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, "Wilddeepfake: A challenging real-world dataset
for deepfake detection," in Proceedings of the 28th ACM International Conference on Multimedia,
2020, pp. 2382-2390.
[299] K. Ito, The LJ speech dataset, Available at: https://keithito.com/LJ-Speech-Dataset. Accessed:
December 22, 2020.
[300] The M-AILABS speech dataset, Available at: https://www.caito.de/2019/01/the-m-ailabs-speech-
dataset/. Accessed: Feb 25, 2021.
[301] R. Ardila et al., "Common voice: A massively-multilingual speech corpus," arXiv preprint
arXiv:1912.06670, 2019.
[302] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, "Recycle-gan: Unsupervised video retargeting," in
Proceedings of the European conference on computer vision (ECCV), 2018, pp. 119-135.
[303] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, "Voice conversion through vector
quantization," Journal of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71-76, 1990.
[304] P. Fraga-Lamas and T. M. Fernández-Caramés, "Fake News, Disinformation, and Deepfakes:
Leveraging Distributed Ledger Technologies and Blockchain to Combat Digital Deception and
Counterfeit Reality," IT Professional, vol. 22, no. 2, pp. 53-59, 2020.
[305] H. R. Hasan and K. Salah, "Combating deepfake videos using blockchain and smart contracts,"
IEEE Access, vol. 7, pp. 41596-41606, 2019.
[306] M. Feng and H. Xu, "Deep reinforecement learning based optimal defense for cyber-physical
system in presence of unknown cyber-attack," in 2017 IEEE Symposium Series on Computational
Intelligence (SSCI), 2017, pp. 1-8: IEEE.
[307] A. R. Gonçalves, R. P. Violato, P. Korshunov, S. Marcel, and F. O. Simoes, "On the generalization
of fused systems in voice presentation attack detection," in 2017 International conference of the
biometrics special interest group (BIOSIG), 2017, pp. 1-5: IEEE.
[308] Fang, Meng, Yuan Li, and Trevor Cohn. "Learning how to active learn: A deep reinforcement
learning approach." arXiv preprint arXiv:1708.02383 (2017).
[309] Liu, Zimo, Jingya Wang, Shaogang Gong, Huchuan Lu, and Dacheng Tao. "Deep reinforcement
active learning for human-in-the-loop person re-identification." In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 6122-6131. 2019.
[310] B. Deokar and A. Hazarnis, "Intrusion detection system using log files and reinforcement learning,"
International Journal of Computer Applications, vol. 45, no. 19, pp. 28-35, 2012.
[311] R. Baumann, K. M. Malik, A. Javed, A. Ball, B. Kujawa, and H. Malik, "Voice spoofing detection
corpus for single and multi-order audio replays," Computer Speech Language, vol. 65, p. 101132,
2021.

You might also like