A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
Review
A Review of Modern Audio Deepfake Detection Methods:
Challenges and Future Directions
Zaynab Almutairi 1, * and Hebah Elgibreen 1,2
1 Information Technology Department, College of Computer and Information Sciences, King Saud University,
Riyadh P.O. Box 145111, Saudi Arabia; [email protected]
2 Artificial Intelligence Center of Advanced Studies (Thakaa), King Saud University,
Riyadh P.O. Box 145111, Saudi Arabia
* Correspondence: [email protected]
Abstract: A number of AI-generated tools are used today to clone human voices, leading to a new
technology known as Audio Deepfakes (ADs). Despite being introduced to enhance human lives
as audiobooks, ADs have been used to disrupt public safety. ADs have thus recently come to
the attention of researchers, with Machine Learning (ML) and Deep Learning (DL) methods being
developed to detect them. In this article, a review of existing AD detection methods was conducted,
along with a comparative description of the available faked audio datasets. The article introduces
types of AD attacks and then outlines and analyzes the detection methods and datasets for imitation-
and synthetic-based Deepfakes. To the best of the authors’ knowledge, this is the first review targeting
imitated and synthetically generated audio detection methods. The similarities and differences of
AD detection methods are summarized by providing a quantitative comparison that finds that the
method type affects the performance more than the audio features themselves, in which a substantial
tradeoff between the accuracy and scalability exists. Moreover, at the end of this article, the potential
research directions and challenges of Deepfake detection methods are discussed to discover that,
even though AD detection is an active area of research, further research is still needed to address the
Citation: Almutairi, Z.; Elgibreen, H.
existing gaps. This article can be a starting point for researchers to understand the current state of the
A Review of Modern Audio
AD literature and investigate more robust detection models that can detect fakeness even if the target
Deepfake Detection Methods:
audio contains accented voices or real-world noises.
Challenges and Future Directions.
Algorithms 2022, 15, 155. https://
doi.org/10.3390/a15050155
Keywords: Audio Deepfakes (ADs); Machine Learning (ML); Deep Learning (DL); imitated audio
types of AD
emerged, have emerged,
increasing increasing
the challenge the challenge
in detection; theyin are detection; they are imitation-based,
synthetic-based, synthetic-based,
imitation-based,
and replay-based,and replay-based,
as will be explained as will befollowing
in the explainedsection.
in the following section.
With regard
With regard to to Deepfakes,
Deepfakes, many many detection
detection methods
methods have have been
been introduced
introduced to to discern
discern
fake audio
fake audio files
files from
from real
real speech.
speech. A A number
number of of ML
ML andand DL DL models
models have
have been
been developed
developed
that use different
that different strategies
strategiestotodetect
detectfake
fakeaudio.
audio. TheThefollowing
following strategies describe
strategies the AD
describe the
detection
AD process
detection in general,
process as illustrated
in general, as illustratedin Figure
in Figure 1. First, each
1. First, audio
each clipclip
audio should
shouldbe
be preprocessed
preprocessed and and transformed
transformed intosuitable
into suitableaudio
audiofeatures,
features,suchsuch as
as Mel-spectrograms.
These
These features are are input
inputintointothe
thedetection
detectionmodel,
model, which
which thenthen performs
performs the the necessary
necessary op-
operations,
erations, suchsuch as the
as the training
training process.
process. TheThe output
output is fedis into
fed into any fully
any fully connected
connected layerlayer
with
with an activation
an activation functionfunction
(for a (for a nonlinearity
nonlinearity task) totask)
produceto produce a prediction
a prediction probability
probability of class
of class
0 as fake0 or
as class
fake 1orasclass
real.1However,
as real. However,
there is athere is a trade-off
trade-off between and
between accuracy accuracy and
computa-
computational complexity. Further work is therefore required to
tional complexity. Further work is therefore required to improve the performance of ADimprove the performance
of AD detection
detection and overcome
and overcome the gapstheidentified
gaps identified
in the in the literature.
literature.
Figure 1.
Figure 1. An
An illustration
illustration of
of the
the AD
AD detection
detection process.
process.
AD
AD detection
detection has therefore
therefore become
become an an active
active area
area of
of research
research with
with the
the development
development
of
of advanced
advanced techniques
techniques and and DL methods. However,
However, with such advancements, current DL
methods struggling, and
methods are struggling, andfurther
furtherinvestigation
investigationisisnecessary
necessarytotounderstand
understand what
what area
area of
of AD detection needs further development. Moreover, a comparative
AD detection needs further development. Moreover, a comparative analysis analysis of current
current
methods
methods is is also
also important,
important, and and to
to the
the best of the authors’ knowledge,
knowledge, a review
review of imitated
imitated
and
and synthetically
synthetically generated
generated audio detection methods is missing from the literature. Thus,
this
this article
article introduces
introduces the the following
following significant
significantcontributions
contributionsto tothe
theliterature:
literature:
•• A A review
review of of state-of-the-art
state-of-the-art AD AD detection
detection methods
methods that
that target
target imitated and syntheti-
imitated and syntheti-
cally
cally generated
generated voices;
voices;
•• provision
provision of of aa brief
brief description
description of of current
currentAD
ADdatasets;
datasets;
•• aa comparative analysis of existing
comparative analysis of existing methods methods and datasets
and to highlight
datasets the strengths
to highlight and
the strengths
weaknesses of each AD detection
and weaknesses of each AD detection family;family;
•• aa quantitative
quantitative comparison
comparison of of recent
recent state-of-the-art
state-of-the-artADADdetection
detectionmethods;
methods;andand
•• aa discussion
discussion of the challenges and potential future research directionsin
of the challenges and potential future research directions inthis
thisarea.
area.
The
The rest
rest of
of this
this article
article is
is organized
organized asas follows.
follows. An
An AD
AD definition
definition and
and its
its types
types are
are
presented in Section 2. Section 3 discusses and summarizes the current methods
presented in Section 2. Section 3 discusses and summarizes the current methods devel- developed
for
opedADfordetection. SectionSection
AD detection. 4 presents the generated
4 presents audio dataset
the generated used forused
audio dataset AD detection and
for AD detec-
highlights its characteristics. Section 5 presents a quantitative comparison of recent state-of-
tion and highlights its characteristics. Section 5 presents a quantitative comparison of re-
the-art AD detection methods. Section 6 presents the challenges involved in detecting AD
cent state-of-the-art AD detection methods. Section 6 presents the challenges involved in
and discusses potential future research directions for the detection methods. Finally, this
detecting AD and discusses potential future research directions for the detection methods.
article concludes with Section 7, which summarizes our findings.
Finally, this article concludes with Section 7, which summarizes our findings.
Algorithms 2022,
Algorithms 2022,15,
15,155
155 33 of 19
of 20
2. Types
2. Types ofof Audio
Audio Deepfake
Deepfake AttacksAttacks
AD technology
AD technology is is aa recent
recent invention
invention thatthat allows
allows users
users to
to create
create audio
audio clips
clips that
that sound
sound
like specific people saying things they did not say [2]. This technology
like specific people saying things they did not say [2]. This technology was initially was initially devel-
oped for afor
developed variety of applications
a variety of applications intended
intended to to
improve
improve human
humanlife,
life,such
suchasas audiobooks,
audiobooks,
where itit could
where could bebe used
used to to imitate
imitate soothing
soothing voices
voices [8].[8]. As defined from the AD literature,
there are
there are three
three main
main types of audio fakeness: imitation-based,
imitation-based, synthetic-based, and replay- replay-
based Deepfakes.
based Deepfakes.
Imitation-based Deepfakes are “a way
Imitation-based way ofof transforming
transforming speech (secret audio) audio) soso that
that
it sounds
it sounds like
like another
another speech
speech (target
(target audio)
audio) with
with thethe primary
primary purpose
purpose of of protecting
protecting the the
privacy
privacy ofof the
the secret
secret audio”
audio” [3].[3]. Voices
Voices cancan be
be imitated
imitated in in different
different ways,
ways, forfor example,
example, by
using
using humans
humans with similar voices who are able to imitate the original speaker. However,
masking algorithms, such
masking algorithms, suchasasEfficient
EfficientWavelet
WaveletMask Mask (EWM),
(EWM), have
have been been introduced
introduced to
to im-
imitate audio
itate audio andand Deepfake
Deepfake speech.
speech. In particular,
In particular, an original
an original andand target
target audio audio
will will be
be rec-
recorded
orded with with similar
similar characteristics.
characteristics. Then,Then, as illustrated
as illustrated in Figure
in Figure 2, theof
2, the signal signal of the
the original
original audio Figure 2a will be transformed to say the speech
audio Figure 2a will be transformed to say the speech in the target audio in Figure 2b in the target audio in
Figure
using an2b imitation
using an imitation
generation generation
method that method
willthat will generate
generate a new speech,
a new speech, shown shown
in Figurein
Figure 2c, which
2c, which is theone.
is the fake fakeItone. It isdifficult
is thus thus difficult for humans
for humans to discern
to discern between
between the fake
the fake and
and
real real
audioaudio generated
generated by thisby this
methodmethod[3]. [3].
Figure 2.
Figure 2. Imitation-based
Imitation-based Deepfake.
Deepfake.
Synthetic-based
Synthetic-based or Text-To-Speech
Text-To-Speech (TTS)(TTS) aims
aims to to transform
transform text
text into
into acceptable
acceptable andand
natural
natural speech in real time [9] and consists of three modules: a text analysis model,
speech in real time [9] and consists of three modules: a text analysis model, an an
acoustic
acoustic model,
model, andand aa vocoder.
vocoder. To To generate
generate synthetic
synthetic Deepfake
Deepfake audio,
audio, two
two crucial
crucial steps
steps
should
should be be followed.
followed.First,
First,clean
clean and
and structured
structured rawraw audio
audio should
should be collected,
be collected, with with a
a tran-
transcript
script text text
of theofaudio
the audio
speech.speech.
Second, Second,
the TTSthe TTS must
model model bemust be using
trained trainedtheusing the
collected
collected data toabuild
data to build a synthetic
synthetic audioaudio generation
generation model. model. Tactoran
Tactoran 2, 2, DeepVoice
Deep Voice 3,3, and
and
FastSpeech
FastSpeech 2 are well-known model generation techniques and are able to produce the
2 are well-known model generation techniques and are able to produce the
highest
highest level
level of
of natural-sounding
natural-sounding audioaudio [10,11]. Tactoran 22 creates
[10,11]. Tactoran creates Mel-spectrograms
Mel-spectrograms with with
aa modified
modifiedWaveNet
WaveNetvocoder
vocoder[12].
[12].Deep
DeepVoice
Voice3 3isisa aneural
neuraltext-to-speech
text-to-speech model
model that uses
that a
uses
position-augmented attention mechanism for an attention-based
a position-augmented attention mechanism for an attention-based decoder [13]. decoder [13]. FastSpeech 2
produces
FastSpeech high-quality
2 producesresults with theresults
high-quality fastestwith
training time [11].
the fastest In thetime
training synthetic technique,
[11]. In the syn-
the transcript
thetic textthe
technique, with the voicetext
transcript of the
withtarget speaker
the voice willtarget
of the be fedspeaker
into thewill
generation model.
be fed into the
The text analysis module then processes the incoming text and converts
generation model. The text analysis module then processes the incoming text and converts it into linguistic
characteristics. Then, the acoustic module extracts the parameters of the target speaker
it into linguistic characteristics. Then, the acoustic module extracts the parameters of the
from the dataset depending on the linguistic features generated from the text analysis
target speaker from the dataset depending on the linguistic features generated from the
module. Last, the vocoder will learn to create speech waveforms based on the acoustic
text analysis module. Last, the vocoder will learn to create speech waveforms based on
Algorithms 2022, 15,
Algorithms 2022, 15, 155
155 44 of
of 20
19
feature parameters,
the acoustic feature and the final and
parameters, audio
thefile willaudio
final be generated, which
file will be includes
generated, the synthetic
which includes
fake audio infake
the synthetic a waveform format. Figure
audio in a waveform format.3 Figure
illustrates the process
3 illustrates of synthetic-based
the process of synthetic-
voice
basedgeneration.
voice generation.
Figure 3.
Figure 3. The Synthetic-based Deepfake Process.
Process.
Replay-based
Replay-based Deepfakes
Deepfakes are
are aa type
type of malicious work that aims to replay a recording
of
of the
thetarget
targetspeaker’s
speaker’svoice
voice[14]. There
[14]. areare
There twotwo
types: far-field
types: detection
far-field and cut-and-paste
detection and cut-and-
detection. In far-field
paste detection. detection,
In far-field a microphone
detection, recording
a microphone of the victim
recording of the recording is played
victim recording is
as a test segment on a telephone handset with a loudspeaker [15]. Meanwhile,
played as a test segment on a telephone handset with a loudspeaker [15]. Meanwhile, cut- cutting and
pasting
ting andinvolves
pastingfaking thefaking
involves sentencetherequired
sentencebyrequired
a text-dependent system [15].system
by a text-dependent This article
[15].
will focus on Deepfake methods spoofing real voices rather than approaches
This article will focus on Deepfake methods spoofing real voices rather than approaches that use edited
recordings. Thisrecordings.
that use edited review will This
thus review
cover the detection
will thus covermethods used to identify
the detection methodssynthetic
used to
and imitation
identify Deepfakes,
synthetic and replay-based
and imitation Deepfakes,attacks will be considered
and replay-based attacksout
willofbe
scope.
considered
out of scope.
3. Fake Audio Detection Methods
TheAudio
3. Fake wide range of accessible
Detection Methods tools and methods capable of generating fake audio has
led to significant recent attention to AD detection with different languages. This section
The wide range of accessible tools and methods capable of generating fake audio has
will therefore present the latest work on detecting imitated and synthetically produced
led to significant recent attention to AD detection with different languages. This section
voices. In general, the current methods can be divided into two main types: ML and
will therefore present the latest work on detecting imitated and synthetically produced
DL methods.
voices. In general, the current methods can be divided into two main types: ML and DL
Classical ML models have been widely adopted in AD detection. Rodríguez-Ortega
methods.
et al. [3] contributed to the literature on detecting fake audio in two aspects. They first
Classical ML models have been widely adopted in AD detection. Rodríguez-Ortega
developed a fake audio dataset based on the imitation method by extracting the entropy
et al. [3] contributed to the literature on detecting fake audio in two aspects. They first
features of real and fake audio. Using the created H-Voice dataset [16], the researchers were
developed
able to builda an
fakeML audio
model dataset
usingbased on Regression
Logistic the imitation method
(LR) by fake
to detect extracting
audio.theTheentropy
model
features of real and fake audio. Using the created H-Voice dataset [16],
achieved a 98% success rate in detecting tasks, but the data needed to be pre-processed the researchers
were able to
manually to extract
build an theML modelfeatures.
relevant using Logistic Regression (LR) to detect fake audio. The
model achieved a 98% success rate
Kumar-Singh and Singh [17] proposed in detecting tasks, but
a Quadratic the data
Support needed
Vector to be (Q-SVM)
Machine pre-pro-
cessed manually to extract the relevant features.
model to distinguish synthetic audio from natural human voices. When adopting the model
Kumar-Singh
for binary and Singh
classification, [17] proposed
the authors a Quadratic
divided the audio intoSupport Vector
two classes, Machine
human and (Q-
AI-
SVM) model to distinguish synthetic audio from natural human voices.
generated. This model was compared to other ML methods, such as Linear Discriminant, When adopting
the model for
Quadratic binary classification,
Discriminant, Linear SVM,the authors
weighted divided the audio
K-Nearest into two(KNN),
Neighbors classes,boosted
human
and ensemble,
tree AI-generated. andThis model
LR. As was compared
a result, they foundtothat
other ML methods,
Q-SVM such asother
outperformed Linear Discri-
classical
minant, Quadratic Discriminant, Linear SVM, weighted K-Nearest Neighbors
methods by 97.56%, with a misclassification rate of 2.43%. Moreover, Borrelli et al. [18] (KNN),
boosted tree
developed anensemble,
SVM model and LR.Random
with As a result, they
Forest found
(RF) that Q-SVM
to predict syntheticoutperformed
voices basedotheron a
new audio feature called Short-Term Long-Term (STLT). The models were trainedBorrelli
classical methods by 97.56%, with a misclassification rate of 2.43%. Moreover, using theet
al. [18] developed
Automatic an SVM model
Speaker Verification (ASV) with Random
spoof challengeForest
2019(RF) to predict
[19] dataset. synthetic voices
Experiments found
based
that theon a new audio
performance of feature
SVM was called Short-Term
higher than that ofLong-Term
RF by 71%. (STLT).
Liu et The models
al. [20] were
compared
trained using the Automatic Speaker Verification (ASV) spoof challenge 2019 [19] dataset.
Algorithms 2022, 15, 155 5 of 20
the robustness of SVM with the DL method called Convolutional Neural Network (CNN)
to detect the faked stereo audio from the real ones. From that comparison, it was found
that CNN is more robust than SVM even though both achieved a high accuracy of 99% in
the detection. However, SVM suffered from what the LR model had faced in the feature
extraction process.
According to the works discussed thus far, the features in the ML models need to be
manually extracted, and intensive preprocessing is needed before training to ensure good
performance. However, this is time-consuming and can lead to inconsistencies, which has
led the research community to develop high-level DL methods.. To address this, the CNN
model used by Subramani and Rao [21] created a novel approach for detecting synthetic
audio based on two CNN models, EfficientCNN and RES-EfficientCNN. As a result, RES-
EfficientCNN achieved a higher F1-score of 97.61 than EfficientCNN (94.14 F1-score) when
tested over the ASV spoof challenge 2019 dataset [19]. M. Ballesteros et al. [5] developed
a classification model named Deep4SNet that visualized the audio dataset based on a 2D
CNN model (histogram) to classify imitation and synthetic audio. Deep4SNet showed
an accuracy of 98.5% in detecting imitation and synthetic audio. However, Deep4SNet’s
performance was not scalable and was affected by the data transformation process. E.R.
Bartusiak and E.J. Delp [22] compared the performance of the CNN model against the
random method in detecting synthetic audio signals. Although the CNN achieved an
accuracy 85.99% higher than that of the baseline classifier, it suffered from an overfitting
problem. The Lataifeh et al. [23] experimental study compared CNN and Bidirectional
Long Short-Term Memory (BiLSTM) performance with ML models. The proposed method
targeted the imitation-based fakeness of the Quranic audio clips dataset named Arabic Di-
versified Audio (AR-DAD) [24]. They tested the ability of CNN and BiLSTM to distinguish
real voices from imitators. In addition, ML methods such as SVM, SVM-Linear, Radial
Basis Function (SVMRBF), LR, Decision Tree (DT), RF, and Gradient Boosting (XGBoost)
were also tested. Ultimately, the study found that SVM had the highest accuracy with 99%,
while the lowest was DT with 73.33%. Meanwhile, CNN achieved a detection rate higher
than BiLSTM with 94.33%. Although the accuracy of the CNN method was lower than that
of the ML models, it was better in capturing spurious correlations. It was also effective
in extracting features that could be achieved automatically with generalization abilities.
However, the main limitation of the CNN models that are used thus far for AD is that
they can only handle images as input, and thus the audio needs to be preprocessed and
transformed to a spectrogram or 2D figure to be able to provide it as input to the network.
Zhenchun Lei et al. [25] proposed a 1-D CNN and Siamese CNN to detect fake audio.
In the case of the 1-D CNN, the input to the model was the speech log-probabilities, while
the Siamese CNN was based on two trained GMM models. The Siamese CNN contained
two identical CNNs that were the same as the 1-D CNN but concatenated them using a
fully connected layer with a softmax output layer. The two models were tested over the
ASVspoof 2019 dataset to find that the proposed Siamese CNN outperformed the GMM
and 1-D CNN by improving the min-tDCF and Equal Error Rate (EER) (EER is the error
rate where the false-negative rate and the false-positive rate are equal [26]) by ~55% when
using the LFCC features. However, the performance was slightly lower when using the
CQCC features. It was also found that the model is not sufficiently robust and works with
a specific type of feature.
Another CNN model was proposed in [27], where the audio was transferred to scatter
plot images of neighboring samples before giving it as input to the CNN model. The
developed model was trained over a dataset called the Fake or Real (FoR) dataset [28]
to evaluate the model, and the model accuracy reached 88.9%. Although the proposed
model addressed the generalization problem of DL-based models by training with data
from different generation algorithms, its performance was not as good as the others in
the literature. The accuracy (88%) and EER (11%) were worse than those of the other DL
models tested in the experiment. Hence, the model needs further improvement, and more
data transformers need to be included.
Algorithms 2022, 15, 155 6 of 20
On the other hand, Yu et al. [29] proposed a new scoring method named Human
Log-Likelihoods (HLLs) based on the Deep Neural Network (DNN) classifier to enhance
the detection rate. They compared this with a classical scoring method called the Log-
Likelihood Ratios (LLRs) that depends on the Gaussian Mixture Model (GMM). DNN-HLLs
and GMM-LLRs have been tested with the ASV spoof challenge 2015 dataset [30] and
extracted features automatically. These tests confirmed that DNN-HLLs produced better
detection results than GMM-LLRs since they achieved an EER of 12.24.
Wang et al. [31] therefore developed a DNN model named Deep-Sonar that captured
the neuron behaviors of speaker recognition (SR) systems against AI-synthesized fake
audio. Their model depends on Layer-wise neuron behaviors in the classification task. The
proposed model achieved a detection rate of 98.1% with an EER of approximately 2% on the
voices of English speakers from the FoR dataset [28]. However, DeepSonar’s performance
was highly affected by real-world noise. Wijethunga et al.’s [32] research used DNNs to
differentiate synthetic and real voices and combined two DL models, CNNs and Recurrent
Neural Network (RNN). This is because CNN is efficient at extracting features, while
RNN is effective at detecting long-term dependencies in time variances. Interestingly, this
combination achieved a 94% success rate in detecting audio generated by AI synthesizers.
Nevertheless, the DNN model does not carry much artifact information from the feature
representation perspective.
Chintha et al. [33] developed two novel models that depend on a convolution RNN
for audio Deepfake classification. First, the Convolution Recurrent Neural Network Spoof
(CRNN-Spoof) model contains five layers of extracted audio signals that are fed into
a bidirectional LSTM network for predicting fake audio. Second, the Wide Inception
Residual Network Spoof (WIRE-Net-Spoof) model has a different training process and uses
a function named weighted negative log-likelihood. The CRNN-Spoof method obtained
higher results than the WIRE-Net-Spoof approach by 0.132% of the Tandem Decision Cost
Function (t-DCF) (t-DCF is a single scalar that measure the reliability of decisions made
by the systems [34]) with a 4.27% EER in the ASV spoof challenge 2019 dataset [19]. One
limitation of this study is that it used many layers and convolutional networks, which
caused it to suffer from management complexities. To address this limitation, Shan and
Tsai [35] proposed an alignment technique based on the classification models: Long Short-
Term Memory (LSTM), bidirectional LSTM, and transformer architectures. The technique
classifies each audio frame as matching or nonmatching from 50 recordings. The results
reported that bidirectional LSTM outperforms the other models with a 99.7% accuracy and
0.43% EER. However, the training process took a long time, and the dataset used in the
study was small, which led to overfitting.
In regard to transfer learning and unimodal methods, P. RahulT et al. [36] proposed a
new framework based on transfer learning and the ResNet-34 method for detecting faked
English-speaking voices. The transfer learning model was pretrained on the CNN network.
The Rest-34 method was used for solving the vanishing gradient problem that always occurs
in any DL model. The results showed that the proposed framework achieved the best results
measured by the EER and t-DCF metrics with results of 5.32% and 0.1514%, respectively.
Although ResNet-34 solves the vanishing gradient issue, training takes a long time because
of its deep architecture. Similarly, Khochare et al. [37] investigated feature-based and image-
based approaches for classifying faked audio generated synthetically. New DL models
called the Temporal Convolutional Network (TCN) and Spatial Transformer Network (STN)
were used in this work. TCN achieved promising outcomes in distinguishing between fake
and real audio with 92% accuracy, while STN obtained an accuracy of 80%. Although the
TCN works well with sequential data, it does not work with inputs converted to Short-Time
Fourier Transform (STFT) and Mel Frequency Cepstral Coefficients (MFCC) features.
Khalid et al. [38] contributed a new Deepfake dataset named FakeAVCeleb [39]. The
authors investigated unimodal methods that contain five classifiers to evaluate their effi-
ciency in detection; the classifiers were MesoInception-4, Meso-4, Xception, EfficientNet-B0,
and VGG16. The Xception classifier was found to achieve the highest performance with a
Algorithms 2022, 15, 155 7 of 20
result of 76%, while EfficientNet-B0 had the worst performance with a result of 50%. They
concluded that none of the unimodal classifiers were effective for detecting fake audio.
Alzantot et al. [40] highlighted the need to develop a system for AD detection based on
residual CNN. The main idea of this system is to extract three crucial features from the
input, MFCC, constant Q cepstral coefficients (CQCC), and STFT, to determine the Counter
Major (CM) score of the faked audio. A high CM score proves that the audio is real speech,
while a low CM score suggests that it is fake. The proposed system showed promising
results, improving the CM rate by 71% and 75% in two matrices of t-DCF (0.1569) and EER
(6.02), respectively. However, further investigation is still needed due to the generalization
errors in the proposed system.
T. Arif et al. [41] developed a new audio feature descriptor called ELTP-LFCC based
on a Local Ternary Pattern (ELTP) and Linear Frequency Cepstral Coefficients (LFCC).
This descriptor was used with a Deep Bidirectional Long Short-Term Memory (DBiLSTM)
network to increase the robustness of the model and to detect fake audio in diverse indoor
and outdoor environmental conditions. The model created was tested over the ASVspoof
2019 dataset with synthetic and imitated-based fake audio. From the experiment, it was
found that the model performed better over the audio synthetic dataset (with 0.74% EER)
but not as well with imitated-based samples (with 33.28% EER).
An anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT) method
was proposed in [42] based on variants of the Squeeze-Excitation Network (SENet) and
ResNet. This method uses log power magnitude spectra (logspec) and CQCC acoustic
features to train the DNN. The model was tested with the ASVspoof 2019 dataset to find
that ASSERT obtained more than a 17% relative improvements in synthetic audio. However,
the model had zero t-DCF cost and zero EER with a logical access scenario during the test,
which indicates that the model is highly overfitting.
Based on the literature discussed thus far, we can say that although DL methods avoid
manual feature extraction and excessive training, they still require special transformations
for audio data. Consequently, self-supervised DL methods have recently been introduced
into the AD detection literature. In particular, Jiang et al. [43] proposed a self-supervised
spoofing audio detection (SSAD) model inspired by an existing self-supervised DL method
named PASE+. The proposed model depends on multilayer convolutional blocks to extract
context features from the audio stream. It was tested over the dataset with a 5.31% EER.
While the SSAD did well in terms of efficiency and scalability, its performance was not
as good as other DL methods. Future research could thus focus on the advantages of
self-supervised learning and improving its performance.
Ultimately, the literature discussed thus far is summarized in Table 1, which shows
that the method type affects the performance more than the feature used. It is very clear that
ML methods are more accurate than DL methods regardless of the features used. However,
due to excessive training and manual feature extraction, the scalability of the ML methods
is not confirmed, especially with large numbers of audio files. On the other hand, when DL
algorithms were used, specific transformations were required on the audio files to ensure
that the algorithms could manage them. In conclusion, although AD detection is an active
area of study, further research is still needed to address the existing gaps. These challenges
and potential future research directions will be highlighted in Section 6.
Algorithms 2022, 15, 155 8 of 20
Year Ref. Speech Language Fakeness Type Technique Audio Feature Used Dataset Drawbacks
The error rate is zero, indicating that the
DNN-HLL MFCC, LFCC, CQCC
ASV spoof proposed DNN is overfitting.
2018 Yu et al. [29] English Synthetic
IMFCC, GFCC, 2015 [30] Does not carry much artifact information
GMM-LLR
IGFCC in the feature representations perspective.
The model is highly overfitting with
Alzantot ASV spoof
2019 English Synthetic Residual CNN MFCC, CQCC, STFT synthetic data and cannot be generalized
et al. [40] 2019 [19]
over unknown attacks.
ASSERT (SENet + ASV spoof The model is highly overfitting with
2019 C. Lai et al. [42] English Synthetic Logspec, CQCC
ResNet) 2019 [19] synthetic data.
Requires transforming the input into a
P. RahulT ASV spoof 2-D feature map before the detection
2020 English Synthetic ResNet-34 Spectrogram
et al. [36] 2019 [19] process, which increases the training time
and effects its speed.
Classical Classifiers Failed to capture spurious correlations,
(SVM-Linear, and features are extracted manually so
-
SVMRBF, Arabic they are not scalable and needs extensive
Lataifeh LR, DT, RF, XGBoost) Diversified manual labor to prepare the data.
2020 Classical Arabic Imitation
et al. [23] Audio DL accuracy was not as good as the
DL Classifiers (CNN, (AR-DAD) [24] classical methods, and they are an
MFCC spectrogram
BiLSTM) image-based approach that requires
special transformation of the data.
Failed to capture spurious correlations,
Rodríguez- Spanish, English,
Time domain and features are extracted manually so it
2020 Ortega Portuguese, French, Imitation LR H-Voice [16]
waveform is not scalable and needs extensive
et al. [3] and Tagalog
manual labor to prepare the data.
High-dimensional
data visualization of
2020 Wang et al. [31] English, Chinese Synthetic Deep-Sonar FoR dataset [28] Highly affected by real-world noises.
MFCC, raw neuron,
activated neuron
They use an image-based approach that
Subramani and EfficientCNN and ASV spoof
2020 English Synthetic Spectrogram requires special transformation of the
Rao [21] RES-EfficientCNN 2019 [19]
data to transfer audio files into images.
Algorithms 2022, 15, 155 9 of 20
Table 1. Cont.
Year Ref. Speech Language Fakeness Type Technique Audio Feature Used Dataset Drawbacks
Shan and The method did not perform well over
2020 English Synthetic Bidirectional LSTM MFCC –
Tsai [35] long 5 s edits.
Urban-Sound8K,
MFCC, The proposed model does not carry much
Wijethunga Conversational,
2020 English Synthetic DNN Mel-spectrogram, artifact information from the feature
et al. [32] AMI-Corpus,
STFT representations perspective.
and FoR
It needs extensive computing processing
since it uses a temporal convolutional
ASV spoof network (TCN) to capture the context
2020 Jiang et al. [43] English Synthetic SSAD LPS, LFCC, CQCC
2019 [19] features and another three regression
workers and one binary worker to predict
the target features.
The model proposed is complex and
contains many layers and convolutional
CRNN-Spoof CQCC networks, so it needs an extensive
Chintha computing process. Did not perform well
2020 English Synthetic ASV spoof
et al. [33] compared to WIRE-Net-Spoof.
2019 [19]
Did not perform well compared to
WIRE- Net-Spoof MFCC
CRNN-Spoof.
Features are extracted manually so it is
Kumar-Singh MFCC,
2020 English Synthetic Q-SVM – not scalable and needs extensive manual
and Singh [17] Mel-spectrogram
labor to prepare the data.
Zhenchun Lei CNN and Siamese ASV spoof The models are not robust to different
2020 English Synthetic CQCC, LFCC
et al. [25] CNN 2019 [19] features and work best with LFCC only.
Spanish, English, Histogram, The model was not scalable and was
M. Ballesteros Synthetic
2021 Portuguese, French, Deep4SNet Spectrogram, Time H-Voice [16] affected by the data transformation
et al. [5] Imitation
and Tagalog domain waveform process.
They used an image-based approach,
which required a special transformation
E.R. Bartusiak
ASV spoof of the data, and the authors found that
2021 and E.J. English Synthetic CNN Spectrogram
2019 [19] the model proposed failed to correctly
Delp [22]
classify new audio signals indicating that
the model is not general enough.
Algorithms 2022, 15, 155 10 of 20
Table 1. Cont.
Year Ref. Speech Language Fakeness Type Technique Audio Feature Used Dataset Drawbacks
Features extracted manually so they are
Borrelli ASV spoof
2021 English Synthetic RF, SVM STLT not scalable and needs extensive manual
et al. [18] 2019 [19]
labor to prepare the data.
It was observed from the experiment that
Meso-4 overfits the real class and
MesoInception-4,
MesoInception-4 overfits the fake class,
Khalid et al. Meso-4, Xception, Three-channel image
2021 English Synthetic FakeAVCeleb [39] and none of the methods provided a
[38] EfficientNet-B0, of MFCC
satisfactory performance indicating that
VGG16
they are not suitable for fake audio
detection.
Feature-based (SVM, Features extracted manually so they are
Vector of 37 features
RF, KNN, XGBoost, not scalable and needs extensive manual
of audio
Khochare and LGBM) labor to prepare the data.
2021 English Synthetic FoR dataset [28]
et al. [37] It uses an image-based approach and
Image-based (CNN,
Melspectrogram could not work with inputs converted to
TCN, STN)
STFT and MFCC features.
Features extracted manually so it is not
SVM MFCC scalable and needs extensive manual
2021 Liu et al. [20] Chinese Synthetic – labor to prepare the data.
The error rate is zero indicating that the
CNN –
proposed CNN is overfitting.
It did not perform as well as the
S. Camacho
2021 English Synthetic CNN Scatter plots FoR dataset [28] traditional DL methods, and the model
et al. [27]
needed more training.
Synthetic ASV spoof Does not perform well over an
2021 T. Arif et al. [41] English DBiLSTM ELTP-LFCC
imitated 2019 [19] imitated-based dataset.
Algorithms 2022, 15, 155 10 of 19
Figure 4.
Figure 4. The Structure
Structure of
of the
the H-Voice
H-VoiceDataset.
Dataset.
Algorithms 2022, 15, 155 12 of 20
Real Fake
Sample Fakeness Speech
Year Dataset Total Size Sample Sample Format Accessibility Dataset URL
Length (s Type Language
Size Size
https://www.caito.de/2019/01/
The M-AILABS
2018 18,7 h 9265 806 1–20 Synthetic WAV German Public the-m-ailabs-speech-dataset/
Speech [44]
(accessed 3 March 2022)
Baidu Silicon
https://audiodemos.github.io/
2018 Valley AI Lab 6h 10 120 2 Synthetic Mp3 English Public
(accessed 3 March 2022)
cloned audio [45]
https://bil.eecs.yorku.ca/
Fake oR Real 198,000
2019 111,000 87,000 2 Synthetic Mp3, WAV English Public datasets/(accessed 20
(FoR) [28] Files
November 2021)
https:
AR-DAD: Arabic
Classical //data.mendeley.com/datasets/
2020 Diversified 16,209 Files 15,810 397 10 Imitation WAV Public
Arabic 3kndp5vs6b/3(accessed 20
Audio [24]
November 2021)
Spanish,
Imitation
Imitation English, https://data.mendeley.com/
6672 3264 Imitation
2020 H-Voice [16] 3332 2–10 PNG Portuguese, Public datasets/k47yd3m28w/4
Files Synthetic Synthetic
Synthetic 4 French, and (accessed 20 November 2021)
72
Tagalog
Only older
https://datashare.ed.ac.uk/
ASV spoof 2021 versions
2021 - - - 2 Synthetic Mp3 English handle/10283/3336(accessed 20
Challenge available
November 2021)
thus far
https://sites.google.com/view/
20,490
2021 FakeAVCeleb [39] 490 20,000 7 Synthetic Mp3 English Restricted fakeavcelebdash-lab/(accessed
Files
20 November 2021)
https://sites.google.com/view/
LF:300 LF:700
2022 ADD [46] 85 h 2–10 Synthetic WAV Chinese Public fakeavcelebdash-lab/(accessed 3
PF:0 PF:1052
May 2022)
Algorithms 2022, 15, 155 13 of 20
Moreover, the FakeAVCeleb dataset is a new restricted dataset of English speakers that
has been synthetically generated by the SV2TTS tool. It contains a total of 20,490 samples
divided between 490 real samples and 20,000 fakes, each being 7 s long in MP3 format. Last,
the ASV spoof 2021 challenge dataset also consists of two fake scenarios, a logical and a
physical scenario. The logical scenario contains fake audio made using synthetic software,
while the physical scenario is fake audio made by reproducing prerecorded audio using
parts of real speaker data. While this dataset has yet to be published, older versions are
available to the public (2015 [30], 2017 [47], and 2019 [19]).
However, the ASV spoof challenge has one limitation that has not been considered a
crucial factor in the AD area, which is noise. A new synthetic-based dataset was therefore
developed in the current year to fill this gap called the Audio Deep synthesis Detection
challenge (ADD). This dataset consists of three tracks, a low-quality fake audio detection
(LF), a partially fake audio detection (PF), and a fake audio game (FG), which is outside the
scope of the current article. LF contains 300 real voices and 700 fully faked spoken words
with real-world noises, while PF has 1052 partially fake audio samples. The language of
the ADD dataset is Chinese, and it is publicly available.
From Table 2, it can be concluded that most datasets have been developed for English.
While one dataset was found for Classical Arabic (CA) language, it covered only imitation
fakeness, while other types of the Arabic language were not covered. There is thus still a
need to generate a new dataset based on the syntactic fakeness of the Arabic language. The
developed dataset can be used to complement the developed AD detection model to detect
both imitation and synthetic Deepfakes with minimal preprocessing and training delays.
5. Discussion
From the literature, it was clear that the methods proposed thus far require special
data processing to perform well, where classical ML methods require extensive amounts
of manual labor to prepare the data, while the DL-based methods use an image-based
approach to understand the audio features. The preprocessing approach used can affect
the performance of the method, and thus new research is recommended to develop new
solutions that allow the models to understand the audio data as it is. Nevertheless, it was
crucial to analyze the statutes of the current AD detection methods based on previous
work experiments. Thus, from the experimental results of the cited studies, a quantitative
comparison was conducted based on three criteria (EER, t-DCF, and accuracy), as illustrated
in Table 3.
Table 3. Cont.
Starting with the EER and t-DCF, as shown in Figure 5, it can be concluded that there is
no clear pattern in performance with respect to the approach or dataset used. Each method
performs differently depending on the technique used. For instance, the Bidirectional LSTM
method provides the best EER and t-DCF compared to the other methods, but the dataset
information was not clarified in the study, and overfitting was a concern. Another example
is GMM-LLR, which provides the worst EER even though it used the same features and
Algorithms 2022, 15, 155 14 of 19
Algorithms 2022, 15, 155 15 of 20
Another example is GMM-LLR, which provides the worst EER even though it used the
same features
trained on theand
sametrained
datasetonastheDNN-HLLs.
same datasetInasregard
DNN-HLLs. In regard
to the CNN to the highlighted
methods CNN meth-
odsthe
in highlighted
orange box, in regardless
the orangeof box,
theregardless of the
dataset used, all dataset
versionsused,
haveall versionsperformance
a similar have a sim-
ilar performance
with respect to thewith respect
EER and to the EER
t-DCF. and t-DCF.
However, oneHowever, oneobservation
interesting interesting observation
that can be
that can be highlighted
highlighted is the
is the fact that thefact that
type the type of
of fakeness canfakeness
have ancan have
effect onan
theeffect on the per-
performance of
formance of the method. For instance, DBiLSTM provides a very low
the method. For instance, DBiLSTM provides a very low EER and t-DCF compared to EER and t-DCF com-
the
paredmethods
other to the other
when methods
appliedwhen
with applied
syntheticwith
AD,synthetic
while it isAD,
onewhile
of theitworst
is onewhen
of the worst
applied
when
to applied to the imitation-based
the imitation-based datasets. datasets.
1.00
0.50 0.43
0.39
0.333
0.03
0.02
0.02
0.0074 0.008
0.01
0.0043
0.00
EER t-DCF
CRNN-Spoof ResNet-34-spoof SSAD-spoof Deep-Sonar-FoR Siamese CNN-Spoof
Residual CNN-spoof CNN-FoR CNN-Spoof DNN-HLLs-Spoof15 GMM-LLR-Spoof15
SENet-34-Spoof DBiLSTM(Synthetic)-Spoof DBiLSTM(Imitation)-Spoof Bidirectional LSTM-Spoof
Figure 5.
Figure 5. Quantitative comparison of
Quantitative comparison of AD
AD detection
detection methods
methods measured
measured by
by EER
EER and
and t-DCF.
t-DCF.
100% 99% 99% 99% 98% 98% 98% 98% 99% 99% 99% 98%
94% 94% 94%
92% 91%
90% 88%
86%
80%
80% 76%
73%
71%
70% 67% 67%
62% 62%
59% 60%
60%
54%
Accuracy %
50% 50%
50%
40%
30%
20%
10%
0%
Method-Dataset
Figure
Figure 6.
6. Quantitative
Quantitative comparison
comparison between
between recent
recent AD
AD detection
detection methods
methods measured
measured by
by accuracy
accuracy on
on
multiple datasets.
multiple datasets.
6.1. Limited
6.1. Limited AD
AD Detection
Detection Methods
Methods with
with Respect
Respect toto Non-English
Non-English Languages
Languages
Almost all existing studies focus on developing detection
Almost all existing studies focus on developing detection methods methods to to detect
detect fake
fake
voices speaking
voices speakingEnglish,
English,although
althoughsixsixofficial languages
official languagesareare
included in the
included United
in the Nations’
United Na-
list of official languages [48]. For example, the authors are aware of no existing
tions’ list of official languages [48]. For example, the authors are aware of no existing studies
stud-
focusing
ies on on
focusing Arabic. Indeed,
Arabic. Indeed,Arabic
Arabicisisthe
theworld’s
world’sfourth
fourthmost
mostwidely
widely spoken
spoken language
language
behind Chinese, Spanish, and English, with over 230 million native
behind Chinese, Spanish, and English, with over 230 million native speakers speakers [49]. It consists
[49]. It con-
of three core types: Classical Arabic (CA), Modern Standard Arabic (MSA), and Dialect
sists of three core types: Classical Arabic (CA), Modern Standard Arabic (MSA), and Dia-
Arabic (DA) [50]. CA is the official language of the Quran; MSA is the official language
lect Arabic (DA) [50]. CA is the official language of the Quran; MSA is the official language
of the present era; and DA is the spoken language of everyday life that differs between
of the present era; and DA is the spoken language of everyday life that differs between
regions [50]. The reason for highlighting the Arabic language in this section is because the
regions [50]. The reason for highlighting the Arabic language in this section is because the
Arabic language has a unique challenge in alphabet pronunciation, which the traditional
Arabic language has a unique challenge in alphabet pronunciation, which the traditional
techniques of audio processing and ML learning models cannot deal with [51]. It contains
techniques of audio processing and ML learning models cannot deal with [51]. It contains
three crucial long vowels named Fatha, Damma, and Khasra [52], which if pronounced
three crucial long vowels named Fatha, Damma, and Khasra [52], which if pronounced
incorrectly will change the meaning of the sentence [51]. The authors [53] therefore pointed
incorrectly will change the meaning of the sentence [51]. The authors [53] therefore
out that the performance of any model in a specific language will not be the same in other
pointed out that the performance of any model in a specific language will not be the same
languages, especially in languages that have limited available data, such as Arabic and
in other languages, especially in languages that have limited available data, such as Arabic
Chinese. There was only one attempt by [24], where the authors collected CA data based on
and Chinese.
imitation There
fakes. wasreason,
For this only one
weattempt by [24],
can directly where the
understand theauthors collected methods
lack of detection CA data
based on imitation
for non-English AD.fakes. For this reason,
We therefore encourage we the
canresearch
directly community
understand to themeet
lackthis
of detection
research
methods for non-English AD. We therefore encourage the research
gap by proposing a new detection method to detect other languages, such as Arabic. community to meet
this research gap by proposing a new detection method to detect other languages, such as
Arabic.
6.2. Lack of Accent Assessment in Existing AD Detection Methods
The majority of detection methods rely on detecting the type of fake itself without
6.2. Lack of Accent
considering otherAssessment
factors thatincould
Existing ADthe
affect Detection
accuracyMethods
of the detection. One such factor
The majority
is accents, of detection
which are defined asmethods
the wayrely on detecting
a specific thepeople
group of type oftypically
fake itself without
speak, par-
considering other factors that could affect the accuracy of the detection. One such factor
Algorithms 2022, 15, 155 17 of 20
ticularly the citizens or natives of a particular country [54]. Research on this subject is
still missing from the AD literature, and it is presently unclear whether accents can affect
detection accuracy. In other audio fields, such as speaker recognition, accents affected
the performance of the methods proposed [55]. Thus, it is expected that accents can be a
challenge in the AD area. To address this challenge, further study is needed on languages
that use many different accents, such as Arabic. One country will often contain speakers
using many different accents, and the Saudi language is no exception, as it contains Najdi,
Hijazi, Qaseemi, and many other accents. Further research is necessary because when the
number of accents increases, the chance of the classifier learning a more generalized model
for the detection task will increase [28]. We therefore suggest that future research focus
on covering the AD detection area and measuring the effectiveness of accents, especially
Saudi accents.
7. Conclusions
This review article has discussed the field of AD, carefully surveying a number of
studies exploring detection methods with respect to current datasets. It began by presenting
a broad overview of AD, along with their definitions and types. Then, it reviewed the
relevant articles that have addressed the subject over the last four years and examined the
limitations covered in the literature on classical ML and DL detection methods. Following
Algorithms 2022, 15, 155 18 of 20
this, the available faked audio datasets were summarized, and the discussed methods were
also compared. Moreover, a quantitative comparison of recent state-of-the-art AD detection
methods was also provided. Finally, the research challenges and opportunities of the field
were discussed. From this analysis, it can be concluded that further advancements are
still needed in the literature of fake audio detection to develop a method that can detect
fakeness with different accents or real-world noises. Moreover, the SSL approach can be
one future research direction to help solve the current issues affecting the existing AD
methods. Imitation-based AD detection is an important part of the AD field that also needs
further development in comparison to the synthesis-based methods.
Author Contributions: Conceptualization, Z.A.; formal analysis, Z.A.; resources, Z.A.; writing—
original draft preparation, Z.A.; writing—review and editing, H.E.; visualization, Z.A.; supervision,
H.E. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in
the study.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Lyu, S. Deepfake detection: Current challenges and next steps. IEEE Comput. Soc. 2020, 1–6. [CrossRef]
2. Diakopoulos, N.; Johnson, D. Anticipating and addressing the ethical implications of deepfakes in the context of elections. New
Media Soc. 2021, 23, 2072–2098. [CrossRef]
3. Rodríguez-Ortega, Y.; Ballesteros, D.M.; Renza, D. A machine learning model to detect fake voice. In Applied Informatics; Florez,
H., Misra, S., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 3–13.
4. Chen, T.; Kumar, A.; Nagarsheth, P.; Sivaraman, G.; Khoury, E. Generalization of audio deepfake detection. In Proceedings of the
Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 132–137.
5. Ballesteros, D.M.; Rodriguez-Ortega, Y.; Renza, D.; Arce, G. Deep4SNet: Deep learning for fake speech classification. Expert Syst.
Appl. 2021, 184, 115465. [CrossRef]
6. Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph.
ToG 2017, 36, 1–13. [CrossRef]
7. Catherine Stupp Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case. Available online: https://www.wsj.
com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402 (accessed on 29 January 2022).
8. Chadha, A.; Kumar, V.; Kashyap, S.; Gupta, M. Deepfake: An overview. In Proceedings of Second International Conference on
Computing, Communications, and Cyber-Security; Singh, P.K., Wierzchoń, S.T., Tanwar, S., Ganzha, M., Rodrigues, J.J.P.C., Eds.;
Springer: Singapore, 2021; pp. 557–566.
9. Tan, X.; Qin, T.; Soong, F.; Liu, T.-Y. A survey on neural speech synthesis. arXiv 2021, arXiv:2106.15561.
10. Ning, Y.; He, S.; Wu, Z.; Xing, C.; Zhang, L.-J. A Review of Deep Learning Based Speech Synthesis. Appl. Sci. 2019, 9, 4050.
[CrossRef]
11. Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. Fastspeech 2: Fast and High-Quality End-to-End Text to Speech.
arXiv 2020, arXiv:2006.04558.
12. Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R. Natural Tts Synthesis
by Conditioning Wavenet on Mel Spectrogram Predictions; IEEE: Piscataway, NJ, USA, 2018; pp. 4779–4783.
13. Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep voice 3: Scaling text-to-speech
with convolutional sequence learning. arXiv 2017, arXiv:1710.07654.
14. Khanjani, Z.; Watson, G.; Janeja, V.P. How deep are the fakes? Focusing on audio deepfake: A survey. arXiv 2021, arXiv:2111.14203.
15. Pradhan, S.; Sun, W.; Baig, G.; Qiu, L. Combating replay attacks against voice assistants. Proc. ACM Interact. Mob. Wearable
Ubiquitous Technol. 2019, 3, 1–26. [CrossRef]
16. Ballesteros, D.M.; Rodriguez, Y.; Renza, D. A dataset of histograms of original and fake voice recordings (H-voice). Data Brief
2020, 29, 105331. [CrossRef]
17. Singh, A.K.; Singh, P. Detection of ai-synthesized speech using cepstral & bispectral statistics. In Proceedings of the 2021 IEEE
4th International Conference on Multimedia Information Processing and Retrieval (MIPR), Tokyo, Japan, 8–10 September 2021;
pp. 412–417.
18. Borrelli, C.; Bestagini, P.; Antonacci, F.; Sarti, A.; Tubaro, S. Synthetic speech detection through short-term and long-term
prediction traces. EURASIP J. Inf. Secur. 2021, 2021, 2. [CrossRef]
Algorithms 2022, 15, 155 19 of 20
19. Todisco, M.; Wang, X.; Vestman, V.; Sahidullah, M.; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.; Kinnunen, T.; Lee, K.A.
ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv 2019, arXiv:1904.05441.
20. Liu, T.; Yan, D.; Wang, R.; Yan, N.; Chen, G. Identification of fake stereo audio using SVM and CNN. Information 2021, 12, 263.
[CrossRef]
21. Subramani, N.; Rao, D. Learning efficient representations for fake speech detection. In Proceedings of the The Thirty-Fourth
AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence
Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York,
NY, USA, 7–12 February 2020; pp. 5859–5866.
22. Bartusiak, E.R.; Delp, E.J. Frequency domain-based detection of generated audio. In Proceedings of the Electronic Imaging;
Society for Imaging Science and Technology, New York, NY, USA, 11–15 January 2021; Volume 2021, pp. 273–281.
23. Lataifeh, M.; Elnagar, A.; Shahin, I.; Nassif, A.B. Arabic audio clips: Identification and discrimination of authentic cantillations
from imitations. Neurocomputing 2020, 418, 162–177. [CrossRef]
24. Lataifeh, M.; Elnagar, A. Ar-DAD: Arabic diversified audio dataset. Data Brief 2020, 33, 106503. [CrossRef]
25. Lei, Z.; Yang, Y.; Liu, C.; Ye, J. Siamese convolutional neural network using gaussian probability feature for spoofing speech
detection. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 1116–1120.
26. Hofbauer, H.; Uhl, A. Calculating a boundary for the significance from the equal-error rate. In Proceedings of the 2016 International
Conference on Biometrics (ICB), Halmstad, Sweden, 13 June 2016; pp. 1–4.
27. Camacho, S.; Ballesteros, D.M.; Renza, D. Fake speech recognition using deep learning. In Applied Computer Sciences in Engineering;
Figueroa-García, J.C., Díaz-Gutierrez, Y., Gaona-García, E.E., Orjuela-Cañón, A.D., Eds.; Springer International Publishing: Cham,
Switzerland, 2021; pp. 38–48.
28. Reimao, R.; Tzerpos, V. For: A dataset for synthetic speech detection. In Proceedings of the 2019 International Conference on
Speech Technology and Human-Computer Dialogue (SpeD), Timisoara, Romania, 10 October 2019; pp. 1–10.
29. Yu, H.; Tan, Z.-H.; Ma, Z.; Martin, R.; Guo, J. Guo spoofing detection in automatic speaker verification systems using DNN
classifiers and dynamic acoustic features. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4633–4644. [CrossRef]
30. Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilçi, C.; Sahidullah, M.; Sizov, A. ASVspoof 2015: The first automatic speaker
verification spoofing and countermeasures challenge. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September
2015; p. 5.
31. Wang, R.; Juefei-Xu, F.; Huang, Y.; Guo, Q.; Xie, X.; Ma, L.; Liu, Y. Deepsonar: Towards effective and robust detection of
ai-synthesized fake voices. In Proceedings of the the 28th ACM International Conference on Multimedia, Seattle, WA, USA,
12–16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1207–1216.
32. Wijethunga, R.L.M.A.P.C.; Matheesha, D.M.K.; Al Noman, A.; De Silva, K.H.V.T.A.; Tissera, M.; Rupasinghe, L. Rupasinghe
deepfake audio detection: A deep learning based solution for group conversations. In Proceedings of the 2020 2nd International
Conference on Advancements in Computing (ICAC), Malabe, Sri Lanka, 10–11 December 2020; Volume 1, pp. 192–197.
33. Chintha, A.; Thai, B.; Sohrawardi, S.J.; Bhatt, K.M.; Hickerson, A.; Wright, M.; Ptucha, R. Ptucha recurrent convolutional structures
for audio spoof and video deepfake detection. IEEE J. Sel. Top. Signal. Process. 2020, 14, 1024–1037. [CrossRef]
34. Kinnunen, T.; Lee, K.A.; Delgado, H.; Evans, N.; Todisco, M.; Sahidullah, M.; Yamagishi, J.; Reynolds, D.A. T-DCF: A detection cost
function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv 2018, arXiv:1804.09618.
35. Shan, M.; Tsai, T. A cross-verification approach for protecting world leaders from fake and tampered audio. arXiv 2020,
arXiv:2010.12173.
36. Aravind, P.R.; Nechiyil, U.; Paramparambath, N. Audio spoofing verification using deep convolutional neural networks by
transfer learning. arXiv 2020, arXiv:2008.03464.
37. Khochare, J.; Joshi, C.; Yenarkar, B.; Suratkar, S.; Kazi, F. A deep learning framework for audio deepfake detection. Arab. J. Sci.
Eng. 2021, 47, 3447–3458. [CrossRef]
38. Khalid, H.; Kim, M.; Tariq, S.; Woo, S.S. Evaluation of an audio-video multimodal deepfake dataset using unimodal and
multimodal detectors. In Proceedings of the 1st Workshop on Synthetic Multimedia, ACM Association for Computing Machinery,
New York, NY, USA, 20 October 2021; pp. 7–15.
39. Khalid, H.; Tariq, S.; Kim, M.; Woo, S.S. FakeAVCeleb: A novel audio-video multimodal deepfake dataset. In Proceedings of the
35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, Virtual, 6–14
December 2021; p. 14.
40. Alzantot, M.; Wang, Z.; Srivastava, M.B. Deep residual neural networks for audio spoofing detection. arXiv CoRR 2019,
arXiv:1907.00501.
41. Arif, T.; Javed, A.; Alhameed, M.; Jeribi, F.; Tahir, A. Voice spoofing countermeasure for logical access attacks detection. IEEE
Access 2021, 9, 162857–162868. [CrossRef]
42. Lai, C.-I.; Chen, N.; Villalba, J.; Dehak, N. ASSERT: Anti-spoofing with squeeze-excitation and residual networks. arXiv 2019,
arXiv:1904.01120.
43. Jiang, Z.; Zhu, H.; Peng, L.; Ding, W.; Ren, Y. Self-supervised spoofing audio detection scheme. In Proceedings of the INTER-
SPEECH 2020, Shanghai, China, 25–29 October 2020; pp. 4223–4227.
44. Imdat Solak The M-AILABS Speech Dataset. Available online: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
(accessed on 10 March 2022).
Algorithms 2022, 15, 155 20 of 20
45. Arik, S.O.; Chen, J.; Peng, K.; Ping, W.; Zhou, Y. Neural voice cloning with a few samples. In Proceedings of the 32nd Conference
on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 2–8 December 2018; p. 11.
46. Yi, J.; Fu, R.; Tao, J.; Nie, S.; Ma, H.; Wang, C.; Wang, T.; Tian, Z.; Bai, Y.; Fan, C. Add 2022: The first audio deep synthesis detection
challenge. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Singapore, 23–27
May 2022; p. 5.
47. Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The 2nd Automatic Speaker Verification
Spoofing and Countermeasures Challenge (ASVspoof 2017) Database, Version 2. Available online: https://datashare.ed.ac.uk/
handle/10283/3055 (accessed on 5 November 2021).
48. Nations, U. Official Languages. Available online: https://www.un.org/en/our-work/official-languages (accessed on 5
March 2022).
49. Almeman, K.; Lee, M. A comparison of arabic speech recognition for multi-dialect vs. specific dialects. In Proceedings of the
Seventh International Conference on Speech Technology and Human-Computer Dialogue (SpeD 2013), Cluj-Napoca, Romania,
16–19 October 2013; pp. 16–19.
50. Elgibreen, H.; Faisal, M.; Al Sulaiman, M.; Abdou, S.; Mekhtiche, M.A.; Moussa, A.M.; Alohali, Y.A.; Abdul, W.; Muhammad, G.;
Rashwan, M.; et al. An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi
Corpus. IEEE Access 2021, 9, 88405–88428. [CrossRef]
51. Asif, A.; Mukhtar, H.; Alqadheeb, F.; Ahmad, H.F.; Alhumam, A. An approach for pronunciation classification of classical arabic
phonemes using deep learning. Appl. Sci. 2022, 12, 238. [CrossRef]
52. Ibrahim, A.B.; Seddiq, Y.M.; Meftah, A.H.; Alghamdi, M.; Selouani, S.-A.; Qamhan, M.A.; Alotaibi, Y.A.; Alshebeili, S.A.
Optimizing Arabic Speech Distinctive Phonetic Features and Phoneme Recognition Using Genetic Algorithm. IEEE Access 2020,
8, 200395–200411. [CrossRef]
53. Maw, M.; Balakrishnan, V.; Rana, O.; Ravana, S.D. Trends and patterns of text classification techniques: A systematic mapping
study. Malays. J. Comput. Sci. 2020, 33, 102–117.
54. Rizwan, M.; Odelowo, B.O.; Anderson, D.V. Word based dialect classification using extreme learning machines. In Proceedings of
the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24 July 2016; pp. 2625–2629.
55. Najafian, M. Modeling accents for automatic speech recognition. In Proceedings of the 23rd European Signal Proceedings
(EUSIPCO), Nice, France, 31 August–4 September 2015; University of Birmingham: Birmingham, UK, 2013; Volume 1568, p. 1.
56. Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-supervised learning: Generative or contrastive. IEEE Trans.
Knowl. Data Eng. 2021. [CrossRef]
57. Jain, D.; Beniwal, D.P. Review paper on noise cancellation using adaptive filters. Int. J. Eng. Res. Technol. 2022, 11, 241–244.