Cell-Phone Identification From Recompressed Audio Recordings
Cell-Phone Identification From Recompressed Audio Recordings
Abstract—Many audio forensic applications would civil incident. A few decades back, most of these
benefit from the ability to classify audio recordings, based audio recordings needed to be created using dedicated
on characteristics of the originating device, particularly and costly setup. However, now, these recordings are
in social media platforms where an enormous amount of
data is posted every day. This paper utilizes passive signa- mainly obtained using hand-held multipurpose devices,
tures associated with the recording devices, as extracted especially cell-phones due to their ubiquitous presence.
from recorded audio itself, in the absence of any extrinsic During the last couple of years, there has been a
security mechanism such as digital watermarking, to tremendous increase in usage of smartphones due to
identify the source cell-phone of recorded audio. It uses their increasing functionalities resulting from more
device-specific information present in low as well as high-
frequency regions of the recorded audio. On the only capable and less costly hardware components in them.
publicly available dataset in this field, MOBIPHONE, the A report by International Telecommunication Union
proposed system gives a closed set accuracy of 97.2% (ITU), says that in 2015 there were more than seven
which matches the state of art accuracy reported for billion mobile cellular subscriptions in the world [1].
this dataset. On audio recordings which have undergone Moreover, the capability to easily manipulate user
double compression, as typically happens for a recording
posted on social media, the proposed system outperforms generated audio content by easily available and user-
the existing methods (4% improvement in average accu- friendly software and spreading of this content through
racy). social media platform has resulted in serious concerns
for law enforcement agencies around the world.
I. I NTRODUCTION Audio data is acquired by the acquisition device in
There has been a continuous evolution in research particular from the cell-phone, and authentication of
related to various aspects of automated audio process- the cell-phone from the audio which has gone through
ing such as speech, speaker and language recognition, some compression either by audio editing software
speech-based human-machine interaction and biomet- or compression mechanism employed by social media
rics. Simultaneously, there is a tremendous growth platforms while sharing the audio recordings, possibly
of electronic devices in the consumer market which due to storage requirement, becomes a challenging
has made hand-held devices such as digital cameras, problem. If source cell-phone can be identified even
portable scanners, tablets, and smartphones, key com- from such compressed audio recordings, it could be
ponents of our daily lives. Due to the widespread helpful for the forensic examiner to answer the foren-
usage of these handheld devices, the amount of user- sics questions such as verifying the claim for the
generated multimedia content is escalating. Thus, its authenticity of audio or ownership of the acquisition
study is an essential aspect of multimedia forensics device.
with numerous applications such as proclaiming mean- Performance of the proposed system is evaluated
ingful conclusions about a subject in the court of law. on audio recordings in their original format as well
Audio forensics, part of the broader field of mul- as when audio recordings had undergone a second
timedia forensics, pertains to the acquisition, analy- compression (compressed with different bit rate and
sis, and evaluation of audio recordings that can be sampling rate). Following are the main contributions
produced as evidence in the court of law. Evidence of this paper:
to be analyzed using audio forensic methods may • This paper addresses the importance of high-
come from a criminal investigation by law enforcement frequency region of the audio in identifying the
agencies or as part of an official inquiry into an source cell-phone. It utilizes device-specific infor-
accident, fraud, accusation of slander, or some other mation present in low as well as high-frequency
regions of the recorded audio. Most of the exist-
This research was supported by a grant from the Department of
Science and Technology (DST), New Delhi, India, under Award ing systems for source cell-phone identification
Number ECR/2015/000583. Any opinions, findings, and conclusions from the recorded audio have focused on utilizing
or recommendations expressed in this material are those of the features from low-frequency regions.
author(s) and do not necessarily reflect the views of the Department
of Science and Technology. Address all correspondence to Nitin • For extracting features from the high frequency
Khanna at [email protected]. region of the audio, our method uses inverted
Authorized licensed use limited to: Northumbria University Library. Downloaded on November 23,2022 at 21:53:08 UTC from IEEE Xplore. Restrictions apply.
2018 Twenty Fourth National Conference on Communications (NCC)
Mel frequency cepstral coefficient (IMFCC) [2]. for classifying cell-phones according to their five dif-
IMFCCs have not been previously explored for ferent manufacturers. Aggrawal et al. [15] used only
source cell-phone classification. the noisy part of whole speech, and MFCC features
• The performance of our algorithm is also tested were extracted from the estimated noisy part of the
when the recorded audio has undergone second speech. Average accuracy of 90% was reported for
compression by popular audio editing software classifying cell phones belonging to different manufac-
such as Adobe Audition. To the best of our turers when speech content of the recorded files varied.
knowledge, such a study in source identification Authors in [7] used Gaussian supervectors (GSV’s) for
from doubly compressed audio has not yet been characterizing the intrinsic signatures of the record-
explored. However, it will be the case when ing device. This work also released the dataset MO-
recorded audio is posted on social media. BIPHONE. Source cell phone identification through
MFCC (Mel frequency cepstral coefficient) feature sparse representation is presented in [16]–[18]. Zou
with generalized linear discriminant sequence kernel et al. [16] used Gaussian supervectors (GSV’s) for
(GLDS-kernel [3], [4]) with order-2 has been used characterizing the intrinsic signatures of the recording
for comparison purpose. Some works for the cell- device. Moreover, they used an exemplar-based dic-
phone identification task using MFCC were proposed tionary and a dictionary learned by K-SVD learning
in [5]–[7]. These state-of-the-art handcrafted features algorithm for the sparse representation of GSV’s. For
for cell-phone identification task are well suited for similarity measure, correlation is estimated between
comparison with the proposed method. The primary two sparse representations. Further, KISS metric based
motive of this study is to reveal that high-frequency similarity between a pair of intrinsic signatures rep-
information of the audio also plays a significant part resented by sparse representations was used in [17].
and our experiments confirm this hypothesis. A dataset consisting of audio recordings from 14 cell-
Rest of the paper is organized as follows. In Sec- phones was used for evaluation purpose. Zou et al. [18]
tion II, previous work on source cell-phone identifi- extended their previous work [16] by proposing a new
cation is described. In Section III proposed method supervised learning method based on discriminative K-
is discussed. Experimental results are presented in SVD (D-KSVD). Li et al. [19] used a deep neural
Section IV, while conclusion and future work are network to learn the intrinsic signatures left by the
dicussed in Section V. recording device in the recorded audio. Further, spec-
tral clustering on the learned features was used to make
II. R ELATED W ORKS a single cluster for each recording device.
Any device specific distortions in the data acquired
III. P ROPOSED S YSTEM
by a device are referred as intrinsic signatures of the
acquisition device [8]. In audio forensics, the concept Every electronic component has some tolerance val-
of unique fingerprint of the recording device, extracted ues associated with it, and thus practical realization of
from the captured audio, has been previously used different instances of even the same electronic circuitry
for microphone identication [9]–[13]. In recent years, is unique. This also holds for hardware associated with
the idea of cell-phone identication from their recorded different parts of a cell-phone such as its microphone,
audio has been explored, keeping in view the increased and thus no two recorders associated with two dif-
number of cell-phone users [1]. Hanilci et al. [5] ferent cell-phones are expected to be exactly same.
first addressed the problem of cell-phone recognition This difference in the realization becomes even more
from the recorded audio. They proposed a cell-phone prominent for cell-phones manufactured by different
recognition system which was equivalent to speaker manufacturers as they may use their specific design
recognition system and used MFCC features, which of microphone [5]. An audio signal recorded by a
are well-known and state of the art for speech and cell-phone can be approximated as a convolution of
speaker recognition. A maximum closed set accuracy the original audio signal with the impulse response of
of 96.42% was reported using support vector machine the cell-phone. This implies that for the same input
(SVM) classifier on a dataset with 14 cell-phones audio signal, recorded signal will be slightly different
of different make and model. Further, they extended for different cell-phones (evident from Figure 2 in [5]
their work by extracting MFCC and linear frequency showing the spectrum of the same input audio recorded
cepstral coefficients (LFCC) features from the silence by different cell-phones).
region of audio recordings [6]. This approach was Thus, cell-phones leave traces of their specific con-
tested on the dataset used in their earlier work [5] and volutional distortion in the recorded audio signal, and
resulted in an improvement in classification accuracies. we propose to use such device-specific distortions as
In another work, Pandey et al. [14], estimated power intrinsic signatures of the cell-phones. One way to
spectral density from the speech free regions of the recognize the recording device is to find its impulse
recordings for source cell-phone classification. With response from the recorded audio and classify it ac-
twenty-six cell-phones of different make and models, cordingly. Estimation of this impulse response may
an average classification accuracy of 88% was obtained not always be possible as it will generally require
Authorized licensed use limited to: Northumbria University Library. Downloaded on November 23,2022 at 21:53:08 UTC from IEEE Xplore. Restrictions apply.
2018 Twenty Fourth National Conference on Communications (NCC)
access to the device and specific experimental setup filter bank and energy is summed up in each of the
for response measurement. An alternative approach filters of the filter bank. That gives an amount of
taken in this paper is to estimate some device-specific energy contained in each of the filters of the filter
information from the recorded audio only and then bank. The logarithm of each energy value is taken
use this information to model a particular cell-phone. and discrete cosine transform (DCT) of log energies
A test recording can be compared with this model is calculated. These DCT coefficients are termed as
to recognize the recording cell-phone. This device inverse Mel frequency cepstral coefficients (IMFCC’s).
specific information may not be directly the system’s Generally, for most of the applications, first 12-13
impulse response, but the generated model captures coefficients are chosen excluding the DC component.
intra-class similarities and inter-class dissimilarities Inverted Mel filter bank (Figure 2) is generated
and thus enables identification of cell-phone used for using following equation (equation 1) that describes
the recording using only the recorded audio from that mapping from Hz scale of frequency to inverse Mel
cell-phone. scale of frequency:
The aim of the proposed cell-phone recognition
8031.25 − f
system is to capture device-specific discriminatory mb = 2889.22 − 2595 log10 1 + (1)
700
information using a suitable representation. The device
specific signature associated with recording circuitry of
2889.22 − m b
a cell-phone is not limited to low-frequency regions f = 8031.25 − 700 10 2595 − 1 , (2)
and might contain substantial information in high-
frequency regions as well. Thus, this paper proposes a
combination of MFCC and IMFCC (which emphasizes where m b is a subjective pitch in inverted Mel corre-
high-frequency region of the signal as well) based sponding to the actual frequency f in Hz. This corre-
features for capturing the device-specific distortions sponds to taking 512 points discrete Fourier transform
present in a recorded sample. IMFCC features have (DFT) of each frame, for the audio having a sampling
been previously used for speaker identification [2] rate of 16 kHz. Detailed description and more general
and synthetic speech detection [20]. By using the form of the equation can be found in [2].
concatenation of MFCC and IMFCC features, as the
device-specific foot-prints/signatures, this paper ad- 3000
2000
is performed using SVM [21] classifier. The Effect
on the performance of the proposed system when the 1500
Authorized licensed use limited to: Northumbria University Library. Downloaded on November 23,2022 at 21:53:08 UTC from IEEE Xplore. Restrictions apply.
2018 Twenty Fourth National Conference on Communications (NCC)
1 male and six female speakers are used for training pur-
0.8 pose and remaining six male, and six female speakers
are used for testing purpose. Thus, training and testing
Amplitude
0.6
0.4
sets are mutually exclusive and balanced in the number
of speakers, gender and source cell-phones. Similar
0.2
kind of train and test split is done in [7]. This kind of
0
0 1000 2000 3000 4000 5000 6000 7000 8000 splitting of the data make sure that performance of the
Frequeny in Hz
classifier is speaker independent and to a reasonable
Fig. 2. IMFCC Filter Bank
extent text independent(not fully text independently
because two sentences are same for each speaker).
Audio Pre- Inverse IMFCC
For our experiments speaker [1:7,14:18] are chosen
Window | DFT |2 Log DCT
emphasis Filter Bank
for training and remaining twelve speakers for testing
for each cell-phone (Table II in [7]). Two kinds of
Fig. 3. An Overview of IMFCC Feature Extraction experiments are performed to see the performance of
the proposed system in different scenarios. In one
experiment, decisions are made on 3-seconds of audio
xi is mapped into yi , which is in higher dimensional
segments (or subpart), while in another experiment
feature sapce (say p), using a polynomial expansion
as done in [7], decisions are made for each of the
function φ(.), where φ : Rd → Rp . So yi = φ(xi ),
recordings from different speakers (each recording of
∀i = 1, 2, . . . n. Details about polynomial expansion
approximately 30 seconds duration). The decision on
function φ(.) can be found in [4]. Polynomial expan-
the complete recording is made by taking the majority
sion of each feature vector results in a new feature
vote on the decisions of 3-second audio segments of
matrix Yn×p = [y1 y2 . . . yn ]T . In the second step,
that recording. To assess the effect of different kinds of
final feature vector z1×p is obtained using z = n1 eT Y,
compression on the proposed system, the experiments
where eT is [1 1 . . . 1]. Hence for a given audio
were conducted on two variants of MOBIPHONE
segment of t seconds (3-seconds in our case), say final
dataset viz. original and mp3 compressed.
MFCC and IMFCC feature vectors are zm and zim ,
that will result in proposed feature vector [zm zim ].
A. Experiments Using Original Audio Recordings
C. Classification For comparison with the existing results reported
A multi-class SVM classifier available in LibSVM on the MOBIPHONE dataset, this experiment uses the
package [21] is used for the classification purpose. original recordings present in the dataset for training
the classifier as well as testing. These original record-
IV. E XPERIMENTAL R ESULTS AND D ISCUSSIONS ings are available in lossless compressed wav file
To facilitate easier comparison with the existing format and contain audio sampled at 16 kHz sampling
methods, the performance of the proposed system is rate. For this task, an average classification accuracy
evaluated on a standard, publicly available dataset in of 91.2% and 93.2% is obtained when only MFCC
this field, MOBIPHONE [7]. This dataset consists of features and a combination of MFCC and IMFCC
21 cell-phones (Table I in [7]). Each of the cell- features respectively are used for classifying audio
phones is used to record 12 male and 12 female segments of length 3-seconds (Second column in Ta-
speakers randomly chosen from TIMIT dataset (Table ble V). These results demonstrate the complementary
II in [7]). Each of the speakers speaks ten sentences information captured in the features corresponding to
of approximately 3-seconds each. First two sentences the high-frequency region of the audio. Moreover,
are same for all the speakers while remaining eight are when the decisions are obtained for one complete
different. recording that is of approximately 30 seconds, an
For the results presented in this paper, MFCC [23] average classification accuracy of the 97.2% is ob-
and IMFCC feature estimations are done on a frame tained by the proposed system which is similar to the
size of 30ms with 15ms overlap, using the Hamming best accuracy obtained by MFCC based features and
window, and 26 filters in the filter bank; as these classifiers proposed in [7]. Classification accuracy for
are the most commonly used settings for frame size, each of the 21 cell-phone is shown in Figure 4, and
window, and overlap respectively. For generating a Figure 5 for the decision made on 3-seconds of audio
single feature vector for each subpart of 3-seconds, segments and on the complete recording respectively.
features for all the frames corresponding to this 3- Results in terms of percentage of correct and incorrect
seconds subpart are combined using GLDS kernel of classification accuracies with the proposed method and
order 2. This choice of order of GLDS kernel is similar method in [5], with decisions on complete recording,
to [5] which has reported similar performance by using is shown in Table I, and Table II respectively. In the
either 2nd order or 3rd order GLDS kernel. Table I, Table II, Table III, and Table IV, the columns
A classifier is trained and tested on two mutually P1 , P2 , . . . , P21 represents the cell-phone1, cell-
exclusive sets of audio recordings. Recordings of six phone2, . . . , cell-phone21 respectively. Actual brand
Authorized licensed use limited to: Northumbria University Library. Downloaded on November 23,2022 at 21:53:08 UTC from IEEE Xplore. Restrictions apply.
2018 Twenty Fourth National Conference on Communications (NCC)
TABLE I
R ESULTS (% OF CORRECT AND INCORRECT CLASSIFICATION ACCURACIES ) USING P ROPOSED M ETHOD ON THE O RIGINAL
R ECORDINGS
Cell-Phones P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21
Correct
Classification
100 91.7 91.7 100 100 91.7 100 100 100 100 100 100 83.3 83.3 100 100 100 100 100 100 100
TABLE II
R ESULTS (% OF CORRECT AND INCORRECT CLASSIFICATION ACCURACIES ) USING MFCC [5] BASED M ETHOD ON THE O RIGINAL
R ECORDINGS
Cell-Phones P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21
Correct
Classification
100 91.7 91.7 100 91.7 75 100 100 100 100 100 100 66.7 83.3 91.7 100 100 91.7 100 100 100
(P14 , 16.7)
Incorrect (P5 , 16.7)
Classification
(P8 , 8.33) (P21 , 8.33) (P14 , 8.33) (P15 , 8.33) (P13 , 16.7) (P4 , 8.33) (P4 , 8.33)
(P13 , 8.33)
(P5 , 8.33)
TABLE III
R ESULTS (% OF CORRECT AND INCORRECT CLASSIFICATION ACCURACIES ) USING P ROPOSED M ETHOD ON MP 3 COMPRESSED
RECORDINGS ( SAMPLING RATE =16 K H Z AND BITRATE =40 K BPS )
Cell-Phones P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21
Correct
Classification
100 91.7 91.7 100 83.3 91.7 100 100 100 100 100 100 100 91.7 100 100 100 100 100 100 100
TABLE IV
R ESULTS (% OF CORRECT AND INCORRECT CLASSIFICATION ACCURACIES ) USING MFCC [5] BASED M ETHOD ON MP 3 COMPRESSED
RECORDINGS ( SAMPLING RATE =16 K H Z AND BITRATE =40 K BPS )
Cell-Phones P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21
Correct
Classification
100 83.3 91.7 100 100 83.3 100 100 100 100 100 100 58.3 83.3 100 100 100 91.7 100 100 100
(P14 , 16.7)
Incorrect (P5 , 8.33)
Classification
(P8 , 16.7) (P21 , 8.33) (P15 , 8.33) (P13 , 16.7) (P5 , 8.33)
(P13 , 8.33)
(P5 , 16.7)
80
class has been correctly classified with classification
accuracy of 83.3% and missclassified with cell-phone 70
TABLE V 90
AVERAGE ACCURACIES (%) FOR O RIGINAL R ECORDINGS IN
Accuracy (%)
THE DATASET 80
Authorized licensed use limited to: Northumbria University Library. Downloaded on November 23,2022 at 21:53:08 UTC from IEEE Xplore. Restrictions apply.
2018 Twenty Fourth National Conference on Communications (NCC)
bit rates and sampling frequencies. Results in terms [4] W. M. Campbell, K. T. Assaleh, and C. C. Broun, “Speaker
of percentage of correct and incorrect classification Recognition with Polynomial Classifiers,” IEEE Transactions
on Speech and Audio Processing, vol. 10, no. 4, pp. 205–212,
accuracies with the proposed method and method 2002.
in [5] with decisions on complete recording is shown [5] C. Hanilçi, F. Ertaş, T. Ertaş, and . Eskidere, “Recognition of
in Table III, and Table IV respectively. Brand and Models of Cell-Phones,” vol. 7, no. 2, pp. 625–634,
2012.
[6] C. Hanilçi and T. Kinnunen, “Source Cell-phone Recognition
TABLE VI from Recorded Speech using Non-speech Segments,” Digital
AVERAGE ACCURACY (%) WITH DIFFERENT MP 3 COMPRESSION Signal Processing: A Review Journal, vol. 35, pp. 75–85, 2014.
PARAMETERS ( SAMPLING RATE AND BITRATE ) (A DOBE [7] C. Kotropoulos and S. Samaras, “Mobile Phone Identification
AUDITION ) using Recorded Speech Signals,” in 19th IEEE International
Conference on Digital Signal Processing (DSP), 2014, pp.
3s Segments Complete Recording 586–591.
11 kHz MFCC [5] 91.6 95.6 [8] N. Khanna, A. K. Mikkilineni, A. F. Martone, G. N. Ali,
24 Kbps Proposed System 93 96.4 G. T.-C. Chiu, J. P. Allebach, and E. J. Delp, “A Survey
11 kHz MFCC [5] 90.1 93.7 of Forensic Characterization Methods for Physical Devices,”
32 Kbps Proposed System 93.2 97.2 Digital Investigation, vol. 3, pp. 17–28, September 2006.
12 kHz MFCC [5] 92.2 96.4 [9] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, “Digital
24 Kbps Proposed System 92.7 96.4 Audio Forensics: A First Practical Evaluation on Microphone
12 kHz MFCC [5] 90.3 93.7 and Environment Classification,” in Proceedings of the 9th
32 Kbps Proposed System 93.6 97.6 Workshop on Multimedia & Security, 2007, pp. 63–74.
16 kHz MFCC [5] 92.5 95.6 [10] R. Buchholz, C. Kraetzer, and J. Dittmann, “Microphone
32 Kbps Proposed System 93.1 97.2 Classification using Fourier Coefficients,” in International
16 kHz MFCC [5] 91.9 94.8
Workshop on Information Hiding. Springer, 2009, pp. 235–
40 Kbps Proposed System 93.4 97.6 246.
[11] D. Garcia-Romero and C. Y. Espy-Wilson, “Automatic Acqui-
sition Device Identification from Speech Recordings,” in IEEE
International Conference on Acoustics, Speech and Signal
V. C ONCLUSION Processing (ICASSP), 2010, pp. 1806–1809.
[12] Y. Jiang and F. H. F. Leung, “Mobile Phone Identification from
This paper has explored the characteristics of micro- Speech Recordings using Weighted Support Vector Machine,”
phone sensitivity used in cell-phones, and experimental in 42nd Annual Conference of the IEEE Industrial Electronics
Society (IECON), Oct 2016, pp. 963–968.
results are conclusive of the fact that in addition [13] H. Malik and J. Miller, “Microphone Identification using
to the low-frequency spectrum, there is significant Higher-order Statistics,” in Proc. AES Int. Conf. Audio Foren-
information regarding signature of the device in the sics, 2012, pp. 2–5.
[14] V. Pandey, V. K. Verma, and N. Khanna, “Cell-phone Iden-
high-frequency spectrum as well. The proposed system tification from Audio Recordings using PSD of Speech-free
achieved the average classification accuracy of 97.2% Regions,” in IEEE Students’ Conference on Electrical, Elec-
on the publicly available dataset MOBIPHONE and tronics and Computer Science (SCEECS), March 2014, pp.
1–6.
similar kind of accuracy when the audio recordings [15] R. Aggarwal, S. Singh, A. K. Roul, and N. Khanna, “Cellphone
undergo a different amount of compressions. On audio identification using noise estimates from recorded audio,”
recordings which have undergone double compression, in International Conference on Communications and Signal
Processing (ICCSP), April 2014, pp. 1218–1222.
which commonly happens when a recording is up- [16] L. Zou, Q. He, and X. Feng, “Cell Phone Verification from
loaded on a social media platform such as WhatsApp Speech Recordings using Sparse Representation,” in IEEE
or edited in Adobe Audition and re-saved, the pro- International Conference on Acoustics, Speech and Signal
Processing (ICASSP), April 2015, pp. 1787–1791.
posed features perform much better than the existing [17] L. Zou, Q. He, J. Yang, and Y. Li, “Source Cell Phone Match-
state of art features (Table VI). For example, when ing from Speech Recordings by Sparse Representation and
double compression happens at 16KHz, 40 Kbps, the KISS Metric,” in IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), March 2016, pp.
proposed features give an accuracy of 97.6% while 2079–2083.
MFCC gives an accuracy of 93.4% (Tables IV and [18] L. Zou, Q. He, and J. Wu, “Source Cell Phone Verification
VI). Further, the proposed system’s least accuracy is from Speech Recordings using Sparse Representation,” Digital
Signal Processing, vol. 62, pp. 125 – 136, 2017.
83.3% for P3 , while the existing MFCC features give [19] Y. Li, X. Zhang, X. Li, X. Feng, J. Yang, A. Chen, and Q. He,
least accuracy of 58.3% for P13 (Tables III and IV). “Mobile Phone Clustering from Acquired Speech Recordings
Future work will include more extensive evaluation for using Deep Gaussian Supervector and Spectral Clustering,”
in IEEE International Conference on Acoustics, Speech and
audio undergone through several compression stages Signal Processing (ICASSP), March 2017, pp. 2137–2141.
while transmitting through social media platforms. [20] D. Paul, M. Pal, and G. Saha, “Spectral Features for Synthetic
Speech Detection,” IEEE Journal of Selected Topics in Signal
Processing, vol. 11, no. 4, pp. 605–617, 2017.
R EFERENCES [21] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support
Vector Machines,” ACM Transactions on Intelligent Systems
[1] International Telecommunication Union (ITU), “ICT Facts and
and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
Figures 2015,” Tech. Rep., May 2015.
[22] B. Logan et al., “Mel frequency cepstral coefficients for music
[2] S. Chakroborty, A. Roy, and G. Saha, “Improved Closed Set
modeling.” in ISMIR, 2000.
Text-independent Speaker Identification by Combining MFCC
[23] K. Wojcicki. (June 2011) HTK MFCC Toolbox.
with Evidence from Flipped Filter Banks,” International Jour-
http://in.mathworks.com/matlabcentral/fileexchange/32849-
nal of Signal Processing, vol. 4, no. 2, pp. 114–122, 2007.
htk-mfcc-matlab.
[3] W. M. Campbell, “Generalized Linear Discriminant Sequence
[24] F. Bellard, M. Niedermayer et al., “FFmpeg,” Available from:
Kernels for Speaker Recognition,” in IEEE International Con-
http://ffmpeg. org, 2017.
ference on Acoustics, Speech, and Signal Processing (ICASSP),
vol. 1, 2002, pp. 161–164.
Authorized licensed use limited to: Northumbria University Library. Downloaded on November 23,2022 at 21:53:08 UTC from IEEE Xplore. Restrictions apply.