Audio Authenticity Detecting ENF Discontinuity Wit
Audio Authenticity Detecting ENF Discontinuity Wit
net/publication/224142452
CITATIONS READS
152 1,253
3 authors:
Luiz Biscainho
Federal University of Rio de Janeiro
131 PUBLICATIONS 1,054 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jose Apolinario on 30 June 2014.
Abstract—This paper addresses a forensic tool used to assess The importance of this topic is enhanced by the advent of
audio authenticity. The proposed method is based on detecting personal computers and all sorts of digital technology: we may
phase discontinuity of the power grid signal; this signal, referred say that, today, editing digital audio has become a simple task
to as electric network frequency (ENF), is sometimes embedded in
audio signals when the recording is carried out with the equipment [4]. Moreover, if a good job is carried out, it is hard, even for
connected to an electrical outlet or when certain microphones are well-trained ears, to detect this type of fraud, hence, the impor-
in an ENF magnetic field. After down-sampling and band-filtering tance of this subject in the field of audio authenticity.
the audio around the nominal value of the ENF, the result can To tackle the digital audio authenticity problem, this paper
be considered a single tone such that a high-precision Fourier resorts to modern DSP techniques which, to some extent, can
analysis can be used to estimate its phase. The estimated phase
provides a visual aid to locating editing points (signalled by abrupt be quite effective in detecting subtle changes in the phase of the
phase changes) and inferring the type of audio editing (insertion or ENF, provided it is present in the recorded material.
removal of audio segments). From the estimated values, a feature The paper is organized as follows. Section II provides some
is used to quantify the discontinuity of the ENF phase, allowing background about the power grid signal: its generation, behavior
an automatic decision concerning the authenticity of the audio of the ENF and its phase, and how it is embedded in audio sig-
evidence. The theoretical background is presented along with
practical implementation issues related to the proposed technique, nals. Section III deals with estimating the phase of a sinusoidal
whose performance is evaluated on digitally edited audio signals. signal. We start from a simple concept, the use of the discrete
Fourier transform (DFT), and discuss a high-precision Fourier
Index Terms—Audio authenticity, discrete Fourier transform
(DFT), electric network frequency (ENF), forensic analysis, phase analysis technique for which we propose an efficient phase es-
estimation. timation scheme. Section IV details the proposed method for
audio authenticity based on the phase estimate of the power grid
signal. The method includes a visual characterization as well as
I. INTRODUCTION an automatic discrimination. Section V evaluates the proposed
method with real audio signals. The signals belong to two public
corpora. Examples of the two types of editing (insertion and re-
F ORENSIC audio authenticity, a branch of audio forensics,
has developed remarkably over the last years due to ad-
vances in digital signal processing (DSP) and a growing avail-
moval of a signal fragment) are also shown in this section. Fi-
nally, after a few practical issues discussed in Section VI, con-
ability of technology [1]. It uses DSP methods to perform signal clusions are summarized in Section VII.
analysis of recorded audio evidence in legal and law enforce-
ment contexts.
II. THE POWER GRID SIGNAL
As any other forensic science, authenticity examinations ana-
lyze and interpret physical evidence using natural sciences. The The electric power system, as an important element for
goal of this paper is to detail a technique that uses a high preci- modern society, constitutes a fundamental factor for the de-
sion phase analysis to detect electric network frequency (ENF) velopment of countries and can be defined as a group of
discontinuities and thus provide some degree of audio authen- apparatuses, wires, and machines, that links the power plants
tication [2], [3]. The proposed technique is, therefore, based on to costumers and their needs. Power plants may generate
the presence of a small portion of the power grid signal, some- energy by different ways including thermal (coal, oil, nuclear,
times embedded in audio recordings. geothermal), hydroelectric, solar, and wind. The public power
grid signal may be viewed as a single sinusoidal waveform with
Manuscript received February 26, 2010; revised April 16, 2010; accepted a fixed frequency (the so-called ENF).
April 20, 2010. Date of publication June 01, 2010; date of current version Most of the power provided by the power grid comes from
August 13, 2010. This work was supported in part by the Brazilian Agencies
CAPES, CNPq, and FAPERJ. The associate editor coordinating the review of
turbines that work as generators of alternating current. The ro-
this manuscript and approving it for publication was Dr. Darko Kirovski. tation velocity of these turbines determines the ENF, whose
D. P. Nicolalde Rodríguez and J. A. Apolinário, Jr. are with the Department of standard nominal values are 50 and 60 Hz. The first value is
Electrical Engineering, Military Institute of Engineering (IME), Rio de Janeiro,
RJ, Brazil (e-mail: [email protected]; [email protected]).
adopted in European countries, Asian countries (except Saudi
L. W. P. Biscainho is with the Program of Electrical Engineering, COPPE/ Arabia), African countries (except Liberia), Australia, and in
Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, RJ, Brazil (e-mail: some South American countries like Argentina, Bolivia, Chile,
[email protected]). Uruguay, and Paraguay. Meanwhile, 60 Hz is used in Central
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. and North America and in some other South American coun-
Digital Object Identifier 10.1109/TIFS.2010.2051270 tries including Ecuador, Venezuela, Peru, Colombia, and Brazil.
1556-6013/$26.00 © 2010 IEEE
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 535
Japan is a peculiar case that adopts both 50 and 60 Hz as ENF where is the sampling frequency of . The resolution
nominal values. of , which can only assume discrete values, is .
It is important to mention that, for a correct operation of the This means that the greater the value of , the better the
power system, frequency and phase of all power generation units accuracy of , at the expense of increased computational
should remain synchronous within narrow limits. It is, therefore, burden. The tone phase is simply the argument (or angle) of
of paramount importance that the ENF remains stable. If, for
example, a generator drops 2 Hz below the nominal ENF, it will
rapidly build up enough heat to destroy itself [5]. Therefore, in (2)
the majority of cities, especially those in the most developed
regions, a tight control is kept over operator units. B. The Novel Phase Estimation Method
Every type of electric equipment operating connected to the
power grid emits an electromagnetic field. This fact causes the The method in [12], named DFT , refines the DFT-based fre-
power grid signal to be embedded in some recorded signals quency estimation of a single tone, and is commonly used to
when a recording device is connected to an electrical outlet or extract spectral modeling parameters from audio signals. It uses
to certain microphones in an ENF magnetic field [6]. Its pres- the short-time DFT of the first-order signal derivative. Practical
ence in recorded signals and its expected frequency and phase experiments show that DFT attains an improved accuracy in
stability make the ENF useful in some audio authenticity exam- finding the peak of the signal spectrum (i.e., the actual value
inations [7], [8]. of its frequency) compared to the DFT method, even for small
In [4] and [9], the ENF is used for the task of audio authen- values of .
ticity; the method therein is based on comparing the pattern of The basic steps to estimate the frequency, as presented in [12],
the ENF embedded in a recorded signal with the patterns of the are the following:
power grid signals from a few (suspect) regions, which have 1) Compute the approximate first derivative of the signal at
been previously stored in a database. It is then possible to obtain, instant
besides audio authentication, information about the place where
and the time when the recording was carried out. The Forensic
Speech and Audio Analysis Working Group of the European
Network of Forensic Science Institutes recently published a doc- 2) Obtain the windowed version of and
ument giving guidelines for the use of ENF analysis in forensic
authentication of audio recordings [10], attesting to the impor-
tance of this subject.
The present work is based on estimating the phase of the
3) Obtain the -point DFT of and . They
power grid signal embedded in the recorded audio signal as-
will be denoted as and , respectively.
suming that a database with ENF information is not available.
4) Compute and as well as , obtained
We use abrupt changes in the estimated phase to infer whether
as in Section III-A.
or not the signal has been digitally edited.
5) Multiply by the scaling factor
A. Phase Estimation Using the DFT According to [12], is expected to be the closest
integer to ; then, in order for to
Let be an -sample single tone sequence, whose be considered a valid solution,
frequency and phase are to be estimated. The application of a
smoothing window (e.g., Hann) yields the signal
. The -point DFT of , with
, will be called .
must be satisfied, otherwise the method has failed for
Let be the integer index associated with the maximum
this frequency. If we define ,
value of . Then, the estimated value of the tone frequency
the validation condition can be rewritten as
is
(1)
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
536 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010
The mechanism introduced in [12] is intended to estimate Dividing both numerator and denominator of (8) by
the value of the frequencies of single tones present in an audio and isolating , the next expression is obtained
signal, based on the use of the Fourier transform of signal deriva-
tives. The method proposed below extends this result to estimate
the phase of a single tone. (9)
Considering a signal model given by
, the signal phase corresponds to The value of represents the initial phase of ; since
, where is the phase at . An it is being estimated from the DFT , we write it as
estimation of such a value would be restricted to the interval
between and , and a plot of would be a saw- (10)
tooth-like curve (wrapped phase). This model of is
of a narrowband signal, which would be deterministic were
where the value of is approximated as .
a constant. In practice, is assumed to evolve slowly
For the value of , we carry out a linear interpolation in the
over time, and thus can be taken as approximately constant
argument of . Let and be defined as
within a small analysis frame or “window.” The model does
not include any stochastic part (or broadband component), but
can be applied to the target problem of this work, since, as will
be seen in the next section, all frequency components outside and
a small bandwidth defined around the ENF nominal value are
carefully filtered out.
Therefore, the signal can be expressed as where rounds the value of to the nearest integer less
than or equal to and rounds the value of to the nearest
(3) integer greater than or equal to .
Recalling that , a linear in-
where , and is the actual value of the terpolation between points
tone frequency. and can yield point
Consequently, , as computed in the first step of the , whose argument corresponds to
DFT frequency estimation procedure, can be expressed as the value of used in (10), i.e.,
(11)
C. Preliminary Experiments
In order to understand better and evaluate the proposed
(5) method, we provide the results of a few preliminary computer
experiments.
where is a constant and is the phase of . We have initially considered a 60.98-Hz sinusoidal tone sam-
Comparing (4) to (5), we can write pled at 1200 Hz. In Fig. 1, the true spectrum of this signal,
zoomed around the nominal frequency of the tone, is shown to-
gether with its associate discrete spectra computed via 200- and
(6) 2000-point DFTs.
and In this experiment, we obtained the first 100 estimated
frequencies and phases for consecutive frames of the test tone
(7)
delimited by a 200-sample sliding window (i.e., advancing
sample by sample). For this particular signal, the DFT proce-
Dividing (7) by (6), we obtain
dure provided a constant estimated frequency value of 60 Hz
for , and 61.20 Hz for . Meanwhile,
when using the DFT method, the values of the estimated
frequency had a mean of 60.9719 Hz with a standard deviation
of 0.0025 Hz for , and a mean of 60.9818 Hz with
(8)
a standard deviation of 0.0032 Hz for .
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 537
TABLE I
EVALUATION OF FREQUENCY AND PHASE ESTIMATIONS USING DFT AND
DFT . THE EXPERIMENT WAS CARRIED OUT WITH 1000 TONES WITH
FREQUENCIES VARYING RANDOMLY BETWEEN 59.0 AND 61.0 HZ. e AND e
REPRESENT THE MEAN ERRORS IN FREQUENCY AND PHASE, RESPECTIVELY
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
538 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010
A. Visual Method
The steps of the visual method are detailed below:
1) Down-sample the audio signal to a frequency which,
as a suggestion, could be 1000 or 1200 Hz, depending on
the value of the nominal ENF being 50 or 60 Hz, respec-
Fig. 3. Block fragmentation of an audio signal. N is the number of ENF
tively. This synchronous sampling, besides reducing the N
cycles in the audio signal; is the number of fragment blocks; N is the
analysis computational burden, allows working with an number of ENF cycles in each block.
exact number of samples per cycle of the nominal ENF
(or, in the frequency domain, locating one DFT bin ex-
actly on the nominal ENF). For the detection process, the hypothesis group
2) Use a very sharp linear-phase FIR filter to bandpass is defined; and represent the hypothesis for an audio
the down-sampled signal. This filter should be centered signal being original and edited, respectively. A decision ratio
in the nominal ENF value, and have a passband width for the automatic detection can be expressed as
between 0.6 and 1.4 Hz, depending on the ENF toler-
ance guaranteed by the electrical company. In the ex- (14)
periments carried out in this work, a 10 000-coefficient
zero-phase filter has been employed (using Matlab func-
tion filtfilt to avoid delay). where is a threshold. For greater than , it is decided that
3) Divide the filtered signal in blocks of cycles of the the audio signal has been edited, i.e., hypothesis . Otherwise,
nominal ENF, each block overlapping the former by hypothesis is favored.
cycles. The signal is then segmented in Let be the probability of detection, or hit (i.e., the audio
blocks. In Fig. 3, blocks of cycles of the nom- signal is considered as edited when it indeed has been edited),
inal ENF are shown. be the probability of false alarm (i.e., the audio signal is con-
4) Estimate the phase of every segmented block using DFT sidered as edited when it has actually not been edited), and
or DFT . Let be the corresponding phase estimate as the probability of a miss (i.e., the audio signal is considered
for the block index . as not edited when it indeed has been edited). The expressions
5) Plot phase values in degrees versus cycles of nominal for , , and are
ENF for visual inspection.
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 539
TABLE II
EVALUATION OF AUDIO AUTHENTICITY FOR THE TEST AUDIO CORPUS (100
ORIGINAL AND 100 EDITED SIGNALS). NREPRESENTS THE ANALYSIS
WINDOW SIZE IN CYCLES OF THE NOMINAL ENF
Fig. 4. Histogram of phase change for the signals forming the corpus of edited
speech.
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
540 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010
Fig. 5. Histograms of feature F for the test audio corpus. The DFT -based
phase estimation method was used, with a window size of 10 cycles of the nom-
inal ENF and N = 2000 points.
inal values, the ENF in Spain seems to vary more slowly than
VI. PRACTICAL ISSUES the ENF recorded in the city of Rio de Janeiro. Since the differ-
The well-behaved ENF variation of the Spanish corpus ence in EER was very small (1%), no further investigation was
(signals from AHUMADA and GAUDI edited) yielded very carried out; nevertheless, this result reinforces the expectation
nice results. But what if the proposed method is to be used in that the performance of the proposed method would degrade in
real-life situations where signals are degraded in a number of a region without a tight control over the ENF.
ways? To answer this question, two additional local corpora The Carioca 2 corpus was prepared under unfavorable condi-
were prepared, containing recordings in Portuguese as spoken tions: among its 100 signals, 21 exhibited a moderate degree of
in Rio de Janeiro, Brazil: Carioca 1 (digitized with 16-bit saturation; and the corpus average SNR was around 30 dB (the
quantization and a sampling rate of 44 100 Hz) and Carioca 2 average SNR of the Spanish corpus was estimated in 35 dB).
(16 bits and 11 050 Hz), both with the same structure of the The resulting EER for this corpus was 15%. In the following
edited Spanish corpus: a total of 100 original and 100 edited subsections, both effects will be addressed individually.
signals. As Brazilian speech databases, their nominal ENF is A. Effect of Background Noise
60 Hz.
Carioca 1 speech signals were recorded with low background The Spanish corpus was used to carry out this study. Defining
as the clean speech and as the background noise,
noise and without saturation. The EER obtained for this corpus
both mutually uncorrelated by assumption, the original SNR is
was 7%. This result, only slightly worse than the one obtained
given as
for the Spanish corpus, could be due to the slightly faster vari-
ation of the ENF contained in Carioca 1 recordings. Although
(15)
in both cases the ENF has a similar deviation around their nom-
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 541
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
542 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010
Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 543
[13] D. P. Nicolalde R. and J. A. Apolinário, Jr., “Evaluating digital José Antonio Apolinário, Jr. (S’95–M’99–SM’04)
audio authenticity with spectral distances and ENF phase change,” was born in Taubaté, Brazil, in 1960. He graduated
in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing from the Military Academy of Agulhas Negras
(ICASSP’09), Taipei, Taiwan, Apr. 2009. (AMAN), Resende, Brazil, in 1981 and received
[14] D. P. Nicolalde R., J. A. Apolinário, Jr., and L. W. Biscainho, “Aut- the B.Sc. degree from the Military Institute of
enticação de áudio digital com base na mudança de fase da frequência Engineering (IME), Rio de Janeiro, Brazil, in 1988,
da rede elétrica,” in Proc. XXVII Brazilian Telecommunications Symp. the M.Sc. degree from the University of Brasília
(SBrT’09) (in Portuguese), Blumenau, Brazil, Sep. 2009. (UnB), Brasília, Brazil, in 1993, and the D.Sc.
[15] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, degree from the Federal University of Rio de Janeiro
“The DET curve in assessment of detection task performance,” in Proc. (COPPE/UFRJ), Rio de Janeiro, Brazil, in 1998, all
Eur. Conf. Speech Communication and Technology, Rhodes, Greece, in electrical engineering.
Sep. 1997. He is currently an Adjoint Professor with the Department of Electrical Engi-
[16] J. Ortega-García, J. González-Rodríguez, and V. Marrero-Aguiar, neering, IME, where he has already served as the Head of Department and as the
“AHUMADA, a large speech corpus in Spanish for speaker charac- Vice-Rector for Study and Research. He was a Visiting Professor at the Escuela
terization and identification,” Elsevier Speech Commun., vol. 31, pp. Politécnica del Ejército (ESPE), Quito, Ecuador, from 1999 to 2000 and a Vis-
255–264, Jun. 2000. iting Researcher and twice a Visiting Professor at Helsinki University of Tech-
[17] A. Benyassine et al., “ITU-T recommendation G.729 Annex B: A si- nology (HUT), Finland, in 1997, 2004, and 2006, respectively. His research in-
lence compression scheme for use with G.729 optimized for V.70 dig- terests comprise many aspects of linear and nonlinear digital signal processing,
ital simultaneous voice and data applications,” IEEE Commun. Mag., including adaptive filtering, speech, and array processing. He has recently edited
vol. 35, no. 9, pp. 64–73, Sep. 1997. the book QRD-RLS Adaptive Filtering (New York: Springer, 2009).
Dr. Apolinário has organized and been the first Chair of the Rio de Janeiro
Chapter of the IEEE Communications Society.
Authorized licensed
View publication stats use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.