0% found this document useful (0 votes)
13 views11 pages

Audio Authenticity Detecting ENF Discontinuity Wit

The paper presents a forensic tool for assessing audio authenticity by detecting electric network frequency (ENF) discontinuities using high precision phase analysis. It employs digital signal processing techniques to identify phase changes in audio recordings, which can indicate editing. The method is evaluated on digitally edited audio signals, demonstrating its effectiveness in distinguishing between authentic and manipulated audio.

Uploaded by

jota31415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Audio Authenticity Detecting ENF Discontinuity Wit

The paper presents a forensic tool for assessing audio authenticity by detecting electric network frequency (ENF) discontinuities using high precision phase analysis. It employs digital signal processing techniques to identify phase changes in audio recordings, which can indicate editing. The method is evaluated on digitally edited audio signals, demonstrating its effectiveness in distinguishing between authentic and manipulated audio.

Uploaded by

jota31415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224142452

Audio Authenticity: Detecting ENF Discontinuity With High Precision Phase


Analysis

Article in IEEE Transactions on Information Forensics and Security · October 2010


DOI: 10.1109/TIFS.2010.2051270 · Source: IEEE Xplore

CITATIONS READS
152 1,253

3 authors:

Daniel Nicolalde Jose Apolinario


Nokia Military Institute of Engineering
13 PUBLICATIONS 237 CITATIONS 175 PUBLICATIONS 1,823 CITATIONS

SEE PROFILE SEE PROFILE

Luiz Biscainho
Federal University of Rio de Janeiro
131 PUBLICATIONS 1,054 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jose Apolinario on 30 June 2014.

The user has requested enhancement of the downloaded file.


534 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010

Audio Authenticity: Detecting ENF Discontinuity


With High Precision Phase Analysis
Daniel Patricio Nicolalde Rodríguez, José Antonio Apolinário, Jr., Senior Member, IEEE, and
Luiz Wagner Pereira Biscainho, Member, IEEE

Abstract—This paper addresses a forensic tool used to assess The importance of this topic is enhanced by the advent of
audio authenticity. The proposed method is based on detecting personal computers and all sorts of digital technology: we may
phase discontinuity of the power grid signal; this signal, referred say that, today, editing digital audio has become a simple task
to as electric network frequency (ENF), is sometimes embedded in
audio signals when the recording is carried out with the equipment [4]. Moreover, if a good job is carried out, it is hard, even for
connected to an electrical outlet or when certain microphones are well-trained ears, to detect this type of fraud, hence, the impor-
in an ENF magnetic field. After down-sampling and band-filtering tance of this subject in the field of audio authenticity.
the audio around the nominal value of the ENF, the result can To tackle the digital audio authenticity problem, this paper
be considered a single tone such that a high-precision Fourier resorts to modern DSP techniques which, to some extent, can
analysis can be used to estimate its phase. The estimated phase
provides a visual aid to locating editing points (signalled by abrupt be quite effective in detecting subtle changes in the phase of the
phase changes) and inferring the type of audio editing (insertion or ENF, provided it is present in the recorded material.
removal of audio segments). From the estimated values, a feature The paper is organized as follows. Section II provides some
is used to quantify the discontinuity of the ENF phase, allowing background about the power grid signal: its generation, behavior
an automatic decision concerning the authenticity of the audio of the ENF and its phase, and how it is embedded in audio sig-
evidence. The theoretical background is presented along with
practical implementation issues related to the proposed technique, nals. Section III deals with estimating the phase of a sinusoidal
whose performance is evaluated on digitally edited audio signals. signal. We start from a simple concept, the use of the discrete
Fourier transform (DFT), and discuss a high-precision Fourier
Index Terms—Audio authenticity, discrete Fourier transform
(DFT), electric network frequency (ENF), forensic analysis, phase analysis technique for which we propose an efficient phase es-
estimation. timation scheme. Section IV details the proposed method for
audio authenticity based on the phase estimate of the power grid
signal. The method includes a visual characterization as well as
I. INTRODUCTION an automatic discrimination. Section V evaluates the proposed
method with real audio signals. The signals belong to two public
corpora. Examples of the two types of editing (insertion and re-
F ORENSIC audio authenticity, a branch of audio forensics,
has developed remarkably over the last years due to ad-
vances in digital signal processing (DSP) and a growing avail-
moval of a signal fragment) are also shown in this section. Fi-
nally, after a few practical issues discussed in Section VI, con-
ability of technology [1]. It uses DSP methods to perform signal clusions are summarized in Section VII.
analysis of recorded audio evidence in legal and law enforce-
ment contexts.
II. THE POWER GRID SIGNAL
As any other forensic science, authenticity examinations ana-
lyze and interpret physical evidence using natural sciences. The The electric power system, as an important element for
goal of this paper is to detail a technique that uses a high preci- modern society, constitutes a fundamental factor for the de-
sion phase analysis to detect electric network frequency (ENF) velopment of countries and can be defined as a group of
discontinuities and thus provide some degree of audio authen- apparatuses, wires, and machines, that links the power plants
tication [2], [3]. The proposed technique is, therefore, based on to costumers and their needs. Power plants may generate
the presence of a small portion of the power grid signal, some- energy by different ways including thermal (coal, oil, nuclear,
times embedded in audio recordings. geothermal), hydroelectric, solar, and wind. The public power
grid signal may be viewed as a single sinusoidal waveform with
Manuscript received February 26, 2010; revised April 16, 2010; accepted a fixed frequency (the so-called ENF).
April 20, 2010. Date of publication June 01, 2010; date of current version Most of the power provided by the power grid comes from
August 13, 2010. This work was supported in part by the Brazilian Agencies
CAPES, CNPq, and FAPERJ. The associate editor coordinating the review of
turbines that work as generators of alternating current. The ro-
this manuscript and approving it for publication was Dr. Darko Kirovski. tation velocity of these turbines determines the ENF, whose
D. P. Nicolalde Rodríguez and J. A. Apolinário, Jr. are with the Department of standard nominal values are 50 and 60 Hz. The first value is
Electrical Engineering, Military Institute of Engineering (IME), Rio de Janeiro,
RJ, Brazil (e-mail: [email protected]; [email protected]).
adopted in European countries, Asian countries (except Saudi
L. W. P. Biscainho is with the Program of Electrical Engineering, COPPE/ Arabia), African countries (except Liberia), Australia, and in
Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, RJ, Brazil (e-mail: some South American countries like Argentina, Bolivia, Chile,
[email protected]). Uruguay, and Paraguay. Meanwhile, 60 Hz is used in Central
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. and North America and in some other South American coun-
Digital Object Identifier 10.1109/TIFS.2010.2051270 tries including Ecuador, Venezuela, Peru, Colombia, and Brazil.
1556-6013/$26.00 © 2010 IEEE

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 535

Japan is a peculiar case that adopts both 50 and 60 Hz as ENF where is the sampling frequency of . The resolution
nominal values. of , which can only assume discrete values, is .
It is important to mention that, for a correct operation of the This means that the greater the value of , the better the
power system, frequency and phase of all power generation units accuracy of , at the expense of increased computational
should remain synchronous within narrow limits. It is, therefore, burden. The tone phase is simply the argument (or angle) of
of paramount importance that the ENF remains stable. If, for
example, a generator drops 2 Hz below the nominal ENF, it will
rapidly build up enough heat to destroy itself [5]. Therefore, in (2)
the majority of cities, especially those in the most developed
regions, a tight control is kept over operator units. B. The Novel Phase Estimation Method
Every type of electric equipment operating connected to the
power grid emits an electromagnetic field. This fact causes the The method in [12], named DFT , refines the DFT-based fre-
power grid signal to be embedded in some recorded signals quency estimation of a single tone, and is commonly used to
when a recording device is connected to an electrical outlet or extract spectral modeling parameters from audio signals. It uses
to certain microphones in an ENF magnetic field [6]. Its pres- the short-time DFT of the first-order signal derivative. Practical
ence in recorded signals and its expected frequency and phase experiments show that DFT attains an improved accuracy in
stability make the ENF useful in some audio authenticity exam- finding the peak of the signal spectrum (i.e., the actual value
inations [7], [8]. of its frequency) compared to the DFT method, even for small
In [4] and [9], the ENF is used for the task of audio authen- values of .
ticity; the method therein is based on comparing the pattern of The basic steps to estimate the frequency, as presented in [12],
the ENF embedded in a recorded signal with the patterns of the are the following:
power grid signals from a few (suspect) regions, which have 1) Compute the approximate first derivative of the signal at
been previously stored in a database. It is then possible to obtain, instant
besides audio authentication, information about the place where
and the time when the recording was carried out. The Forensic
Speech and Audio Analysis Working Group of the European
Network of Forensic Science Institutes recently published a doc- 2) Obtain the windowed version of and
ument giving guidelines for the use of ENF analysis in forensic
authentication of audio recordings [10], attesting to the impor-
tance of this subject.
The present work is based on estimating the phase of the
3) Obtain the -point DFT of and . They
power grid signal embedded in the recorded audio signal as-
will be denoted as and , respectively.
suming that a database with ENF information is not available.
4) Compute and as well as , obtained
We use abrupt changes in the estimated phase to infer whether
as in Section III-A.
or not the signal has been digitally edited.
5) Multiply by the scaling factor

III. ESTIMATING FREQUENCY AND PHASE OF A SINGLE TONE

The power grid signal may be viewed as a single tone whose


frequency and phase can be estimated. This section starts with At this point, we have DFT and
the short-time DFT [11] and proceeds to a high-precision DFT .
Fourier analysis method named DFT (the term DFT was 6) Finally, the value of the estimated frequency
coined in [12] denoting the DFT of the th derivative of a
signal, DFT representing its regular DFT).

A. Phase Estimation Using the DFT According to [12], is expected to be the closest
integer to ; then, in order for to
Let be an -sample single tone sequence, whose be considered a valid solution,
frequency and phase are to be estimated. The application of a
smoothing window (e.g., Hann) yields the signal
. The -point DFT of , with
, will be called .
must be satisfied, otherwise the method has failed for
Let be the integer index associated with the maximum
this frequency. If we define ,
value of . Then, the estimated value of the tone frequency
the validation condition can be rewritten as
is

(1)

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
536 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010

The mechanism introduced in [12] is intended to estimate Dividing both numerator and denominator of (8) by
the value of the frequencies of single tones present in an audio and isolating , the next expression is obtained
signal, based on the use of the Fourier transform of signal deriva-
tives. The method proposed below extends this result to estimate
the phase of a single tone. (9)
Considering a signal model given by
, the signal phase corresponds to The value of represents the initial phase of ; since
, where is the phase at . An it is being estimated from the DFT , we write it as
estimation of such a value would be restricted to the interval
between and , and a plot of would be a saw- (10)
tooth-like curve (wrapped phase). This model of is
of a narrowband signal, which would be deterministic were
where the value of is approximated as .
a constant. In practice, is assumed to evolve slowly
For the value of , we carry out a linear interpolation in the
over time, and thus can be taken as approximately constant
argument of . Let and be defined as
within a small analysis frame or “window.” The model does
not include any stochastic part (or broadband component), but
can be applied to the target problem of this work, since, as will
be seen in the next section, all frequency components outside and
a small bandwidth defined around the ENF nominal value are
carefully filtered out.
Therefore, the signal can be expressed as where rounds the value of to the nearest integer less
than or equal to and rounds the value of to the nearest
(3) integer greater than or equal to .
Recalling that , a linear in-
where , and is the actual value of the terpolation between points
tone frequency. and can yield point
Consequently, , as computed in the first step of the , whose argument corresponds to
DFT frequency estimation procedure, can be expressed as the value of used in (10), i.e.,

(11)

From (10), it is worth mentioning that can have two


possible values. If has a positive value,
(4) could be in the first or in the third quadrant of a two-dimensional
Cartesian system; if, on the other hand, has a
Additionally, since the first difference of a sinusoid (tone) negative value, could be in the second or in the fourth
is in fact another sinusoid with the same frequency, (4) can be quadrant. A simple decision can be taken by using the value of
represented by as a reference: choose the value of closer to .

C. Preliminary Experiments
In order to understand better and evaluate the proposed
(5) method, we provide the results of a few preliminary computer
experiments.
where is a constant and is the phase of . We have initially considered a 60.98-Hz sinusoidal tone sam-
Comparing (4) to (5), we can write pled at 1200 Hz. In Fig. 1, the true spectrum of this signal,
zoomed around the nominal frequency of the tone, is shown to-
gether with its associate discrete spectra computed via 200- and
(6) 2000-point DFTs.
and In this experiment, we obtained the first 100 estimated
frequencies and phases for consecutive frames of the test tone
(7)
delimited by a 200-sample sliding window (i.e., advancing
sample by sample). For this particular signal, the DFT proce-
Dividing (7) by (6), we obtain
dure provided a constant estimated frequency value of 60 Hz
for , and 61.20 Hz for . Meanwhile,
when using the DFT method, the values of the estimated
frequency had a mean of 60.9719 Hz with a standard deviation
of 0.0025 Hz for , and a mean of 60.9818 Hz with
(8)
a standard deviation of 0.0032 Hz for .

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 537

TABLE I
EVALUATION OF FREQUENCY AND PHASE ESTIMATIONS USING DFT AND
DFT . THE EXPERIMENT WAS CARRIED OUT WITH 1000 TONES WITH
FREQUENCIES VARYING RANDOMLY BETWEEN 59.0 AND 61.0 HZ. e AND e
REPRESENT THE MEAN ERRORS IN FREQUENCY AND PHASE, RESPECTIVELY

Fig. 1. Spectra of 500 windowed samples of a single 60.98-Hz tone sampled


at 1200 Hz: continuous spectrum; 200-point DFT; and 2000-point DFT.

obtained as an average over the phase error.2 The mean phase


errors obtained were: 29.25 with , and 6.57 with
for the DFT method; 0.25 with ,
and 0.0912 with for the DFT method. A con-
siderable improvement has been be obtained in phase estimation
using the new method.
A statistical evaluation of frequency and phase estimation for
both methods, DFT and DFT , was performed. For that, 1000
tones with frequencies randomly varying (with uniform distri-
bution) between 59.0 and 61.0 Hz were synthesized. Subse-
quently, the errors in the estimates of frequency and phase for
different DFT lengths and window sizes were computed.
Table I summarizes the results.
It can be seen that when the DFT length increases, given a
constant window size, frequency and phase estimates improve
in both methods; this is due to the fact that the signal spec-
trum is sampled with higher resolution. Additionally, the DFT
method provides a substantial improvement in frequency and
phase estimation when compared to the DFT method, for the
Fig. 2. Phase estimation of an artificial 60.98-Hz tone: (a) signal; (b) phase
estimate. same . This effect can be seen in Table I: the DFT es-
timates with lowest resolution are better than
the DFT estimations with highest resolution .
A mean relative error1 averaged over the frames has been For stationary signals, as in the present experiment, increasing
computed for both DFT and DFT . The values of obtained window size improves frequency estimation in both methods.
were: 1.61% with , and 0.36% with However, this parameter cannot grow unbounded if one needs
for the DFT method; 0.014% with , and 0.005% to detect abrupt phase changes, thus it should be kept low for
with for the DFT method. The errors attained the target application.
with the DFT method are substantially lower than those with
the DFT method. IV. THE PROPOSED METHOD
The respective 100 estimations of phase using DFT and DFT As mentioned before, the power grid signal may be em-
are shown in Fig. 2. For both methods, the same number of bedded in recorded signals. Consequently, considering that
points and window size have been used. Analogously audio editing means removal or insertion of a portion of audio,
to the case of frequency estimation, a mean error has been the same action is carried out in the embedded power grid
1The relative frequency error at the n th frame is defined by the rate between 2The phase error at the n th frame is defined as the absolute value of the
the absolute value of the difference estimated value minus correct value and the difference between the estimated value and the correct phase and is given as
correct value, i.e., e (n ) = (jf^(n ) 0 f (n )j=f (n )) 2 100%. e (n ) = j^(n ) 0 (n )j.

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
538 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010

signal. Following this reasoning, a method that attempts to


detect abrupt changes in the phase of the embedded ENF signal
is proposed here.
The method can be divided in two parts. The first part com-
prises a visual mechanism that allows the observation of the be-
havior of the estimated phase of the power grid signal. The other
part automatically discriminates between original and edited
signals by means of a decision ratio. The basic idea of this
method, without employing high-precision phase analysis, can
be found in [13] and [14].

A. Visual Method
The steps of the visual method are detailed below:
1) Down-sample the audio signal to a frequency which,
as a suggestion, could be 1000 or 1200 Hz, depending on
the value of the nominal ENF being 50 or 60 Hz, respec-
Fig. 3. Block fragmentation of an audio signal. N is the number of ENF
tively. This synchronous sampling, besides reducing the N
cycles in the audio signal; is the number of fragment blocks; N is the
analysis computational burden, allows working with an number of ENF cycles in each block.
exact number of samples per cycle of the nominal ENF
(or, in the frequency domain, locating one DFT bin ex-
actly on the nominal ENF). For the detection process, the hypothesis group
2) Use a very sharp linear-phase FIR filter to bandpass is defined; and represent the hypothesis for an audio
the down-sampled signal. This filter should be centered signal being original and edited, respectively. A decision ratio
in the nominal ENF value, and have a passband width for the automatic detection can be expressed as
between 0.6 and 1.4 Hz, depending on the ENF toler-
ance guaranteed by the electrical company. In the ex- (14)
periments carried out in this work, a 10 000-coefficient
zero-phase filter has been employed (using Matlab func-
tion filtfilt to avoid delay). where is a threshold. For greater than , it is decided that
3) Divide the filtered signal in blocks of cycles of the the audio signal has been edited, i.e., hypothesis . Otherwise,
nominal ENF, each block overlapping the former by hypothesis is favored.
cycles. The signal is then segmented in Let be the probability of detection, or hit (i.e., the audio
blocks. In Fig. 3, blocks of cycles of the nom- signal is considered as edited when it indeed has been edited),
inal ENF are shown. be the probability of false alarm (i.e., the audio signal is con-
4) Estimate the phase of every segmented block using DFT sidered as edited when it has actually not been edited), and
or DFT . Let be the corresponding phase estimate as the probability of a miss (i.e., the audio signal is considered
for the block index . as not edited when it indeed has been edited). The expressions
5) Plot phase values in degrees versus cycles of nominal for , , and are
ENF for visual inspection.

B. Automatic Method and


The automatic discrimination between edited and original
signals requires a feature that characterizes the detection of
abrupt phase changes in the power grid signal embedded in the Additionally, . For optimal detection, the goal
recorded audio, related to audio editing. The variation of the is to obtain a value of that maximizes the value of . To es-
estimated ENF phase for the th block under analysis tablish this threshold, it is necessary to prepare a corpus of audio
signals including their original and edited (in a controlled way)
(12) versions, and evaluate this database with the proposed method
for an extended range of values; with the corresponding values
for , is chosen for this purpose. of and , the so-called detection error tradeoff (DET)
Taking as the average of from to , curve [15] ( as a function of ) is constructed. The point
the proposed feature is then defined as in the curve where is known as the equal error rate
(EER). The value of that corresponds to the EER point will be
taken as the decision threshold in (14). The EER allows the char-
(13) acterization of the detection system error by a single parameter,
since the lower this value the better is the system performance.

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 539

TABLE II
EVALUATION OF AUDIO AUTHENTICITY FOR THE TEST AUDIO CORPUS (100
ORIGINAL AND 100 EDITED SIGNALS). NREPRESENTS THE ANALYSIS
WINDOW SIZE IN CYCLES OF THE NOMINAL ENF

Fig. 4. Histogram of phase change for the signals forming the corpus of edited
speech.

V. ASSESSING THE METHOD


In order to evaluate the proposed method, it was employed
to check the authenticity of digitally edited recordings. The Table II summarizes the results obtained by the automatic
original recordings [digitized with 16-bit quantization and discrimination method according to the decision rule expressed
sampling rates of 8 kHz in the case of telephone and 16 kHz in (14). It can be seen that the use of DFT yields more stable,
in the case of microphone signals (see additional details in almost constant results: around 6% EER in the audio editing
[16])] for the main evaluation were taken from two public decision, regardless of and values (the exception,
databases in Spanish, AHUMADA and GAUDI, obtained from EER , occurs for the only case where the DFT length
the website http://atvs.ii.uam.es/databases.jsp. The signals is not greater than the window size).
chosen for the evaluation were checked and determined to be After examination of the results in Table II, allowing for some
neither digitally saturated nor having a low signal-to-noise ratio safety margin in both parameters without increasing too much
(SNR). Since both databases came from Spain, the nominal of the computational load, a cautious choice is to use the DFT
ENF associated with all signals was 50 Hz. Overall, 100 signals method with a window size of 5–10 ENF cycles and
(speech utterances) were used: 50 by female and 50 by male points. Next, a particular case using the DFT method with
speakers. They were edited in such a way that half of the this choice ( cycles and ) is further de-
speech files had an audio portion deleted, while the other half tailed. Fig. 5 presents the histograms of feature for both edited
had a portion of audio inserted. Insertions were carried out and original signals in the test audio corpus as well as the lo-
with a fragment of audio belonging to the same file in order to calization of , the optimum decision threshold for the detection
avoid strong short-time spectral changes (due to, for example, process. It can be seen that the distribution of original signals are
a difference in sampling rate), which could make the detection reasonably separated from the distribution of edited signals.
easier. In [13], the authors address the problem of detecting The DET curve ( versus ) as well as the localization of
audio discontinuities from spectral distances. the EER point (6%) for this particular case are shown in Fig. 6.
It is important to mention that the editing was carried out In an attempt to show results of the visual aid provided by
without regard to the phase changes ocurring in the power grid the proposed method, two examples of audio editing of signals
signal, in an attempt to emulate the way most files are digitally from the test audio corpus are presented. The preset is the same
edited. Therefore, the phase changes resulted in a random dis- at that used for the particular case previously detailed.
tribution among the speech files, as depicted in Fig. 4. This his- Fig. 7 presents an example where an audio portion has been
togram shows a relatively uniform distribution between 180 deleted from a speech file. The phase estimation for the original
and 180 . That means that all graduations of difficulty are cov- ENF signal has rectilinear behavior, whereas the edited signal
ered by the edited databases, as would happen in real life. has an abrupt phase change at the edit point (in this case, a
In the following, the values of the different variables used in positive change).
the proposed method to detect audio editing from ENF disconti- Fig. 8 presents an example where an audio portion has been
nuity are detailed. Because of the ENF nominal value, 50 Hz for inserted into a speech file. There are, in this case, two edit points:
this evaluation, the sampling frequency after decimation was set and . Consequently, we can observe two phase changes in
to Hz. The value passband width of the tuned filter the edited signal (the first one negative, the second one positive).
around the nominal ENF was chosen to be 0.8 Hz. Additionally, Phase estimation using the DFT -based method exhibits
window size values were chosen as 3, 5, and 10 cycles of better resolution than using the DFT-based method (specially
the nominal ENF ( cycle ms of signal); and DFT length in the regions where phase transitions occur). This greater
values were chosen as 200, 2000, and 20 000 points. accuracy in phase estimation improves the visual aid.

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
540 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010

Fig. 5. Histograms of feature F for the test audio corpus. The DFT -based
phase estimation method was used, with a window size of 10 cycles of the nom-
inal ENF and N = 2000 points.

Fig. 7. Visualization of a fragment deletion. Points P and P bound the por-


tion eliminated from the original signal. Consequently, P is the edited point
within the signal. The nominal ENF is 50 Hz and the passband width of the
bandpass filter is 0.8 Hz. The phase estimation methods used a window size of
10 cycles of the nominal ENF and N = 2000 points. (a) Original signal.
Fig. 6. DET curve: P versus P for the test audio corpus. The DFT -based (b) Edited signal. (c) Phase estimation using DFT. (d) Phase estimation using
phase estimation method was used, with a window size of 10 cycles of the nom- DFT .
inal ENF and N = 2000 points.

inal values, the ENF in Spain seems to vary more slowly than
VI. PRACTICAL ISSUES the ENF recorded in the city of Rio de Janeiro. Since the differ-
The well-behaved ENF variation of the Spanish corpus ence in EER was very small (1%), no further investigation was
(signals from AHUMADA and GAUDI edited) yielded very carried out; nevertheless, this result reinforces the expectation
nice results. But what if the proposed method is to be used in that the performance of the proposed method would degrade in
real-life situations where signals are degraded in a number of a region without a tight control over the ENF.
ways? To answer this question, two additional local corpora The Carioca 2 corpus was prepared under unfavorable condi-
were prepared, containing recordings in Portuguese as spoken tions: among its 100 signals, 21 exhibited a moderate degree of
in Rio de Janeiro, Brazil: Carioca 1 (digitized with 16-bit saturation; and the corpus average SNR was around 30 dB (the
quantization and a sampling rate of 44 100 Hz) and Carioca 2 average SNR of the Spanish corpus was estimated in 35 dB).
(16 bits and 11 050 Hz), both with the same structure of the The resulting EER for this corpus was 15%. In the following
edited Spanish corpus: a total of 100 original and 100 edited subsections, both effects will be addressed individually.
signals. As Brazilian speech databases, their nominal ENF is A. Effect of Background Noise
60 Hz.
Carioca 1 speech signals were recorded with low background The Spanish corpus was used to carry out this study. Defining
as the clean speech and as the background noise,
noise and without saturation. The EER obtained for this corpus
both mutually uncorrelated by assumption, the original SNR is
was 7%. This result, only slightly worse than the one obtained
given as
for the Spanish corpus, could be due to the slightly faster vari-
ation of the ENF contained in Carioca 1 recordings. Although
(15)
in both cases the ENF has a similar deviation around their nom-

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 541

Fig. 9. Frequency response curves corresponding to H () ()


z and H z for
NOISE 2 and NOISE 3, respectively. The curves were plotted as a function of
the frequency f , in hertz, assuming a sampling rate of 8000 Hz.

Fig. 8. Visualization of a fragment insertion of length P 0 P . P is the inser-


tion point in the original signal. Consequently, P and P are the edited points
within the signal. The nominal ENF is 50 Hz and the passband width of the
bandpass filter is 0.8 Hz. The phase estimation methods used a window size of
10 cycles of the nominal ENF and N = 2000 points. (a) Original signal.
(b) Edited signal. (c) Phase estimation using DFT. (d) Phase estimation using
DFT . Fig. 10. Effect of background noise: EER in the task of audio authentication as
a function of the SNR. The system error without adding extra noise is 6%, and
the original SNR is 35 dB.
In order to estimate the original SNR, a voice activity detector
(VAD) algorithm [17] was employed to separate active speech
regions from background noise. Fig. 10 presents the (equal) error rates as a function of the
In order to alter the SNR, zero mean uncorrelated noise SNR when the proposed method is evaluated with additive
has been added to the signals, such that NOISES 1, 2, and 3. Note that, when the noise has high energy
in low frequencies (those components that affect directly the
ENF, 50 Hz in this case), the background noise effect is much
(16) stronger. Therefore, NOISE 2 (whose curve has a logarithmic
shape) is the most harmful to the authentication method, fol-
The next step is to obtain an error rate as a function of the SNR. lowed by NOISE 1 (linear shape) and NOISE 3 (exponential
For that, the value of was varied and the proposed shape).
method for audio authenticity (using the DFT -based phase es-
timation technique) was applied to the Spanish corpus. B. Effect of Saturation
Three types of noise were employed: In order to analyze the effect of the nonlinearity caused by
• NOISE 1: White Gaussian noise. saturation on the audio authenticity method proposed here, the
• NOISE 2: Low-frequency colored noise obtained from Spanish corpus was used once more. Initially, a VAD algorithm
white noise filtered through . [17] was applied to all signals. Then, a percentage of the active
• NOISE 3: High-frequency colored noise obtained from voice samples (referred to here as saturation level) are clipped to
white noise filtered through . a suitably chosen maximum value. Fig. 11 presents an example
Assuming a frequency sampling of 8000 Hz, the frequency of a signal with 3% saturation level.
responses of both filters and , are shown By varying the saturation level and applying the audio authen-
in Fig. 9. tication method to the Spanish corpus, the curve EER versus sat-

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
542 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010

lows an automatic discrimination between original and edited


signals. In the computer experiments, the error attained by the
detection process over clean audio signals was 6%. This small
value of EER was probably due to those cases where the editing
process caused insignificant ENF phase changes (around 0 in
the histogram of Fig. 4).
The use of the DFT method, here adapted to estimate phase
with high accuracy, yielded improved resolution in the visual
characterization (especially in the regions where the phase tran-
Fig. 11. Example of a signal with 3% saturation level, i.e., with 3% of the sitions are located) as well as good regularity in the automatic
samples in active (shaded) regions clipped to a maximum level. discrimination. The use of the DFT -based technique instead
of the traditional DFT-based technique is justifiable due to its
higher accuracy results with a smaller number of points, thus
not increasing computational overhead.
Practical issues related mainly to the effects of nonlinearity
and low SNR were independently analyzed.
Considering the presence of power grid signals in some
recorded signals, the proposed technique for evaluating audio
authenticity can be a useful tool in the field of forensic pho-
netics. The method described herein becomes even more
important in those cases when there is no ENF database avail-
able.
It is worth mentioning that several extraneous phenomena ei-
ther acoustically generated, or inherent to ac transmission or the
recording system itself (e.g., transients, power-line spikes and
surges, coding artifacts, etc.) may affect the recording under
analysis, thus impacting the detection performance. A careful
Fig. 12. Effect of saturation: EER in the task of audio authentication as a func-
tion of the saturation level. The system error without artificial nonlinearity is evaluation of those effects should be object of future research.
6%.
REFERENCES
[1] R. C. Maher, “Audio forensic examination: Authenticity, enhancement,
uration shown in Fig. 12 was obtained. It can be observed that, and interpratation,” IEEE Signal Process. Mag., vol. 26, no. 2, pp.
considering the active voice intervals defined by the shaded re- 84–94, Mar. 2009.
[2] B. E. Koenig, “Authentication of forensic audio recordings,” J. Audio
gions, saturation levels above 0.5% considerably affected the Eng. Soc., vol. 38, no. 1/2, pp. 3–33, Jan./Feb. 1990.
performance of the authentication method. Also, perceptually [3] B. E. Koenig and D. S. Lacey, “Forensic authentication of digital audio
speaking, this value corresponds to a high level of saturation. recordings,” J. Audio Eng. Soc., vol. 57, no. 9, pp. 662–695, Sep. 2009.
[4] R. W. Sanders, “Digital authenticity using the electric network fre-
From what have been studied in this section, assuming that quency,” in Proc. AES 33rd Int. Conf. Audio Forensics, Theory and
the effect of the unfavorable conditions were linear, one could Practice, Denver, CO, Jun. 2008.
expect to have, for the Carioca 2 corpus, an overall EER given [5] M. Amin and J. Stringer, “The electric power grid: Today and to-
morrow,” MRS Bulletin, vol. 33, pp. 399–407, Apr. 2008.
by the sum of the EER for the Spanish corpus (6%) with 3 extra [6] E. Brixen, “ENF quantification of the magnetic field,” in Proc. AES
terms: 33rd Int. Conf. Audio Forensic, Theory and Practice, Denver, CO, Jun.
• 1% due to ENF variation—recalling that the experiment 2008.
[7] C. Grigoras, “Applications of ENF criterion in forensic audio, video,
with Carioca 1 database, which shares its ENF characteris- computer, and telecommunication analysis,” Forensic Sci. Int., vol.
tics with Carioca 2 database, resulted in an additional error 167, no. 2, pp. 136–145, Apr. 2007.
of 1% over the Spanish database results; [8] C. Grigoras, “Applications of ENF analysis in forensic authentication
of digital audio and video recordings,” J. Audio Eng. Soc., vol. 57, no.
• 5% due to background error—computing the difference be- 9, pp. 643–661, Sep. 2009.
tween EERs for 30- and 35-dB SNR in Noise 1 plot of [9] A. J. Cooper, “The Electric Network Frequency (ENF) as an aid to
Fig. 10; authenticating forensic digital audio recordings—An automated ap-
proach,” in Proc. AES 33rd Int. Conf. Audio Forensic, Theory and Prac-
• the remaining 3% due to saturation in some signals—a rea- tice, Denver, CO, Jun. 2008.
sonable speculation, not denied by inspection of Fig. 12. [10] C. Grigoras, A. J. Cooper, and M. Michalek, Best Practice Guidelines
for ENF Analysis in Forensic Authentication of Digital Evidence Eu-
ropean Network of Forensic Science Institutes, Forensic Speech and
VII. CONCLUSION Audio Analysis Working Group, 2009, Ref. ENFSI-FSAAWG-BPM-
The proposed technique to detect audio editing has yielded ENF-001.
[11] P. A. Esquef and L. W. Biscainho, “Spectral-based analysis and syn-
favorable results. The idea of finding abrupt phase changes in thesis of audio signals,” in Advances in Audio and Speech Signal Pro-
the power grid signal provides an accurate visual characteriza- cessing: Technologies and Applications, H. P. Meana, Ed. Hershey:
tion. This visual aid helps in determining the editing points and Idea Group, 2007, pp. 56–92.
[12] M. Desainte-Catherine and S. Marchand, “High-precision fourier anal-
inferring the type of editing (whether insertion or deletion of ysis of sounds using signal derivatives,” J. Audio Eng. Soc., vol. 48, no.
audio segments). Additionally, the use of a decision feature al- 7/8, pp. 654–667, Jul./Aug. 2000.

Authorized licensed use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.
NICOLALDE RODRÍGUEZ et al.: AUDIO AUTHENTICITY: DETECTING ENF DISCONTINUITY 543

[13] D. P. Nicolalde R. and J. A. Apolinário, Jr., “Evaluating digital José Antonio Apolinário, Jr. (S’95–M’99–SM’04)
audio authenticity with spectral distances and ENF phase change,” was born in Taubaté, Brazil, in 1960. He graduated
in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing from the Military Academy of Agulhas Negras
(ICASSP’09), Taipei, Taiwan, Apr. 2009. (AMAN), Resende, Brazil, in 1981 and received
[14] D. P. Nicolalde R., J. A. Apolinário, Jr., and L. W. Biscainho, “Aut- the B.Sc. degree from the Military Institute of
enticação de áudio digital com base na mudança de fase da frequência Engineering (IME), Rio de Janeiro, Brazil, in 1988,
da rede elétrica,” in Proc. XXVII Brazilian Telecommunications Symp. the M.Sc. degree from the University of Brasília
(SBrT’09) (in Portuguese), Blumenau, Brazil, Sep. 2009. (UnB), Brasília, Brazil, in 1993, and the D.Sc.
[15] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, degree from the Federal University of Rio de Janeiro
“The DET curve in assessment of detection task performance,” in Proc. (COPPE/UFRJ), Rio de Janeiro, Brazil, in 1998, all
Eur. Conf. Speech Communication and Technology, Rhodes, Greece, in electrical engineering.
Sep. 1997. He is currently an Adjoint Professor with the Department of Electrical Engi-
[16] J. Ortega-García, J. González-Rodríguez, and V. Marrero-Aguiar, neering, IME, where he has already served as the Head of Department and as the
“AHUMADA, a large speech corpus in Spanish for speaker charac- Vice-Rector for Study and Research. He was a Visiting Professor at the Escuela
terization and identification,” Elsevier Speech Commun., vol. 31, pp. Politécnica del Ejército (ESPE), Quito, Ecuador, from 1999 to 2000 and a Vis-
255–264, Jun. 2000. iting Researcher and twice a Visiting Professor at Helsinki University of Tech-
[17] A. Benyassine et al., “ITU-T recommendation G.729 Annex B: A si- nology (HUT), Finland, in 1997, 2004, and 2006, respectively. His research in-
lence compression scheme for use with G.729 optimized for V.70 dig- terests comprise many aspects of linear and nonlinear digital signal processing,
ital simultaneous voice and data applications,” IEEE Commun. Mag., including adaptive filtering, speech, and array processing. He has recently edited
vol. 35, no. 9, pp. 64–73, Sep. 1997. the book QRD-RLS Adaptive Filtering (New York: Springer, 2009).
Dr. Apolinário has organized and been the first Chair of the Rio de Janeiro
Chapter of the IEEE Communications Society.

Luiz Wagner Pereira Biscainho (S’95–M’03) was


born in 1962, in Rio de Janeiro, Brazil. He received
the B.Sc. degree (magna cum laude) in electronics
engineering, the M.Sc. degree and the D.Sc. degrees,
both in electrical engineering, all from the Federal
Daniel Patricio Nicolalde Rodríguez was born in University of Rio de Janeiro (UFRJ), Rio de Janeiro,
Quito, Ecuador, in 1982. He completed his under- Brazil, in 1985, 1990, and 2000, respectively.
graduate education in electronics and telecommuni- He is with the Department of Electronics Engi-
cations engineering from the Escuela Politécnica del neering (DEL/UFRJ) and the Program of Electrical
Ejército (ESPE), Quito, Ecuador, in 2007. He is a Engineering (COPPE/UFRJ) as an Associate Pro-
graduating student at the Military Institute of Engi- fessor. His research area is digital signal processing,
neering (IME), Rio de Janeiro, Brazil, since 2008. particularly audio processing and adaptive systems.
His professional interests include digital signal Besides the IEEE, Dr. Biscainho is an active member of the Audio En-
processing, speech and audio processing, as well as gineering Society (AES) and of the Brazilian Telecommunications Society
dynamic programming. (SBrT).

Authorized licensed
View publication stats use limited to: INSTITUTO MILITAR DE ENGENHARIA. Downloaded on August 13,2010 at 11:22:34 UTC from IEEE Xplore. Restrictions apply.

You might also like