0% found this document useful (0 votes)
27 views

Speech Enhancement in Modulation Domain Using Codebook-Based Speech and Noise Estimation

Uploaded by

pravin2275767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Speech Enhancement in Modulation Domain Using Codebook-Based Speech and Noise Estimation

Uploaded by

pravin2275767
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

Speech Enhancement in Modulation Domain Using


Codebook-based Speech and Noise Estimation
Vidhyasagar Mani, Benoit Champagne Wei-Ping Zhu
Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering
McGill University, 3480 University St. Concordia University, 1455 Maisonneuve Blvd. West
Montreal, Quebec, Canada, H3A 0E9 Montreal, Quebec, Canada, H3G 1M8
[email protected], [email protected] [email protected]

Abstract—Conventional single-channel speech enhancement methods Most speech enhancement algorithms, including those operating
implement the analysis-modification-synthesis (AMS) framework in the in the modulation domain, require an estimate of the background
acoustic frequency domain. In recent years, it has been shown that the
noise PSD which is typically obtained via a minimum statistics [13]
extension of this framework to the modulation frequency domain may
result in better noise suppression. However, this conclusion has been approach. Minimum statistics and its offshoots [14], [15] assume
reached by relying on a minimum statistics approach for the required that the background noise exhibits a semi-stationary behaviour (i.e.
noise power spectral density (PSD) estimation, which is known to create slowly changing statistics) while performing its estimation. This
a time frame lag when the noise is non-stationary. In this paper, to avoid may not be the case in acoustic environments with rapidly changing
this problem, we perform noise suppression in the modulation domain
with speech and noise power spectra obtained from a codebook-based background, e.g., a street intersection with passing vehicles or a busy
estimation approach. The PSD estimates derived from the codebook airport terminal. In such cases, the noise PSD cannot be tracked
approach are used to obtain a minimum mean square error (MMSE) properly and speech enhancement algorithms may perform poorly.
estimate of the clean speech modulation magnitude spectrum, which Codebook based approaches [16]–[20], which fit under the gen-
is combined with the phase spectrum of the noisy speech to recover
the enhanced speech signal. Results of objective evaluations indicate eral category of unsupervised learning [21], try to overcome this
improvement in noise suppression with the proposed codebook-based limitation by estimating the noise parameters based on a priori
speech enhancement approach, particularly in cases of non-stationary knowledge about different speech and noise types. In these ap-
noise.1 proaches, joint estimation of the speech and noise PSD is performed
Index Terms—Speech enhancement, modulation domain, MMSE on a frame-by-frame basis by exploiting a priori information
estimation, LPC codebooks stored in the form of trained codebooks of short-time parameter
vectors. Examples of such parameters include gain normalized linear
I. I NTRODUCTION predictive (LP) coefficients [16]–[19] and cepstral coefficients [20].
Speech enhancement involves the suppression of background The use of these codebook methods in the acoustic AMS frame-
noise from a desired speech signal while ensuring that the incurred work has shown promising results in the enhancement of speech
distortion is within a tolerable limit. Some of the most commonly corrupted by non-stationary noise. However, to the best of our
used single channel speech enhancement methods include spectral knowledge, they have not been applied yet to the modulation domain
subtraction [1], [2], Wiener filtering [3], and MMSE short-time framework. In this work, we conjecture that codebook methods can
spectral amplitude (STSA) estimation [4], [5]. These methods typi- indeed bring similar benefits to the enhancement of noisy speech
cally involve implementation of the following three-stage framework in the modulation domain by providing more accurate estimation
known as AMS [6], [7]: (1) Analysis, in which the short-time of the noise PSD in non-stationary environments, and validate this
fourier transform (STFT) is applied on successive frames of the hypothesis experimentally.
noisy speech signal; (2) Modification, where the spectrum of the Specifically, the new speech enhancement method that we pro-
noisy speech is altered for achieving noise suppression, and; (3) pose in this paper incorporates codebook assisted noise and speech
Synthesis, where the enhanced speech is recovered via inverse STFT PSD estimation into the modulation domain framework. We use
and overlap-add (OLA) synthesis. codebooks of linear prediction coefficients and gains obtained by
In past years, research has shown that extension of this frame- training with the Linde-Buzo-Gray (LBG) algorithm [22]. The PSD
work into the modulation domain may result in improved noise estimates derived from the codebook approach are used to calculate
suppression and better speech quality [8], [9]. For instance, in the a gain function based on the MMSE criterion [9], which is applied
case of spectral subtraction, musical noise distortion is lesser when to the modulation magnitude spectrum of the noisy speech in
the subtraction is performed in the modulation domain than in the order to suppress noise. Results of objective evaluations indicate
conventional frequency domain [8]. Extension of the MMSE-STSA improvement in noise suppression with the proposed codebook-
estimator to the modulation domain, in the form of the modulation based speech enhancement method, especially in cases of non-
magnitude estimator (MME) [9], has also shown positive results. stationary noise.
The interest towards this framework extension is further motivated
by physiological evidence [10]–[12], which underlines the signifi- II. ACOUSTIC VERSUS MODULATION DOMAIN PROCESSING
cance of modulation domain information in speech analysis. A. AMS in the Acoustic Frequency Domain
1 Funding for this work was provided by a CRD grant from the Natural
Conventional speech enhancement methods implement the AMS
Sciences and Engineering Research Council of Canada under sponsoring framework in the acoustic frequency domain, where the acoustic
from Microsemi Corporation (Ottawa, Canada). frequency spectrum of a speech signal is defined by its STFT. To

978-1-4799-7591-4/15/$31.00 ©2015 IEEE 707


2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

this end, an additive noise model is assumed, i.e., III. C ODEBOOK - BASED SPEECH AND NOISE ESTIMATION
x[n] = s[n] + d[n], (1) A. Overview
Various noise estimation algorithms are available in the liter-
where x[n], s[n] and d[n] refer to the noisy speech, clean speech
ature to estimate the background noise PSD, needed to perform
and noise signals respectively, while n ∈ Z is the discrete-time
noise suppression in speech enhancement. In algorithms based on
index. STFT analysis of (1) results in,
minimum statistics [13], [14], which are widely applied, the noise
X(ν, k) = S(ν, k) + D(ν, k) (2) PSD is updated by tracking the minima of a smoothed version of
|X(ν, k)|2 within a finite window. Tracking the minimum power in
where X(ν, k), S(ν, k) and D(ν, k) refer to the STFTs of the noisy this way results in a frame lag in the estimated PSD. This lag can
speech, clean speech and noise signals, respectively, and where k is lead to highly inaccurate results in the case of non-stationary noise.
the discrete acoustic frequency index. The STFT X(ν, k) is obtained The basis for the codebook-based speech and noise PSD estimation
from, approach in [17]–[20] is the observation that the spectra of speech
∞
X(ν, k) = x(l)w(νF − l)e−2jklπ/N (3) and different noise classes can be approximately described by few
l=−∞ representative models’ spectra. These spectra are stored in finite
codebooks as quantized vectors of short-time parameters (e.g., LP
where w(l) is a windowing function of duration N samples, and F
coefficients) and serve as the a priori knowledge of the respective
is the frame advance. In this work, the Hamming window is used
signals. The use of a priori information about noise eliminates the
for this purpose [7]. The STFT of a signal is represented by its
dependence on buffers of past data. This makes the estimation robust
acoustic magnitude and phase spectra as,
to spectral variations in non-stationary noise conditions [16].
X(ν, k) = |X(ν, k)|ej∠X(ν,k) (4)
B. PSD Model
Speech enhancement methods, such as spectral subtraction [1] or
For the additive noise model (1), under the assumption of
MMSE-STSA [4], implement the modification part of the AMS
uncorrelated speech and noise signals, the PSD of the noisy speech
framework by modifying the noisy magnitude spectrum whilst
can be represented as,
retaining the phase spectrum. Synthesis of the enhanced signal is
performed by inverse STFT followed by OLA synthesis. Pxx (ω) = Pss (ω) + Pdd (ω), ω ∈ [0, 2π) (8)
B. Modulation Domain Enhancement where Pss (ω) and Pdd (ω) are the clean speech and background
The calculation of the short time modulation spectrum involves noise PSD, respectively, and ω ∈ [0, 2π) is the normalized angular
performing STFT analysis on time trajectories of the individual frequency. The PSD shape of signal y[n], where y ∈ {s, d} stands
acoustic frequency components of the signal STFT. The magnitude for either the speech or noise, can be modelled in terms of its LP
spectrum of the noisy speech in each acoustic frequency bin, i.e. coefficients and corresponding excitation variance as,
|X(ν, k)|, is first windowed and then Fourier transformed again,
Pyy (ω) = gy P yy (ω) (9)
resulting into,

∞ where P yy (ω) is the gain normalized spectral envelope and gy is
Z(t, k, m) = |X(ν, k)|wM (tFM − ν)e−2jνmπ/M (5) the excitation gain (or variance). The former is given by,
ν=−∞
 
p
−2
where wM (ν) is the so-called modulation window of length NM , P yy (ω) = 1 + ayk ejωk  (10)
m ∈ {0, ..., M − 1} is the modulation frequency index, t is the k=1

modulation time-frame index, and FM is the frame advance in where {ayk }pk=1 are the LP coefficients, represented here by vector
the modulation domain. The resulting modulation spectrum can be θ y = [ay1 , ...., ayp ], and p is the model order chosen.
expressed in polar form as,
C. Codebook Generation
Z(t, k, m) = |Z(t, k, m)|ej∠Z(t,k,m) (6)
In this work, two different codebooks of short-time spectral
where |Z(t, k, m)| is the modulation magnitude spectrum and parameters, one for the speech and the other for the noise, are
∠Z(t, k, m) is the modulation phase spectrum. generated from training data comprised of multiple speaker signals
Speech enhancement in the modulation domain involves spectral and different noise types. The codebook generation comprises the
modification of the modulation magnitude spectrum while retaining following steps: segmentation of the training speech and noise data
the phase spectrum, into frames with 20-40ms duration; computation of LP coefficients
{ayk }pk=1 for each frame; vector quantization of the LP coefficient
Ŝ(t, k, m) = G(t, k, m)Z(t, k, m) (7)
vectors θ y using the LBG algorithm to obtain the required codebook
where G(t, k, m) > 0 is a processing gain. Following this operation, [22]. The LBG algorithm forms a set of median cluster vectors
the enhanced time-domain signal is recovered by applying inverse which best represent the given input set of LP coefficient vectors.
STFT and OLA operations twice. Previous works [8], [9] suggest Optimal values have to be chosen empirically for the size of the
that enhancement approaches applied in the modulation domain speech and noise codebooks, considering the trade-off between
perform better than their traditional acoustic domain counterparts. PSD estimation accuracy and complexity. In the sequel, we shall
In this work, the MMSE estimator of the modulation magnitude represent the speech and noise codebooks so obtained as {θ is }N s
i=1
Nd
spectrum, also known as MME [9], will be used as a basis and {θ jd }j=1 , where vectors θ is and θ jd are the corresponding i-th
for developing the proposed codebook-based speech enhancement and j-th codebook entries, and Ns and Nd are the codebook sizes,
method. respectively.

708
2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

In addition to the codebook vectors generated from training on where ωk = 2πk N


. Equation (13) is a reasonable approximation of
noise data, during the estimation phase, the noise codebook is (12) for large frame sizes N .
supplemented by one extra vector. The latter is updated for every With the help of the estimated excitation gains at the ν-th
frame based on a noise PSD estimate obtained using a MS method frame, we can define for each pair of speech and noise codebook
[13], [14]. This provides robustness in dealing with noise types vectors θ is and θ jd a complete codebook-based parameter vector
which may not be present in the training set. θ ij = [θ is , θ jd , gsi , gdj ]. The joint MMSE estimation of the unknown
D. Gain Adaptation parameter vector θ is implemented by carrying numerical integra-
tion over the product codebook of vectors θ ij so obtained, as given
Each codebeook entry, i.e., θ is or θ jd , can be used to compute
by [19]:
a corresponding gain normalized spectral envelope, respectively
1   ij p(x|θ ij )
Ns N d
i j
P ss (ω) or P dd (ω) by means of relations (10). To obtain the final θ̂MMSE ≈ θ (14)
PSD shape as in (9), however, the resulting envelope needs to be Ns Nd i=1 j=1 p(x)
scaled by a corresponding excitation gain, which we denote as gsi
1 
Ns Nd
and gdj , respectively. In this work, we use an adaptive approach
p(x) ≈ p(x|θ ij ). (15)
whereby the excitation gains for the speech and noise codebooks are Ns Nd i=1 j=1
updated every frame based on the observed noisy speech magnitudes
|X(ν, k)|. These equations provide a fair approximation to the MMSE estimate
Specifically, for every possible combination of vectors θ is and θ jd under the assumptions that the codebook is sufficiently large and
from the speech and noise codebooks, respectively, the correspond- the unknown parameter vector θ is uniformly distributed.
ing gains gsi and gdj at the ν-th frame are obtained by minimizing the
IV. I NCORPORATION OF C ODEBOOK - BASED PSD INTO THE
Itakura-Saito distance measure between an estimated PSD and the
M ODULATION M AGNITUDE E STIMATOR
squared magnitude spectrum |X(ν, k)|2 of the noisy speech over the
frequency domain. In this calculation, the estimated PSD is defined The MME method [9] is an extension of the widely used
as the sum of the gain-adapted speech and noise envelopes, i.e., acoustic domain based MMSE spectral amplitude estimator [4],
ij i j into the modulation domain. In the MME method, the clean speech
Pxx = gsi P ss (ω) + gdj P dd (ω). (11) modulation magnitude spectrum is estimated from the noisy speech
The final optimum values of gsi
and gdj ,
which can be interpreted by minimizing the mean square error, denoted as E, between the
as conditional ML estimates, are approximated as in [18]. clean and estimated speech, i.e.,
E. Joint PSD Estimation E = E[(|S(t, k, m)| − |Ŝ(t, k, m)|)2 ] (16)
The joint estimation of the speech and noise PSD is done on a
where |S(t, k, m)|and|Ŝ(t, k, m)| denote the modulation magnitude
frame by frame basis. Let θ = [θ s , θ d , gs , gd ] denote the vector of
spectra of the clean and estimated speech, respectively. Using this
unknown parameters to be estimated, and from which speech and
MMSE criterion, the modulation magnitude spectrum of the clean
noise PSD can be determined through (9)-(10). Following [19], we
speech can be estimated from the noisy speech as,
adopt an MMSE framework for the estimation of parameter vector
θ. This framework makes it possible to simultaneously estimate the
|Ŝ(t, k, m)| = G(t, k, m)|Z(t, k, m)| (17)
LP coefficients (and excitation gains) of two linear processes that
additively overlap with each other. where G(t, k, m) is the MME spectral gain function and Z(t, k, m)
To this end, the noisy speech signal x[n] in (1) is assumed to is the modulation spectrum of the noisy speech from (5). The MME
follow a multivariate normal distribution when conditioned on θ, gain function is given by [9],
1 T −1 √  −ν    −ν   −ν 
p(x|θ) = e−(1/2)(x Rxx x) (12) πν
(2π)N/2 det(Rxx )1/2 G(t, k, m) = exp (1 + ν)I0 + νI1
2γ 2 2 2
where x = [x[νF +1], . . . , x[νF +N ]]T is the observed data vector (18)
at the ν-th frame and Rxx = E{xxT } is the associated covariance where I0 (·) and I1 (·) denote the modified bessel functions of order
matrix. Under the previous modeling assumptions, the latter can be zero and one, respectively, and the parameter ν ≡ ν(t, k, m) =
ξ
written as the sum of the speech and noise covariance matrices, 1+ξ
γ is defined in terms of the a priori and a posteriori SNRs ξ
i.e., Rxx = Rss + Rdd . In turn, Rss and Rdd are functions of the and γ.
corresponding LP coefficients and excitation gains, as in Rss = It is precisely in the calculation of these SNR parameters that we
gs (ATs As )−1 where As is an N × N Toeplitz lower triangular make use of the codebook-based PSD estimates. In this work, the
matrix derived from θ Ts . a posteriori SNR is estimated as,
The equation for the conditional distribution p(x|θ) in (12) |Z(t, k, m)|2
involves a matrix inversion, which is computationally expensive. γ̂(t, k, m) = (19)
|D̂(t, k, m)|2
For a simpler and less time consuming computation, the covariance
matrices Rss and Rdd can be approximated as circulant matrices where |D̂(t, k, m)|2 is an estimate of the noise power in the
[17], thereby reducing (12) to, modulation domain. This quantity is obtained by applying the STFT
(over frame index ν) to the square-root of the codebook-based noise
1 
N −1
N PSD estimate, and then squaring the result. Specifically,
ln p(x|θ) ≈ − ln 2π − ln(gs P ss (ωk ) + gd P dd (ωk ))
2 2 
k=0
D̂(t, k, m) = Pdd (ν, k)wM (tFM − ν)e−2jνmπ/M (20)
1 
N −1
|X(ν, ωk )|2
− (13) ν
2 gs P ss (ωk ) + gd P dd (ωk )
k=0 where Pd (ν, k) is the noise PSD estimate obtained at the ν-th frame

709
2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

TABLE I: PESQ values


through codebook-based MMSE estimation.
To reduce spectral distortion the following “decision directed” Input SNR Noisy MMSE MME CB-MME
approach is employed to obtain the value of the a priori SNR, 0 dB 1.75 1.78 2.04 2.24
NS-white
2 2 5 dB 2.06 2.19 2.46 2.58
ˆ k, m) = α |Ŝ(t − 1, k, m)| + (1 − α) |C(t, k, m)|
ξ(t, (21) 0 dB 1.72 1.85 1.95 2.07
|D̂(t − 1, k, m)|2 |D̂(t, k, m)|2 Street
5 dB 2.01 2.17 2.30 2.40
where |C(t, k, m)|2 is an estimate of the clean speech power in Restaurant
0 dB 1.78 1.84 1.87 2.04
the modulation domain and 0 < α < 1 is a control factor which 5 dB 2.13 2.20 2.27 2.37
acts as a trade-off between noise reduction and speech distortion. 0 dB 1.67 1.83 1.93 2.07
Babble
Similar to (20), C(t, k, m) is obtained by applying the STFT to the 5 dB 2.04 2.19 2.30 2.43
square-root of Pss (ν, k), i.e. the codebook-based PSD estimate of
the clean speech at the ν-th frame. TABLE II: Segmental SNR values (dB)
The estimated modulation magnitude spectrum, |Ŝ(t, k, m)| in
Input SNR Noisy MMSE MME CB-MME
(15), is transformed to the acoustic frequency domain by applying
inverse STFT followed by OLA synthesis. The resulting spectrum 0 dB -2.02 -1.19 0.57 1.63
NS-white
is combined with the phase spectrum of the noisy speech to obtain 5 dB 1.55 2.60 3.75 5.04
the enhanced speech spectrum. The latter is mapped back to the 0 dB -2.75 -0.96 0.47 1.09
Street
time by performing inverse STFT followed by OLA synthesis. 5 dB 0.72 1.35 1.91 2.94
0 dB -2.44 -2.31 -0.59 0.71
Restaurant
V. E XPERIMENTAL E VALUATION 5 dB 1.14 1.43 2.07 3.67
0 dB -3.02 -2.24 -0.85 0.47
In this section we describe objective evaluation experiments Babble
5 dB 0.84 1.28 2.36 3.16
that were performed to assess the performance of the proposed
algorithm, referred to as codebook-based MME (CB-MME). Other
enhancement methods, including the acoustic domain MMSE-STSA
[4] and modulation domain MME [9], were also evaluated for short segments of speech; higher SegSNR values indicate lesser
comparison. background noise.

A. Methodology
Speech utterances of two male and two female speakers from B. Results & Discussion
the TSP [23] and TIMIT databases were used for conducting the
experiments, along with different types of noise samples from the The PESQ and SegSNR results for different noises at SNR
NoiseX92 [24] and Sound Jay [25] databases, including babble, of 0 and 5dB are reported in Tables I and II, respectively. It
street and restaurant noise. In addition, a non-stationary (i.e. am- can be seen that the proposed CB-MME method performs better
plitude modulated) Gaussian white noise was also considered. All than the MME and MMSE methods, for both performance metrics
the speech and noise files were uniformly sampled at a rate of under consideration. Results for other SNR and noise types (not
16kHz. The LP coefficient order p was set to 10 for both speech shown) show a similar trend. Informal listening tests concur with
and noise codebooks. A 7-bit speech codebook was trained with the objective results. The proposed CB-MME method seems to
7.5 minutes of clean speech from the above mentioned sources. suppress non-stationary elements of background noise better than
(i.e 55 short sentences for each speaker). A 4-bit noise codebook MMSE and MME, at the expense of some slight distortion in the
was trained using over 1 minute of noise data from the available enhanced speech. This is mainly due to the use of a codebook-based
databases (i.e. about 15s for each noise type). For the testing, i.e. approach, which performs on-line noise PSD estimation on a frame-
objective evaluation of the various algorithms, noisy speech files by-frame basis based on current observation, as opposed to the MS
were generated by adding scaled segments of noise to the clean approach used in the MMSE and MME algorithms, which relies on
speech. For each speaker, 3 sentences were selected and combined a long buffer of past frames. The slight distortion could be caused
with the four different types of noise, properly scaled to obtain the by the spectral mismatch between the codebook-based speech PSD
desired SNR values of 0 and 5dB. The speech and noise samples estimate and the actual one, which remains a topic for future study.
used for testing were different from those used to train the two
codebooks.
Fine tuning of parameters is crucial for the performance of the VI. C ONCLUSION
proposed enhancement method. The acoustic frame duration was
chosen to be 32ms, while the values of the other analysis parameters In this paper, we have proposed a new speech enhancement
where chosen empirically as follows: acoustic frame advance F method that performs noise suppression in the modulation domain
= 4ms, modulation frame duration NM = 80, modulation frame with speech and noise PSD obtained from a codebook-based esti-
advance FM = 8ms and control factor α = 0.95. mation approach. We use codebooks of linear prediction coefficients
For the objective evaluation of the enhanced speech, we used the and gains obtained by training with the LBG algorithm. The PSD
perceptual evaluation of speech quality (PESQ) and the segmental estimates derived from the codebooks were used to calculate an
SNR (SegSNR) as performance measures. PESQ [26] is widely MMSE gain function, which was applied to the modulation magni-
used for automated assessment of speech quality as experienced tude spectrum of the noisy speech in order to suppress noise. Results
by a listener, where higher PESQ values indicate a better speech of objective evaluation showed improvements in the suppression of
quality. SegSNR is defined as the average SNR calculated over non-stationary noise with the proposed CB-MME approach.

710
2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

R EFERENCES [27] E. Vincent, R. Gribonval, C. Fevotte, “Performance measurement


in blind audio source separation,” IEEE Trans. Audio, Speech and
[1] S. F. Boll, “Suppression of acoustic noise in speech using spectral Language Process., vol. 14, no. 4, pp. 1462-1469, Jul. 2006.
subtraction,” IEEE Trans. Acoust. Speech Signal Process., vol. 27, pp.
113-120, Apr. 1979.
[2] N. Virag, “Single channel speech enhancement based on masking
properties of the human auditory system,” IEEE Trans. Speech Audio
Process., vol. 7, pp. 126-137, Mar. 1999.
[3] J. Chen, J. Benesty, Y. Huang, “New insights into the noise reduction
Wiener filter,” IEEE Trans. Acoust. Speech Signal Process., vol. 14, pp.
1218-1234, Jul. 2006.
[4] Y. Ephraim, D. Malah, “Speech enhancement using a minimum mean-
square error short-time spectral amplitude estimator,” IEEE Trans.
Acoust. Speech Signal Process., vol. 32, pp. 1109-1121, Dec. 1984.
[5] E. Plourde, B. Champagne, “Generalized Bayesian estimators of the
spectral amplitude for speech enhancement,” IEEE Signal Process.
Letters, vol. 16, pp. 485-488, Jun. 2009.
[6] D. Griffin, J. Lim, “Signal estimation from modified short-time Fourier
transform,” IEEE Trans. Acoust. Speech Signal Process., vol. 2, pp. 236-
243, Apr. 1984.
[7] T. Quatieri, Discrete-Time Speech Signal Processing: Principles and
Practice, Prentice Hall, 2002.
[8] K. Paliwal, K. Wojcicki, B. Schwerin, “Single-channel speech enhance-
ment using spectral subtraction in the short-time modulation domain,”
Speech Commun., vol. 52, no. 5, pp. 450-475, May 2010.
[9] K. Paliwal, B. Schwerin, K. Wojcicki, “Speech enhancement using
minimum mean-square error short-time spectral modulation magnitude
estimator,” Speech Commun., vol. 54, no. 2, pp. 282-305, Feb. 2012.
[10] L. Atlas, S. Shamma, “Joint acoustic and modulation frequency,”
EURASIP J. on Applied Signal Process., vol. 7 , pp. 668-675, Jan. 2003.
[11] A. I. Shim, B. G. Berg, “Estimating critical bandwidths of temporal
sensitivity to low-frequency amplitude modulation,” J. Acoustical Soci-
ety of America, vol. 5, pp. 2834-2838, May 2013.
[12] K. Paliwal, B. Schwerin, “Modulation Processing for Speech Enhance-
ment,” Chap. 10 in T. Ogunfunmi, R. Togneri and M. Narasimha,
Eds., Speech and Audio Processing for Coding, Enhancement and
Recognition, Springer 2015.
[13] R. Martin, “Noise power spectral density estimation based on optimal
smoothing and minimum statistics,” IEEE Trans. Speech Audio Process.,
vol. 9, no. 5, pp. 504-512, Jul. 2001.
[14] I. Cohen, “Noise spectrum estimation in adverse environments: im-
proved minima controlled recursive averaging ,” IEEE Trans. on Speech
and Audio Process., vol. 11, pp. 466-475, Sep. 2003.
[15] V. Stahl, A. Fischer, R. Bippus, “Quantile based noise estimation for
spectral subtraction and wiener filtering,” Proc. of IEEE Int. Conf. on
Acoustics, Speech, and Signal Process., vol.3, pp. 1875-1878, Jun. 2000.
[16] S. Srinivasan, J. Samuelsson, W. B. Kleijn, “Speech enhancement using
a-priori information,” Proc. Eurospeech,, pp. 1405-1408, Sep. 2003.
[17] M. Kuropatwinski, W. B. Kleijn, “Estimation of the short-term predic-
tor parameters of speech under noisy conditions,” IEEE Trans. Audio,
Speech, Lang. Process., vol. 14, no. 5, pp. 1645-1655, Sep. 2006.
[18] S. Srinivasan, J. Samuelsson, W. B. Kleijn, “Codebook driven short-
term predictor parameter estimation for speech enhancement,” IEEE
Trans. Audio, Speech, Language Process., vol. 14, no. 1, pp. 163-176,
Jan. 2006.
[19] S. Srinivasan, J. Samuelsson, W. B. Kleijn, “Codebook-based Bayesian
speech enhancement for nonstationary environments,” IEEE Trans.
Audio, Speech, Lang. Process., vol. 15, no. 2, pp. 441-452, Feb. 2007.
[20] T. Rosenkranz, “Modeling the temporal evolution of LPC parameters
for codebook-based speech enhancement,” Int. Symp. on Image and
Signal Process. and Analysis, Salzburg, pp. 455-460 , Sep. 2009.
[21] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical
Learning, 2nd Ed. Springer, 2009.
[22] Y. Linde, A. Buzo, R. M. Gray, “An algorithm for vector quantizer
design,” IEEE Trans. Communications, vol. 28, no. 1, pp. 84-95, Jan.
1980.
[23] P. Kabal, McGill University, “TSP speech database,” Tech. Rep., 2002.
[24] Rice University, “Signal processing information base: noise data.”
Available online: http://spib.rice.edu/spib/select noise.html.
[25] Sound Jay, “Ambient and special sound effects.” Available online:
http://www.soundjay.com/ambient-sounds-2.html.
[26] ITU-T. P.862, “Perceptual evaluation of speech quality (PESQ): and
objective method for end-to-end speech quality assessment of narrow-
band telephone networks and speech codecs,” Tech. Rep., 2000.

711

You might also like