Speech Enhancement in Modulation Domain Using Codebook-Based Speech and Noise Estimation
Speech Enhancement in Modulation Domain Using Codebook-Based Speech and Noise Estimation
Abstract—Conventional single-channel speech enhancement methods Most speech enhancement algorithms, including those operating
implement the analysis-modification-synthesis (AMS) framework in the in the modulation domain, require an estimate of the background
acoustic frequency domain. In recent years, it has been shown that the
noise PSD which is typically obtained via a minimum statistics [13]
extension of this framework to the modulation frequency domain may
result in better noise suppression. However, this conclusion has been approach. Minimum statistics and its offshoots [14], [15] assume
reached by relying on a minimum statistics approach for the required that the background noise exhibits a semi-stationary behaviour (i.e.
noise power spectral density (PSD) estimation, which is known to create slowly changing statistics) while performing its estimation. This
a time frame lag when the noise is non-stationary. In this paper, to avoid may not be the case in acoustic environments with rapidly changing
this problem, we perform noise suppression in the modulation domain
with speech and noise power spectra obtained from a codebook-based background, e.g., a street intersection with passing vehicles or a busy
estimation approach. The PSD estimates derived from the codebook airport terminal. In such cases, the noise PSD cannot be tracked
approach are used to obtain a minimum mean square error (MMSE) properly and speech enhancement algorithms may perform poorly.
estimate of the clean speech modulation magnitude spectrum, which Codebook based approaches [16]–[20], which fit under the gen-
is combined with the phase spectrum of the noisy speech to recover
the enhanced speech signal. Results of objective evaluations indicate eral category of unsupervised learning [21], try to overcome this
improvement in noise suppression with the proposed codebook-based limitation by estimating the noise parameters based on a priori
speech enhancement approach, particularly in cases of non-stationary knowledge about different speech and noise types. In these ap-
noise.1 proaches, joint estimation of the speech and noise PSD is performed
Index Terms—Speech enhancement, modulation domain, MMSE on a frame-by-frame basis by exploiting a priori information
estimation, LPC codebooks stored in the form of trained codebooks of short-time parameter
vectors. Examples of such parameters include gain normalized linear
I. I NTRODUCTION predictive (LP) coefficients [16]–[19] and cepstral coefficients [20].
Speech enhancement involves the suppression of background The use of these codebook methods in the acoustic AMS frame-
noise from a desired speech signal while ensuring that the incurred work has shown promising results in the enhancement of speech
distortion is within a tolerable limit. Some of the most commonly corrupted by non-stationary noise. However, to the best of our
used single channel speech enhancement methods include spectral knowledge, they have not been applied yet to the modulation domain
subtraction [1], [2], Wiener filtering [3], and MMSE short-time framework. In this work, we conjecture that codebook methods can
spectral amplitude (STSA) estimation [4], [5]. These methods typi- indeed bring similar benefits to the enhancement of noisy speech
cally involve implementation of the following three-stage framework in the modulation domain by providing more accurate estimation
known as AMS [6], [7]: (1) Analysis, in which the short-time of the noise PSD in non-stationary environments, and validate this
fourier transform (STFT) is applied on successive frames of the hypothesis experimentally.
noisy speech signal; (2) Modification, where the spectrum of the Specifically, the new speech enhancement method that we pro-
noisy speech is altered for achieving noise suppression, and; (3) pose in this paper incorporates codebook assisted noise and speech
Synthesis, where the enhanced speech is recovered via inverse STFT PSD estimation into the modulation domain framework. We use
and overlap-add (OLA) synthesis. codebooks of linear prediction coefficients and gains obtained by
In past years, research has shown that extension of this frame- training with the Linde-Buzo-Gray (LBG) algorithm [22]. The PSD
work into the modulation domain may result in improved noise estimates derived from the codebook approach are used to calculate
suppression and better speech quality [8], [9]. For instance, in the a gain function based on the MMSE criterion [9], which is applied
case of spectral subtraction, musical noise distortion is lesser when to the modulation magnitude spectrum of the noisy speech in
the subtraction is performed in the modulation domain than in the order to suppress noise. Results of objective evaluations indicate
conventional frequency domain [8]. Extension of the MMSE-STSA improvement in noise suppression with the proposed codebook-
estimator to the modulation domain, in the form of the modulation based speech enhancement method, especially in cases of non-
magnitude estimator (MME) [9], has also shown positive results. stationary noise.
The interest towards this framework extension is further motivated
by physiological evidence [10]–[12], which underlines the signifi- II. ACOUSTIC VERSUS MODULATION DOMAIN PROCESSING
cance of modulation domain information in speech analysis. A. AMS in the Acoustic Frequency Domain
1 Funding for this work was provided by a CRD grant from the Natural
Conventional speech enhancement methods implement the AMS
Sciences and Engineering Research Council of Canada under sponsoring framework in the acoustic frequency domain, where the acoustic
from Microsemi Corporation (Ottawa, Canada). frequency spectrum of a speech signal is defined by its STFT. To
this end, an additive noise model is assumed, i.e., III. C ODEBOOK - BASED SPEECH AND NOISE ESTIMATION
x[n] = s[n] + d[n], (1) A. Overview
Various noise estimation algorithms are available in the liter-
where x[n], s[n] and d[n] refer to the noisy speech, clean speech
ature to estimate the background noise PSD, needed to perform
and noise signals respectively, while n ∈ Z is the discrete-time
noise suppression in speech enhancement. In algorithms based on
index. STFT analysis of (1) results in,
minimum statistics [13], [14], which are widely applied, the noise
X(ν, k) = S(ν, k) + D(ν, k) (2) PSD is updated by tracking the minima of a smoothed version of
|X(ν, k)|2 within a finite window. Tracking the minimum power in
where X(ν, k), S(ν, k) and D(ν, k) refer to the STFTs of the noisy this way results in a frame lag in the estimated PSD. This lag can
speech, clean speech and noise signals, respectively, and where k is lead to highly inaccurate results in the case of non-stationary noise.
the discrete acoustic frequency index. The STFT X(ν, k) is obtained The basis for the codebook-based speech and noise PSD estimation
from, approach in [17]–[20] is the observation that the spectra of speech
∞
X(ν, k) = x(l)w(νF − l)e−2jklπ/N (3) and different noise classes can be approximately described by few
l=−∞ representative models’ spectra. These spectra are stored in finite
codebooks as quantized vectors of short-time parameters (e.g., LP
where w(l) is a windowing function of duration N samples, and F
coefficients) and serve as the a priori knowledge of the respective
is the frame advance. In this work, the Hamming window is used
signals. The use of a priori information about noise eliminates the
for this purpose [7]. The STFT of a signal is represented by its
dependence on buffers of past data. This makes the estimation robust
acoustic magnitude and phase spectra as,
to spectral variations in non-stationary noise conditions [16].
X(ν, k) = |X(ν, k)|ej∠X(ν,k) (4)
B. PSD Model
Speech enhancement methods, such as spectral subtraction [1] or
For the additive noise model (1), under the assumption of
MMSE-STSA [4], implement the modification part of the AMS
uncorrelated speech and noise signals, the PSD of the noisy speech
framework by modifying the noisy magnitude spectrum whilst
can be represented as,
retaining the phase spectrum. Synthesis of the enhanced signal is
performed by inverse STFT followed by OLA synthesis. Pxx (ω) = Pss (ω) + Pdd (ω), ω ∈ [0, 2π) (8)
B. Modulation Domain Enhancement where Pss (ω) and Pdd (ω) are the clean speech and background
The calculation of the short time modulation spectrum involves noise PSD, respectively, and ω ∈ [0, 2π) is the normalized angular
performing STFT analysis on time trajectories of the individual frequency. The PSD shape of signal y[n], where y ∈ {s, d} stands
acoustic frequency components of the signal STFT. The magnitude for either the speech or noise, can be modelled in terms of its LP
spectrum of the noisy speech in each acoustic frequency bin, i.e. coefficients and corresponding excitation variance as,
|X(ν, k)|, is first windowed and then Fourier transformed again,
Pyy (ω) = gy P yy (ω) (9)
resulting into,
∞ where P yy (ω) is the gain normalized spectral envelope and gy is
Z(t, k, m) = |X(ν, k)|wM (tFM − ν)e−2jνmπ/M (5) the excitation gain (or variance). The former is given by,
ν=−∞
p
−2
where wM (ν) is the so-called modulation window of length NM , P yy (ω) = 1 + ayk ejωk (10)
m ∈ {0, ..., M − 1} is the modulation frequency index, t is the k=1
modulation time-frame index, and FM is the frame advance in where {ayk }pk=1 are the LP coefficients, represented here by vector
the modulation domain. The resulting modulation spectrum can be θ y = [ay1 , ...., ayp ], and p is the model order chosen.
expressed in polar form as,
C. Codebook Generation
Z(t, k, m) = |Z(t, k, m)|ej∠Z(t,k,m) (6)
In this work, two different codebooks of short-time spectral
where |Z(t, k, m)| is the modulation magnitude spectrum and parameters, one for the speech and the other for the noise, are
∠Z(t, k, m) is the modulation phase spectrum. generated from training data comprised of multiple speaker signals
Speech enhancement in the modulation domain involves spectral and different noise types. The codebook generation comprises the
modification of the modulation magnitude spectrum while retaining following steps: segmentation of the training speech and noise data
the phase spectrum, into frames with 20-40ms duration; computation of LP coefficients
{ayk }pk=1 for each frame; vector quantization of the LP coefficient
Ŝ(t, k, m) = G(t, k, m)Z(t, k, m) (7)
vectors θ y using the LBG algorithm to obtain the required codebook
where G(t, k, m) > 0 is a processing gain. Following this operation, [22]. The LBG algorithm forms a set of median cluster vectors
the enhanced time-domain signal is recovered by applying inverse which best represent the given input set of LP coefficient vectors.
STFT and OLA operations twice. Previous works [8], [9] suggest Optimal values have to be chosen empirically for the size of the
that enhancement approaches applied in the modulation domain speech and noise codebooks, considering the trade-off between
perform better than their traditional acoustic domain counterparts. PSD estimation accuracy and complexity. In the sequel, we shall
In this work, the MMSE estimator of the modulation magnitude represent the speech and noise codebooks so obtained as {θ is }N s
i=1
Nd
spectrum, also known as MME [9], will be used as a basis and {θ jd }j=1 , where vectors θ is and θ jd are the corresponding i-th
for developing the proposed codebook-based speech enhancement and j-th codebook entries, and Ns and Nd are the codebook sizes,
method. respectively.
708
2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
709
2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
A. Methodology
Speech utterances of two male and two female speakers from B. Results & Discussion
the TSP [23] and TIMIT databases were used for conducting the
experiments, along with different types of noise samples from the The PESQ and SegSNR results for different noises at SNR
NoiseX92 [24] and Sound Jay [25] databases, including babble, of 0 and 5dB are reported in Tables I and II, respectively. It
street and restaurant noise. In addition, a non-stationary (i.e. am- can be seen that the proposed CB-MME method performs better
plitude modulated) Gaussian white noise was also considered. All than the MME and MMSE methods, for both performance metrics
the speech and noise files were uniformly sampled at a rate of under consideration. Results for other SNR and noise types (not
16kHz. The LP coefficient order p was set to 10 for both speech shown) show a similar trend. Informal listening tests concur with
and noise codebooks. A 7-bit speech codebook was trained with the objective results. The proposed CB-MME method seems to
7.5 minutes of clean speech from the above mentioned sources. suppress non-stationary elements of background noise better than
(i.e 55 short sentences for each speaker). A 4-bit noise codebook MMSE and MME, at the expense of some slight distortion in the
was trained using over 1 minute of noise data from the available enhanced speech. This is mainly due to the use of a codebook-based
databases (i.e. about 15s for each noise type). For the testing, i.e. approach, which performs on-line noise PSD estimation on a frame-
objective evaluation of the various algorithms, noisy speech files by-frame basis based on current observation, as opposed to the MS
were generated by adding scaled segments of noise to the clean approach used in the MMSE and MME algorithms, which relies on
speech. For each speaker, 3 sentences were selected and combined a long buffer of past frames. The slight distortion could be caused
with the four different types of noise, properly scaled to obtain the by the spectral mismatch between the codebook-based speech PSD
desired SNR values of 0 and 5dB. The speech and noise samples estimate and the actual one, which remains a topic for future study.
used for testing were different from those used to train the two
codebooks.
Fine tuning of parameters is crucial for the performance of the VI. C ONCLUSION
proposed enhancement method. The acoustic frame duration was
chosen to be 32ms, while the values of the other analysis parameters In this paper, we have proposed a new speech enhancement
where chosen empirically as follows: acoustic frame advance F method that performs noise suppression in the modulation domain
= 4ms, modulation frame duration NM = 80, modulation frame with speech and noise PSD obtained from a codebook-based esti-
advance FM = 8ms and control factor α = 0.95. mation approach. We use codebooks of linear prediction coefficients
For the objective evaluation of the enhanced speech, we used the and gains obtained by training with the LBG algorithm. The PSD
perceptual evaluation of speech quality (PESQ) and the segmental estimates derived from the codebooks were used to calculate an
SNR (SegSNR) as performance measures. PESQ [26] is widely MMSE gain function, which was applied to the modulation magni-
used for automated assessment of speech quality as experienced tude spectrum of the noisy speech in order to suppress noise. Results
by a listener, where higher PESQ values indicate a better speech of objective evaluation showed improvements in the suppression of
quality. SegSNR is defined as the average SNR calculated over non-stationary noise with the proposed CB-MME approach.
710
2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)
711