Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
SPEECH CODING: FUNDAMENTALS AND The input–output relationship is specified using a ref-
APPLICATIONS erence implementation, but novel implementations are
allowed, provided that input–output equivalence is main-
MARK HASEGAWA-JOHNSON tained. Speech coders differ primarily in bit rate (mea-
University of Illinois at sured in bits per sample or bits per second), complexity
Urbana–Champaign (measured in operations per second), delay (measured in
Urbana, Illinois milliseconds between recording and playback), and percep-
ABEER ALWAN tual quality of the synthesized speech. Narrowband (NB)
University of California at Los coding refers to coding of speech signals whose bandwidth
Angeles is less than 4 kHz (8 kHz sampling rate), while wideband
Los Angeles, California (WB) coding refers to coding of 7-kHz-bandwidth signals
(14–16 kHz sampling rate). NB coding is more common
than WB coding mainly because of the narrowband nature
1. INTRODUCTION of the wireline telephone channel (300–3600 Hz). More
recently, however, there has been an increased effort in
Speech coding is the process of obtaining a compact wideband speech coding because of several applications
representation of voice signals for efficient transmission such as videoconferencing.
over band-limited wired and wireless channels and/or There are different types of speech coders. Table 1
storage. Today, speech coders have become essential summarizes the bit rates, algorithmic complexity, and
components in telecommunications and in the multimedia standardized applications of the four general classes of
infrastructure. Commercial systems that rely on efficient coders described in this article; Table 2 lists a selection
speech coding include cellular communication, voice over of specific speech coding standards. Waveform coders
internet protocol (VOIP), videoconferencing, electronic attempt to code the exact shape of the speech signal
toys, archiving, and digital simultaneous voice and data waveform, without considering the nature of human
(DSVD), as well as numerous PC-based games and speech production and speech perception. These coders
multimedia applications. are high-bit-rate coders (typically above 16 kbps). Linear
Speech coding is the art of creating a minimally prediction coders (LPCs), on the other hand, assume that
redundant representation of the speech signal that can the speech signal is the output of a linear time-invariant
be efficiently transmitted or stored in digital media, and (LTI) model of speech production. The transfer function
decoding the signal with the best possible perceptual of that model is assumed to be all-pole (autoregressive
quality. Like any other continuous-time signal, speech may model). The excitation function is a quasiperiodic signal
be represented digitally through the processes of sampling constructed from discrete pulses (1–8 per pitch period),
and quantization; speech is typically quantized using pseudorandom noise, or some combination of the two. If
either 16-bit uniform or 8-bit companded quantization. the excitation is generated only at the receiver, based on a
Like many other signals, however, a sampled speech transmitted pitch period and voicing information, then the
signal contains a great deal of information that is system is designated as an LPC vocoder. LPC vocoders that
either redundant (nonzero mutual information between provide extra information about the spectral shape of the
successive samples in the signal) or perceptually irrelevant excitation have been adopted as coder standards between
(information that is not perceived by human listeners). 2.0 and 4.8 kbps. LPC-based analysis-by-synthesis coders
Most telecommunications coders are lossy, meaning that (LPC-AS), on the other hand, choose an excitation function
the synthesized speech is perceptually similar to the by explicitly testing a large set of candidate excitations
original but may be physically dissimilar. and choosing the best. LPC-AS coders are used in most
A speech coder converts a digitized speech signal into standards between 4.8 and 16 kbps. Subband coders are
a coded representation, which is usually transmitted in frequency-domain coders that attempt to parameterize the
frames. A speech decoder receives coded frames and syn- speech signal in terms of spectral properties in different
thesizes reconstructed speech. Standards typically dictate frequency bands. These coders are less widely used than
the input–output relationships of both coder and decoder. LPC-based coders but have the advantage of being scalable
and do not model the incoming signal as speech. Subband 0, . . . , m, . . . , K, regardless of the values of previous
coders are widely used for high-quality audio coding. samples. The reconstructed signal ŝ(n) is given by
This article is organized as follows. Sections 2, 3,
4 and 5 present the basic principles behind waveform ŝ(n) = ŝm for s(n) s.t. (s(n) − ŝm )2 = min (s(n) − ŝk )2
k=0,...,K
coders, subband coders, LPC-based analysis-by-synthesis
(1)
coders, and LPC-based vocoders, respectively. Section 6
Many speech and audio applications use an odd number of
describes the different quality metrics that are used
reconstruction levels, so that background noise signals
to evaluate speech coders, while Section 7 discusses a
with a very low level can be quantized exactly to
variety of issues that arise when a coder is implemented
ŝK/2 = 0. One important exception is the A-law companded
in a communications network, including voice over IP,
PCM standard [48], which uses an even number of
multirate coding, and channel coding. Section 8 presents reconstruction levels.
an overview of standardization activities involving speech
coding, and we conclude in Section 9 with some final
remarks. 2.1.1. Uniform PCM. Uniform PCM is the name given
to quantization algorithms in which the reconstruction
levels are uniformly distributed between Smax and Smin .
The advantage of uniform PCM is that the quantization
2. WAVEFORM CODING
error power is independent of signal power; high-power
signals are quantized with the same resolution as
Waveform coders attempt to code the exact shape of the low-power signals. Invariant error power is considered
speech signal waveform, without considering in detail the desirable in many digital audio applications, so 16-bit
nature of human speech production and speech perception. uniform PCM is a standard coding scheme in digital audio.
Waveform coders are most useful in applications that The error power and SNR of a uniform PCM coder
require the successful coding of both speech and nonspeech vary with bit rate in a simple fashion. Suppose that a
signals. In the public switched telephone network (PSTN), signal is quantized using B bits per sample. If zero is a
for example, successful transmission of modem and reconstruction level, then the quantization step size is
fax signaling tones, and switching signals is nearly as
important as the successful transmission of speech. The Smax − Smin
most commonly used waveform coding algorithms are = (2)
2B − 1
uniform 16-bit PCM, companded 8-bit PCM [48], and
ADPCM [46]. Assuming that quantization errors are uniformly dis-
tributed between /2 and −/2, the quantization error
power is
2.1. Pulse Code Modulation (PCM)
2.1.2. Companded PCM. Companded PCM is the name is approximately 0.9. In differential PCM, each sample
given to coders in which the reconstruction levels ŝk are s(n) is compared to a prediction sp (n), and the difference
not uniformly distributed. Such coders may be modeled is called the prediction residual d(n) (Fig. 2). d(n) has
using a compressive nonlinearity, followed by uniform a smaller dynamic range than s(n), so for a given error
PCM, followed by an expansive nonlinearity: power, fewer bits are required to quantize d(n).
Accurate quantization of d(n) is useless unless it
s(n) → compress → t(n) → uniform PCM leads to accurate quantization of s(n). In order to avoid
amplifying the error, DPCM coders use a technique copied
→ t̂(n) → expand → ŝ(n) (4) by many later speech coders; the encoder includes an
embedded decoder, so that the reconstructed signal ŝ(n) is
It can be shown that, if small values of s(n) are more known at the encoder. By using ŝ(n) to create sp (n), DPCM
likely than large values, expected error power is minimized coders avoid amplifying the quantization error:
by a companding function that results in a higher density
of reconstruction levels x̂k at low signal levels than at
high signal levels [78]. A typical example is the µ-law d(n) = s(n) − sp (n) (6)
companding function [48] (Fig. 1), which is given by ŝ(n) = d̂(n) + sp (n) (7)
log(1 + µ|s(n)/Smax |) e(n) = s(n) − ŝ(n) = d(n) − d̂(n) (8)
t(n) = Smax sign(s(n)) (5)
log(1 + µ)
where µ is typically between 0 and 256 and determines Two existing standards are based on DPCM. In the
the amount of nonlinear compression applied. first type of coder, continuously varying slope delta
modulation (CVSD), the input speech signal is upsampled
2.2. Differential PCM (DPCM) to either 16 or 32 kHz. Values of the upsampled signal are
predicted using a one-tap predictor, and the difference
Successive speech samples are highly correlated. The long- signal is quantized at one bit per sample, with an
term average spectrum of voiced speech is reasonably adaptively varying . CVSD performs badly in quiet
well approximated by the function S(f ) = 1/f above about environments, but in extremely noisy environments (e.g.,
500 Hz; the first-order intersample correlation coefficient helicopter cockpit), CVSD performs better than any
LPC-based algorithm, and for this reason it remains
the U.S. Department of Defense recommendation for
1
extremely noisy environments [64,96].
DPCM systems with adaptive prediction and quan-
tization are referred to as adaptive differential PCM
0.8 m= 256
systems (ADPCM). A commonly used ADPCM standard
is G.726, which can operate at 16, 24, 32, or 40 kbps
Output signal t (n )
Quantization step
∧
∧ d (n ) ∧
s (n ) + d (n ) d (n ) + s (n )
+ Quantizer Encoder Decoder +
− +
∧ Channel
sp (n ) Predictor s (n ) + Predictor
+
P (z ) + sp (n ) P (z )
3. SUBBAND CODING are more easily scalable in bit rate than standard CELP
techniques, an issue which will become more critical for
In subband coding, an analysis filterbank is first used to high-quality speech and audio transmission over wireless
filter the signal into a number of frequency bands and communication channels and the Internet, allowing the
then bits are allocated to each band by a certain criterion. system to seamlessly adapt to changes in both the
Because of the difficulty in obtaining high-quality speech transmission environment and network congestion.
at low bit rates using subband coding schemes, these
techniques have been used mostly for wideband medium
4. LPC-BASED ANALYSIS BY SYNTHESIS
to high bit rate speech coders and for audio coding.
For example, G.722 is a standard in which ADPCM
An analysis-by-synthesis speech coder consists of the
speech coding occurs within two subbands, and bit
following components:
allocation is set to achieve 7-kHz audio coding at rates
of 64 kbps or less.
In Refs. 12,13, and 30 subband coding is proposed as • A model of speech production that depends on certain
a flexible scheme for robust speech coding. A speech pro- parameters θ :
duction model is not used, ensuring robustness to speech ŝ(n) = f (θ ) (9)
in the presence of background noise, and to nonspeech
sources. High-quality compression can be achieved by • A list of K possible parameter sets for the model
incorporating masking properties of the human auditory
system [54,93]. In particular, Tang et al. [93] present a θ1 , . . . , θk , . . . , θK (10)
scheme for robust, high-quality, scalable, and embedded
speech coding. Figure 3 illustrates the basic structure • An error metric |Ek |2 that compares the original
of the coder. Dynamic bit allocation and prioritization speech signal s(n) and the coded speech signal ŝ(n).
and embedded quantization are used to optimize the per- In LPC-AS coders, |Ek |2 is typically a perceptually
ceptual quality of the embedded bitstream, resulting in weighted mean-squared error measure.
little performance degradation relative to a nonembedded
implementation. A subband spectral analysis technique A general analysis-by-synthesis coder finds the opti-
was developed that substantially reduces the complexity mum set of parameters by synthesizing all of the K
of computing the perceptual model. different speech waveforms ŝk (n) corresponding to the
The encoded bitstream is embedded, allowing the K possible parameter sets θk , computing |Ek |2 for each
coder output to be scalable from high quality at higher synthesized waveform, and then transmitting the index of
bit rates, to lower quality at lower rates, supporting the parameter set which minimizes |Ek |2 . Choosing a set
a wide range of service and resource utilization. The of transmitted parameters by explicitly computing ŝk (n) is
lower bit rate representation is obtained simply through called ‘‘closed loop’’ optimization, and may be contrasted
truncation of the higher bit rate representation. Since with ‘‘open-loop’’ optimization, in which coder parameters
source rate adaptation is performed through truncation are chosen on the basis of an analytical formula without
of the encoded stream, interaction with the source coder explicit computation of ŝk (n). Closed-loop optimization of
is not required, making the coder ideally suited for rate all parameters is prohibitively expensive, so LPC-based
adaptive communication systems. analysis-by-synthesis coders typically adopt the following
Even though subband coding is not widely used for compromise. The gross spectral shape is modeled using an
speech coding today, it is expected that new standards all-pole filter 1/A(z) whose parameters are estimated in
for wideband coding and rate-adaptive schemes will be open-loop fashion, while spectral fine structure is modeled
based on subband coding or a hybrid technique that using an excitation function U(z) whose parameters are
includes subband coding. This is because subband coders optimized in closed-loop fashion (Fig. 4).
(a) Codebook(s) (b) Codebook(s) Spectrum of a pitch− prediction filter, b = [0.25 0.5 0.75 1.0]
of LPC excitation of LPC excitation
vectors vectors 1.2
s (n ) + 1
Get specified
∧
sw (n )
0
0 1 2 3 4 5 6
Figure 4. General structure of an LPC-AS coder (a) and
decoder (b). LPC filter A(z) and perceptual weighting filter W(z) Digital frequency (radians/sample)
are chosen open-loop, then the excitation vector u(n) is chosen in Figure 5. Normalized magnitude spectrum of the pitch predic-
a closed-loop fashion in order to minimize the error metric |E|2 . tion filter for several values of the prediction coefficient.
The number of LPC coefficients (p) depends on the spectrum that is heard as voiced, without the need for
signal bandwidth. Since each pair of complex-conjugate a binary voiced/unvoiced decision.
poles represents one formant frequency and since there is, In LPC-AS coders, the noise signal c(n) is chosen from
on average, one formant frequency per 1 kHz, p is typically a ‘‘stochastic codebook’’ of candidate noise signals. The
equal to 2BW (in kHz) + (2 to 4). Thus, for a 4 kHz speech stochastic codebook index, the pitch period, and the gains
signal, a 10th–12th-order LPC model would be used. b and g are chosen in a closed-loop fashion in order to
This system is excited by a signal u(n) that is minimize a perceptually weighted error metric. The search
uncorrelated with itself over lags of less than p + 1. If for an optimum T0 typically uses the same algorithm as
the underlying speech sound is unvoiced (the vocal folds the search for an optimum c(n). For this reason, the list of
do not vibrate), then u(n) is uncorrelated with itself even excitation samples delayed by different candidate values
at larger time lags, and may be modeled using a pseudo- of T0 is typically called an ‘‘adaptive codebook’’ [87].
random-noise signal. If the underlying speech is voiced
(the vocal folds vibrate), then u(n) is quasiperiodic with a 4.3. Perceptual Error Weighting
fundamental period called the ‘‘pitch period.’’ Not all types of distortion are equally audible. Many
types of speech coders, including LPC-AS coders, use
4.2. Pitch Prediction Filtering simple models of human perception in order to minimize
the audibility of different types of distortion. In LPC-AS
In an LPC-AS coder, the LPC excitation is allowed to vary
coding, two types of perceptual weighting are commonly
smoothly between fully voiced conditions (as in a vowel)
used. The first type, perceptual weighting of the residual
and fully unvoiced conditions (as in / s /). Intermediate
quantization error, is used during the LPC excitation
levels of voicing are often useful to model partially voiced
search in order to choose the excitation vector with the
phonemes such as / z /.
least audible quantization error. The second type, adaptive
The partially voiced excitation in an LPC-AS coder is
postfiltering, is used to reduce the perceptual importance
constructed by passing an uncorrelated noise signal c(n)
of any remaining quantization error.
through a pitch prediction filter [2,79]. A typical pitch
prediction filter is 4.3.1. Perceptual Weighting of the Residual Quantization
Error. The excitation in an LPC-AS coder is chosen to
u(n) = gc(n) + bu(n − T0 ) (12) minimize a perceptually weighted error metric. Usually,
the error metric is a function of the time domain waveform
where T0 is the pitch period. If c(n) is unit variance white error signal
noise, then according to Eq. (12) the spectrum of u(n) is e(n) = s(n) − ŝ(n) (14)
Amplitude (dB)
spectrum at lower SNR may be less audible than a white- 70
noise spectrum at higher SNR (Fig. 7). The audibility of
noise may be estimated using a noise-to-masker ratio
|Ew |2 : 65
π
1 |E(ejω )|2
|Ew |2 = dω (16)
2π −π |M(ejω )|2
60 Shaped noise, 4.4 dB SNR
jω
The masking spectrum M(e ) has peaks and valleys at
0 1000 2000 3000 4000
the same frequencies as the speech spectrum, but the
Frequency (Hz)
difference in amplitude between peaks and valleys is
somewhat smaller than that of the speech spectrum. A Figure 7. Shaped quantization noise may be less audible than
variety of algorithms exist for estimating the masking white quantization noise, even at slightly lower SNR.
spectrum, ranging from extremely simple to extremely
complex [51]. One of the simplest model masking spectra
Given sw (n) and ŝw (n), the noise-to-masker ratio may be
that has the properties just described is as follows [2]:
computed as follows:
|A(z/γ2 )| π
M(z) = , 0 < γ2 < γ1 ≤ 1 (17) 1
|A(z/γ1 )| |Ew |2 = |Sw (ejω ) − Ŝw (ejω )|2 dω = (sw (n) − ŝw (n))2
2π −π n
(20)
where 1/A(z) is an LPC model of the speech spectrum.
The poles and zeros of M(z) are at the same frequencies
as the poles of 1/A(z), but have broader bandwidths. Since 4.3.2. Adaptive Postfiltering. Despite the use of per-
the zeros of M(z) have broader bandwidth than its poles, ceptually weighted error minimization, the synthesized
M(z) has peaks where 1/A(z) has peaks, but the difference speech coming from an LPC-AS coder may contain audible
between peak and valley amplitudes is somewhat reduced. quantization noise. In order to minimize the perceptual
The noise-to-masker ratio may be efficiently computed effects of this noise, the last step in the decoding process is
by filtering the speech signal using a perceptual weighting often a set of adaptive postfilters [11,80]. Adaptive postfil-
filter W(z) = 1/M(z). The perceptually weighted input tering improves the perceptual quality of noisy speech by
speech signal is giving a small extra emphasis to features of the spectrum
Sw (z) = W(z)S(z) (18) that are important for human-to-human communication,
including the pitch periodicity (if any) and the peaks in
Likewise, for any particular candidate excitation signal, the spectral envelope.
the perceptually weighted output speech signal is A pitch postfilter (or long-term predictive postfilter)
enhances the periodicity of voiced speech by applying
either an FIR or IIR comb filter to the output. The time
Ŝw (z) = W(z)Ŝ(z) (19)
delay and gain of the comb filter may be set equal to the
transmitted pitch lag and gain, or they may be recalculated
at the decoder using the reconstructed signal ŝ(n). The
80 pitch postfilter is applied only if the proposed comb filter
gain is above a threshold; if the comb filter gain is below
threshold, the speech is considered unvoiced, and no pitch
postfilter is used. For improved perceptual quality, the
Spectral amplitude (dB)
75 Speech spectrum
LPC excitation signal may be interpolated to a higher
sampling rate in order to allow the use of fractional pitch
periods; for example, the postfilter in the ITU G.729 coder
70
White noise at 5 dB SNR uses pitch periods quantized to 18 sample.
A short-term predictive postfilter enhances peaks in the
spectral envelope. The form of the short-term postfilter is
65 similar to that of the masking function M(z) introduced
in the previous section; the filter has peaks at the same
frequencies as 1/A(z), but the peak-to-valley ratio is less
60 than that of A(z).
0 1000 2000 3000 4000 Postfiltering may change the gain and the average
Frequency (Hz) spectral tilt of ŝ(n). In order to correct these problems,
Figure 6. The minimum-energy quantization noise is usually systems that employ postfiltering may pass the final signal
characterized as white noise. through a one-tap FIR preemphasis filter, and then modify
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 7
Given a candidate excitation vector U, the perceptually 4.5. Types of LPC-AS Coder
weighted error vector E may be defined as
4.5.1. Multipulse LPC (MPLPC). In the multipulse LPC
algorithm [4,50], the shape vectors are impulses. U is
Ew = Sw − Ŝw = S̃ − UH (26)
typically formed as the weighted sum of 4–8 impulses per
subframe.
where the target vector S̃ is
The number of possible combinations of impulses
grows exponentially in the number of impulses, so joint
S̃ = Sw − ŜZIR (27)
optimization of the positions of all impulses is usually
impossible. Instead, most MPLPC coders optimize the
The target vector needs to be computed only once per
pulse positions one at a time, using something like
subframe, prior to the codebook search. The objective of
the following strategy. First, the weighted zero state
the codebook search, therefore, is to find an excitation
response of H(z) corresponding to each impulse location
vector U that minimizes |S̃ − UH|2 .
is computed. If Ck is an impulse located at n = k, the
corresponding weighted zero state response is
4.4.2. Optimum Gain and Optimum Excitation. Recall
that the excitation vector U is modeled as the weighted
Ck H = [0, . . . , 0, h(0), h(1), . . . , h(L − k − 1)] (34)
sum of a number of codevectors Xm , m = 1, . . . , M. The
perceptually weighted error is therefore:
The location of the first impulse is chosen in order to
optimally approximate the target vector S̃1 = S̃, using the
|E| = |S̃ − GXH| = S̃S̃ − 2GXH S̃ + GXH(GXH)
2 2
(28)
methods described in the previous section. After selecting
the first impulse location k1 , the target vector is updated
where prime denotes transpose. Minimizing |E|2 requires
according to
optimum choice of the shape vectors X and of the gains G. It
S̃m = S̃m−1 − Ckm−1 H (35)
turns out that the optimum gain for each excitation vector
can be computed in closed form. Since the optimum gain
Additional impulses are chosen until the desired number
can be computed in closed form, it need not be computed
of impulses is reached. The gains of all pulses may be
during the closed-loop search; instead, one can simply
reoptimized after the selection of each new pulse [87].
assume that each candidate excitation, if selected, would
Variations are possible. The multipulse coder described
be scaled by its optimum gain. Assuming an optimum gain
in ITU standard G.723.1 transmits a single gain for all the
results in an extremely efficient criterion for choosing the
impulses, plus sign bits for each individual impulse. The
optimum excitation vector [3].
G.723.1 coder restricts all impulse locations to be either
Suppose we define the following additional bits of
odd or even; the choice of odd or even locations is coded
notation:
using one bit per subframe [50]. The regular pulse excited
RX = XH S̃ ,
= XH(XH) (29)
LPC algorithm, which was the first GSM full-rate speech
coder, synthesized speech using a train of impulses spaced
Then the mean-squared error is
one per 4 samples, all scaled by a single gain term [65].
The alignment of the pulse train was restricted to one
|E|2 = S̃S̃ − 2GRX + G
G (30)
of four possible locations, chosen in a closed-loop fashion
together with a gain, an adaptive codebook delay, and an
For any given set of shape vectors X, G is chosen so that
adaptive codebook gain.
|E|2 is minimized, which yields
Singhal and Atal demonstrated that the quality of
MPLPC may be improved at low bit rates by modeling
G = RX
−1 (31)
the periodic component of an LPC excitation vector using
a pitch prediction filter [87]. Using a pitch prediction filter,
If we substitute the minimum MSE value of G into
the LPC excitation signal becomes
Eq. (30), we get
M
|E|2 = S̃S̃ − RX
−1 RX (32) u(n) = bu(n − D) + ckm (n) (36)
m=1
Hence, in order to minimize the perceptually weighted
MSE, we choose the shape vectors X in order to maximize where the signal ck (n) is an impulse located at n = k
the covariance-weighted sum of correlations: and b is the pitch prediction filter gain. Singhal and Atal
proposed choosing D before the locations of any impulses
Xopt = arg max(RX
−1 RX ) (33) are known, by minimizing the following perceptually
weighted error:
When the shape matrix X contains more than one row,
the matrix inversion in Eq. (33) is often computed using |ED |2 = |S̃ − bXD H|2 , XD = [u(−D), . . . , u((L − 1) − D)]
approximate algorithms [4]. In the VSELP coder [25], (37)
X is transformed using a modified Gram–Schmidt The G.723.1 multipulse LPC coder and the GSM
orthogonalization so that
has a diagonal structure, thus (Global System for Mobile Communication) full-rate RPE-
simplifying the computation of Eq. (33). LTP (regular-pulse excitation with long-term prediction)
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 9
coder both use a closed-loop pitch predictor, as do all truncated impulse response of the filter W(z)/A(z), as dis-
standardized variations of the CELP coder (see Sections cussed in Section 4.4 [3,97]. Davidson and Lin separately
4.5.2 and 4.5.3). Typically, the pitch delay and gain are proposed center clipping the stochastic codevectors, so that
optimized first, and then the gains of any additional most of the samples in each codevector are zero [15,67].
excitation vectors (e.g., impulses in an MPLPC algorithm) Lin also proposed structuring the stochastic codebook so
are selected to minimize the remaining error. that each codevector is a slightly-shifted version of the pre-
vious codevector; such a codebook is called an overlapped
4.5.2. Code-Excited LPC (CELP). LPC analysis finds codebook [67]. Overlapped stochastic codebooks are rarely
a filter 1/A(z) whose excitation is uncorrelated for used in practice today, but overlapped-codebook search
correlation distances smaller than the order of the filter. methods are often used to reduce the computational com-
Pitch prediction, especially closed-loop pitch prediction, plexity of an adaptive codebook search. In the search of
removes much of the remaining intersample correlation. an overlapped codebook, the correlation RX and autocor-
The spectrum of the pitch prediction residual looks like the relation
introduced in Section 4.4 may be recursively
spectrum of uncorrelated Gaussian noise, but replacing the computed, thus greatly reducing the complexity of the
residual with real noise (noise that is independent of the codebook search [63].
original signal) yields poor speech quality. Apparently, Most CELP coders optimize the adaptive codebook
some of the temporal details of the pitch prediction index and gain first, and then choose a stochastic
residual are perceptually important. Schroeder and Atal codevector and gain in order to minimize the remaining
proposed modeling the pitch prediction residual using a perceptually weighted error. If all the possible pitch
stochastic excitation vector ck (n) chosen from a list of periods are longer than one subframe, then the entire
stochastic excitation vectors, k = 1, . . . , K, known to both content of the adaptive codebook is known before the
the transmitter and receiver [85]: beginning of the codebook search, and the efficient
overlapped codebook search methods proposed by Lin
u(n) = bu(n − D) + gck (n) (38) may be applied [67]. In practice, the pitch period of a
female speaker is often shorter than one subframe. In
The list of stochastic excitation vectors is called a stochastic order to guarantee that the entire adaptive codebook is
codebook, and the index of the stochastic codevector is known before beginning a codebook search, two methods
chosen in order to minimize the perceptually weighted are commonly used: (1) the adaptive codebook search may
error metric |Ek |2 . Rose and Barnwell discussed the simply be constrained to only consider pitch periods longer
similarity between the search for an optimum stochastic than L samples — in this case, the adaptive codebook will
codevector index k and the search for an optimum predictor lock onto values of D that are an integer multiple of
delay D [82], and Kleijn et al. coined the term ‘‘adaptive the actual pitch period (if the same integer multiple is not
codebook’’ to refer to the list of delayed excitation signals chosen for each subframe, the reconstructed speech quality
u(n − D) which the coder considers during closed-loop is usually good); and (2) adaptive codevectors with delays
pitch delay optimization (Fig. 9). of D < L may be constructed by simply repeating the most
The CELP algorithm was originally not considered effi- recent D samples as necessary to fill the subframe.
cient enough to be used in real-time speech coding, but 4.5.3. SELP, VSELP, ACELP, and LD-CELP. Rose and
a number of computational simplifications were proposed Barnwell demonstrated that reasonable speech quality
that resulted in real-time CELP-like algorithms. Trancoso is achieved if the LPC excitation vector is computed com-
and Atal proposed efficient search methods based on the pletely recursively, using two closed-loop pitch predictors
in series, with no additional information [82]. In their
‘‘self-excited LPC’’ algorithm (SELP), the LPC excitation
is initialized during the first subframe using a vector of
u (n −Dmin)
samples known at both the transmitter and receiver. For
all frames after the first, the excitation is the sum of an
b arbitrary number of adaptive codevectors:
sw (n )
u (n −Dmax)
M
∧ u(n) = bm u(n − Dm ) (39)
Adaptive codebook u (n ) W (z ) sw (n )
+ − sum(||^2) m=1
A(z )
c1(n )
Choose codebook indices
Kleijn et al. developed efficient recursive algorithms for
to minimize MSE searching the adaptive codebook in SELP coder and other
g LPC-AS coders [63].
Just as there may be more than one adaptive codebook,
cK (n ) it is also possible to use more than one stochastic codebook.
Stochastic codebook The vector-sum excited LPC algorithm (VSELP) models
the LPC excitation vector as the sum of one adaptive and
Figure 9. The code-excited LPC algorithm (CELP) constructs an
LPC excitation signal by optimally choosing input vectors from
two stochastic codevectors [25]:
two codebooks: an ‘‘adaptive’’ codebook, which represents the
2
pitch periodicity; and a ‘‘stochastic’’ codebook, which represents u(n) = bu(n − D) + gm ckm (n) (40)
the unpredictable innovations in each speech frame. m=1
EOT156
10 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
The two stochastic codebooks are each relatively small Let A(z) be the frequency response of an LPC inverse
(typically 32 vectors), so that each of the codebooks may be filter of order p:
searched efficiently. The adaptive codevector and the two
p
tandem coder and decoder must be no more than 2 ms. = (1 − ejpn z−1 )(1 − e−jpn z−1 ) (44)
n=1
LPC analysis and codevector search are computed once
per 2 ms (16 samples). Transmission of LPC coefficients Q(z) = A(z) − z−(p+1) A(z−1 )
once per two milliseconds would require too many bits, so
LPC coefficients are computed in a recursive backward-
(p−1)/2
= (1 − z−2 ) (1 − ejqn z−1 )(1 − e−jqn z−1 ) (45)
adaptive fashion. Before coding or decoding each frame, n=1
samples of ŝ(n) from the previous frame are windowed, and
used to update a recursive estimate of the autocorrelation The LSFs have some interesting characteristics: the
function. The resulting autocorrelation coefficients are frequencies {pn } and {qn } are related to the formant
similar to those that would be obtained using a relatively frequencies; the dynamic range of {pn } and {qn } is
long asymmetric analysis window. LPC coefficients are limited and the two alternate around the unit circle
then computed from the autocorrelation function using (0 ≤ p1 ≤ q1 ≤ p2 . . .); {pn } and {qn } are correlated so that
the Levinson–Durbin algorithm. intraframe prediction is possible; and they change slowly
from one frame to another, hence, interframe prediction is
4.6. Line Spectral Frequencies (LSFs) or Line Spectral Pairs also possible. The interleaving nature of the {pn } and {qn }
(LSPs) allow for efficient iterative solutions [58].
Almost all LPC-based coders today use the LSFs
Linear prediction can be viewed as an inverse filtering to represent the LP parameters. Considerable recent
procedure in which the speech signal is passed through an research has been devoted to methods for efficiently
all-zero filter A(z). The filter coefficients of A(z) are chosen quantizing the LSFs, especially using vector quantization
such that the energy in the output, that is, the residual or (VQ) techniques. Typical algorithms include predictive
error signal, is minimized. Alternatively, the inverse filter VQ, split VQ [76], and multistage VQ [66,74]. All of these
A(z) can be transformed into two other filters P(z) and methods are used in the ITU standard ACELP coder G.729:
Q(z). These new filters turn out to have some interesting the moving-average vector prediction residual is quantized
properties, and the representation based on them, called using a 7-bit first-stage codebook, followed by second-stage
the line spectrum pairs [89,91], has been used in speech quantization of two subvectors using independent 5-bit
coding and synthesis applications. codebooks, for a total of 17 bits per frame [49,84].
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 11
1
N−1
5.3. Multiband Excitation (MBE)
d (m) = |d(n) − d(n − |m|)| (47)
N − |m| n=|m| In multiband excitation (MBE) coding the voiced/unvoiced
decision is not a binary one; instead, a series of
The frame is labeled as voiced if there is a trough in d (m) voicing decisions are made for independent harmonic
that is large enough to be caused by voiced excitation. intervals [31]. Since voicing decisions can be made in
Only values of m between 20 and 160 are examined, different frequency bands individually, synthesized speech
corresponding to pitch frequencies between 50 and 400 Hz. may be partially voiced and partially unvoiced. An
If the minimum value of d (m) in this range is less than improved version of the MBE was introduced in the late
a threshold, the frame is declared voiced, and otherwise it 1980s [7,35] and referred to as the IMBE coder. The IMBE
is declared unvoiced [8]. at 2.4 kbps produces better sound quality than does the
If the frame is voiced, then the LPC residual is LPC-10e. The IMBE was adopted as the Inmarsat-M
represented using an impulse train of period T0 , where coding standard for satellite voice communication at a
total rate of 6.4 kbps, including 4.15 kbps of source coding
160 and 2.25 kbps of channel coding [104]. The Advanced
T0 = arg min d (m) (48) MBE (AMBE) coder was adopted as the Inmarsat Mini-M
m=20
standard at a 4.8 kbps total data rate, including 3.6 kbps
If the frame is unvoiced, a pitch period of T0 = 0 is of speech and 1.2 kbps of channel coding [18,27]. In [14]
transmitted, indicating that an uncorrelated Gaussian an enhanced multiband excitation (EMBE) coder was
random noise signal should be used as the excitation of presented. The distinguishing features of the EMBE coder
the LPC synthesis filter. include signal-adaptive multimode spectral modeling
and parameter quantization, a two-band signal-adaptive
frequency-domain voicing decision, a novel VQ scheme for reviewers) like the way the product sounds, then the
the efficient encoding of the variable-dimension spectral speech coder is a success. The reaction of consumers
magnitude vectors at low rates, and multiclass selective can often be predicted to a certain extent by evaluating
protection of spectral parameters from channel errors. The the reactions of experimental listeners in a controlled
4-kbps EMBE coder accounts for both source (2.9 kbps) and psychophysical testing paradigm. Psychophysical tests
channel (1.1 kbps) coding and was designed for satellite- (often called ‘‘subjective tests’’) vary depending on the
based communication systems. quantity being evaluated, and the structure of the test.
5.4. Prototype Waveform Interpolative (PWI) Coding 6.1.1. Intelligibility. Speech coder intelligibility is eval-
A different kind of coding technique that has proper- uated by coding a number of prepared words, asking lis-
ties of both waveform and LPC-based coders has been teners to write down the words they hear, and calculating
proposed [59,60] and is called prototype waveform interpo- the percentage of correct transcriptions (an adjustment for
lation (PWI). PWI uses both interpolation in the frequency guessing may be subtracted from the score). The diagnostic
domain and forward–backward prediction in the time rhyme test (DRT) and diagnostic alliteration test (DALT)
domain. The technique is based on the assumption that, for are intelligibility tests which use a controlled vocabulary
voiced speech, a perceptually accurate speech signal can to test for specific types of intelligibility loss [101,102].
be reconstructed from a description of the waveform of a Each test consists of 96 pairs of confusable words spo-
single, representative pitch cycle per interval of 20–30 ms. ken in isolation. The words in a pair differ in only one
The assumption exploits the fact that voiced speech can distinctive feature, where the distinctive feature dimen-
be interpreted as a concentration of slowly evolving pitch sions proposed by Voiers are voicing, nasality, sustention,
cycle waveforms. The prototype waveform is described by sibilation, graveness, and compactness. In the DRT, the
a set of linear prediction (LP) filter coefficients describing words in a pair differ in only one distinctive feature of the
the formant structure and a prototype excitation wave- initial consonant; for instance, ‘‘jest’’ and ‘‘guest’’ differ in
form, quantized with analysis-by-synthesis procedures. the sibilation of the initial consonant. In the DALT, words
The speech signal is reconstructed by filtering an excita- differ in the final consonant; for instance, ‘‘oaf’’ and ‘‘oath’’
tion signal consisting of the concatenation of (infinitesimal) differ in the graveness of the final consonant. Listeners
sections of the instantaneous excitation waveforms. By hear one of the words in each pair, and are asked to select
coding the voiced and unvoiced components separately, a the word from two written alternatives. Professional test-
2.4-kbps version of the coder performed similarly to the ing firms employ trained listeners who are familiar with
4.8-kbps FS1016 standard [61]. the speakers and speech tokens in the database, in order
Recent work has aimed at reducing the computational to minimize test-retest variability.
complexity of the coder for rates between 1.2 and 2.4 kbps Intelligibility scores quoted in the speech coding
by including a time-varying waveform sampling rate and literature often refer to the composite results of a DRT.
a cubic B-spline waveform representation [62,86]. In a comparison of two federal standard coders, the LPC
10e algorithm resulted in 90% intelligibility, while the
6. MEASURES OF SPEECH QUALITY FS-1016 CELP algorithm had 91% intelligibility [64].
An evaluation of waveform interpolative (WI) coding
Deciding on an appropriate measurement of quality is published DRT scores of 87.2% for the WI algorithm, and
one of the most difficult aspects of speech coder design, 87.7% for FS-1016 [61].
and is an area of current research and standardization.
Early military speech coders were judged according to only 6.1.2. Numerical Measures of Perceptual Qual-
one criterion: intelligibility. With the advent of consumer- ity. Perhaps the most commonly used speech quality
grade speech coders, intelligibility is no longer a sufficient measure is the mean opinion score (MOS). A mean opin-
condition for speech coder acceptability. Consumers want ion score is computed by coding a set of spoken phrases
speech that sounds ‘‘natural.’’ A large number of subjective using a variety of coders, presenting all of the coded
and objective measures have been developed to quantify speech together with undegraded speech in random order,
‘‘naturalness,’’ but it must be stressed that any scalar asking listeners to rate the quality of each phrase on
measurement of ‘‘naturalness’’ is an oversimplification. a numerical scale, and then averaging the numerical
‘‘Naturalness’’ is a multivariate quantity, including such ratings of all phrases coded by a particular coder. The
factors as the metallic versus breathy quality of speech, five-point numerical scale is associated with a standard
the presence of noise, the color of the noise (narrowband set of descriptive terms: 5 = excellent, 4 = good, 3 = fair,
noise tends to be more annoying than wideband noise, 2 = poor, and 1 = bad. A rating of 4 is supposed to corre-
but the parameters that predict ‘‘annoyance’’ are not well spond to standard toll-quality speech, quantized at 64 kbps
understood), the presence of unnatural spectral envelope using ITU standard G.711 [48].
modulations (e.g., flutter noise), and the absence of natural Mean opinion scores vary considerably depending
spectral envelope modulations. on background noise conditions; for example, CVSD
performs significantly worse than LPC-based methods in
6.1. Psychophysical Measures of Speech Quality quiet recording conditions, but significantly better under
(Subjective Tests) extreme noise conditions [96]. Gender of the speaker may
The final judgment of speech coder quality is the also affect the relative ranking of coders [96]. Expert
judgment made by human listeners. If consumers (and listeners tend to give higher rankings to speech coders
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 13
4 4 G
C
Jarvinen
Kohler
P F
3 K 3
2 O E 2
2.4 4.8 8 16 32 64 2.4 4.8 8 16 32 64
MPEG
N H H D
recording conditions — Jarvinen [53],
3 3 M J Kohler [64], MPEG [39], Yeldener
M [107], and the COMSAT and MPC
2 2 K sites from Tardelli et al. [96]: (A)
unmodified speech, (B) ITU G.722
2.4 4.8 8 16 32 64 2.4 4.8 8 16 32 64 subband ADPCM, (C) ITU G.726
ADPCM, (D) ISO MPEG-II layer 3
A A subband audio coder, (E) DDVPC
4 4
C CVSD, (F) GSM full-rate RPE-LTP,
COMSAT
with which they are familiar, even when they are not tested under the same conditions may be as reliable as
consciously aware of the order in which coders are DAM testing [96].
presented [96]. Factors such as language and location of
the testing laboratory may shift the scores of all coders up
6.1.3. Comparative Measures of Perceptual Quality. It is
or down, but tend not to change the rank order of individual
sometimes difficult to evaluate the statistical significance
coders [39]. For all of these reasons, a serious MOS test
of a reported MOS difference between two coders. A
must evaluate several reference coders in parallel with the
more powerful statistical test can be applied if coders are
coder of interest, and under identical test conditions. If an
evaluated in explicit A/B comparisons. In a comparative
MOS test is performed carefully, intercoder differences
test, a listener hears the same phrase coded by two
of approximately 0.15 opinion points may be considered different coders, and chooses the one that sounds better.
significant. Figure 12 is a plot of MOS as a function of bit The result of a comparative test is an apparent preference
rate for coders evaluated under quiet listening conditions score, and an estimate of the significance of the observed
in five published studies (one study included separately preference; for example, in a 1999 study, WI coding at
tabulated data from two different testing sites [96]). 4.0 kbps was preferred to 4 kbps HVXC 63.7% of the
The diagnostic acceptability measure (DAM) is an time, to 5.3 kbps G.723.1 57.5% of the time (statistically
attempt to control some of the factors that lead to significant differences), and to 6.3 kbps G.723.1 53.9% of
variability in published MOS scores [100]. The DAM the time (not statistically significant) [29]. It should be
employs trained listeners, who rate the quality of noted that ‘‘statistical significance’’ in such a test refers
standardized test phrases on 10 independent perceptual only to the probability that the same listeners listening to
scales, including six scales that rate the speech itself the same waveforms will show the same preference in a
(fluttering, thin, rasping, muffled, interrupted, nasal), future test.
and four scales that rate the background noise (hissing,
buzzing, babbling, rumbling). Each of these is a 100-
6.2. Algorithmic Measures of Speech Quality (Objective
point scale, with a range of approximately 30 points
Measures)
between the LPC-10e algorithm (50 points) and clean
speech (80 points) [96]. Scores on the various perceptual Psychophysical testing is often inconvenient; it is not
scales are combined into a composite quality rating. DAM possible to run psychophysical tests to evaluate every
scores are useful for pointing out specific defects in a proposed adjustment to a speech coder. For this reason,
speech coding algorithm. If the only desired test outcome a number of algorithms have been proposed that
is a relative quality ranking of multiple coders, a carefully approximate, to a greater or lesser extent, the results
controlled MOS test in which all coders of interest are of psychophysical testing.
EOT156
14 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
The signal-to-noise ratio of a frame of N speech samples The ITU perceptual speech quality measure (PSQM)
starting at sample number n may be defined as computes the perceptual quality of a speech signal by
filtering the input and quantized signals using a Bark-
n+N−1 scale filterbank, nonlinearly compressing the amplitudes
s2 (m) in each band, and then computing an average subband
SNR(n) =
m=n
(49) signal to noise ratio [51]. The development of algorithms
n+N−1 that accurately predict the results of MOS or comparative
2
e (m) testing is an area of active current research, and a number
m=n of improvements, alternatives, and/or extensions to the
PSQM measure have been proposed. An algorithm that
High-energy signal components can mask quantization has been the focus of considerable research activity is the
error, which is synchronous with the signal component, Bark spectral distortion measure [73,103,105,106]. The
or separated by at most a few tens of milliseconds. Over ITU has also proposed an extension of the PSQM standard
longer periods of time, listeners accumulate a general called perceptual evaluation of speech quality (PESQ) [81],
perception of quantization noise, which can be modeled as which will be released as ITU standard P.862.
the average log segmental SNR:
7. NETWORK ISSUES
1
K−1
SEGSNR = 10 log10 SNR(kN) (50)
K 7.1. Voice over IP
k=0
hand, the bitstream of the coder operating at low bit rates In MPLPC, on the other hand, minimizing the perceptual-
is embedded in the bitstream of the coder operating at error weighting is achieved by choosing the amplitude and
higher rates. Each increment in bit rate provides marginal position of a number of pulses in the excitation signal.
improvement in speech quality. Lower bit rate coding is Voice activity detection (VAD) is used to reduce the bit
obtained by puncturing bits from the higher rate coder rate during silent periods, and switching from one bit rate
and typically exhibits graceful degradation in quality with to another is done on a frame-by-frame basis.
decreasing bit rates. Multimode coders have been proposed over a wide vari-
ITU Standard G.727 describes an embedded ADPCM ety of bandwidths. Taniguchi et al. proposed a multimode
coder, which may be run at rates of 40, 32, 24, or ADPCM coder at bit rates between 10 and 35 kbps [94].
16 kbps (5, 4, 3, or 2 bits per sample) [46]. Embedded Johnson and Taniguchi proposed a multimode CELP algo-
ADPCM algorithms are a family of variable bit rate coding rithm at data rates of 4.0–5.3 kbps in which additional
algorithms operating on a sample per sample basis (as stochastic codevectors are added to the LPC excitation vec-
opposed to, e.g., a subband coder that operates on a frame- tor when channel conditions are sufficiently good to allow
by-frame basis) that allows for bit dropping after encoding. high-quality transmission [55]. The European Telecom-
The decision levels of the lower-rate quantizers are subsets munications Standards Institute (ETSI) has recently pro-
of those of the quantizers at higher rates. This allows for posed a standard for adaptive multirate coding at rates
bit reduction at any point in the network without the need between 4.75 and 12.2 kbps.
of coordination between the transmitter and the receiver.
The prediction in the encoder is computed using a 7.3. Joint Source-Channel Coding
more coarse quantization of d̂(n) than the quantization In speech communication systems, a major challenge is
actually transmitted. For example, 5 bits per sample to design a system that provides the best possible speech
may be transmitted, but as few as 2 bits may be used quality throughout a wide range of channel conditions. One
to reconstruct d̂(n) in the prediction loop. Any bits not solution consists of allowing the transceivers to monitor
used in the prediction loop are marked as ‘‘optional’’ by the state of the communication channel and to dynamically
the signaling channel mode flag. If network congestion allocate the bitstream between source and channel coding
disrupts traffic at a router between sender and receiver, accordingly. For low-SNR channels, the source coder
the router is allowed to drop optional bits from the coded operates at low bit rates, thus allowing powerful forward
speech packets. error control. For high-SNR channels, the source coder
Embedded ADPCM algorithms produce codewords that uses its highest rate, resulting in high speech quality,
contain enhancement and core bits. The feedforward (FF) but with little error control. An adaptive algorithm selects
path of the codec utilizes both enhancement bits and core a source coder and channel coder based on estimates
bits, while the feedback (FB) path uses core bits only. of channel quality in order to maintain a constant
With this structure, enhancement bits can be discarded or total data rate [95]. This technique is called adaptive
dropped during network congestion. multirate (AMR) coding, and requires the simultaneous
An important example of a multimode coder is QCELP, implementation of an AMR source coder [24], an AMR
the speech coder standard that was adopted by the TIA channel coder [26,28], and a channel quality estimation
North American digital cellular standard based on code- algorithm capable of acquiring information about channel
division multiple access (CDMA) technology [9]. The coder conditions with a relatively small tracking delay.
selects one of four data rates every 20 ms depending on the The notion of determining the relative importance
speech activity; for example, background noise is coded at a of bits for further unequal error protection (UEP)
lower rate than speech. The four rates are approximately was pioneered by Rydbeck and Sundberg [83]. Rate-
1 kbps (eighth rate), 2 kbps (quarter rate), 4 kbps (half compatible channel codes, such as Hagenauer’s rate
rate), and 8 kbps (full rate). QCELP is based on the CELP compatible punctured convolutional codes (RCPC) [34],
structure but integrates implementation of the different are a collection of codes providing a family of channel
rates, thus reducing the average bit rate. For example, coding rates. By puncturing bits in the bitstream, the
at the higher rates, the LSP parameters are more finely channel coding rate of RCPC codes can be varied
quantized and the pitch and codebook parameters are instantaneously, providing UEP by imparting on different
updated more frequently [23]. The coder provides good segments different degrees of protection. Cox et al. [13]
quality speech at average rates of 4 kbps. address the issue of channel coding and illustrate how
Another example of a multimode coder is ITU standard RCPC codes can be used to build a speech transmission
G.723.1, which is an LPC-AS coder that can operate at scheme for mobile radio channels. Their approach is
2 rates: 5.3 or 6.3 kbps [50]. At 6.3 kbps, the coder is a based on a subband coder with dynamic bit allocation
multipulse LPC (MPLPC) coder while the 5.3-kbps coder proportional to the average energy of the bands. RCPC
is an algebraic CELP (ACELP) coder. The frame size is codes are then used to provide UEP.
30 ms with an additional lookahead of 7.5 ms, resulting Relatively few AMR systems describing source and
in a total algorithmic delay of 67.5 ms. The ACELP and channel coding have been presented. The AMR sys-
MPLPC coders share the same LPC analysis algorithm tems [99,98,75,44] combine different types of variable rate
and frame/subframe structure, so that most of the program CELP coders for source coding with RCPC and cyclic
code is used by both coders. As mentioned earlier, in redundancy check (CRC) codes for channel coding and
ACELP, an algebraic transformation of the transmitted were presented as candidates for the European Telecom-
index produces the excitation signal for the synthesizer. munications Standards Institute (ETSI) GSM AMR codec
EOT156
16 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
standard. In [88], UEP is applied to perceptually based including the layer 3 codec commonly known as MP3 [72].
audio coders (PAC). The bitstream of the PAC is divided The MPEG-4 motion picture standard includes a struc-
into two classes and punctured convolutional codes are tured audio standard [40], in which speech and audio
used to provide different levels of protection, assuming a ‘‘objects’’ are encoded with header information specifying
BPSK constellation. the coding algorithm. Low-bit-rate speech coding is per-
In [5,6], a novel UEP channel encoding scheme is formed using harmonic vector excited coding (HVXC) [43]
introduced by analyzing how symbol-wise puncturing of or code-excited LPC (CELP) [41], and audiocoding is per-
symbols in a trellis code and the rate-compatibility con- formed using time–frequency coding [42]. The MPEG
straint (progressive puncturing pattern) can be used to homepage is at drogo.cselt.stet.it/mpeg.
derive rate-compatible punctured trellis codes (RCPT). Standards for cellular telephony in Europe are estab-
While conceptually similar to RCPC codes, RCPT codes lished by the European Telecommunications Standards
are specifically designed to operate efficiently on large con- Institute (ETSI) (http://www.etsi.org). ETSI speech coding
stellations (for which Euclidean and Hamming distances standards are published by the Global System for Mobile
are no longer equivalent) by maximizing the residual Telecommunications (GSM) subcommittee. All speech cod-
Euclidean distance after symbol puncturing. Large con- ing standards for digital cellular telephone use are based
stellation sizes, in turn, lead to higher throughput and on LPC-AS algorithms. The first GSM standard coder was
spectral efficiency on high SNR channels. An AMR system based on a precursor of CELP called regular-pulse excita-
is then designed based on a perceptually-based embed- tion with long-term prediction (RPE-LTP) [37,65]. Current
ded subband encoder. Since perceptually based dynamic GSM standards include the enhanced full-rate codec GSM
bit allocations lead to a wide range of bit error sensitiv- 06.60 [32,53] and the adaptive multirate codec [33]; both
ities (the perceptually least important bits being almost standards use algebraic code-excited LPC (ACELP). At
insensitive to channel transmission errors), the channel the time of writing, both ITU and ETSI are expected to
protection requirements are determined accordingly. The announce new standards for wideband speech coding in the
AMR systems utilize the new rate-compatible channel cod- near future. ETSI’s standard will be based on GSM AMR.
ing technique (RCPT) for UEP and operate on an 8-PSK The Telecommunications Industry Association (http://
constellation. The AMR-UEP system is bandwidth effi- www.tiaonline.org) published some of the first U.S. digital
cient, operates over a wide range of channel conditions cellular standards, including the vector-sum-excited LPC
and degrades gracefully with decreasing channel quality. (VSELP) standard IS54 [25]. In fact, both the initial U.S.
Systems using AMR source and channel coding are and Japanese digital cellular standards were based on
likely to be integrated in future communication systems the VSELP algorithm. The TIA has been active in the
since they have the capability for providing graceful speech development of standard TR41 for voice over IP.
degradation over a wide range of channel conditions. The U.S. Department of Defense Voice Processing
Consortium (DDVPC) publishes speech coding standards
8. STANDARDS for U.S. government applications. As mentioned earlier,
the original FS-1015 LPC-10e standard at 2.4 kbps [8,16],
Standards for landline public switched telephone service originally developed in the 1970s, was replaced in
(PSTN) networks are established by the International 1996 by the newer MELP standard at 2.4 kbps [92].
Telecommunication Union (ITU) (http://www.itu.int). The Transmission at slightly higher bit rates uses the FS-
ITU has promulgated a number of important speech 1016 CELP (CELP) standard at 4.8 kbps [17,56,57].
and waveform coding standards at high bit rates and Waveform applications use the continuously variable slope
with very low delay, including G.711 (PCM), G.727 and delta modulator (CVSD) at 16 kbps. Descriptions of all
G.726 (ADPCM), and G.728 (LDCELP). The ITU is also DDVPC standards and code for most are available at
involved in the development of internetworking standards, http://www.plh.af.mil/ddvpc/index.html.
including the voice over IP standard H.323. The ITU has
developed one widely used low-bit-rate coding standard
(G.729), and a number of embedded and multimode speech 9. FINAL REMARKS
coding standards operating at rates between 5.3 kbps
(G.723.1) and 40 kbps (G.727). Standard G.729 is a speech In this article, we presented an overview of coders
coder operating at 8 kbps, based on algebraic code-excited that compress speech by attempting to match the time
LPC (ACELP) [49,84]. G.723.1 is a multimode coder, waveform as closely as possible (waveform coders), and
capable of operating at either 5.3 or 6.3 kbps [50]. G.722 coders that attempt to preserve perceptually relevant
is a standard for wideband speech coding, and the ITU spectral properties of the speech signal (LPC-based
will announce an additional wideband standard within and subband coders). LPC-based coders use a speech
the near future. The ITU has also published standards production model to parameterize the speech signal,
for the objective estimation of perceptual speech quality while subband coders filter the signal into frequency
(P.861 and P.862). bands and assign bits by either an energy or perceptual
The ITU is a branch of the International Standards criterion. Issues pertaining to networking, such as voice
Organization (ISO) (http://www.iso.ch). In addition to over IP and joint source–channel coding, were also
ITU activities, the ISO develops standards for the Moving touched on. There are several other coding techniques
Picture Experts Group (MPEG). The MPEG-2 standard that we have not discussed in this article because of
included digital audiocoding at three levels of complexity, space limitations. We hope to have provided the reader
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 17
with an overview of the fundamental techniques of speech 3. B. S. Atal, High-quality speech at low bit rates: Multi-pulse
compression. and stochastically excited linear predictive coders, Proc.
ICASSP, 1986, pp. 1681–1684.
4. B. S. Atal and J. R. Remde, A new model of LPC excitation
Acknowledgments
for producing natural-sounding speech at low bit rates, Proc.
This research was supported in part by the NSF and HRL.
ICASSP, 1982, pp. 614–617.
We thank Alexis Bernard and Tomohiko Taniguchi for their
suggestions on earlier drafts of the article. 5. A. Bernard, X. Liu, R. Wesel, and A. Alwan, Channel
adaptive joint-source channel coding of speech, Proc. 32nd
Asilomar Conf. Signals, Systems, and Computers, 1998,
BIOGRAPHIES Vol. 1, pp. 357–361.
6. A. Bernard, X. Liu, R. Wesel, and A. Alwan, Embedded
Mark A. Hasegawa-Johnson received his S.B., S.M., joint-source channel coding of speech using symbol punc-
and Ph.D. degrees in electrical engineering and computer turing of trellis codes, Proc. IEEE ICASSP, 1999, Vol. 5,
science from MIT in 1989, 1989, and 1996, respectively. pp. 2427–2430.
From 1989 to 1990 he worked as a research engineer 7. M. S. Brandstein, P. A. Monta, J. C. Hardwick, and
at Fujitsu Laboratories Ltd., Kawasaki, Japan, where J. S. Lim, A real-time implementation of the improved MBE
he developed and patented a multimodal CELP speech speech coder, Proc. ICASSP, 1990, Vol. 1: pp. 5–8.
coder with an efficient algebraic fixed codebook. From 8. J. P. Campbell and T. E. Tremain, Voiced/unvoiced classifi-
1996–1999 he was a postdoctoral fellow in the Electrical cation of speech with applications to the U.S. government
Engineering Department at UCLA. Since 1999, he has LPC-10E algorithm, Proc. ICASSP, 1986, pp. 473–476.
been on the faculty of the University of Illinois at
9. CDMA, Wideband Spread Spectrum Digital Cellular System
Urbana-Champaign. Dr. Hasegawa-Johnson holds four
Dual-Mode Mobile Station-Base Station Compatibility
U.S. patents and is the author of four journal articles Standard, Technical Report Proposed EIA/TIA Interim
and twenty conference papers. His areas of interest include Standard, Telecommunications Industry Association TR45.5
speech coding, automatic speech understanding, acoustics, Subcommittee, 1992.
and the physiology of speech production.
10. J.-H. Chen et al., A low delay CELP coder for the CCITT
16 kb/s speech coding standard, IEEE J. Select. Areas
Abeer Alwan received her Ph.D. in electrical engineer- Commun. 10: 830–849 (1992).
ing from MIT in 1992. Since then, she has been with
11. J.-H. Chen and A. Gersho, Adaptive postfiltering for quality
the Electrical Engineering Department at UCLA, Cal- enhancement of coded speech, IEEE Trans. Speech Audio
ifornia, as an assistant professor (1992–1996), associate Process. 3(1): 59–71 (1995).
professor (1996–2000), and professor (2000–present). Pro-
12. R. Cox et al., New directions in subband coding, IEEE JSAC
fessor Alwan established and directs the Speech Pro-
6(2): 391–409 (Feb. 1988).
cessing and Auditory Perception Laboratory at UCLA
(http://www.icsl.ucla.edu/∼spapl). Her research interests 13. R. Cox, J. Hagenauer, N. Seshadri, and C. Sundberg, Sub-
include modeling human speech production and percep- band speech coding and matched convolutional coding for
mobile radio channels, IEEE Trans. Signal Process. 39(8):
tion mechanisms and applying these models to speech-
1717–1731 (Aug. 1991).
processing applications such as automatic recognition,
compression, and synthesis. She is the recipient of the NSF 14. A. Das and A. Gersho, Low-rate multimode multiband
Research Initiation Award (1993), the NIH FIRST Career spectral coding of speech, Int. J. Speech Tech. 2(4): 317–327
Development Award (1994), the UCLA-TRW Excellence (1999).
in Teaching Award (1994), the NSF Career Develop- 15. G. Davidson and A. Gersho, Complexity reduction meth-
ment Award (1995), and the Okawa Foundation Award ods for vector excitation coding, Proc. ICASSP, 1986,
in Telecommunications (1997). Dr. Alwan is an elected pp. 2055–2058.
member of Eta Kappa Nu, Sigma Xi, Tau Beta Pi, and the 16. DDVPC, LPC-10e Speech Coding Standard, Technical
New York Academy of Sciences. She served as an elected Report FS-1015, U.S. Dept. of Defense Voice Processing
member on the Acoustical Society of America Technical Consortium, Nov. 1984.
Committee on Speech Communication (1993–1999), on the 17. DDVPC, CELP Speech Coding Standard, Technical Report
IEEE Signal Processing Technical Committees on Audio FS-1016, U.S. Dept. of Defense Voice Processing Consor-
and Electroacoustics (1996–2000) and Speech Processing tium, 1989.
(1996–2001). She is an editor in chief of the journal Speech
18. S. Dimolitsas, Evaluation of voice coded performance for
Communication. the Inmarsat Mini-M system, Proc. 10th Int. Conf. Digital
Satellite Communications, 1995.
BIBLIOGRAPHY 19. M. Handley et al., SIP: Session Initiation Protocol, IETF
RFC, March 1999, http://www.cs.columbia.edu/hgs/sip/
sip.html.
1. J.-P. Adoul, P. Mabilleau, M. Delprat, and S. Morisette, Fast
CELP coding based on algebraic codes, Proc. ICASSP, 1987, 20. H. Fletcher, Speech and Hearing in Communication, Van
pp. 1957–1960. Nostrand, Princeton, NJ, 1953.
2. B. S. Atal, Predictive coding of speech at low bit rates, IEEE 21. D. Florencio, Investigating the use of asymmetric windows
Trans. Commun. 30: 600–614 (1982). in CELP vocoders, Proc. ICASSP, 1993, Vol. II, pp. 427–430.
EOT156
18 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS
22. S. Furui, Digital Speech Processing, Synthesis, and Recogni- 41. ISO/IEC, Information Technology — Coding of Audiovisual
tion, Marcel Dekker, New York, 1989. Objects, Part 3: Audio, Subpart 3: CELP, Technical Report
ISO/JTC 1/SC 29/N2203CELP, ISO/IEC, 1998.
23. W. Gardner, P. Jacobs, and C. Lee, QCELP: A variable
rate speech coder for CDMA digital cellular, in B. Atal, 42. ISO/IEC, Information Technology — Coding of Audiovisual
V. Cuperman, and A. Gersho, eds., Speech and Audio Coding Objects, Part 3: Audio, Subpart 4: Time/Frequency Coding,
for Wireless and Network Applications, Kluwer, Dordrecht, Technical Report ISO/JTC 1/SC 29/N2203TF, ISO/IEC,
The Netherlands, 1993, pp. 85–93. 1998.
24. A. Gersho and E. Paksoy, An overview of variable rate 43. ISO/IEC, Information Technology — Very Low Bitrate Audio-
speech coding for cellular networks, IEEE Int. Conf. Selected Visual Coding, Part 3: Audio, Subpart 2: Parametric Coding,
Topics in Wireless Communications Proc., June 1999, Technical Report ISO/JTC 1/SC 29/N2203PAR, ISO/IEC,
pp. 172–175. 1998.
25. I. Gerson and M. Jasiuk, Vector sum excited linear predic- 44. H. Ito, M. Serizawa, K. Ozawa, and T. Nomura, An adaptive
tion (VSELP), in B. S. Atal, V. S. Cuperman, and A. Gersho, multi-rate speech codec based on mp-celp coding algorithm
eds., Advances in Speech Coding, Kluwer, Dordrecht, The for etsi amr standard, Proc. ICASSP, 1998, Vol. 1,
Netherlands, 1991, pp. 69–80. pp. 137–140.
26. D. Goeckel, Adaptive coding for time-varying channels using 45. ITU-T, 40, 32, 24, 16 kbit/s Adaptive Differential Pulse
outdated fading estimates, IEEE Trans. Commun. 47(6): Code Modulation (ADPCM), Technical Report G.726,
844–855 (1999). International Telecommunications Union, Geneva, 1990.
27. R. Goldberg and L. Riek, A Practical Handbook of Speech 46. ITU-T, 5-, 4-, 3- and 2-bits per Sample Embedded Adaptive
Differential Pulse Code Modulation (ADPCM), Technical
Coders, CRC Press, Boca Raton, FL, 2000.
Report G.727, International Telecommunications Union,
28. A. Goldsmith and S. G. Chua, Variable-rate variable power Geneva, 1990.
MQAM for fading channels, IEEE Trans. Commun. 45(10):
47. ITU-T, Coding of Speech at 16 kbit/s Using Low-Delay
1218–1230 (1997).
Code Excited Linear Prediction, Technical Report G.728,
29. O. Gottesman and A. Gersho, Enhanced waveform inter- International Telecommunications Union, Geneva, 1992.
polative coding at 4 kbps, IEEE Workshop on Speech Coding,
48. ITU-T, Pulse Code Modulation (PCM) of Voice Frequencies,
Piscataway, NY, 1999, pp. 90–92.
Technical Report G.711, International Telecommunications
30. K. Gould, R. Cox, N. Jayant, and M. Melchner, Robust Union, Geneva, 1993.
speech coding for the indoor wireless channel, ATT Tech.
49. ITU-T, Coding of Speech at 8 kbit/s Using Conjugate-
J. 72(4): 64–73 (1993).
Structure Algebraic-Code-Excited Linear-Prediction (CS-
31. D. W. Griffn and J. S. Lim, Multi-band excitation vocoder, ACELP), Technical Report G.729, International Telecom-
IEEE Trans. Acoust. Speech Signal Process. 36(8): munications Union, Geneva, 1996.
1223–1235 (1988). 50. ITU-T, Dual Rate Speech Coder for Multimedia Communi-
32. Special Mobile Group (GSM), Digital Cellular cations Transmitting at 5.3 and 6.3 kbit/s, Technical Report
Telecommunications System: Enhanced Full Rate (EFR) G.723.1, International Telecommunications Union, Geneva,
Speech Transcoding, Technical Report GSM 06.60, 1996.
European Telecommunications Standards Institute (ETSI), 51. ITU-T, Objective Quality Measurement of Telephone-
1997. Band (300–3400 Hz) speech codecs, Technical Report
33. Special Mobile Group (GSM), Digital Cellular Telecommu- P.861, International Telecommunications Union, Geneva,
nications System (Phase 2+): Adaptive Multi-rate (AMR) 1998.
Speech Transcoding, Technical Report GSM 06.90, Euro- 52. ITU-T, Packet Based Multimedia Communications Systems,
pean Telecommunications Standards Institute (ETSI), 1998. Technical Report H.323, International Telecommunications
34. J. Hagenauer, Rate-compatible punctured convolutional Union, Geneva, 1998.
codes and their applications, IEEE Trans. Commun. 36(4): 53. K. Jarvinen et al., GSM enhanced full rate speech codec,
389–400 (1988). Proc. ICASSP, 1997, pp. 771–774.
35. J. C. Hardwick and J. S. Lim, A 4.8 kbps multi-band excita- 54. N. Jayant, J. Johnston, and R. Safranek, Signal compres-
tion speech coder, Proc. ICASSP, 1988, Vol. 1, pp. 374–377. sion based on models of human perception, Proc. IEEE
36. M. Hasegawa-Johnson, Line spectral frequencies are the 81(10): 1385–1421 (1993).
poles and zeros of a discrete matched-impedance vocal tract 55. M. Johnson and T. Taniguchi, Low-complexity multi-mode
model, J. Acoust. Soc. Am. 108(1): 457–460 (2000). VXC using multi-stage optimization and mode selection,
Proc. ICASSP, 1991, pp. 221–224.
37. K. Hellwig et al., Speech codec for the european mobile radio
system, Proc. IEEE Global Telecomm. Conf., 1989. 56. J. P. Campbell Jr., T. E. Tremain, and V. C. Welch, The
DOD 4.8 KBPS standard (proposed federal standard 1016),
38. O. Hersent, D. Gurle, and J.-P. Petit, IP Telephony, Addison-
in B. S. Atal, V. C. Cuperman, and A. Gersho, ed., Advances
Wesley, Reading, MA, 2000.
in Speech Coding, Kluwer, Dordrecht, The Netherlands,
39. ISO, Report on the MPEG-4 Speech Codec Verification Tests, 1991, pp. 121–133.
Technical Report JTC1/SC29/WG11, ISO/IEC, Oct. 1998.
57. J. P. Campbell, Jr., V. C. Welch, and T. E. Tremain, An
40. ISO/IEC, Information Technology — Coding of Audiovisual expandable error-protected 4800 BPS CELP coder (U.S.
Objects, Part 3: Audio, Subpart 1: Overview, Technical federal standard 4800 BPS voice coder), Proc. ICASSP, 1989,
Report ISO/JTC 1/SC 29/N2203, ISO/IEC, 1998. 735–738.
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 19
58. P. Kabal and R. Ramachandran, The computation of line 77. L. Rabiner and B.-H Juang, Fundamentals of Speech
spectral frequencies using chebyshev polynomials, IEEE Recognition, Prentice-Hall, Englewood Cliffis, NJ, 1993.
Trans. Acoust. Speech Signal Process. ASSP-34: 1419–1426
78. L. R. Rabiner and R. W. Schafer, Digital Processing of
(1986).
Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978.
59. W. Kleijn, Speech coding below 4 kb/s using waveform inter-
79. R. P. Ramachandran and P. Kabal, Stability and perfor-
polation, Proc. GLOBECOM 1991, Vol. 3, pp. 1879–1883.
mance analysis of pitch filters in speech coders, IEEE Trans.
60. W. Kleijn and W. Granzow, Methods for waveform interpola- ASSP 35(7): 937–946 (1987).
tion in speech coding, Digital Signal Process. 1(4): 215–230
80. V. Ramamoorthy and N. S. Jayant, Enhancement of ADPCM
(1991).
speech by adaptive post-filtering, AT&T Bell Labs. Tech. J.
61. W. Kleijn and J. Haagen, A speech coder based on 63(8): 1465–1475 (1984).
decomposition of characteristic waveforms, Proc. ICASSP,
1995, pp. 508–511. 81. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, PESQ — the
new ITU standard for end-to-end speech quality assessment,
62. W. Kleijn, Y. Shoham, D. Sen, and R. Hagen, A low- AES 109th Convention, Los Angeles, CA, Sept. 2000.
complexity waveform interpolation coder, Proc. ICASSP,
1996, pp. 212–215. 82. R. C. Rose and T. P. Barnwell, III, The self-excited
vocoder — an alternate approach to toll quality at 4800 bps,
63. W. B. Kleijn, D. J. Krasinski, and R. H. Ketchum, Improved Proc. ICASSP, 1986, pp. 453–456.
speech quality and efficient vector quantization in SELP,
Proc. ICASSP, 1988, pp. 155–158. 83. N. Rydbeck and C. E. Sundberg, Analysis of digital errors in
non-linear PCM systems, IEEE Trans. Commun. COM-24:
64. M. Kohler, A comparison of the new 2400bps MELP federal 59–65 (1976).
standard with other standard coders, Proc. ICASSP, 1997,
pp. 1587–1590. 84. R. Salami et al., Design and description of CS-ACELP: A
toll quality 8 kb/s speech coder, IEEE Trans. Speech Audio
65. P. Kroon, E. F. Deprettere, and R. J. Sluyter, Regular-pulse
Process. 6(2): 116–130 (1998).
excitation: A novel approach to effective and efficient multi-
pulse coding of speech, IEEE Trans. ASSP 34: 1054–1063 85. M. R. Schroeder and B. S. Atal, Code-excited linear predic-
(1986). tion (CELP): High-quality speech at very low bit rates, Proc.
ICASSP, 1985, pp. 937–940.
66. W. LeBlanc, B. Bhattacharya, S. Mahmoud, and V. Cuper-
man, Efficient search and design procedures for robust 86. Y. Shoham, Very low complexity interpolative speech coding
multi-stage VQ of LPC parameters for 4kb/s speech cod- at 1.2 to 2.4 kbp, Proc. ICASSP, 1997, pp. 1599–1602.
ing, IEEE Trans. Speech Audio Process. 1: 373–385 87. S. Singhal and B. S. Atal, Improving performance of multi-
(1993). pulse LPC coders at low bit rates, Proc. ICASSP, 1984,
67. D. Lin, New approaches to stochastic coding of speech pp. 1.3.1–1.3.4.
sources at very low bit rates, in I. T. Young et al., ed.,
88. D. Sinha and C.-E. Sundberg, Unequal error protection
Signal Processing III: Theories and Applications, Elsevier,
methods for perceptual audio coders, Proc. ICASSP, 1999,
Amsterdam, 1986, pp. 445–447.
Vol. 5, pp. 2423–2426.
68. A. McCree and J. C. De Martin, A 1.7 kb/s MELP coder with
89. F. Soong and B.-H. Juang, Line spectral pair (LSP)
improved analysis and quantization, Proc. ICASSP, 1998,
and speech data compression, Proc. ICASSP, 1984,
Vol. 2, pp. 593–596.
pp. 1.10.1–1.10.4.
69. A. McCree et al., A 2.4 kbps MELP coder candidate for the
90. J. Stachurski, A. McCree, and V. Viswanathan, High quality
new U.S. Federal standard, Proc. ICASSP, 1996, Vol. 1,
MELP coding at bit rates around 4 kb/s, Proc. ICASSP, 1999,
pp. 200–203.
Vol. 1, pp. 485–488.
70. A. V. McCree and T. P. Barnwell, III, A mixed excitation
LPC vocoder model for low bit rate speech coding, IEEE 91. N. Sugamura and F. Itakura, Speech data compression by
Trans. Speech Audio Process. 3(4): 242–250 (1995). LSP speech analysis-synthesis technique, Trans. IECE J64-
A(8): 599–606 (1981) (in Japanese).
71. B. C. J. Moore, An Introduction to the Psychology of Hearing,
Academic Press, San Diego, (1997). 92. L. Supplee, R. Cohn, and J. Collura, MELP: The new federal
standard at 2400 bps, Proc. ICASSP, 1997, pp. 1591–1594.
72. P. Noll, MPEG digital audio coding, IEEE Signal Process.
Mag. 14(5): 59–81 (1997). 93. B. Tang, A. Shen, A. Alwan, and G. Pottie, A perceptually-
based embedded subband speech coder, IEEE Trans. Speech
73. B. Novorita, Incorporation of temporal masking effects into Audio Process. 5(2): 131–140 (March 1997).
bark spectral distortion measure, Proc. ICASSP, Phoenix,
AZ, 1999, pp. 665–668. 94. T. Taniguchi, ADPCM with a multiquantizer for speech
coding, IEEE J. Select. Areas Commun. 6(2): 410–424 (1988).
74. E. Paksoy, W-Y. Chan, and A. Gersho, Vector quantization
of speech LSF parameters with generalized product codes, 95. T. Taniguchi, F. Amano, and S. Unagami, Combined source
Proc. ICASSP, 1992, pp. 33–36. and channel coding based on multimode coding, Proc.
ICASSP, 1990, pp. 477–480.
75. E. Paksoy et al., An adaptive multi-rate speech coder for
digital cellular telephony, Proc. of ICASSP, 1999, Vol. 1, 96. J. Tardelli and E. Kreamer, Vocoder intelligibility and
pp. 193–196. quality test methods, Proc. ICASSP, 1996, pp. 1145–1148.
76. K. K. Paliwal and B. S. Atal, Efficient vector quantization 97. I. M. Trancoso and B. S. Atal, Efficient procedures for
of LPC parameters at 24 bits/frame, IEEE Trans. Speech finding the optimum innovation in stochastic coders, Proc.
Audio Process. 1: 3–14 (1993). ICASSP, 1986, pp. 2379–2382.
EOT156
20 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS