0% found this document useful (0 votes)

77 views

Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson

Uploaded by

Kevin Crisdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views

Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson

Uploaded by

Kevin Crisdi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

EOT156

SPEECH CODING: FUNDAMENTALS AND The input–output relationship is specified using a ref-
APPLICATIONS erence implementation, but novel implementations are
allowed, provided that input–output equivalence is main-
MARK HASEGAWA-JOHNSON tained. Speech coders differ primarily in bit rate (mea-
University of Illinois at sured in bits per sample or bits per second), complexity
Urbana–Champaign (measured in operations per second), delay (measured in
Urbana, Illinois milliseconds between recording and playback), and percep-
ABEER ALWAN tual quality of the synthesized speech. Narrowband (NB)
University of California at Los coding refers to coding of speech signals whose bandwidth
Angeles is less than 4 kHz (8 kHz sampling rate), while wideband
Los Angeles, California (WB) coding refers to coding of 7-kHz-bandwidth signals
(14–16 kHz sampling rate). NB coding is more common
than WB coding mainly because of the narrowband nature
1. INTRODUCTION of the wireline telephone channel (300–3600 Hz). More
recently, however, there has been an increased effort in
Speech coding is the process of obtaining a compact wideband speech coding because of several applications
representation of voice signals for efficient transmission such as videoconferencing.
over band-limited wired and wireless channels and/or There are different types of speech coders. Table 1
storage. Today, speech coders have become essential summarizes the bit rates, algorithmic complexity, and
components in telecommunications and in the multimedia standardized applications of the four general classes of
infrastructure. Commercial systems that rely on efficient coders described in this article; Table 2 lists a selection
speech coding include cellular communication, voice over of specific speech coding standards. Waveform coders
internet protocol (VOIP), videoconferencing, electronic attempt to code the exact shape of the speech signal
toys, archiving, and digital simultaneous voice and data waveform, without considering the nature of human
(DSVD), as well as numerous PC-based games and speech production and speech perception. These coders
multimedia applications. are high-bit-rate coders (typically above 16 kbps). Linear
Speech coding is the art of creating a minimally prediction coders (LPCs), on the other hand, assume that
redundant representation of the speech signal that can the speech signal is the output of a linear time-invariant
be efficiently transmitted or stored in digital media, and (LTI) model of speech production. The transfer function
decoding the signal with the best possible perceptual of that model is assumed to be all-pole (autoregressive
quality. Like any other continuous-time signal, speech may model). The excitation function is a quasiperiodic signal
be represented digitally through the processes of sampling constructed from discrete pulses (1–8 per pitch period),
and quantization; speech is typically quantized using pseudorandom noise, or some combination of the two. If
either 16-bit uniform or 8-bit companded quantization. the excitation is generated only at the receiver, based on a
Like many other signals, however, a sampled speech transmitted pitch period and voicing information, then the
signal contains a great deal of information that is system is designated as an LPC vocoder. LPC vocoders that
either redundant (nonzero mutual information between provide extra information about the spectral shape of the
successive samples in the signal) or perceptually irrelevant excitation have been adopted as coder standards between
(information that is not perceived by human listeners). 2.0 and 4.8 kbps. LPC-based analysis-by-synthesis coders
Most telecommunications coders are lossy, meaning that (LPC-AS), on the other hand, choose an excitation function
the synthesized speech is perceptually similar to the by explicitly testing a large set of candidate excitations
original but may be physically dissimilar. and choosing the best. LPC-AS coders are used in most
A speech coder converts a digitized speech signal into standards between 4.8 and 16 kbps. Subband coders are
a coded representation, which is usually transmitted in frequency-domain coders that attempt to parameterize the
frames. A speech decoder receives coded frames and syn- speech signal in terms of spectral properties in different
thesizes reconstructed speech. Standards typically dictate frequency bands. These coders are less widely used than
the input–output relationships of both coder and decoder. LPC-based coders but have the advantage of being scalable

Table 1. Characteristics of Standardized Speech Coding Algorithms in Each of

Four Broad Categories
Speech Coder Class Rates (kbps) Complexity Standardized Applications Section

Waveform coders 16–64 Low Landline telephone 2

Subband coders 12–256 Medium Teleconferencing, audio 3
LPC-AS 4.8–16 High Digital cellular 4
LPC vocoder 2.0–4.8 High Satellite telephony, military 5

Wiley Encyclopedia of Telecommunications, Edited by John G. Proakis

ISBN 0-471-36972-1 Copyright  2003 John Wiley & Sons, Inc.
1
EOT156
2 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

Table 2. A Representative Sample of Speech Coding Standards

Rate BW Standards Standard
Application (kbps) (kHz) Organization Number Algorithm Year

Landline telephone 64 3.4 ITU G.711 µ-law or A-law PCM 1988

16–40 3.4 ITU G.726 ADPCM 1990
16–40 3.4 ITU G.727 ADPCM 1990
Tele conferencing 48–64 7 ITU G.722 Split-band ADPCM 1988
16 3.4 ITU G.728 Low-delay CELP 1992
Digital 13 3.4 ETSI Full-rate RPE-LTP 1992
cellular 12.2 3.4 ETSI EFR ACELP 1997
7.9 3.4 TIA IS-54 VSELP 1990
6.5 3.4 ETSI Half-rate VSELP 1995
8.0 3.4 ITU G.729 ACELP 1996
4.75–12.2 3.4 ETSI AMR ACELP 1998
1–8 3.4 CDMA-TIA IS-96 QCELP 1993
Multimedia 5.3–6.3 3.4 ITU G.723.1 MPLPC, CELP 1996
2.0–18.2 3.4–7.5 ISO MPEG-4 HVXC, CELP 1998
Satellite telephony 4.15 3.4 INMARSAT M IMBE 1991
3.6 3.4 INMARSAT Mini-M AMBE 1995
Secure communications 2.4 3.4 DDVPC FS1015 LPC-10e 1984
2.4 3.4 DDVPC MELP MELP 1996
4.8 3.4 DDVPC FS1016 CELP 1989
16–32 3.4 DDVPC CVSD CVSD

and do not model the incoming signal as speech. Subband 0, . . . , m, . . . , K, regardless of the values of previous
coders are widely used for high-quality audio coding. samples. The reconstructed signal ŝ(n) is given by
This article is organized as follows. Sections 2, 3,
4 and 5 present the basic principles behind waveform ŝ(n) = ŝm for s(n) s.t. (s(n) − ŝm )2 = min (s(n) − ŝk )2
k=0,...,K
coders, subband coders, LPC-based analysis-by-synthesis
(1)
coders, and LPC-based vocoders, respectively. Section 6
Many speech and audio applications use an odd number of
describes the different quality metrics that are used
reconstruction levels, so that background noise signals
to evaluate speech coders, while Section 7 discusses a
with a very low level can be quantized exactly to
variety of issues that arise when a coder is implemented
ŝK/2 = 0. One important exception is the A-law companded
in a communications network, including voice over IP,
PCM standard [48], which uses an even number of
multirate coding, and channel coding. Section 8 presents reconstruction levels.
an overview of standardization activities involving speech
coding, and we conclude in Section 9 with some final
remarks. 2.1.1. Uniform PCM. Uniform PCM is the name given
to quantization algorithms in which the reconstruction
levels are uniformly distributed between Smax and Smin .
The advantage of uniform PCM is that the quantization
2. WAVEFORM CODING
error power is independent of signal power; high-power
signals are quantized with the same resolution as
Waveform coders attempt to code the exact shape of the low-power signals. Invariant error power is considered
speech signal waveform, without considering in detail the desirable in many digital audio applications, so 16-bit
nature of human speech production and speech perception. uniform PCM is a standard coding scheme in digital audio.
Waveform coders are most useful in applications that The error power and SNR of a uniform PCM coder
require the successful coding of both speech and nonspeech vary with bit rate in a simple fashion. Suppose that a
signals. In the public switched telephone network (PSTN), signal is quantized using B bits per sample. If zero is a
for example, successful transmission of modem and reconstruction level, then the quantization step size is
fax signaling tones, and switching signals is nearly as
important as the successful transmission of speech. The Smax − Smin
most commonly used waveform coding algorithms are = (2)
2B − 1
uniform 16-bit PCM, companded 8-bit PCM [48], and
ADPCM [46]. Assuming that quantization errors are uniformly dis-
tributed between /2 and −/2, the quantization error
power is
2.1. Pulse Code Modulation (PCM)

Pulse code modulation (PCM) is the name given to 2

10 log10 E[e2 (n)] = 10 log10
memoryless coding algorithms that quantize each sample 12
of s(n) using the same reconstruction levels ŝk , k = ≈ constant + 20 log10 (Smax − Smin ) − 6B (3)
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 3

2.1.2. Companded PCM. Companded PCM is the name is approximately 0.9. In differential PCM, each sample
given to coders in which the reconstruction levels ŝk are s(n) is compared to a prediction sp (n), and the difference
not uniformly distributed. Such coders may be modeled is called the prediction residual d(n) (Fig. 2). d(n) has
using a compressive nonlinearity, followed by uniform a smaller dynamic range than s(n), so for a given error
PCM, followed by an expansive nonlinearity: power, fewer bits are required to quantize d(n).
Accurate quantization of d(n) is useless unless it
s(n) → compress → t(n) → uniform PCM leads to accurate quantization of s(n). In order to avoid
amplifying the error, DPCM coders use a technique copied
→ t̂(n) → expand → ŝ(n) (4) by many later speech coders; the encoder includes an
embedded decoder, so that the reconstructed signal ŝ(n) is
It can be shown that, if small values of s(n) are more known at the encoder. By using ŝ(n) to create sp (n), DPCM
likely than large values, expected error power is minimized coders avoid amplifying the quantization error:
by a companding function that results in a higher density
of reconstruction levels x̂k at low signal levels than at
high signal levels [78]. A typical example is the µ-law d(n) = s(n) − sp (n) (6)
companding function [48] (Fig. 1), which is given by ŝ(n) = d̂(n) + sp (n) (7)
log(1 + µ|s(n)/Smax |) e(n) = s(n) − ŝ(n) = d(n) − d̂(n) (8)
t(n) = Smax sign(s(n)) (5)
log(1 + µ)

where µ is typically between 0 and 256 and determines Two existing standards are based on DPCM. In the
the amount of nonlinear compression applied. first type of coder, continuously varying slope delta
modulation (CVSD), the input speech signal is upsampled
2.2. Differential PCM (DPCM) to either 16 or 32 kHz. Values of the upsampled signal are
predicted using a one-tap predictor, and the difference
Successive speech samples are highly correlated. The long- signal is quantized at one bit per sample, with an
term average spectrum of voiced speech is reasonably adaptively varying . CVSD performs badly in quiet
well approximated by the function S(f ) = 1/f above about environments, but in extremely noisy environments (e.g.,
500 Hz; the first-order intersample correlation coefficient helicopter cockpit), CVSD performs better than any
LPC-based algorithm, and for this reason it remains
the U.S. Department of Defense recommendation for
1
extremely noisy environments [64,96].
DPCM systems with adaptive prediction and quan-
tization are referred to as adaptive differential PCM
0.8 m= 256
systems (ADPCM). A commonly used ADPCM standard
is G.726, which can operate at 16, 24, 32, or 40 kbps
Output signal t (n )

0.6 (2–5 bits/sample) [45]. G.726 ADPCM is frequently used

m= 0 at 32 kbps in landline telephony. The predictor in G.726
consists of an adaptive second-order IIR predictor in series
0.4 with an adaptive sixth-order FIR predictor. Filter coef-
ficients are adapted using a computationally simplified
gradient descent algorithm. The prediction residual is
0.2 quantized using a semilogarithmic companded PCM quan-
tizer at a rate of 2–5 bits per sample. The quantization
step size adapts to the amplitude of previous samples of
0 the quantized prediction error signal; the speed of adapta-
0 0.2 0.4 0.6 0.8 1
tion is controlled by an estimate of the type of signal, with
Input signal s (n )
adaptation to speech signals being faster than adaptation
Figure 1. µ-law companding function, µ = 0, 1, 2, 4, 8, . . . , 256. to signaling tones.

Quantization step
∧
∧ d (n ) ∧
s (n ) + d (n ) d (n ) + s (n )
+ Quantizer Encoder Decoder +
− +

∧ Channel
sp (n ) Predictor s (n ) + Predictor
+
P (z ) + sp (n ) P (z )

Figure 2. Schematic of a DPCM coder.

EOT156
4 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

3. SUBBAND CODING are more easily scalable in bit rate than standard CELP
techniques, an issue which will become more critical for
In subband coding, an analysis filterbank is first used to high-quality speech and audio transmission over wireless
filter the signal into a number of frequency bands and communication channels and the Internet, allowing the
then bits are allocated to each band by a certain criterion. system to seamlessly adapt to changes in both the
Because of the difficulty in obtaining high-quality speech transmission environment and network congestion.
at low bit rates using subband coding schemes, these
techniques have been used mostly for wideband medium
4. LPC-BASED ANALYSIS BY SYNTHESIS
to high bit rate speech coders and for audio coding.
For example, G.722 is a standard in which ADPCM
An analysis-by-synthesis speech coder consists of the
speech coding occurs within two subbands, and bit
following components:
allocation is set to achieve 7-kHz audio coding at rates
of 64 kbps or less.
In Refs. 12,13, and 30 subband coding is proposed as • A model of speech production that depends on certain
a flexible scheme for robust speech coding. A speech pro- parameters θ :
duction model is not used, ensuring robustness to speech ŝ(n) = f (θ ) (9)
in the presence of background noise, and to nonspeech
sources. High-quality compression can be achieved by • A list of K possible parameter sets for the model
incorporating masking properties of the human auditory
system [54,93]. In particular, Tang et al. [93] present a θ1 , . . . , θk , . . . , θK (10)
scheme for robust, high-quality, scalable, and embedded
speech coding. Figure 3 illustrates the basic structure • An error metric |Ek |2 that compares the original
of the coder. Dynamic bit allocation and prioritization speech signal s(n) and the coded speech signal ŝ(n).
and embedded quantization are used to optimize the per- In LPC-AS coders, |Ek |2 is typically a perceptually
ceptual quality of the embedded bitstream, resulting in weighted mean-squared error measure.
little performance degradation relative to a nonembedded
implementation. A subband spectral analysis technique A general analysis-by-synthesis coder finds the opti-
was developed that substantially reduces the complexity mum set of parameters by synthesizing all of the K
of computing the perceptual model. different speech waveforms ŝk (n) corresponding to the
The encoded bitstream is embedded, allowing the K possible parameter sets θk , computing |Ek |2 for each
coder output to be scalable from high quality at higher synthesized waveform, and then transmitting the index of
bit rates, to lower quality at lower rates, supporting the parameter set which minimizes |Ek |2 . Choosing a set
a wide range of service and resource utilization. The of transmitted parameters by explicitly computing ŝk (n) is
lower bit rate representation is obtained simply through called ‘‘closed loop’’ optimization, and may be contrasted
truncation of the higher bit rate representation. Since with ‘‘open-loop’’ optimization, in which coder parameters
source rate adaptation is performed through truncation are chosen on the basis of an analytical formula without
of the encoded stream, interaction with the source coder explicit computation of ŝk (n). Closed-loop optimization of
is not required, making the coder ideally suited for rate all parameters is prohibitively expensive, so LPC-based
adaptive communication systems. analysis-by-synthesis coders typically adopt the following
Even though subband coding is not widely used for compromise. The gross spectral shape is modeled using an
speech coding today, it is expected that new standards all-pole filter 1/A(z) whose parameters are estimated in
for wideband coding and rate-adaptive schemes will be open-loop fashion, while spectral fine structure is modeled
based on subband coding or a hybrid technique that using an excitation function U(z) whose parameters are
includes subband coding. This is because subband coders optimized in closed-loop fashion (Fig. 4).

4.1. The Basic LPC Model

PCM
In LPC-based coders, the speech signal S(z) is viewed as
input Scale
Analysis Scale the output of a linear time-invariant (LTI) system whose
& M
filterbank factors input is the excitation signal U(z), and whose transfer
quantize U
X function is represented by the following:
Masking Dynamic bit
FFT thresholds allocation U(z) U(z)
S(z) = = (11)
Signal-to-mask Source Noisy A(z) p

ratios bitrate channel 1− ai z−i

PCM
i=1
output Scale D
Synthesis &
filterbank E Most of the zeros of A(z) correspond to resonant
dequantize M frequencies of the vocal tract or formant frequencies.
U
Dynamic bit X Formant frequencies depend on the geometry of the vocal
decoder tract; this is why men and women, who have different
vocal-tract shapes and lengths, have different formant
Figure 3. Structure of a perceptual subband speech coder [93]. frequencies for the same sounds.
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 5

(a) Codebook(s) (b) Codebook(s) Spectrum of a pitch− prediction filter, b = [0.25 0.5 0.75 1.0]
of LPC excitation of LPC excitation
vectors vectors 1.2

s (n ) + 1
Get specified

Filter response magnitude

W (z ) Minimize Error codevector
−
Perceptual u (n ) 0.8
weighting u (n )
1 LPC
A(z ) synthesis 1 LPC 0.6
A(z ) synthesis
∧
s (n )
0.4
∧
Perceptual s (n )
W (z )
weighting 0.2

∧
sw (n )
0
0 1 2 3 4 5 6
Figure 4. General structure of an LPC-AS coder (a) and
decoder (b). LPC filter A(z) and perceptual weighting filter W(z) Digital frequency (radians/sample)
are chosen open-loop, then the excitation vector u(n) is chosen in Figure 5. Normalized magnitude spectrum of the pitch predic-
a closed-loop fashion in order to minimize the error metric |E|2 . tion filter for several values of the prediction coefficient.

The number of LPC coefficients (p) depends on the spectrum that is heard as voiced, without the need for
signal bandwidth. Since each pair of complex-conjugate a binary voiced/unvoiced decision.
poles represents one formant frequency and since there is, In LPC-AS coders, the noise signal c(n) is chosen from
on average, one formant frequency per 1 kHz, p is typically a ‘‘stochastic codebook’’ of candidate noise signals. The
equal to 2BW (in kHz) + (2 to 4). Thus, for a 4 kHz speech stochastic codebook index, the pitch period, and the gains
signal, a 10th–12th-order LPC model would be used. b and g are chosen in a closed-loop fashion in order to
This system is excited by a signal u(n) that is minimize a perceptually weighted error metric. The search
uncorrelated with itself over lags of less than p + 1. If for an optimum T0 typically uses the same algorithm as
the underlying speech sound is unvoiced (the vocal folds the search for an optimum c(n). For this reason, the list of
do not vibrate), then u(n) is uncorrelated with itself even excitation samples delayed by different candidate values
at larger time lags, and may be modeled using a pseudo- of T0 is typically called an ‘‘adaptive codebook’’ [87].
random-noise signal. If the underlying speech is voiced
(the vocal folds vibrate), then u(n) is quasiperiodic with a 4.3. Perceptual Error Weighting
fundamental period called the ‘‘pitch period.’’ Not all types of distortion are equally audible. Many
types of speech coders, including LPC-AS coders, use
4.2. Pitch Prediction Filtering simple models of human perception in order to minimize
the audibility of different types of distortion. In LPC-AS
In an LPC-AS coder, the LPC excitation is allowed to vary
coding, two types of perceptual weighting are commonly
smoothly between fully voiced conditions (as in a vowel)
used. The first type, perceptual weighting of the residual
and fully unvoiced conditions (as in / s /). Intermediate
quantization error, is used during the LPC excitation
levels of voicing are often useful to model partially voiced
search in order to choose the excitation vector with the
phonemes such as / z /.
least audible quantization error. The second type, adaptive
The partially voiced excitation in an LPC-AS coder is
postfiltering, is used to reduce the perceptual importance
constructed by passing an uncorrelated noise signal c(n)
of any remaining quantization error.
through a pitch prediction filter [2,79]. A typical pitch
prediction filter is 4.3.1. Perceptual Weighting of the Residual Quantization
Error. The excitation in an LPC-AS coder is chosen to
u(n) = gc(n) + bu(n − T0 ) (12) minimize a perceptually weighted error metric. Usually,
the error metric is a function of the time domain waveform
where T0 is the pitch period. If c(n) is unit variance white error signal
noise, then according to Eq. (12) the spectrum of u(n) is e(n) = s(n) − ŝ(n) (14)

g2 Early LPC-AS coders minimized the mean-squared error

|U(ejω )|2 = (13)
1 + b2 − 2b cos ωT0 π
1
e2 (n) = |E(ejω )|2 dω (15)
Figure 5 shows the normalized magnitude spectrum n
2π −π
(1 − b)|U(ejω )| for several values of b between 0.25 and
1. As shown, the spectrum varies smoothly from a uniform It turns out that the MSE is minimized if the error
spectrum, which is heard as unvoiced, to a harmonic spectrum, E(ejω ), is white — that is, if the error signal
EOT156
6 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

e(n) is an uncorrelated random noise signal, as shown in 80

Fig. 6.
Not all noises are equally audible. In particular, noise
components near peaks of the speech spectrum are hidden 75 Speech spectrum
by a ‘‘masking spectrum’’ M(ejω ), so that a shaped noise

Amplitude (dB)
spectrum at lower SNR may be less audible than a white- 70
noise spectrum at higher SNR (Fig. 7). The audibility of
noise may be estimated using a noise-to-masker ratio
|Ew |2 : 65
π
1 |E(ejω )|2
|Ew |2 = dω (16)
2π −π |M(ejω )|2
60 Shaped noise, 4.4 dB SNR
jω
The masking spectrum M(e ) has peaks and valleys at
0 1000 2000 3000 4000
the same frequencies as the speech spectrum, but the
Frequency (Hz)
difference in amplitude between peaks and valleys is
somewhat smaller than that of the speech spectrum. A Figure 7. Shaped quantization noise may be less audible than
variety of algorithms exist for estimating the masking white quantization noise, even at slightly lower SNR.
spectrum, ranging from extremely simple to extremely
complex [51]. One of the simplest model masking spectra
Given sw (n) and ŝw (n), the noise-to-masker ratio may be
that has the properties just described is as follows [2]:
computed as follows:
|A(z/γ2 )| π
M(z) = , 0 < γ2 < γ1 ≤ 1 (17) 1
|A(z/γ1 )| |Ew |2 = |Sw (ejω ) − Ŝw (ejω )|2 dω = (sw (n) − ŝw (n))2
2π −π n
(20)
where 1/A(z) is an LPC model of the speech spectrum.
The poles and zeros of M(z) are at the same frequencies
as the poles of 1/A(z), but have broader bandwidths. Since 4.3.2. Adaptive Postfiltering. Despite the use of per-
the zeros of M(z) have broader bandwidth than its poles, ceptually weighted error minimization, the synthesized
M(z) has peaks where 1/A(z) has peaks, but the difference speech coming from an LPC-AS coder may contain audible
between peak and valley amplitudes is somewhat reduced. quantization noise. In order to minimize the perceptual
The noise-to-masker ratio may be efficiently computed effects of this noise, the last step in the decoding process is
by filtering the speech signal using a perceptual weighting often a set of adaptive postfilters [11,80]. Adaptive postfil-
filter W(z) = 1/M(z). The perceptually weighted input tering improves the perceptual quality of noisy speech by
speech signal is giving a small extra emphasis to features of the spectrum
Sw (z) = W(z)S(z) (18) that are important for human-to-human communication,
including the pitch periodicity (if any) and the peaks in
Likewise, for any particular candidate excitation signal, the spectral envelope.
the perceptually weighted output speech signal is A pitch postfilter (or long-term predictive postfilter)
enhances the periodicity of voiced speech by applying
either an FIR or IIR comb filter to the output. The time
Ŝw (z) = W(z)Ŝ(z) (19)
delay and gain of the comb filter may be set equal to the
transmitted pitch lag and gain, or they may be recalculated
at the decoder using the reconstructed signal ŝ(n). The
80 pitch postfilter is applied only if the proposed comb filter
gain is above a threshold; if the comb filter gain is below
threshold, the speech is considered unvoiced, and no pitch
postfilter is used. For improved perceptual quality, the
Spectral amplitude (dB)

75 Speech spectrum
LPC excitation signal may be interpolated to a higher
sampling rate in order to allow the use of fractional pitch
periods; for example, the postfilter in the ITU G.729 coder
70
White noise at 5 dB SNR uses pitch periods quantized to 18 sample.
A short-term predictive postfilter enhances peaks in the
spectral envelope. The form of the short-term postfilter is
65 similar to that of the masking function M(z) introduced
in the previous section; the filter has peaks at the same
frequencies as 1/A(z), but the peak-to-valley ratio is less
60 than that of A(z).
0 1000 2000 3000 4000 Postfiltering may change the gain and the average
Frequency (Hz) spectral tilt of ŝ(n). In order to correct these problems,
Figure 6. The minimum-energy quantization noise is usually systems that employ postfiltering may pass the final signal
characterized as white noise. through a one-tap FIR preemphasis filter, and then modify
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 7

Speech signal s (n ) the excitation. In general, U may be represented as

the weighted sum of several ‘‘shape vectors’’ Xm , m =
1, . . . , M, which may be drawn from several codebooks,
n
including possibly multiple adaptive codebooks and
multiple stochastic codebooks:
 
X1
Speech frame for codec  
LPC analysis window U = GX, G = [g1 , g2 , . . .], X =  X2  (22)
(typically 30-60ms) synchronization (typically 20ms) ..
.

The choice of shape vectors and the values of the gains

Sub-frames for excitation search gm are jointly optimized in a closed-loop search, in
(typically 3-4/ frame) order to minimize the perceptually weighted error metric
|Sw − Ŝw |2 .
The value of Sw may be computed prior to any codebook
search by perceptually weighting the input speech vector.
LPC coefficients Excitation indices and gains
The value of Ŝw must be computed separately for each
Figure 8. The frame/subframe structure of most LPC analysis candidate excitation, by synthesizing the speech signal
by synthesis coders. ŝ(n), and then perceptually weighting to obtain ŝw (n).
These operations may be efficiently computed, as described
below.
its gain, prior to sending the reconstructed signal to a D/A
converter.
4.4.1. Zero State Response and Zero Input Response. Let
4.4. Frame-Based Analysis the filter H(z) be defined as the composition of the LPC
synthesis filter and the perceptual weighting filter, thus
The characteristics of the LPC excitation signal u(n)
H(z) = W(z)/A(z). The computational complexity of the
change quite rapidly. The energy of the signal may change
excitation parameter search may be greatly simplified
from zero to nearly full amplitude within one millisecond
if Ŝw is decomposed into the zero input response (ZIR)
at the release of a plosive sound, and a mistake of more
and zero state response (ZSR) of H(z) [97]. Note that the
than about 5 ms in the placement of such a sound is
weighted reconstructed speech signal is
clearly audible. The LPC coefficients, on the other hand,
change relatively slowly. In order to take advantage of the
slow rate of change of LPC coefficients without sacrificing
∞
Ŝw = [ŝw (0), . . . , ŝw (L − 1)], ŝw (n) = h(i)u(n − i)
the quality of the coded residual, most LPC-AS coders i=0
encode speech using a frame–subframe structure, as (23)
depicted in Fig. 8. A frame of speech is approximately
20 ms in length, and is composed of typically three to four
where h(n) is the infinite-length impulse response of H(z).
subframes. The LPC excitation is transmitted only once
Suppose that ŝw (n) has already been computed for n < 0,
per subframe, while the LPC coefficients are transmitted
and the coder is now in the process of choosing the optimal
only once per frame. The LPC coefficients are computed
u(n) for the subframe 0 ≤ n ≤ L − 1. The sum above can be
by analyzing a window of speech that is usually longer
divided into two parts: a part that depends on the current
than the speech frame (typically 30–60 ms). In order
subframe input, and a part that does not:
to minimize the number of future samples required to
compute LPC coefficients, many LPC-AS coders use an
asymmetric window that may include several hundred Ŝw = ŜZIR + UH (24)
milliseconds of past context, but that emphasizes the
samples of the current frame [21,84]. where ŜZIR contains samples of the zero input response
The perceptually weighted original signal sw (n) and of H(z), and the vector UH contains the zero state
weighted reconstructed signal ŝw (n) in a given subframe response. The zero input response is usually computed by
are often written as L-dimensional row vectors S and Ŝ, implementing the recursive filter H(z) = W(z)/A(z) as the
where the dimension L is the length of a subframe: sequence of two IIR filters, and allowing the two filters to
run for L samples with zero input. The zero state response
Sw = [sw (0), . . . , sw (L − 1)], Ŝw = [ŝw (0), . . . , ŝw (L − 1)] is usually computed as the matrix product UH, where
(21)
 
h(0) h(1) . . . h(L − 1)
The core of an LPC-AS coder is the closed-loop  0 h(0) . . . h(L − 2) 
 
search for an optimum coded excitation vector U, where H =  .. .. .. .. ,
 . . . . 
U is typically composed of an ‘‘adaptive codebook’’
0 0 ... h(0)
component representing the periodicity, and a ‘‘stochastic
codebook’’ component representing the noiselike part of U = [u(0), . . . , u(L − 1)] (25)
EOT156
8 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

Given a candidate excitation vector U, the perceptually 4.5. Types of LPC-AS Coder
weighted error vector E may be defined as
4.5.1. Multipulse LPC (MPLPC). In the multipulse LPC
algorithm [4,50], the shape vectors are impulses. U is
Ew = Sw − Ŝw = S̃ − UH (26)
typically formed as the weighted sum of 4–8 impulses per
subframe.
where the target vector S̃ is
The number of possible combinations of impulses
grows exponentially in the number of impulses, so joint
S̃ = Sw − ŜZIR (27)
optimization of the positions of all impulses is usually
impossible. Instead, most MPLPC coders optimize the
The target vector needs to be computed only once per
pulse positions one at a time, using something like
subframe, prior to the codebook search. The objective of
the following strategy. First, the weighted zero state
the codebook search, therefore, is to find an excitation
response of H(z) corresponding to each impulse location
vector U that minimizes |S̃ − UH|2 .
is computed. If Ck is an impulse located at n = k, the
corresponding weighted zero state response is
4.4.2. Optimum Gain and Optimum Excitation. Recall
that the excitation vector U is modeled as the weighted
Ck H = [0, . . . , 0, h(0), h(1), . . . , h(L − k − 1)] (34)
sum of a number of codevectors Xm , m = 1, . . . , M. The
perceptually weighted error is therefore:
The location of the first impulse is chosen in order to
optimally approximate the target vector S̃1 = S̃, using the
|E| = |S̃ − GXH| = S̃S̃ − 2GXH S̃ + GXH(GXH)
2 2
(28)
methods described in the previous section. After selecting
the first impulse location k1 , the target vector is updated
where prime denotes transpose. Minimizing |E|2 requires
according to
optimum choice of the shape vectors X and of the gains G. It
S̃m = S̃m−1 − Ckm−1 H (35)
turns out that the optimum gain for each excitation vector
can be computed in closed form. Since the optimum gain
Additional impulses are chosen until the desired number
can be computed in closed form, it need not be computed
of impulses is reached. The gains of all pulses may be
during the closed-loop search; instead, one can simply
reoptimized after the selection of each new pulse [87].
assume that each candidate excitation, if selected, would
Variations are possible. The multipulse coder described
be scaled by its optimum gain. Assuming an optimum gain
in ITU standard G.723.1 transmits a single gain for all the
results in an extremely efficient criterion for choosing the
impulses, plus sign bits for each individual impulse. The
optimum excitation vector [3].
G.723.1 coder restricts all impulse locations to be either
Suppose we define the following additional bits of
odd or even; the choice of odd or even locations is coded
notation:
using one bit per subframe [50]. The regular pulse excited
RX = XH S̃ , = XH(XH) (29)
LPC algorithm, which was the first GSM full-rate speech
coder, synthesized speech using a train of impulses spaced
Then the mean-squared error is
one per 4 samples, all scaled by a single gain term [65].
The alignment of the pulse train was restricted to one
|E|2 = S̃S̃ − 2GRX + G G (30)
of four possible locations, chosen in a closed-loop fashion
together with a gain, an adaptive codebook delay, and an
For any given set of shape vectors X, G is chosen so that
adaptive codebook gain.
|E|2 is minimized, which yields
Singhal and Atal demonstrated that the quality of
MPLPC may be improved at low bit rates by modeling
G = RX −1 (31)
the periodic component of an LPC excitation vector using
a pitch prediction filter [87]. Using a pitch prediction filter,
If we substitute the minimum MSE value of G into
the LPC excitation signal becomes
Eq. (30), we get

M
|E|2 = S̃S̃ − RX −1 RX (32) u(n) = bu(n − D) + ckm (n) (36)
m=1
Hence, in order to minimize the perceptually weighted
MSE, we choose the shape vectors X in order to maximize where the signal ck (n) is an impulse located at n = k
the covariance-weighted sum of correlations: and b is the pitch prediction filter gain. Singhal and Atal
proposed choosing D before the locations of any impulses
Xopt = arg max(RX −1 RX ) (33) are known, by minimizing the following perceptually
weighted error:
When the shape matrix X contains more than one row,
the matrix inversion in Eq. (33) is often computed using |ED |2 = |S̃ − bXD H|2 , XD = [u(−D), . . . , u((L − 1) − D)]
approximate algorithms [4]. In the VSELP coder [25], (37)
X is transformed using a modified Gram–Schmidt The G.723.1 multipulse LPC coder and the GSM
orthogonalization so that has a diagonal structure, thus (Global System for Mobile Communication) full-rate RPE-
simplifying the computation of Eq. (33). LTP (regular-pulse excitation with long-term prediction)
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 9

coder both use a closed-loop pitch predictor, as do all truncated impulse response of the filter W(z)/A(z), as dis-
standardized variations of the CELP coder (see Sections cussed in Section 4.4 [3,97]. Davidson and Lin separately
4.5.2 and 4.5.3). Typically, the pitch delay and gain are proposed center clipping the stochastic codevectors, so that
optimized first, and then the gains of any additional most of the samples in each codevector are zero [15,67].
excitation vectors (e.g., impulses in an MPLPC algorithm) Lin also proposed structuring the stochastic codebook so
are selected to minimize the remaining error. that each codevector is a slightly-shifted version of the pre-
vious codevector; such a codebook is called an overlapped
4.5.2. Code-Excited LPC (CELP). LPC analysis finds codebook [67]. Overlapped stochastic codebooks are rarely
a filter 1/A(z) whose excitation is uncorrelated for used in practice today, but overlapped-codebook search
correlation distances smaller than the order of the filter. methods are often used to reduce the computational com-
Pitch prediction, especially closed-loop pitch prediction, plexity of an adaptive codebook search. In the search of
removes much of the remaining intersample correlation. an overlapped codebook, the correlation RX and autocor-
The spectrum of the pitch prediction residual looks like the relation introduced in Section 4.4 may be recursively
spectrum of uncorrelated Gaussian noise, but replacing the computed, thus greatly reducing the complexity of the
residual with real noise (noise that is independent of the codebook search [63].
original signal) yields poor speech quality. Apparently, Most CELP coders optimize the adaptive codebook
some of the temporal details of the pitch prediction index and gain first, and then choose a stochastic
residual are perceptually important. Schroeder and Atal codevector and gain in order to minimize the remaining
proposed modeling the pitch prediction residual using a perceptually weighted error. If all the possible pitch
stochastic excitation vector ck (n) chosen from a list of periods are longer than one subframe, then the entire
stochastic excitation vectors, k = 1, . . . , K, known to both content of the adaptive codebook is known before the
the transmitter and receiver [85]: beginning of the codebook search, and the efficient
overlapped codebook search methods proposed by Lin
u(n) = bu(n − D) + gck (n) (38) may be applied [67]. In practice, the pitch period of a
female speaker is often shorter than one subframe. In
The list of stochastic excitation vectors is called a stochastic order to guarantee that the entire adaptive codebook is
codebook, and the index of the stochastic codevector is known before beginning a codebook search, two methods
chosen in order to minimize the perceptually weighted are commonly used: (1) the adaptive codebook search may
error metric |Ek |2 . Rose and Barnwell discussed the simply be constrained to only consider pitch periods longer
similarity between the search for an optimum stochastic than L samples — in this case, the adaptive codebook will
codevector index k and the search for an optimum predictor lock onto values of D that are an integer multiple of
delay D [82], and Kleijn et al. coined the term ‘‘adaptive the actual pitch period (if the same integer multiple is not
codebook’’ to refer to the list of delayed excitation signals chosen for each subframe, the reconstructed speech quality
u(n − D) which the coder considers during closed-loop is usually good); and (2) adaptive codevectors with delays
pitch delay optimization (Fig. 9). of D < L may be constructed by simply repeating the most
The CELP algorithm was originally not considered effi- recent D samples as necessary to fill the subframe.
cient enough to be used in real-time speech coding, but 4.5.3. SELP, VSELP, ACELP, and LD-CELP. Rose and
a number of computational simplifications were proposed Barnwell demonstrated that reasonable speech quality
that resulted in real-time CELP-like algorithms. Trancoso is achieved if the LPC excitation vector is computed com-
and Atal proposed efficient search methods based on the pletely recursively, using two closed-loop pitch predictors
in series, with no additional information [82]. In their
‘‘self-excited LPC’’ algorithm (SELP), the LPC excitation
is initialized during the first subframe using a vector of
u (n −Dmin)
samples known at both the transmitter and receiver. For
all frames after the first, the excitation is the sum of an
b arbitrary number of adaptive codevectors:
sw (n )
u (n −Dmax)
M
∧ u(n) = bm u(n − Dm ) (39)
Adaptive codebook u (n ) W (z ) sw (n )
+ − sum(||^2) m=1
A(z )
c1(n )
Choose codebook indices
Kleijn et al. developed efficient recursive algorithms for
to minimize MSE searching the adaptive codebook in SELP coder and other
g LPC-AS coders [63].
Just as there may be more than one adaptive codebook,
cK (n ) it is also possible to use more than one stochastic codebook.
Stochastic codebook The vector-sum excited LPC algorithm (VSELP) models
the LPC excitation vector as the sum of one adaptive and
Figure 9. The code-excited LPC algorithm (CELP) constructs an
LPC excitation signal by optimally choosing input vectors from
two stochastic codevectors [25]:
two codebooks: an ‘‘adaptive’’ codebook, which represents the
2
pitch periodicity; and a ‘‘stochastic’’ codebook, which represents u(n) = bu(n − D) + gm ckm (n) (40)
the unpredictable innovations in each speech frame. m=1
EOT156
10 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

The two stochastic codebooks are each relatively small Let A(z) be the frequency response of an LPC inverse
(typically 32 vectors), so that each of the codebooks may be filter of order p:
searched efficiently. The adaptive codevector and the two
p

stochastic codevectors are chosen sequentially. After selec- A(z) = − ai z−i

i=0
tion of the adaptive codevector, the stochastic codebooks
are transformed using a modified Gram–Schmidt orthog-
with a0 = −1. The ai values are real, and all the zeros of
onalization, so that the perceptually weighted speech
A(z) are inside the unit circle.
vectors generated during the first stochastic codebook
If we use the lattice formulation of LPC, we arrive at a
search are all orthogonal to the perceptually weighted
recursive relation between the mth stage [Am (z)] and the
adaptive codevector. Because of this orthogonalization,
one before it [Am−1 (z)] For the pth-order inverse filter, we
the stochastic codebook search results in the choice of a
have
stochastic codevector that is jointly optimal with the adap-
Ap (z) = Ap−1 (z) − kp z−p Ap−1 (z−1 )
tive codevector, rather than merely sequentially optimal.
VSELP is the basis of the Telecommunications Industry
Associations digital cellular standard IS-54. By allowing the recursion to go one more iteration, we
The algebraic CELP (ACELP) algorithm creates an obtain
LPC excitation by choosing just one vector from an Ap+1 (z) = Ap (z) − kp+1 z−(p+1) Ap (z−1 ) (41)
adaptive codebook and one vector from a fixed codebook.
In the ACELP algorithm, however, the fixed codebook If we choose kp+1 = ±1 in Eq. (41), we can define two new
is composed of binary-valued or trinary-valued algebraic polynomials as follows:
codes, rather than the usual samples of a Gaussian noise
process [1]. Because of the simplicity of the codevectors, P(z) = A(z) − z−(p+1) A(z−1 ) (42)
it is possible to search a very large fixed codebook very Q(z) = A(z) + z−(p+1) A(z−1 ) (43)
quickly using methods that are a hybrid of standard CELP
and MPLPC search algorithms. ACELP is the basis of the
Physically, P(z) and Q(z) can be interpreted as the inverse
ITU standard G.729 coder at 8 kbps. ACELP codebooks
transfer function of the vocal tract for the open-glottis and
may be somewhat larger than the codebooks in a standard
closed-glottis boundary conditions, respectively [22], and
CELP coder; the codebook in G.729, for example, contains
P(z)/Q(z) is the driving-point impedance of the vocal tract
8096 codevectors per subframe.
as seen from the glottis [36].
Most LPC-AS coders operate at very low bit rates, but
If p is odd, the formulae for pn and qn are as follows:
require relatively large buffering delays. The low-delay
CELP coder (LD-CELP) operates at 16 kbps [10,47] and
P(z) = A(z) + z−(p+1) A(z−1 )
is designed to obtain the best possible speech quality,
with the constraint that the total algorithmic delay of a
(p+1)/2

tandem coder and decoder must be no more than 2 ms. = (1 − ejpn z−1 )(1 − e−jpn z−1 ) (44)
n=1
LPC analysis and codevector search are computed once
per 2 ms (16 samples). Transmission of LPC coefficients Q(z) = A(z) − z−(p+1) A(z−1 )
once per two milliseconds would require too many bits, so
LPC coefficients are computed in a recursive backward-
(p−1)/2
= (1 − z−2 ) (1 − ejqn z−1 )(1 − e−jqn z−1 ) (45)
adaptive fashion. Before coding or decoding each frame, n=1
samples of ŝ(n) from the previous frame are windowed, and
used to update a recursive estimate of the autocorrelation The LSFs have some interesting characteristics: the
function. The resulting autocorrelation coefficients are frequencies {pn } and {qn } are related to the formant
similar to those that would be obtained using a relatively frequencies; the dynamic range of {pn } and {qn } is
long asymmetric analysis window. LPC coefficients are limited and the two alternate around the unit circle
then computed from the autocorrelation function using (0 ≤ p1 ≤ q1 ≤ p2 . . .); {pn } and {qn } are correlated so that
the Levinson–Durbin algorithm. intraframe prediction is possible; and they change slowly
from one frame to another, hence, interframe prediction is
4.6. Line Spectral Frequencies (LSFs) or Line Spectral Pairs also possible. The interleaving nature of the {pn } and {qn }
(LSPs) allow for efficient iterative solutions [58].
Almost all LPC-based coders today use the LSFs
Linear prediction can be viewed as an inverse filtering to represent the LP parameters. Considerable recent
procedure in which the speech signal is passed through an research has been devoted to methods for efficiently
all-zero filter A(z). The filter coefficients of A(z) are chosen quantizing the LSFs, especially using vector quantization
such that the energy in the output, that is, the residual or (VQ) techniques. Typical algorithms include predictive
error signal, is minimized. Alternatively, the inverse filter VQ, split VQ [76], and multistage VQ [66,74]. All of these
A(z) can be transformed into two other filters P(z) and methods are used in the ITU standard ACELP coder G.729:
Q(z). These new filters turn out to have some interesting the moving-average vector prediction residual is quantized
properties, and the representation based on them, called using a 7-bit first-stage codebook, followed by second-stage
the line spectrum pairs [89,91], has been used in speech quantization of two subvectors using independent 5-bit
coding and synthesis applications. codebooks, for a total of 17 bits per frame [49,84].
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 11

5. LPC VOCODERS 5.2. Mixed-Excitation Linear Prediction (MELP)

The mixed-excitation linear prediction (MELP) coder [69]
5.1. The LPC-10e Vocoder
was selected in 1996 by the United States Department
The 2.4-kbps LPC-10e vocoder (Fig. 10) is one of the of Defense Voice Processing Consortium (DDVPC) to
earliest and one of the longest-lasting standards for low- be the U.S. Federal Standard at 2.4 kbps, replacing
bit-rate digital speech coding [8,16]. This standard was LPC-10e. The MELP coder is based on the LPC model
originally proposed in the 1970s, and was not officially with additional features that include mixed excitation,
replaced until the selection of the MELP 2.4-kbps coding aperiodic pulses, adaptive spectral enhancement, pulse
standard in 1996 [64]. Speech coded using LPC-10e sounds dispersion filtering, and Fourier magnitude modeling [70].
metallic and synthetic, but it is intelligible. The synthesis model for the MELP coder is illustrated
In the LPC-10e algorithm, speech is first windowed in Fig. 11. LP coefficients are converted to LSFs and a
using a Hamming window of length 22.5ms. The gain multistage vector quantizer (MSVQ) is used to quantize
(G) and coefficients (ai ) of a linear prediction filter are the LSF vectors. For voiced segments a total of 54 bits that
calculated for the entire frame using the Levinson–Durbin represent: LSF parameters (25), Fourier magnitudes of the
recursion. Once G and ai have been computed, the LPC prediction residual signal (8), gain (8), pitch (7), bandpass
residual signal d(n) is computed: voicing (4), aperiodic flag (1), and a sync bit are sent.
The Fourier magnitudes are coded with an 8-bit VQ and

the associated codebook is searched with a perceptually-
1
p
d(n) = s(n) − ai s(n − i) (46) weighted Euclidean distance. For unvoiced segments, the
G Fourier magnitudes, bandpass voicing, and the aperiodic
i=1
flag bit are not sent. Instead, 13 bits that implement
The residual signal d(n) is modeled using either a forward error correction (FEC) are sent. The performance
periodic train of impulses (if the speech frame is voiced) of MELP at 2.4 kbps is similar to or better than that of the
or an uncorrelated Gaussian random noise signal (if the federal standard at 4.8 kbps (FS 1016) [92]. Versions of
frame is unvoiced). The voiced/unvoiced decision is based MELP coders operating at 1.7 kbps [68] and 4.0 kbps [90]
on the average magnitude difference function (AMDF), have been reported.

1
N−1
5.3. Multiband Excitation (MBE)
d (m) = |d(n) − d(n − |m|)| (47)
N − |m| n=|m| In multiband excitation (MBE) coding the voiced/unvoiced
decision is not a binary one; instead, a series of
The frame is labeled as voiced if there is a trough in d (m) voicing decisions are made for independent harmonic
that is large enough to be caused by voiced excitation. intervals [31]. Since voicing decisions can be made in
Only values of m between 20 and 160 are examined, different frequency bands individually, synthesized speech
corresponding to pitch frequencies between 50 and 400 Hz. may be partially voiced and partially unvoiced. An
If the minimum value of d (m) in this range is less than improved version of the MBE was introduced in the late
a threshold, the frame is declared voiced, and otherwise it 1980s [7,35] and referred to as the IMBE coder. The IMBE
is declared unvoiced [8]. at 2.4 kbps produces better sound quality than does the
If the frame is voiced, then the LPC residual is LPC-10e. The IMBE was adopted as the Inmarsat-M
represented using an impulse train of period T0 , where coding standard for satellite voice communication at a
total rate of 6.4 kbps, including 4.15 kbps of source coding
160 and 2.25 kbps of channel coding [104]. The Advanced
T0 = arg min d (m) (48) MBE (AMBE) coder was adopted as the Inmarsat Mini-M
m=20
standard at a 4.8 kbps total data rate, including 3.6 kbps
If the frame is unvoiced, a pitch period of T0 = 0 is of speech and 1.2 kbps of channel coding [18,27]. In [14]
transmitted, indicating that an uncorrelated Gaussian an enhanced multiband excitation (EMBE) coder was
random noise signal should be used as the excitation of presented. The distinguishing features of the EMBE coder
the LPC synthesis filter. include signal-adaptive multimode spectral modeling
and parameter quantization, a two-band signal-adaptive

Vocal fold oscillation

Source
Pulse Pulse
spectral
train train
shaping
H(z ) 1/A (z )
Frication, aspiration G Transfer + Transfer G
function function
Voiced /unvoiced
White switch Source
White
noise spectral
noise
shaping
Figure 10. A simplified model of speech production whose
parameters can be transmitted efficiently across a digital channel. Figure 11. The MELP speech synthesis model.
EOT156
12 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

frequency-domain voicing decision, a novel VQ scheme for reviewers) like the way the product sounds, then the
the efficient encoding of the variable-dimension spectral speech coder is a success. The reaction of consumers
magnitude vectors at low rates, and multiclass selective can often be predicted to a certain extent by evaluating
protection of spectral parameters from channel errors. The the reactions of experimental listeners in a controlled
4-kbps EMBE coder accounts for both source (2.9 kbps) and psychophysical testing paradigm. Psychophysical tests
channel (1.1 kbps) coding and was designed for satellite- (often called ‘‘subjective tests’’) vary depending on the
based communication systems. quantity being evaluated, and the structure of the test.

5.4. Prototype Waveform Interpolative (PWI) Coding 6.1.1. Intelligibility. Speech coder intelligibility is eval-
A different kind of coding technique that has proper- uated by coding a number of prepared words, asking lis-
ties of both waveform and LPC-based coders has been teners to write down the words they hear, and calculating
proposed [59,60] and is called prototype waveform interpo- the percentage of correct transcriptions (an adjustment for
lation (PWI). PWI uses both interpolation in the frequency guessing may be subtracted from the score). The diagnostic
domain and forward–backward prediction in the time rhyme test (DRT) and diagnostic alliteration test (DALT)
domain. The technique is based on the assumption that, for are intelligibility tests which use a controlled vocabulary
voiced speech, a perceptually accurate speech signal can to test for specific types of intelligibility loss [101,102].
be reconstructed from a description of the waveform of a Each test consists of 96 pairs of confusable words spo-
single, representative pitch cycle per interval of 20–30 ms. ken in isolation. The words in a pair differ in only one
The assumption exploits the fact that voiced speech can distinctive feature, where the distinctive feature dimen-
be interpreted as a concentration of slowly evolving pitch sions proposed by Voiers are voicing, nasality, sustention,
cycle waveforms. The prototype waveform is described by sibilation, graveness, and compactness. In the DRT, the
a set of linear prediction (LP) filter coefficients describing words in a pair differ in only one distinctive feature of the
the formant structure and a prototype excitation wave- initial consonant; for instance, ‘‘jest’’ and ‘‘guest’’ differ in
form, quantized with analysis-by-synthesis procedures. the sibilation of the initial consonant. In the DALT, words
The speech signal is reconstructed by filtering an excita- differ in the final consonant; for instance, ‘‘oaf’’ and ‘‘oath’’
tion signal consisting of the concatenation of (infinitesimal) differ in the graveness of the final consonant. Listeners
sections of the instantaneous excitation waveforms. By hear one of the words in each pair, and are asked to select
coding the voiced and unvoiced components separately, a the word from two written alternatives. Professional test-
2.4-kbps version of the coder performed similarly to the ing firms employ trained listeners who are familiar with
4.8-kbps FS1016 standard [61]. the speakers and speech tokens in the database, in order
Recent work has aimed at reducing the computational to minimize test-retest variability.
complexity of the coder for rates between 1.2 and 2.4 kbps Intelligibility scores quoted in the speech coding
by including a time-varying waveform sampling rate and literature often refer to the composite results of a DRT.
a cubic B-spline waveform representation [62,86]. In a comparison of two federal standard coders, the LPC
10e algorithm resulted in 90% intelligibility, while the
6. MEASURES OF SPEECH QUALITY FS-1016 CELP algorithm had 91% intelligibility [64].
An evaluation of waveform interpolative (WI) coding
Deciding on an appropriate measurement of quality is published DRT scores of 87.2% for the WI algorithm, and
one of the most difficult aspects of speech coder design, 87.7% for FS-1016 [61].
and is an area of current research and standardization.
Early military speech coders were judged according to only 6.1.2. Numerical Measures of Perceptual Qual-
one criterion: intelligibility. With the advent of consumer- ity. Perhaps the most commonly used speech quality
grade speech coders, intelligibility is no longer a sufficient measure is the mean opinion score (MOS). A mean opin-
condition for speech coder acceptability. Consumers want ion score is computed by coding a set of spoken phrases
speech that sounds ‘‘natural.’’ A large number of subjective using a variety of coders, presenting all of the coded
and objective measures have been developed to quantify speech together with undegraded speech in random order,
‘‘naturalness,’’ but it must be stressed that any scalar asking listeners to rate the quality of each phrase on
measurement of ‘‘naturalness’’ is an oversimplification. a numerical scale, and then averaging the numerical
‘‘Naturalness’’ is a multivariate quantity, including such ratings of all phrases coded by a particular coder. The
factors as the metallic versus breathy quality of speech, five-point numerical scale is associated with a standard
the presence of noise, the color of the noise (narrowband set of descriptive terms: 5 = excellent, 4 = good, 3 = fair,
noise tends to be more annoying than wideband noise, 2 = poor, and 1 = bad. A rating of 4 is supposed to corre-
but the parameters that predict ‘‘annoyance’’ are not well spond to standard toll-quality speech, quantized at 64 kbps
understood), the presence of unnatural spectral envelope using ITU standard G.711 [48].
modulations (e.g., flutter noise), and the absence of natural Mean opinion scores vary considerably depending
spectral envelope modulations. on background noise conditions; for example, CVSD
performs significantly worse than LPC-based methods in
6.1. Psychophysical Measures of Speech Quality quiet recording conditions, but significantly better under
(Subjective Tests) extreme noise conditions [96]. Gender of the speaker may
The final judgment of speech coder quality is the also affect the relative ranking of coders [96]. Expert
judgment made by human listeners. If consumers (and listeners tend to give higher rankings to speech coders
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 13

4 4 G
C

Jarvinen
Kohler

P F
3 K 3

2 O E 2
2.4 4.8 8 16 32 64 2.4 4.8 8 16 32 64

4 4 G B A Figure 12. Mean opinion scores

Yeldener

I B from five published studies in quiet

MPEG
N H H D
recording conditions — Jarvinen [53],
3 3 M J Kohler [64], MPEG [39], Yeldener
M [107], and the COMSAT and MPC
2 2 K sites from Tardelli et al. [96]: (A)
unmodified speech, (B) ITU G.722
2.4 4.8 8 16 32 64 2.4 4.8 8 16 32 64 subband ADPCM, (C) ITU G.726
ADPCM, (D) ISO MPEG-II layer 3
A A subband audio coder, (E) DDVPC
4 4
C CVSD, (F) GSM full-rate RPE-LTP,
COMSAT

I C (G) GSM EFR ACELP, (H) ITU G.729

MPC

3 L 3 I ACELP, (I) TIA IS54 VSELP, (J)

K
L L
K ITU G.723.1 MPLPC, (K) DDVPC
E L
E FS-1016 CELP, (L) sinusoidal trans-
O O form coding, (M) ISO MPEG-IV
2 E 2 E
HVXC, (N) Inmarsat mini-M AMBE,
2.4 4.8 8 16 32 64 2.4 4.8 8 16 32 64
(O) DDVPC FS-1015 LPC-10e, (P)
Bit rate (kbps) Bit rate (kbps) DDVPC MELP.

with which they are familiar, even when they are not tested under the same conditions may be as reliable as
consciously aware of the order in which coders are DAM testing [96].
presented [96]. Factors such as language and location of
the testing laboratory may shift the scores of all coders up
6.1.3. Comparative Measures of Perceptual Quality. It is
or down, but tend not to change the rank order of individual
sometimes difficult to evaluate the statistical significance
coders [39]. For all of these reasons, a serious MOS test
of a reported MOS difference between two coders. A
must evaluate several reference coders in parallel with the
more powerful statistical test can be applied if coders are
coder of interest, and under identical test conditions. If an
evaluated in explicit A/B comparisons. In a comparative
MOS test is performed carefully, intercoder differences
test, a listener hears the same phrase coded by two
of approximately 0.15 opinion points may be considered different coders, and chooses the one that sounds better.
significant. Figure 12 is a plot of MOS as a function of bit The result of a comparative test is an apparent preference
rate for coders evaluated under quiet listening conditions score, and an estimate of the significance of the observed
in five published studies (one study included separately preference; for example, in a 1999 study, WI coding at
tabulated data from two different testing sites [96]). 4.0 kbps was preferred to 4 kbps HVXC 63.7% of the
The diagnostic acceptability measure (DAM) is an time, to 5.3 kbps G.723.1 57.5% of the time (statistically
attempt to control some of the factors that lead to significant differences), and to 6.3 kbps G.723.1 53.9% of
variability in published MOS scores [100]. The DAM the time (not statistically significant) [29]. It should be
employs trained listeners, who rate the quality of noted that ‘‘statistical significance’’ in such a test refers
standardized test phrases on 10 independent perceptual only to the probability that the same listeners listening to
scales, including six scales that rate the speech itself the same waveforms will show the same preference in a
(fluttering, thin, rasping, muffled, interrupted, nasal), future test.
and four scales that rate the background noise (hissing,
buzzing, babbling, rumbling). Each of these is a 100-
6.2. Algorithmic Measures of Speech Quality (Objective
point scale, with a range of approximately 30 points
Measures)
between the LPC-10e algorithm (50 points) and clean
speech (80 points) [96]. Scores on the various perceptual Psychophysical testing is often inconvenient; it is not
scales are combined into a composite quality rating. DAM possible to run psychophysical tests to evaluate every
scores are useful for pointing out specific defects in a proposed adjustment to a speech coder. For this reason,
speech coding algorithm. If the only desired test outcome a number of algorithms have been proposed that
is a relative quality ranking of multiple coders, a carefully approximate, to a greater or lesser extent, the results
controlled MOS test in which all coders of interest are of psychophysical testing.
EOT156
14 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

The signal-to-noise ratio of a frame of N speech samples The ITU perceptual speech quality measure (PSQM)
starting at sample number n may be defined as computes the perceptual quality of a speech signal by
filtering the input and quantized signals using a Bark-

n+N−1 scale filterbank, nonlinearly compressing the amplitudes
s2 (m) in each band, and then computing an average subband
SNR(n) =
m=n
(49) signal to noise ratio [51]. The development of algorithms

n+N−1 that accurately predict the results of MOS or comparative
2
e (m) testing is an area of active current research, and a number
m=n of improvements, alternatives, and/or extensions to the
PSQM measure have been proposed. An algorithm that
High-energy signal components can mask quantization has been the focus of considerable research activity is the
error, which is synchronous with the signal component, Bark spectral distortion measure [73,103,105,106]. The
or separated by at most a few tens of milliseconds. Over ITU has also proposed an extension of the PSQM standard
longer periods of time, listeners accumulate a general called perceptual evaluation of speech quality (PESQ) [81],
perception of quantization noise, which can be modeled as which will be released as ITU standard P.862.
the average log segmental SNR:

7. NETWORK ISSUES
1
K−1
SEGSNR = 10 log10 SNR(kN) (50)
K 7.1. Voice over IP
k=0

Speech coding for the voice over Internet Protocol (VOIP)

High-amplitude signal components tend to mask application is becoming important with the increasing
quantization error components at nearby frequencies and dependency on the Internet. The first VoIP standard
times. A high-amplitude spectral peak in the speech signal was published in 1998 as recommendation H.323 [52] by
is able to mask quantization error components at the same the International Telecommunications Union (ITU-T). It
frequency, at higher frequencies, and to a much lesser is a protocol for multimedia communications over local
extent, at lower frequencies. Given a short-time speech area networks using packet switching, and the voice-only
spectrum S(ejω ), it is possible to compute a short-time subset of it provides a platform for IP-based telephony.
‘‘masking spectrum’’ M(ejω ) which describes the threshold At high bit rates, H.323 recommends the coders G.711
energy at frequency ω below which noise components are (3.4 kHz at 48, 56, and 64 kbps) and G.722 (wideband
inaudible. The perceptual salience of a noise signal e(n) speech and music at 7 kHz operating at 48, 56, and
may be estimated by filtering the noise signal into K 64 kbps) while at the lower bit rates G.728 (3.4 kHz at
different subband signals ek (n), and computing the ratio 16 kbps), G.723 (5.3 and 6.5 kbps), and G.729 (8 kbps) are
between the noise energy and the masking threshold in recommended [52].
each subband: In 1999, a competing and simpler protocol named
the Session Initiation Protocol (SIP) was developed by

n+N−1
the Internet Engineering Task Force (IETF) Multiparty
e2k (m)
Multimedia Session Control working group and published
m=n
NMR(n, k) = ωk+1 (51) as RFC 2543 [19]. SIP is a signaling protocol for Internet
|M(ejω )|2 dω conferencing and telephony, is independent of the packet
ωk layer, and runs over UDP or TCP although it supports
more protocols and handles the associations between
where ωk is the lower edge of band k, and ωk+1 is the Internet end systems. For now, both systems will coexist
upper band edge. The band edges must be close enough but it is predicted that the H.323 and SIP architectures will
together that all of the signal components in band k are evolve such that two systems will become more similar.
effective in masking the signal ek (n). The requirement of Speech transmission over the Internet relies on sending
effective masking is met if each band is exactly one Bark ‘‘packets’’ of the speech signal. Because of network
in width, where the Bark frequency scale is described in congestion, packet loss can occur, resulting in audible
many references [71,77]. artifacts. High-quality VOIP, hence, would benefit from
Fletcher has shown that the perceived loudness of a variable-rate source and channel coding, packet loss
signal may be approximated by adding the cube roots concealment, and jitter buffer/delay management. These
of the signal power in each one-bark subband, after are challenging issues and research efforts continue to
properly accounting for masking effects [20]. The total generate high-quality speech for VOIP applications [38].
loudness of a quantization noise signal may therefore be
approximated as 7.2. Embedded and Multimode Coding
 1/3 When channel quality varies, it is often desirable to adjust

n+N−1
the bit rate of a speech coder in order to match the channel
 
e2k [m]

K−1
 m=n


capacity. Varying bit rates are achieved in one of two
NMR(n) =   (52) ways. In multimode speech coding, the transmitter and
 ωk+1

k=0  |M(e )| dω 
jω 2
the receiver must agree on a bit rate prior to transmission
ωk
of the coded bits. In embedded source coding, on the other
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 15

hand, the bitstream of the coder operating at low bit rates In MPLPC, on the other hand, minimizing the perceptual-
is embedded in the bitstream of the coder operating at error weighting is achieved by choosing the amplitude and
higher rates. Each increment in bit rate provides marginal position of a number of pulses in the excitation signal.
improvement in speech quality. Lower bit rate coding is Voice activity detection (VAD) is used to reduce the bit
obtained by puncturing bits from the higher rate coder rate during silent periods, and switching from one bit rate
and typically exhibits graceful degradation in quality with to another is done on a frame-by-frame basis.
decreasing bit rates. Multimode coders have been proposed over a wide vari-
ITU Standard G.727 describes an embedded ADPCM ety of bandwidths. Taniguchi et al. proposed a multimode
coder, which may be run at rates of 40, 32, 24, or ADPCM coder at bit rates between 10 and 35 kbps [94].
16 kbps (5, 4, 3, or 2 bits per sample) [46]. Embedded Johnson and Taniguchi proposed a multimode CELP algo-
ADPCM algorithms are a family of variable bit rate coding rithm at data rates of 4.0–5.3 kbps in which additional
algorithms operating on a sample per sample basis (as stochastic codevectors are added to the LPC excitation vec-
opposed to, e.g., a subband coder that operates on a frame- tor when channel conditions are sufficiently good to allow
by-frame basis) that allows for bit dropping after encoding. high-quality transmission [55]. The European Telecom-
The decision levels of the lower-rate quantizers are subsets munications Standards Institute (ETSI) has recently pro-
of those of the quantizers at higher rates. This allows for posed a standard for adaptive multirate coding at rates
bit reduction at any point in the network without the need between 4.75 and 12.2 kbps.
of coordination between the transmitter and the receiver.
The prediction in the encoder is computed using a 7.3. Joint Source-Channel Coding
more coarse quantization of d̂(n) than the quantization In speech communication systems, a major challenge is
actually transmitted. For example, 5 bits per sample to design a system that provides the best possible speech
may be transmitted, but as few as 2 bits may be used quality throughout a wide range of channel conditions. One
to reconstruct d̂(n) in the prediction loop. Any bits not solution consists of allowing the transceivers to monitor
used in the prediction loop are marked as ‘‘optional’’ by the state of the communication channel and to dynamically
the signaling channel mode flag. If network congestion allocate the bitstream between source and channel coding
disrupts traffic at a router between sender and receiver, accordingly. For low-SNR channels, the source coder
the router is allowed to drop optional bits from the coded operates at low bit rates, thus allowing powerful forward
speech packets. error control. For high-SNR channels, the source coder
Embedded ADPCM algorithms produce codewords that uses its highest rate, resulting in high speech quality,
contain enhancement and core bits. The feedforward (FF) but with little error control. An adaptive algorithm selects
path of the codec utilizes both enhancement bits and core a source coder and channel coder based on estimates
bits, while the feedback (FB) path uses core bits only. of channel quality in order to maintain a constant
With this structure, enhancement bits can be discarded or total data rate [95]. This technique is called adaptive
dropped during network congestion. multirate (AMR) coding, and requires the simultaneous
An important example of a multimode coder is QCELP, implementation of an AMR source coder [24], an AMR
the speech coder standard that was adopted by the TIA channel coder [26,28], and a channel quality estimation
North American digital cellular standard based on code- algorithm capable of acquiring information about channel
division multiple access (CDMA) technology [9]. The coder conditions with a relatively small tracking delay.
selects one of four data rates every 20 ms depending on the The notion of determining the relative importance
speech activity; for example, background noise is coded at a of bits for further unequal error protection (UEP)
lower rate than speech. The four rates are approximately was pioneered by Rydbeck and Sundberg [83]. Rate-
1 kbps (eighth rate), 2 kbps (quarter rate), 4 kbps (half compatible channel codes, such as Hagenauer’s rate
rate), and 8 kbps (full rate). QCELP is based on the CELP compatible punctured convolutional codes (RCPC) [34],
structure but integrates implementation of the different are a collection of codes providing a family of channel
rates, thus reducing the average bit rate. For example, coding rates. By puncturing bits in the bitstream, the
at the higher rates, the LSP parameters are more finely channel coding rate of RCPC codes can be varied
quantized and the pitch and codebook parameters are instantaneously, providing UEP by imparting on different
updated more frequently [23]. The coder provides good segments different degrees of protection. Cox et al. [13]
quality speech at average rates of 4 kbps. address the issue of channel coding and illustrate how
Another example of a multimode coder is ITU standard RCPC codes can be used to build a speech transmission
G.723.1, which is an LPC-AS coder that can operate at scheme for mobile radio channels. Their approach is
2 rates: 5.3 or 6.3 kbps [50]. At 6.3 kbps, the coder is a based on a subband coder with dynamic bit allocation
multipulse LPC (MPLPC) coder while the 5.3-kbps coder proportional to the average energy of the bands. RCPC
is an algebraic CELP (ACELP) coder. The frame size is codes are then used to provide UEP.
30 ms with an additional lookahead of 7.5 ms, resulting Relatively few AMR systems describing source and
in a total algorithmic delay of 67.5 ms. The ACELP and channel coding have been presented. The AMR sys-
MPLPC coders share the same LPC analysis algorithm tems [99,98,75,44] combine different types of variable rate
and frame/subframe structure, so that most of the program CELP coders for source coding with RCPC and cyclic
code is used by both coders. As mentioned earlier, in redundancy check (CRC) codes for channel coding and
ACELP, an algebraic transformation of the transmitted were presented as candidates for the European Telecom-
index produces the excitation signal for the synthesizer. munications Standards Institute (ETSI) GSM AMR codec
EOT156
16 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

standard. In [88], UEP is applied to perceptually based including the layer 3 codec commonly known as MP3 [72].
audio coders (PAC). The bitstream of the PAC is divided The MPEG-4 motion picture standard includes a struc-
into two classes and punctured convolutional codes are tured audio standard [40], in which speech and audio
used to provide different levels of protection, assuming a ‘‘objects’’ are encoded with header information specifying
BPSK constellation. the coding algorithm. Low-bit-rate speech coding is per-
In [5,6], a novel UEP channel encoding scheme is formed using harmonic vector excited coding (HVXC) [43]
introduced by analyzing how symbol-wise puncturing of or code-excited LPC (CELP) [41], and audiocoding is per-
symbols in a trellis code and the rate-compatibility con- formed using time–frequency coding [42]. The MPEG
straint (progressive puncturing pattern) can be used to homepage is at drogo.cselt.stet.it/mpeg.
derive rate-compatible punctured trellis codes (RCPT). Standards for cellular telephony in Europe are estab-
While conceptually similar to RCPC codes, RCPT codes lished by the European Telecommunications Standards
are specifically designed to operate efficiently on large con- Institute (ETSI) (http://www.etsi.org). ETSI speech coding
stellations (for which Euclidean and Hamming distances standards are published by the Global System for Mobile
are no longer equivalent) by maximizing the residual Telecommunications (GSM) subcommittee. All speech cod-
Euclidean distance after symbol puncturing. Large con- ing standards for digital cellular telephone use are based
stellation sizes, in turn, lead to higher throughput and on LPC-AS algorithms. The first GSM standard coder was
spectral efficiency on high SNR channels. An AMR system based on a precursor of CELP called regular-pulse excita-
is then designed based on a perceptually-based embed- tion with long-term prediction (RPE-LTP) [37,65]. Current
ded subband encoder. Since perceptually based dynamic GSM standards include the enhanced full-rate codec GSM
bit allocations lead to a wide range of bit error sensitiv- 06.60 [32,53] and the adaptive multirate codec [33]; both
ities (the perceptually least important bits being almost standards use algebraic code-excited LPC (ACELP). At
insensitive to channel transmission errors), the channel the time of writing, both ITU and ETSI are expected to
protection requirements are determined accordingly. The announce new standards for wideband speech coding in the
AMR systems utilize the new rate-compatible channel cod- near future. ETSI’s standard will be based on GSM AMR.
ing technique (RCPT) for UEP and operate on an 8-PSK The Telecommunications Industry Association (http://
constellation. The AMR-UEP system is bandwidth effi- www.tiaonline.org) published some of the first U.S. digital
cient, operates over a wide range of channel conditions cellular standards, including the vector-sum-excited LPC
and degrades gracefully with decreasing channel quality. (VSELP) standard IS54 [25]. In fact, both the initial U.S.
Systems using AMR source and channel coding are and Japanese digital cellular standards were based on
likely to be integrated in future communication systems the VSELP algorithm. The TIA has been active in the
since they have the capability for providing graceful speech development of standard TR41 for voice over IP.
degradation over a wide range of channel conditions. The U.S. Department of Defense Voice Processing
Consortium (DDVPC) publishes speech coding standards
8. STANDARDS for U.S. government applications. As mentioned earlier,
the original FS-1015 LPC-10e standard at 2.4 kbps [8,16],
Standards for landline public switched telephone service originally developed in the 1970s, was replaced in
(PSTN) networks are established by the International 1996 by the newer MELP standard at 2.4 kbps [92].
Telecommunication Union (ITU) (http://www.itu.int). The Transmission at slightly higher bit rates uses the FS-
ITU has promulgated a number of important speech 1016 CELP (CELP) standard at 4.8 kbps [17,56,57].
and waveform coding standards at high bit rates and Waveform applications use the continuously variable slope
with very low delay, including G.711 (PCM), G.727 and delta modulator (CVSD) at 16 kbps. Descriptions of all
G.726 (ADPCM), and G.728 (LDCELP). The ITU is also DDVPC standards and code for most are available at
involved in the development of internetworking standards, http://www.plh.af.mil/ddvpc/index.html.
including the voice over IP standard H.323. The ITU has
developed one widely used low-bit-rate coding standard
(G.729), and a number of embedded and multimode speech 9. FINAL REMARKS
coding standards operating at rates between 5.3 kbps
(G.723.1) and 40 kbps (G.727). Standard G.729 is a speech In this article, we presented an overview of coders
coder operating at 8 kbps, based on algebraic code-excited that compress speech by attempting to match the time
LPC (ACELP) [49,84]. G.723.1 is a multimode coder, waveform as closely as possible (waveform coders), and
capable of operating at either 5.3 or 6.3 kbps [50]. G.722 coders that attempt to preserve perceptually relevant
is a standard for wideband speech coding, and the ITU spectral properties of the speech signal (LPC-based
will announce an additional wideband standard within and subband coders). LPC-based coders use a speech
the near future. The ITU has also published standards production model to parameterize the speech signal,
for the objective estimation of perceptual speech quality while subband coders filter the signal into frequency
(P.861 and P.862). bands and assign bits by either an energy or perceptual
The ITU is a branch of the International Standards criterion. Issues pertaining to networking, such as voice
Organization (ISO) (http://www.iso.ch). In addition to over IP and joint source–channel coding, were also
ITU activities, the ISO develops standards for the Moving touched on. There are several other coding techniques
Picture Experts Group (MPEG). The MPEG-2 standard that we have not discussed in this article because of
included digital audiocoding at three levels of complexity, space limitations. We hope to have provided the reader
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 17

with an overview of the fundamental techniques of speech 3. B. S. Atal, High-quality speech at low bit rates: Multi-pulse
compression. and stochastically excited linear predictive coders, Proc.
ICASSP, 1986, pp. 1681–1684.
4. B. S. Atal and J. R. Remde, A new model of LPC excitation
Acknowledgments
for producing natural-sounding speech at low bit rates, Proc.
This research was supported in part by the NSF and HRL.
ICASSP, 1982, pp. 614–617.
We thank Alexis Bernard and Tomohiko Taniguchi for their
suggestions on earlier drafts of the article. 5. A. Bernard, X. Liu, R. Wesel, and A. Alwan, Channel
adaptive joint-source channel coding of speech, Proc. 32nd
Asilomar Conf. Signals, Systems, and Computers, 1998,
BIOGRAPHIES Vol. 1, pp. 357–361.
6. A. Bernard, X. Liu, R. Wesel, and A. Alwan, Embedded
Mark A. Hasegawa-Johnson received his S.B., S.M., joint-source channel coding of speech using symbol punc-
and Ph.D. degrees in electrical engineering and computer turing of trellis codes, Proc. IEEE ICASSP, 1999, Vol. 5,
science from MIT in 1989, 1989, and 1996, respectively. pp. 2427–2430.
From 1989 to 1990 he worked as a research engineer 7. M. S. Brandstein, P. A. Monta, J. C. Hardwick, and
at Fujitsu Laboratories Ltd., Kawasaki, Japan, where J. S. Lim, A real-time implementation of the improved MBE
he developed and patented a multimodal CELP speech speech coder, Proc. ICASSP, 1990, Vol. 1: pp. 5–8.
coder with an efficient algebraic fixed codebook. From 8. J. P. Campbell and T. E. Tremain, Voiced/unvoiced classifi-
1996–1999 he was a postdoctoral fellow in the Electrical cation of speech with applications to the U.S. government
Engineering Department at UCLA. Since 1999, he has LPC-10E algorithm, Proc. ICASSP, 1986, pp. 473–476.
been on the faculty of the University of Illinois at
9. CDMA, Wideband Spread Spectrum Digital Cellular System
Urbana-Champaign. Dr. Hasegawa-Johnson holds four
Dual-Mode Mobile Station-Base Station Compatibility
U.S. patents and is the author of four journal articles Standard, Technical Report Proposed EIA/TIA Interim
and twenty conference papers. His areas of interest include Standard, Telecommunications Industry Association TR45.5
speech coding, automatic speech understanding, acoustics, Subcommittee, 1992.
and the physiology of speech production.
10. J.-H. Chen et al., A low delay CELP coder for the CCITT
16 kb/s speech coding standard, IEEE J. Select. Areas
Abeer Alwan received her Ph.D. in electrical engineer- Commun. 10: 830–849 (1992).
ing from MIT in 1992. Since then, she has been with
11. J.-H. Chen and A. Gersho, Adaptive postfiltering for quality
the Electrical Engineering Department at UCLA, Cal- enhancement of coded speech, IEEE Trans. Speech Audio
ifornia, as an assistant professor (1992–1996), associate Process. 3(1): 59–71 (1995).
professor (1996–2000), and professor (2000–present). Pro-
12. R. Cox et al., New directions in subband coding, IEEE JSAC
fessor Alwan established and directs the Speech Pro-
6(2): 391–409 (Feb. 1988).
cessing and Auditory Perception Laboratory at UCLA
(http://www.icsl.ucla.edu/∼spapl). Her research interests 13. R. Cox, J. Hagenauer, N. Seshadri, and C. Sundberg, Sub-
include modeling human speech production and percep- band speech coding and matched convolutional coding for
mobile radio channels, IEEE Trans. Signal Process. 39(8):
tion mechanisms and applying these models to speech-
1717–1731 (Aug. 1991).
processing applications such as automatic recognition,
compression, and synthesis. She is the recipient of the NSF 14. A. Das and A. Gersho, Low-rate multimode multiband
Research Initiation Award (1993), the NIH FIRST Career spectral coding of speech, Int. J. Speech Tech. 2(4): 317–327
Development Award (1994), the UCLA-TRW Excellence (1999).
in Teaching Award (1994), the NSF Career Develop- 15. G. Davidson and A. Gersho, Complexity reduction meth-
ment Award (1995), and the Okawa Foundation Award ods for vector excitation coding, Proc. ICASSP, 1986,
in Telecommunications (1997). Dr. Alwan is an elected pp. 2055–2058.
member of Eta Kappa Nu, Sigma Xi, Tau Beta Pi, and the 16. DDVPC, LPC-10e Speech Coding Standard, Technical
New York Academy of Sciences. She served as an elected Report FS-1015, U.S. Dept. of Defense Voice Processing
member on the Acoustical Society of America Technical Consortium, Nov. 1984.
Committee on Speech Communication (1993–1999), on the 17. DDVPC, CELP Speech Coding Standard, Technical Report
IEEE Signal Processing Technical Committees on Audio FS-1016, U.S. Dept. of Defense Voice Processing Consor-
and Electroacoustics (1996–2000) and Speech Processing tium, 1989.
(1996–2001). She is an editor in chief of the journal Speech
18. S. Dimolitsas, Evaluation of voice coded performance for
Communication. the Inmarsat Mini-M system, Proc. 10th Int. Conf. Digital
Satellite Communications, 1995.
BIBLIOGRAPHY 19. M. Handley et al., SIP: Session Initiation Protocol, IETF
RFC, March 1999, http://www.cs.columbia.edu/hgs/sip/
sip.html.
1. J.-P. Adoul, P. Mabilleau, M. Delprat, and S. Morisette, Fast
CELP coding based on algebraic codes, Proc. ICASSP, 1987, 20. H. Fletcher, Speech and Hearing in Communication, Van
pp. 1957–1960. Nostrand, Princeton, NJ, 1953.
2. B. S. Atal, Predictive coding of speech at low bit rates, IEEE 21. D. Florencio, Investigating the use of asymmetric windows
Trans. Commun. 30: 600–614 (1982). in CELP vocoders, Proc. ICASSP, 1993, Vol. II, pp. 427–430.
EOT156
18 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

22. S. Furui, Digital Speech Processing, Synthesis, and Recogni- 41. ISO/IEC, Information Technology — Coding of Audiovisual
tion, Marcel Dekker, New York, 1989. Objects, Part 3: Audio, Subpart 3: CELP, Technical Report
ISO/JTC 1/SC 29/N2203CELP, ISO/IEC, 1998.
23. W. Gardner, P. Jacobs, and C. Lee, QCELP: A variable
rate speech coder for CDMA digital cellular, in B. Atal, 42. ISO/IEC, Information Technology — Coding of Audiovisual
V. Cuperman, and A. Gersho, eds., Speech and Audio Coding Objects, Part 3: Audio, Subpart 4: Time/Frequency Coding,
for Wireless and Network Applications, Kluwer, Dordrecht, Technical Report ISO/JTC 1/SC 29/N2203TF, ISO/IEC,
The Netherlands, 1993, pp. 85–93. 1998.
24. A. Gersho and E. Paksoy, An overview of variable rate 43. ISO/IEC, Information Technology — Very Low Bitrate Audio-
speech coding for cellular networks, IEEE Int. Conf. Selected Visual Coding, Part 3: Audio, Subpart 2: Parametric Coding,
Topics in Wireless Communications Proc., June 1999, Technical Report ISO/JTC 1/SC 29/N2203PAR, ISO/IEC,
pp. 172–175. 1998.

25. I. Gerson and M. Jasiuk, Vector sum excited linear predic- 44. H. Ito, M. Serizawa, K. Ozawa, and T. Nomura, An adaptive
tion (VSELP), in B. S. Atal, V. S. Cuperman, and A. Gersho, multi-rate speech codec based on mp-celp coding algorithm
eds., Advances in Speech Coding, Kluwer, Dordrecht, The for etsi amr standard, Proc. ICASSP, 1998, Vol. 1,
Netherlands, 1991, pp. 69–80. pp. 137–140.

26. D. Goeckel, Adaptive coding for time-varying channels using 45. ITU-T, 40, 32, 24, 16 kbit/s Adaptive Differential Pulse
outdated fading estimates, IEEE Trans. Commun. 47(6): Code Modulation (ADPCM), Technical Report G.726,
844–855 (1999). International Telecommunications Union, Geneva, 1990.

27. R. Goldberg and L. Riek, A Practical Handbook of Speech 46. ITU-T, 5-, 4-, 3- and 2-bits per Sample Embedded Adaptive
Differential Pulse Code Modulation (ADPCM), Technical
Coders, CRC Press, Boca Raton, FL, 2000.
Report G.727, International Telecommunications Union,
28. A. Goldsmith and S. G. Chua, Variable-rate variable power Geneva, 1990.
MQAM for fading channels, IEEE Trans. Commun. 45(10):
47. ITU-T, Coding of Speech at 16 kbit/s Using Low-Delay
1218–1230 (1997).
Code Excited Linear Prediction, Technical Report G.728,
29. O. Gottesman and A. Gersho, Enhanced waveform inter- International Telecommunications Union, Geneva, 1992.
polative coding at 4 kbps, IEEE Workshop on Speech Coding,
48. ITU-T, Pulse Code Modulation (PCM) of Voice Frequencies,
Piscataway, NY, 1999, pp. 90–92.
Technical Report G.711, International Telecommunications
30. K. Gould, R. Cox, N. Jayant, and M. Melchner, Robust Union, Geneva, 1993.
speech coding for the indoor wireless channel, ATT Tech.
49. ITU-T, Coding of Speech at 8 kbit/s Using Conjugate-
J. 72(4): 64–73 (1993).
Structure Algebraic-Code-Excited Linear-Prediction (CS-
31. D. W. Griffn and J. S. Lim, Multi-band excitation vocoder, ACELP), Technical Report G.729, International Telecom-
IEEE Trans. Acoust. Speech Signal Process. 36(8): munications Union, Geneva, 1996.
1223–1235 (1988). 50. ITU-T, Dual Rate Speech Coder for Multimedia Communi-
32. Special Mobile Group (GSM), Digital Cellular cations Transmitting at 5.3 and 6.3 kbit/s, Technical Report
Telecommunications System: Enhanced Full Rate (EFR) G.723.1, International Telecommunications Union, Geneva,
Speech Transcoding, Technical Report GSM 06.60, 1996.
European Telecommunications Standards Institute (ETSI), 51. ITU-T, Objective Quality Measurement of Telephone-
1997. Band (300–3400 Hz) speech codecs, Technical Report
33. Special Mobile Group (GSM), Digital Cellular Telecommu- P.861, International Telecommunications Union, Geneva,
nications System (Phase 2+): Adaptive Multi-rate (AMR) 1998.
Speech Transcoding, Technical Report GSM 06.90, Euro- 52. ITU-T, Packet Based Multimedia Communications Systems,
pean Telecommunications Standards Institute (ETSI), 1998. Technical Report H.323, International Telecommunications
34. J. Hagenauer, Rate-compatible punctured convolutional Union, Geneva, 1998.
codes and their applications, IEEE Trans. Commun. 36(4): 53. K. Jarvinen et al., GSM enhanced full rate speech codec,
389–400 (1988). Proc. ICASSP, 1997, pp. 771–774.
35. J. C. Hardwick and J. S. Lim, A 4.8 kbps multi-band excita- 54. N. Jayant, J. Johnston, and R. Safranek, Signal compres-
tion speech coder, Proc. ICASSP, 1988, Vol. 1, pp. 374–377. sion based on models of human perception, Proc. IEEE
36. M. Hasegawa-Johnson, Line spectral frequencies are the 81(10): 1385–1421 (1993).
poles and zeros of a discrete matched-impedance vocal tract 55. M. Johnson and T. Taniguchi, Low-complexity multi-mode
model, J. Acoust. Soc. Am. 108(1): 457–460 (2000). VXC using multi-stage optimization and mode selection,
Proc. ICASSP, 1991, pp. 221–224.
37. K. Hellwig et al., Speech codec for the european mobile radio
system, Proc. IEEE Global Telecomm. Conf., 1989. 56. J. P. Campbell Jr., T. E. Tremain, and V. C. Welch, The
DOD 4.8 KBPS standard (proposed federal standard 1016),
38. O. Hersent, D. Gurle, and J.-P. Petit, IP Telephony, Addison-
in B. S. Atal, V. C. Cuperman, and A. Gersho, ed., Advances
Wesley, Reading, MA, 2000.
in Speech Coding, Kluwer, Dordrecht, The Netherlands,
39. ISO, Report on the MPEG-4 Speech Codec Verification Tests, 1991, pp. 121–133.
Technical Report JTC1/SC29/WG11, ISO/IEC, Oct. 1998.
57. J. P. Campbell, Jr., V. C. Welch, and T. E. Tremain, An
40. ISO/IEC, Information Technology — Coding of Audiovisual expandable error-protected 4800 BPS CELP coder (U.S.
Objects, Part 3: Audio, Subpart 1: Overview, Technical federal standard 4800 BPS voice coder), Proc. ICASSP, 1989,
Report ISO/JTC 1/SC 29/N2203, ISO/IEC, 1998. 735–738.
EOT156
SPEECH CODING: FUNDAMENTALS AND APPLICATIONS 19

58. P. Kabal and R. Ramachandran, The computation of line 77. L. Rabiner and B.-H Juang, Fundamentals of Speech
spectral frequencies using chebyshev polynomials, IEEE Recognition, Prentice-Hall, Englewood Cliffis, NJ, 1993.
Trans. Acoust. Speech Signal Process. ASSP-34: 1419–1426
78. L. R. Rabiner and R. W. Schafer, Digital Processing of
(1986).
Speech Signals, Prentice-Hall, Englewood Cliffs, NJ, 1978.
59. W. Kleijn, Speech coding below 4 kb/s using waveform inter-
79. R. P. Ramachandran and P. Kabal, Stability and perfor-
polation, Proc. GLOBECOM 1991, Vol. 3, pp. 1879–1883.
mance analysis of pitch filters in speech coders, IEEE Trans.
60. W. Kleijn and W. Granzow, Methods for waveform interpola- ASSP 35(7): 937–946 (1987).
tion in speech coding, Digital Signal Process. 1(4): 215–230
80. V. Ramamoorthy and N. S. Jayant, Enhancement of ADPCM
(1991).
speech by adaptive post-filtering, AT&T Bell Labs. Tech. J.
61. W. Kleijn and J. Haagen, A speech coder based on 63(8): 1465–1475 (1984).
decomposition of characteristic waveforms, Proc. ICASSP,
1995, pp. 508–511. 81. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, PESQ — the
new ITU standard for end-to-end speech quality assessment,
62. W. Kleijn, Y. Shoham, D. Sen, and R. Hagen, A low- AES 109th Convention, Los Angeles, CA, Sept. 2000.
complexity waveform interpolation coder, Proc. ICASSP,
1996, pp. 212–215. 82. R. C. Rose and T. P. Barnwell, III, The self-excited
vocoder — an alternate approach to toll quality at 4800 bps,
63. W. B. Kleijn, D. J. Krasinski, and R. H. Ketchum, Improved Proc. ICASSP, 1986, pp. 453–456.
speech quality and efficient vector quantization in SELP,
Proc. ICASSP, 1988, pp. 155–158. 83. N. Rydbeck and C. E. Sundberg, Analysis of digital errors in
non-linear PCM systems, IEEE Trans. Commun. COM-24:
64. M. Kohler, A comparison of the new 2400bps MELP federal 59–65 (1976).
standard with other standard coders, Proc. ICASSP, 1997,
pp. 1587–1590. 84. R. Salami et al., Design and description of CS-ACELP: A
toll quality 8 kb/s speech coder, IEEE Trans. Speech Audio
65. P. Kroon, E. F. Deprettere, and R. J. Sluyter, Regular-pulse
Process. 6(2): 116–130 (1998).
excitation: A novel approach to effective and efficient multi-
pulse coding of speech, IEEE Trans. ASSP 34: 1054–1063 85. M. R. Schroeder and B. S. Atal, Code-excited linear predic-
(1986). tion (CELP): High-quality speech at very low bit rates, Proc.
ICASSP, 1985, pp. 937–940.
66. W. LeBlanc, B. Bhattacharya, S. Mahmoud, and V. Cuper-
man, Efficient search and design procedures for robust 86. Y. Shoham, Very low complexity interpolative speech coding
multi-stage VQ of LPC parameters for 4kb/s speech cod- at 1.2 to 2.4 kbp, Proc. ICASSP, 1997, pp. 1599–1602.
ing, IEEE Trans. Speech Audio Process. 1: 373–385 87. S. Singhal and B. S. Atal, Improving performance of multi-
(1993). pulse LPC coders at low bit rates, Proc. ICASSP, 1984,
67. D. Lin, New approaches to stochastic coding of speech pp. 1.3.1–1.3.4.
sources at very low bit rates, in I. T. Young et al., ed.,
88. D. Sinha and C.-E. Sundberg, Unequal error protection
Signal Processing III: Theories and Applications, Elsevier,
methods for perceptual audio coders, Proc. ICASSP, 1999,
Amsterdam, 1986, pp. 445–447.
Vol. 5, pp. 2423–2426.
68. A. McCree and J. C. De Martin, A 1.7 kb/s MELP coder with
89. F. Soong and B.-H. Juang, Line spectral pair (LSP)
improved analysis and quantization, Proc. ICASSP, 1998,
and speech data compression, Proc. ICASSP, 1984,
Vol. 2, pp. 593–596.
pp. 1.10.1–1.10.4.
69. A. McCree et al., A 2.4 kbps MELP coder candidate for the
90. J. Stachurski, A. McCree, and V. Viswanathan, High quality
new U.S. Federal standard, Proc. ICASSP, 1996, Vol. 1,
MELP coding at bit rates around 4 kb/s, Proc. ICASSP, 1999,
pp. 200–203.
Vol. 1, pp. 485–488.
70. A. V. McCree and T. P. Barnwell, III, A mixed excitation
LPC vocoder model for low bit rate speech coding, IEEE 91. N. Sugamura and F. Itakura, Speech data compression by
Trans. Speech Audio Process. 3(4): 242–250 (1995). LSP speech analysis-synthesis technique, Trans. IECE J64-
A(8): 599–606 (1981) (in Japanese).
71. B. C. J. Moore, An Introduction to the Psychology of Hearing,
Academic Press, San Diego, (1997). 92. L. Supplee, R. Cohn, and J. Collura, MELP: The new federal
standard at 2400 bps, Proc. ICASSP, 1997, pp. 1591–1594.
72. P. Noll, MPEG digital audio coding, IEEE Signal Process.
Mag. 14(5): 59–81 (1997). 93. B. Tang, A. Shen, A. Alwan, and G. Pottie, A perceptually-
based embedded subband speech coder, IEEE Trans. Speech
73. B. Novorita, Incorporation of temporal masking effects into Audio Process. 5(2): 131–140 (March 1997).
bark spectral distortion measure, Proc. ICASSP, Phoenix,
AZ, 1999, pp. 665–668. 94. T. Taniguchi, ADPCM with a multiquantizer for speech
coding, IEEE J. Select. Areas Commun. 6(2): 410–424 (1988).
74. E. Paksoy, W-Y. Chan, and A. Gersho, Vector quantization
of speech LSF parameters with generalized product codes, 95. T. Taniguchi, F. Amano, and S. Unagami, Combined source
Proc. ICASSP, 1992, pp. 33–36. and channel coding based on multimode coding, Proc.
ICASSP, 1990, pp. 477–480.
75. E. Paksoy et al., An adaptive multi-rate speech coder for
digital cellular telephony, Proc. of ICASSP, 1999, Vol. 1, 96. J. Tardelli and E. Kreamer, Vocoder intelligibility and
pp. 193–196. quality test methods, Proc. ICASSP, 1996, pp. 1145–1148.
76. K. K. Paliwal and B. S. Atal, Efficient vector quantization 97. I. M. Trancoso and B. S. Atal, Efficient procedures for
of LPC parameters at 24 bits/frame, IEEE Trans. Speech finding the optimum innovation in stochastic coders, Proc.
Audio Process. 1: 3–14 (1993). ICASSP, 1986, pp. 2379–2382.
EOT156
20 SPEECH CODING: FUNDAMENTALS AND APPLICATIONS

98. A. Uvliden, S. Bruhn, and R. Hagen, Adaptive multi-rate.

A speech service adapted to cellular radio network quality,
Proc. 32nd Asilomar Conf., 1998, Vol. 1, pp. 343–347.
99. J. Vainio, H. Mikkola, K. Jarvinen, and P. Haavisto, GSM
EFR based multi-rate codec family, Proc. ICASSP, 1998,
Vol. 1, pp. 141–144.
100. W. D. Voiers, Diagnostic acceptability measure for speech
communication systems, Proc. ICASSP, 1977, pp. 204–207.
101. W. D. Voiers, Evaluating processed speech using the diag-
nostic rhyme test, Speech Technol. 1(4): 30–39 (1983).
102. W. D. Voiers, Effects of noise on the discriminability of
distinctive features in normal and whispered speech, J.
Acoust. Soc. Am. 90: 2327 (1991).
103. S. Wang, A. Sekey, and A. Gersho, An objective measure
for predicting subjective quality of speech coders, IEEE J.
Select. Areas Commun. 10(5): 819–829 (1992).
104. S. W. Wong, An evaluation of 6.4 kbit/s speech codecs for
Inmarsat-M system, Proc. ICASSP, 1991, pp. 629–632.
105. W. Yang, M. Benbouchta, and R. Yantorno, Performance of
the modified bark spectral distortion measure as an objective
speech quality measure, Proc. ICASSP, 1998, pp. 541–544.
106. W. Yang and R. Yantorno, Improvement of MBSD by scaling
noise masking threshold and correlation analysis with MOS
difference instead of MOS, Proc. ICASSP, Phoenix, AZ, 1999,
pp. 673–676.
107. S. Yeldener, A 4 kbps toll quality harmonic excitation linear
predictive speech coder, Proc. ICASSP, 1999, pp. 481–484.

Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
No ratings yet
Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
20 pages
Speech Coding Journal
No ratings yet
Speech Coding Journal
20 pages
DAP Speech Coding v3 2025
No ratings yet
DAP Speech Coding v3 2025
49 pages
Speech Coder
No ratings yet
Speech Coder
20 pages
Speech and Audio Processing: Lecture-3
No ratings yet
Speech and Audio Processing: Lecture-3
20 pages
Wireless Communications by Theodore S Ra
No ratings yet
Wireless Communications by Theodore S Ra
31 pages
Speech Coding
100% (3)
Speech Coding
36 pages
New Speech Coding Techniques: Mr. L.Ramesh Ap/Ece
No ratings yet
New Speech Coding Techniques: Mr. L.Ramesh Ap/Ece
24 pages
Speech Coding Techniques
No ratings yet
Speech Coding Techniques
38 pages
Speech Coders For Wireless Communication
No ratings yet
Speech Coders For Wireless Communication
53 pages
LPC Modeling: Unit 5 1.speech Compression
No ratings yet
LPC Modeling: Unit 5 1.speech Compression
13 pages
Adaptive Multi Rate Coder Using ACLP
No ratings yet
Adaptive Multi Rate Coder Using ACLP
45 pages
4: Speech Compression: Data Rates
No ratings yet
4: Speech Compression: Data Rates
14 pages
Dokumen - Tips Elec9344speech Audio Processing 4pdfspeech Signal For Digital Storage or Transmission
No ratings yet
Dokumen - Tips Elec9344speech Audio Processing 4pdfspeech Signal For Digital Storage or Transmission
87 pages
Procedia: Speech Coding Techniques
No ratings yet
Procedia: Speech Coding Techniques
11 pages
Speech and Audio Coding
No ratings yet
Speech and Audio Coding
16 pages
Speech Generation
No ratings yet
Speech Generation
11 pages
Audio Compression
No ratings yet
Audio Compression
81 pages
Project Guidelines 3
No ratings yet
Project Guidelines 3
6 pages
Introduction To Speech Coding What, Why, Where & How (First Part)
No ratings yet
Introduction To Speech Coding What, Why, Where & How (First Part)
10 pages
2720_Slides7
No ratings yet
2720_Slides7
18 pages
Digital Speech Processing
No ratings yet
Digital Speech Processing
18 pages
5. Audio Coding and Standards
No ratings yet
5. Audio Coding and Standards
32 pages
Nice
No ratings yet
Nice
15 pages
Bab 7 Multimedia Kompresi Audio
No ratings yet
Bab 7 Multimedia Kompresi Audio
52 pages
MMC Unit III-1
No ratings yet
MMC Unit III-1
122 pages
(Ebook) Digital speech: coding for low bit rate communication systems by A. M. Kondoz ISBN 9780470870082, 9781423717584, 0470870087, 1423717589 - Discover the ebook with all chapters in just a few seconds
100% (3)
(Ebook) Digital speech: coding for low bit rate communication systems by A. M. Kondoz ISBN 9780470870082, 9781423717584, 0470870087, 1423717589 - Discover the ebook with all chapters in just a few seconds
50 pages
Ellis, Mandel - 2009 - Lecture 7 Audio Compression and Coding
No ratings yet
Ellis, Mandel - 2009 - Lecture 7 Audio Compression and Coding
37 pages
Speech Coding Systems
No ratings yet
Speech Coding Systems
90 pages
M3 Codecs
No ratings yet
M3 Codecs
61 pages
Transmission of Information: David Falconer and Halim Yanikomeroglu
No ratings yet
Transmission of Information: David Falconer and Halim Yanikomeroglu
42 pages
4 Chapter Audio and Video Compression (1)
No ratings yet
4 Chapter Audio and Video Compression (1)
122 pages
CELP
No ratings yet
CELP
23 pages
Speech Signal Analysis and Coding: Dr. Arun Kumar
No ratings yet
Speech Signal Analysis and Coding: Dr. Arun Kumar
52 pages
Human Speech Producing Organs: 2.4 Kbps
No ratings yet
Human Speech Producing Organs: 2.4 Kbps
108 pages
Low Bit Rate Speech Coding
No ratings yet
Low Bit Rate Speech Coding
165 pages
EE412/CS455 Principles of Digital Audio and Video
No ratings yet
EE412/CS455 Principles of Digital Audio and Video
71 pages
Unit 2 Wireless
No ratings yet
Unit 2 Wireless
159 pages
4. Human Speech Communication
No ratings yet
4. Human Speech Communication
44 pages
Digital Transmission
No ratings yet
Digital Transmission
25 pages
5. Speech Coding Techniques
No ratings yet
5. Speech Coding Techniques
17 pages
1.1 Motivation: Subband Coding Using Filter Banks OCTOBER 2011
No ratings yet
1.1 Motivation: Subband Coding Using Filter Banks OCTOBER 2011
30 pages
PPT Sistem Digital Nirkabel [TM3]
No ratings yet
PPT Sistem Digital Nirkabel [TM3]
64 pages
Bab13 TeknikDasarKompresiAudio
No ratings yet
Bab13 TeknikDasarKompresiAudio
36 pages
Chapter 9 - Speech Coding in GSM
No ratings yet
Chapter 9 - Speech Coding in GSM
44 pages
Lecture 16
No ratings yet
Lecture 16
23 pages
Digitizing and Packetizing Voice: Describe Cisco Voip Implementations
No ratings yet
Digitizing and Packetizing Voice: Describe Cisco Voip Implementations
24 pages
dịch bt
No ratings yet
dịch bt
13 pages
Comparative Analysis of Speech Compression Algorithms With Perceptual and LP Based Quality Evaluations
No ratings yet
Comparative Analysis of Speech Compression Algorithms With Perceptual and LP Based Quality Evaluations
1 page
Voipfuture_Whitepaper_CodecChoice_2020-1
No ratings yet
Voipfuture_Whitepaper_CodecChoice_2020-1
16 pages
CSPL 392
No ratings yet
CSPL 392
20 pages
Multi-Band Excitation Vocoder: RLE Technical Report No. 524
No ratings yet
Multi-Band Excitation Vocoder: RLE Technical Report No. 524
140 pages
UNIT IV AUDIO AND VIDEO CODING
No ratings yet
UNIT IV AUDIO AND VIDEO CODING
15 pages
Unit2 1
No ratings yet
Unit2 1
23 pages
Digital Speech Processing
No ratings yet
Digital Speech Processing
46 pages
Low Bit Rate Speech Coding
No ratings yet
Low Bit Rate Speech Coding
3 pages
Audio Coding
No ratings yet
Audio Coding
349 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Key technologies for NG-PON2 system
From Everand
Key technologies for NG-PON2 system
Rawa Muayad
No ratings yet
Introduction to GSM: Second Edition
From Everand
Introduction to GSM: Second Edition
Guy Inchbald
No ratings yet
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Performance Evaluation of Machine Learning Algorithms For A Cluster-Based Crop Recommendation System
No ratings yet
Performance Evaluation of Machine Learning Algorithms For A Cluster-Based Crop Recommendation System
5 pages
Notes For 351
No ratings yet
Notes For 351
16 pages
Advanced Power System Anlaysis Syllabus
No ratings yet
Advanced Power System Anlaysis Syllabus
2 pages
Channel Coding For Modern Communication Systems: Presented by Yasir Mehmood (200411018)
No ratings yet
Channel Coding For Modern Communication Systems: Presented by Yasir Mehmood (200411018)
20 pages
Cycling in Simplex Method: Aryaman Banga Karan Kukreja
No ratings yet
Cycling in Simplex Method: Aryaman Banga Karan Kukreja
12 pages
Numerical Methods Important PDF
No ratings yet
Numerical Methods Important PDF
174 pages
CS3401 Algorithms Unit IV
No ratings yet
CS3401 Algorithms Unit IV
57 pages
Artificial Neural Network - ..
100% (1)
Artificial Neural Network - ..
15 pages
CNN For Deep Learning - Convolutional Neural Networks
No ratings yet
CNN For Deep Learning - Convolutional Neural Networks
10 pages
The Transformer Architecture
No ratings yet
The Transformer Architecture
9 pages
Pengenalan FIlter Aktif
No ratings yet
Pengenalan FIlter Aktif
19 pages
Monte Carlo Turbulence Simulation Using Rational Approximations To Von Karman Spectra
No ratings yet
Monte Carlo Turbulence Simulation Using Rational Approximations To Von Karman Spectra
5 pages
TKPS - Discrete-Time Filter Design
No ratings yet
TKPS - Discrete-Time Filter Design
29 pages
161411-161601-Modelling, Simulation and Operations Research
No ratings yet
161411-161601-Modelling, Simulation and Operations Research
2 pages
Newton-Raphson Method
No ratings yet
Newton-Raphson Method
32 pages
pereira2021
No ratings yet
pereira2021
7 pages
DSP File
No ratings yet
DSP File
26 pages
Chapter 5 PDF
100% (1)
Chapter 5 PDF
16 pages
Chapter 4 Elements of Realizability Theory
No ratings yet
Chapter 4 Elements of Realizability Theory
37 pages
CS510 Notes 17 Approximation - Algorithms
No ratings yet
CS510 Notes 17 Approximation - Algorithms
40 pages
Tae 2 DM
No ratings yet
Tae 2 DM
2 pages
I Jcs It 20150603227
No ratings yet
I Jcs It 20150603227
3 pages
Chapter 3 Solution
No ratings yet
Chapter 3 Solution
134 pages
lec05-multi-armed-bandit
No ratings yet
lec05-multi-armed-bandit
4 pages
Modeling An Optimized Approach For Load Balancing in Cloud
No ratings yet
Modeling An Optimized Approach For Load Balancing in Cloud
19 pages
Digital Communication
No ratings yet
Digital Communication
7 pages
CS 240: Data Structures
No ratings yet
CS 240: Data Structures
17 pages
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
100% (1)
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
12 pages
Self Evaluation Test (Polynomials)
No ratings yet
Self Evaluation Test (Polynomials)
2 pages

Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson

Uploaded by

Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson

Uploaded by

EOT156

Table 1. Characteristics of Standardized Speech Coding Algorithms in Each of

Waveform coders 16–64 Low Landline telephone 2

Wiley Encyclopedia of Telecommunications, Edited by John G. Proakis

Table 2. A Representative Sample of Speech Coding Standards

Landline telephone 64 3.4 ITU G.711 µ-law or A-law PCM 1988

Pulse code modulation (PCM) is the name given to 2

0.6 (2–5 bits/sample) [45]. G.726 ADPCM is frequently used

Figure 2. Schematic of a DPCM coder.

4.1. The Basic LPC Model

ratios bitrate channel 1− ai z−i

Filter response magnitude

g2 Early LPC-AS coders minimized the mean-squared error

e(n) is an uncorrelated random noise signal, as shown in 80

Speech signal s (n ) the excitation. In general, U may be represented as

The choice of shape vectors and the values of the gains

stochastic codevectors are chosen sequentially. After selec- A(z) = − ai z−i

5. LPC VOCODERS 5.2. Mixed-Excitation Linear Prediction (MELP)

Vocal fold oscillation

4 4 G B A Figure 12. Mean opinion scores

I B from five published studies in quiet

I C (G) GSM EFR ACELP, (H) ITU G.729

3 L 3 I ACELP, (I) TIA IS54 VSELP, (J)

Speech coding for the voice over Internet Protocol (VOIP)

98. A. Uvliden, S. Bruhn, and R. Hagen, Adaptive multi-rate.

You might also like

Pulse code modulation (PCM) is the name given to 2