Digital Audio Processing
Digital Audio Processing
This material is based on work support by the National Science Foundation. The first draft of this chapter was
written under Grant No. DUE-0127280. The final version was written under Grant No. 0340969. This chapter was
written by Dr. Jennifer Burg ([email protected]).
Know the common file types for digital audio and be able to choose an
multiple outputs so that speakers can be set up appropriately around an auditorium to create the
desired sound environment.
Figure 5.2 shows what your live performance audio set-up would look like if you used a
digital mixer. Notice how much hardware is absorbed into the software of the digital mixer.
Inside the digital mixer is an analog-to-digital converter (ADC). As the signal enters the mixer,
it is immediately converted to digital form so that further processing can be done on it in
software. Before it is sent to the outputs, it is converted back to analog form with a digital-toanalog converter (DAC).
Sound processing can also take place in settings other than live performances for
example, in sound recording studios. Much of the equipment is the same, with the difference
that the sound is recorded, refined, processed with special effects and finally compressed and
3
packaged for distribution. Much of this work can be done by means of a personal computer
equipped with the right sound card, microphones, and audiop processing software.
In this book, we're going to assume that you're working in a mostly-digital environment,
using your personal computer as the central component of your digital audio workstation
(DAW). The most basic element of your DAW is a sound card or external sound interface. A
sound card is a piece of hardware that performs the following basic functions:
provides input jacks for microphones and external audio sources
converts sound from analog to digital form as it is being recorded, using an ADC
provides output jacks for headphones and external speakers
converts sound from digital to analog form as it is being played, using a DAC
synthesizes MIDI sound samples using either FM or wavetable synthesis, more often the
latter. (Not all sound cards have this capability.)
If the sound card that comes with your computer doesn't do a very good job with ADC, DAC, or
MIDI-handling, or if it doesn't offer you the input/output interface you want, you may want to
buy a better internal sound card or even an external sound interface. An external sound interface
can offer extra input and output channels and connections, including MIDI, and a choice of XLR,
inch TRS, or other audio jacks to connect the microphones. To complete your DAW, you can
add digital audio processing software; headphones; and good quality monitors. If you want to
work with MIDI, you can also add hardware or software MIDI samplers, synthesizers, and/or a
controller.
Many computers have built-in microphones, but their quality isn't always very good. The
two types of external microphones that you're most likely to use are dynamic and capacitor mics.
A dynamic mic operates by means of an induction coil attached to a diaphragm and placed in a
magnetic field such that the coil moves in response to sound pressure oscillations. Dynamic
mics are versatile, sturdy, and fairly inexpensive, and they're good enough for most home or
classroom recording studio situations. A capacitor mic (also called a condenser mic) has a
diaphragm and a backplate that together form the two plates of a capacitor. As the distance
between the plates is changed by vibration, the capacitance changes, and in turn, so does the
voltage representing the sound wave. The material from which the plates are made is lighter and
thus more sensitive than the coil of a dynamic mic.
The main differences between dynamic and capacitor mics are these:
A dynamic mic is less expensive than a capacitor mic.
A capacitor mic requires an external power supply to charge the diaphragm and to drive
the preamp. A dynamic mic is sometimes more convenient because it doesn't require an
external power supply.
A dynamic mic has an upper frequency limit of about 16 kHz, while the capacitor's
frequency limit goes to 20 kHz and beyond. This difference is probably most noticeable
with regard to the upper harmonics of musical instruments.
A dynamic mic is fine for loud instruments or reasonably close vocals. A capacitor mic
is more sensitive to soft or distant sounds.
A capacitor mic is susceptible to damage from moisture condensation.
You can't really say that one type of microphone is better than another. It depends on
what you're recording, and it depends on the make and model of the microphone. Greater
sensitivity isn't always a good thing, since you may be recording vocals at close range and don't
want to pick up background noise, in which case a dynamic mic is preferable.
Microphones can also be classified according to the direction from which they receive
sound. The cardioid mic is most commonly used for voice recording, since its area of sensitivity
resembles a heart-shape around the front of the mic. An omnidirectional mic senses sound fairly
equally in a circle around the mic. It can be used for general environmental sounds. A bidirectional mic senses sound generally in a figure-eight area. The direction from which sounds
are sensed depends in part on the frequency of the sound. In the case of high frequencies, the
directional response of a microphone depends in part on the angle from which the sound arrives.
You may want to buy monitors for the audio work you do on your computer. A monitor
is just another name for a speaker but one that treats all frequency components equally.
Speakers can be engineered to favor some frequencies over others, but for your audio work you
want to hear the sound exactly as you create it, with no special coloring. For this reason, you
should pay particular attention to the frequency characteristics of the speakers or monitors you
use. It's nice to have a good set of earphones as well, so that you can listen to your audio without
environmental background noise.
Not all audio application programs include all the features listed above. In some cases,
you have to install plug-ins to get all the functionality you want. Three common plug-in formats
are VST (Virtual Studio Technology), TDM (time division multiplexing), DirectX, and Audio
Units by Core Audio. Plug-ins provide fine-tuned effects such as dynamic compression,
equalization, reverb, and delay.
In digital audio processing programs, you often have a choice between working in a
waveform view or a multitrack view. The waveform view gives you a graphical picture of
soundwaves like we have been describing since the beginning of Chapter 4. You can view and
edit this sound wave down to the level of individual sample values, and for this reason the
waveform view is sometimes called the sample editor. The waveform view is where you apply
effects and processes that cannot be done in real-time, primarily because these processes require
that the whole audio file be examined before new values can be computed. When such processes
are executed, they permanently alter the sample values. In contrast, effects that can be applied in
real-time do not alter sample values.
The waveform view of an audio file displays time across the horizontal axis and
amplitude up the vertical axis. The standard representation of time for digital audio and video is
SMPTE, which stands for Society of Motion Picture and Television Engineers. SMPTE
divides the timeline into units of hours, minutes, seconds, and frames. The unit of a frame is
derived from digital audios association with video. A video file is divided into a sequence of
individual still images called frames. One standard frame rate is 30 frames per second, but this
varies according to the type of video. A position within an audio file can be denoted as h:m:s:f.
For example, 0:10:42:14 would be 10 minutes, 23 seconds, and 14 frames from the beginning of
the audio file. Audio editing programs allow you to slide a Current Position Cursor to any
moment in the audio file, or alternatively you can type in the position as h:m:s:f. Other time
formats are also possible. For example, some audio editors allow you to look at time in terms of
samples, bars and beats, or decimal seconds.
The multitrack view allows you to record different sounds, musical instruments, or voices
on separate tracks so that you can work with these units independently modifying frequency or
dynamic range, applying reverb or phase shifts whatever you want for your artistic purposes.
A track is a sequence of audio samples that can be played and/or edited as a separate unit.
Often, different tracks are associated with different instruments of voices. You can record one
instrument on one track and then you can record a second instrument or voice on another. If you
play the first track while recording the second, you can easily synchronize the two. With the
right digital processing software and a good computer with plenty of RAM and external storage,
you can have your own sound studio. If youre also a musician, you can do all of the musical
parts yourself, recording one track after another. In the end, youll probably want to mix down
the tracks, collapsing them all into one unit. This is very much like flattening the layers of a
digital image. Once you have done this, you cant apply effects to the separate parts (but if
youre smart, youll have kept a copy of the multitrack version just in case). The mixed-down
file might then go through the mastering process. Mastering puts the final touches on the audio
and prepares it for distribution. For example, if the audio file is one musical piece to be put on a
CD with others, the pieces are sequenced, their volumes can be normalized with respect to each
other so one doesnt sound much louder than another, their dynamic ranges can be altered for
artistic reasons, and so forth.
A channel is different from a track. A channel corresponds to a stream of audio data,
both input and output. Recording on only one channel is called monophonic or simply mono.
Two channels are stereo. When you record in stereo, the recording picks up sound from slightly
different directions. When the sound is played back, the intention is to send different channels
out through different speakers so that the listener can hear a separation of the sound. This gives
the sound more dimension, as if it comes from different places in the room. Stereo recording
used to be the norm in music, but with the proliferation of DVD audio formats and the advent of
home theatres that imitate the sound environments of movie theatres, the popularity of
multichannel audio is growing. A 5.1 multichannel setup has five main channels: front center,
front right, front left, rear right, rear left, and a low frequency extension channel (LFE). LFE,
often called a subwoofer, has frequencies from 10 to 120 Hz. A 6.1 multichannel setup has a
back center channel as well.
able to undo a lot of operations, but youll need sufficient disk space for all the temp files.
Real-time processing is always non-destructive. It's amazing to discover just how much
real-time processing can go on while an audio file is playing or being mixed in live performance.
It's possible to equalize frequencies, compress dynamic range, add reverb, etc. even combining
these processing algorithms. The processing happens so quickly that the delay isn't noticed.
However, all systems have their limit for real-time processing, and you should make yourself
aware of these limits. In mixing of live performances, there are also times when you have to pay
attention to the delay inevitably introduced in any kind of digital audio processing. ADC and
DAC themselves introduce some delay right off the top, to which you have to add the delay of
special effects.
Raw files have nothing but sample values in them. Theres no header to indicate the
sampling rate, sample size, or type of encoding. If you try to open a .raw file in an audio
processing program, youll be prompted for the necessary information.
Sometimes the file extension of an audio file uniquely identifies the file type, but this
isnt always the case. Often it is necessary for the audio file player or decoder to read the header
of the file to identify the file type. For example, the .wav extension is used for files in a variety
of formats uncompressed, A-law encoded, -law encoded, ADPCM encoded, or compressed
with any codec. The header of the file specifies exactly which format is being used. .wav files
are the standard audio format of Microsoft and IBM.
To facilitate file exchange, a standard format called IFF the Interchange Format File
has been devised. IFF files are divided into chunks, each with its own header and data. .wav and
.aif (Apple) are versions of the IFF format. They differ in that Microsoft/IBM use little-endian
byte order while Macintosh uses big-endian. In little-endian byte ordering, the least significant
byte comes first, and in big-endian the most significant byte comes first.
It is sometimes difficult to separate audio file formats from codecs, but they are not the
same. Codecs are compression/decompression algorithms that can be applied within a variety of
file formats. For example, the fact that a file has the .wav file extension says nothing about
whether or not the file is compressed. The file could be compressed with any one of a number of
codecs, as specified in the files header. However, if a file has the file extension .mp3 or .aac,
then you know that it has been compressed with the MP3 or AAC codec, respectively. Some
codecs are proprietary (e.g. WMA) and some are open source (e.g., FLAC and Ogg Vorbis).
Some provide lossy compression, some lossless.
Representative audio file formats and codecs are listed in Table 5.1. Your choice of
audio file format depends on a number of factors. On what platforms do you expect to deliver
the audio? What sampling rate and bit depth are appropriate to the type of audio youre
recording? Not all file types offer all the choices you might want in sampling rate and bit depth.
Remember to keep the file in an uncompressed form while youre working on it in order to
maintain the greatest fidelity possible. But if file size is ultimately an issue, you need to consider
what codec to use for the audios final deliverable form. Will your user have the needed codec
for decompression? Is lossy compression sufficient for the desired quality, or do you want
lossless compression? All these decisions should become clearer as you work more with audio
files and read more about audio file compression later in this chapter.
File
Extension
.aac
.aif
.au
.flac
.mp3
.ogg
Characteristics
.raw
.rm
RealMedia
.tta
True Audio
DVI/IMA
.wav
International Multimedia
Association version of
ADPCM .wav format
with ADPCM
compression
Microsoft
Microsoft
ADPCM
.wav
Windows
PCM .wav
.wma or .asf
Microsoft
10
11
In contrast, "elevator music" or "muzak" is intentionally produced with a small dynamic range.
Its purpose is to lie in the background, pleasantly but almost imperceptibly.
Musicians and music editors have words to describe the character of different pieces that
arise from their variance in dynamic range. A piece can sound "punchy," "wimpy," "smooth,"
"bouncy," "hot," or "crunchy," for example. Audio engineers train their ears to hear subtle
nuances in sound and to use their dynamics processing tools to create the effects they want.
Deciding when and how much to compress or expand dynamic range is more art than
science. Compressing the dynamic range is desirable for some types of sound and listening
environments and not for others. It's generally a good thing to compress the dynamic range of
music intended for radio. You can understand why if you think about how radio sounds in a car,
which is where radio music is often heard. With the background noise of your tires humming on
the highway, you don't want music that has big differences between the loudest and softest parts.
Otherwise, the soft parts will be drowned out by the background noise. For this reason, radio
music is dynamically compressed, and then the amplitude is raised overall. The result is that the
sound has a higher average RMS, and overall it is perceived to be louder.
There's a price to be paid for dynamic compression. Some sounds like percussion
instruments or the beginning notes of vocal music have a fast attack time. The attack time of a
sound is the time taken for the sound to change amplitude. With a fast attack time, the sound
reaches high amplitude in a sudden burst, and then it may drop off quickly. Fast-attack
percussion sounds like drums or cymbals are called transients. Increasing the perceived
loudness of a piece by compressing the dynamic range and then increasing the overall amplitude
can leave little headroom room for transients to stand out with higher amplitude. The entire
piece of music may sound louder, but it can lose much of its texture and musicality. Transients
give brightness or punchiness to sound, and suppressing them too much can make music sound
dull and flat. Allowing the transients to be sufficiently loud without compromising the overall
perceived loudness and dynamic range of a piece is one of the challenges of dynamics
processing.
While dynamic compression is more common than expansion, expansion has its uses
also. Expansion allows more of the potential dynamic range the range made possible by the bit
depth of the audio file to be used. This can brighten a music selection.
With downward expansion, it's possible to lower the amplitude of signals such that they
can no longer be heard. The point below which a digital audio signal is no longer audible is
called the noise floor. Say that your audio processing software represents amplitude in dBFS
decibels full scale where the maximum amplitude of a sample is 0 and the minimum possible
amplitude a function of bit depth is somewhere between 0 and . For 16-bit audio, the
minimum possible amplitude is approximately 96 dBFS. Ideally, this is the noise floor, but in
most recording situations there is a certain amount of low amplitude background noise that
masks low amplitude sounds. The maximum amplitude of the background noise is the actual
noise floor. If you apply downward expansion to an audio selection and you lower some of your
audio below the noise floor, you've effectively lost it. (On the other hand, you could get rid of
the background noise itself by downward expansion, moving the background below the -96 dB
noise floor).
To understand how dynamic processing works, let's look more closely at the tools and the
mathematics underlying dynamics processing, including hard limiting, normalization,
compression, and expansion.
12
We've talked mostly about dynamic range compression in the examples above, but there
are four ways to change dynamic range: downward compression, upward compression,
downward expansion, and upward expansion, as illustrated in Figure 5.3. The two most
commonly-applied processes are downward compression and downward expansion. You have to
look at your hardware or software tool to see what types of dynamics processing it can do. Some
tools allow you to use these four types of compression and expansion in various combinations
with each other.
Downward compression lowers the amplitude of signals that are above a designated
level, without changing the amplitude of signals below the designated level. It reduces
the dynamic range.
Upward compression raises the amplitude of signals that are below a designated level
without altering the amplitude of signals above the designated level. It reduces the
dynamic range.
Upward expansion raises the amplitude of signals that are above a designated level,
without changing the amplitude of signals below that level. It increases the dynamic
range.
Downward expansion lowers the amplitude of signals that are below a designated level
without changing the amplitude of signals above this level. It increases the dynamic
range.
downward
compression
original
audio
selection
compressed
dynamic
range
upward
compression
upward
expansion
original
audio
selection
expanded
dynamic
range
downward
expansion
Audio limiting, as the name implies, limits the amplitude of an audio signal to a
designated level. Imagine how this might be done in real-time during recording. If hard
limiting is applied, the recording system does not allow sound to be recorded above a given
amplitude. Samples above the limit are clipped. Clipping cuts amplitudes of samples to a given
maximum and/or minimum level. If soft limiting is applied, then audio signals above the
designated amplitude are recorded at lower amplitude. Both hard and soft limiting cause some
distortion of the waveform.
Normalization is a process which raises the amplitude of audio signal values and thus the
perceived loudness of an audio selection. Because normalization operates on an entire audio
signal, it has to be applied after the audio has been recorded.
The normalization algorithm proceeds as follows:
find the highest amplitude sample in the audio selection
determine the gain needed in the amplitude to raise the highest amplitude to maximum
amplitude (0 dBFS by default, or some limit set by the user)
raise all samples in the selection by this amount
13
-96 dB
-96 dB
0 dB
0 dB
A variation of this algorithm is to normalize the RMS amplitude to a decibel level specified by
the user. RMS can give a better measure of the perceived loudness of the audio. In digital audio
processing software, predefined settings are sometimes offered with descriptions that are
intuitively understandable for example, "Normalize RMS to -10dB (speech)."
Often, normalization is used to increase the perceived loudness of a piece after the
dynamic range of the piece has been compressed, as was described above in the processing or
radio music. Normalization can also be applied to a group of audio selections. For example, the
different tracks on a CD can be normalized so that they are at basically the same amplitude level.
This is part of the mastering process.
Compression and expansion can be represented mathematically by means of a transfer
function and graphically by means of the corresponding transfer curve. Digital audio processing
programs sometimes give you this graphical view with which you can specify the type of
compression or expansion you wish to apply. Alternatively, you may be able to type in values
that indicate the compression or expansion ratio. The transfer function maps an input amplitude
level to the amplitude level that results from compression or expansion. If you apply no
compression or expansion to an audio file, the transfer function graphs as a straight line at a 45
angle, as shown in Figure 5.4. If you choose to raise the amplitude of the entire audio piece by a
constant amount, this can also be represented by a straight line of slope 1, but the line crosses the
vertical axis at the decibel amount by which all samples are raised. For example, the two
transfer functions in Figure 5.4 show a 5 dB increase and a 5 dB decrease in the amplitude of the
entire audio piece.
-96 dB
0 dB
-96 dB
0 dB
Figure 5.4 Linear transfer functions for 5 dB gain and 5dB loss no compression or expansion
14
Supplements
on dynamics
processing:
interactive
tutorial
worksheet
Often, a gain makeup is applied after downward compression. You can see in Figure 5.6
that there is a place to set Output Gain. The Output Gain is set to 0 in the figure. If you set the
output gain to a value g dB greater than 0, this means that after the audio selection is
compressed, the amplitudes of all samples are increased by g dB. Gain makeup can also be done
by means of normalization, as described above. The result is to increase the perceived loudness
of the entire piece. However, if the dynamic range has been decreased, the perceived difference
between the loud and soft parts is reduced.
15
16
a.
Uncompressed audio
b. Compressed audio
It is also possible to compress the dynamic range at both ends, by making high
amplitudes lower and low amplitudes higher. Below is an example of expanding the dynamic
range by "squashing" at both the low and high amplitudes. The compression is performed on an
audio file that has three pure tones at 440 Hz the amplitude of the first is 5 dB; the second is
3 dB; and the third is 12 dB. Values above 4 dB are made smaller; this is downward
compression. Values below 10 dB are made larger; this is upward compression. The settings
are given in Figure 5.12. The audio file before and after compression is shown in Figure 5.11a
17
and Figure 5.11b. (The three sine waves appear as solid blocks because the view is too far out to
show detail. You can see only the amplitudes of the three waves.)
a. Before compression
b. After compression
Figure 5.11 Audio file (three consecutive sine waves of different amplitudes)
before and after dynamic compression
Figure 5.12 Compression of dynamic range at both high and low amplitudes
There's one more part of the compression and expansion process to be aware of. The
attack of a dynamics processor is defined as the time between the appearance of the first sample
value beyond the threshold and the full change in amplitude of samples beyond the threshold.
Thus, attack relates to how quickly the dynamics processor initiates the compression or
expansion. When you downward compress the dynamic range, a slower attack time can
sometimes be more gradual and natural, preserving transients by not immediately lowering
amplitude when the amplitude suddenly, but perhaps only momentarily, goes above the
threshold. The release is the time between the appearance of the first sample that is not beyond
the threshold (before processing) and the cessation of compression or expansion.
18
a "p" is spoken too close to the mic, for example. Three basic types of audio restoration are used
to alleviate these problems: noise gating, noise reduction, and click and pop removal.
The operation of a noise gate is very simple. (See Figure 5.13.) A noise gate serves as a
block to signals below a given amplitude threshold. When samples fall below the threshold, the
gate closes and the samples are not passed through. When the samples rise above the threshold,
the gate opens. Some noise gates allow you to indicate the reduction level, which tells the
amplitude to which you want the below-threshold samples to be reduced. Often, reduction is set
at the maximum value, completely eliminating the signals below the threshold. The attack time
indicates how quickly you want the gate to open when the signal goes above the threshold. If
you want to preserve transients like sudden drum beats, then you would want the attack to be
short so that the gate opens quickly for these. A lookahead feature allows the noise gater to look
ahead to anticipate a sudden rise in amplitude and open the gate shortly before the rise occurs.
Some instruments like strings fade in slowly, and a short attack time doesn't work well in this
case. If the attack time is too short, then at the moment the strings go above the threshold, the
signal amplitude will rise suddenly. The release time indicates how quickly you want the gate to
close when the signal goes below the threshold. If a musical piece fades gradually, then you
want a long release time to model the release of the instruments. Otherwise, the amplitude of the
signal will drop suddenly, ruining the decrescendo. Some noise gaters also have a hold control,
indicating the minimum amount of time that the gate must stay open. A hysteresis control may
be available to handle cases where the audio signal hovers around the threshold. If the signal
keeps moving back and forth around the threshold, the gate will open and close continuously,
creating a kind of chatter, as it is called by audio engineers. The hysteresis control indicates the
difference between the value that caused the gate to open (call it n) and the value that will cause
it to close again (call it m). If n m is large enough to contain the fluctuating signal, the noise
gate won't cause chatter.
Noise reduction tools can eliminate noise in a digital audio file after the file has been
recorded. The first step is to get a profile of the background noise. This can be done by
selecting an area that should be silent, but which contains a hum or buzz. The noise reduction
tool does a spectral analysis of the selected area in order to determine the frequencies in the noise
and their corresponding amplitude levels. Then the entire signal is processed in sections. The
frequencies in each section are analyzed and compared to the profile, and if these sections
contain frequency components similar to the noise, these can be eliminated below certain
amplitudes. Noise reduction is always done at the risk of changing the character of the sound in
unwanted ways. Music is particularly sensitive to changes in its frequency components. A good
noise reduction tool can analyze the harmonic complexity of a sound to distinguish between
19
music and noise. A sound segment with more complex harmonic structure is probably music and
therefore should not be altered. In any case, it is often necessary to tweak the parameters
experimentally to get the best results.
Some noise reduction interfaces, such as the one in Figure 5.14, give a graphical view of
the frequency analysis. The graph in the top window is the noise profile. The horizontal axis
moves from lower to higher frequencies, while the vertical axis is amplitude. The original signal
is in one color (the data points at the top of the graph), the amount of noise reduction is in a
second color (the data points at the next level down in amplitude), and the noise is in a third
color (the data points at the bottom of the graph). The main setting is the amount of noise
reduction. Some tools allow you to say how much to "reduce by" (Figure 5.14) and some say
exactly the level to "reduce to" (Figure 5.15).
You can see from the interface in Figure 5.14 that the noise profile is done by Fourier
analysis. In this interface, the size of the FFT of the noise profiler can be set explicitly. Think
about what the profiler is doing determining the frequencies at which the noise appears, and the
corresponding amplitudes of these frequencies. Once the noise profile has been made, the
frequency spectrum of the entire audio file can be compared to the noise profile, and sections
that match the profile can be eliminated. With a larger FFT size, the noise profile's frequency
spectrum is divided into a greater number of frequency components that is, there is greater
frequency resolution. This is a good thing, up to a point, because noise is treated differently for
each frequency component. For example, your audio sample may have a 70 dB background
hum at 100 Hz. If the frequency resolution isn't fine enough, the noise reducer may have to treat
frequencies between 80 and 120 Hz all the same way, perhaps doing more harm than good. On
the other hand, it's also possible to set the FFT size too high because there is a tradeoff between
frequency resolution and time resolution. The higher the frequency resolution is, the lower the
time resolution. If the FFT size is set too high, time slurring can occur, manifested in the form of
reverberant or echo-like effects.
20
Noise reducers often allow you to set parameters for "smoothing." Time smoothing
adjusts the attack and release times for the noise reduction. Frequency smoothing adjusts the
extent to which noise that is identified in one frequency band affects amplitude changes in
neighboring frequency bands. Transition smoothing sets a range between amplitudes that are
considered noise and those that are not considered noise.
Noise reducers can also be set to look for certain types of noise. White noise is noise that
occurs with equal amplitude (or relatively equally) at all frequencies (Figure 5.16). Another way
to say this is that white noise has equal energy in frequency bands of the same size. (Intuitively,
you can think of the energy as the area under the curve the "colored" area for a frequency
band.) Pink noise has equal energy in octave bands. Recall from Chapter 4 that as you move
from one octave to the same note in the next higher octave, you double the frequency. Middle C
on the piano has a frequency of about 261.6 Hz, and the next higher C has a frequency of
2*261.6, which is about 523 Hz. In pink noise, there is an equal amount of noise in the band
from 100 to 200 Hz as in the band from 1000 to 2000 Hz because each band is one octave wide.
Human perception is logarithmic in that our sensitivity to the loudness of a signal drops off
logarithmically as frequency increases. Thus, pink noise is perceived by humans to have equal
loudness at all frequencies. If you know the character of the noise in your audio file, you may be
able to set the noise reducer to eliminate this particular type of noise.
Figure 5.16 White noise with energy equally distributed across frequency spectrum (from Adobe Audition)
A click or pop eliminator can look at a selected portion of an audio file, detect a sudden
amplitude change, and eliminate this change by interpolating the sound wave between the start
and end point of the click or pop. Figure 5.17 shows how a waveform can be altered by click
removal.
21
filter
Digital audio filters are applied for a variety of reasons. They can be used to separate and
analyze frequency components in order to identify the source of sounds. For example, undersea
sounds could be detected and analyzed to determine the marine life in the area. Filters can also
be used for restoration of a poorly-recorded or poorly-preserved audio recording, like an old,
scratched vinyl record disk that is being digitized. Digital filters are the basis for common digital
audio processing tools for equalization, reverb, and special effects.
22
Digital audio filters can be divided into two categories based on the way they're
implemented: FIR (finite-inpulse response) and IIR (infinite-impulse response) filters.
Mathematically, FIR and IIR filters can be represented as a convolution operation. Lets look at
the FIR filter first. The FIR filter is defined as follows:
Let x (n) be a digital audio signal of L samples for 0 n L 1 .
Let y (n) be the audio signal after it has undergone an FIR filtering
operation. Let h(n) be a convolution mask operating as an FIR
filter where N is the length of the mask. Then an FIR filter
function is defined by
y ( n) = h( n) x ( n) =
key
equation
N 1
h( k ) x ( n k )
k =0
where x (n k ) = 0 if n k < 0
Equation 5.1
> Aside: h(n) goes by different names,
Note that in this section, we will use notation that
is standard in the literature. Instead of using
depending on your source. It can be called the
convolution mask, the impulse response, the filter,
subscripts for discrete functions to emphasize that
or the convolution kernel.
they are arrays, as in x n , we use x (n) , y (n) , and
y (n) . These functions are in the time domain.
(To emphasize this, we could call it x (Tn) where T is the interval between time samples, but
x (n) captures the same information with T implicit.) is the convolution operator.
Consider what values convolution produces and, procedurally, how it operates.
y (0) = h(0) x (0)
y (1) = h(0) x (1)+ h(1) x (0)
y ( 2) = h(0) x ( 2) + h(1) x (1) + h( 2) x (0)
In general, for n N , y (n) = h(0) x (n) + h(1) x (n 1) + ... + h( N 1) x (n N + 1)
We already covered convolution in Chapter 3 as applied to two-dimensional digital
images. Convolution works in basically the same way here, as pictured in Figure 5.19. h(n) can
be thought of as a convolution mask that is moved across the sample values. The values in the
mask serve as multipliers for the corresponding sample values. To compute succeeding values
of y (n) , the samples are shifted left and the computations are done again. With digital images,
we applied a two-dimensional mask. For digital sound, convolution is applied in only one
dimension, the time domain. You may notice that the mask is "flipped" relative to the sample
values. That is, x (0) is multiplied by h(N1), and x(N1) is multiplied by h(0) .
23
Equation 5.1 describes an FIR filter. The essence of the filtering operation is h(n) ,
which is just a vector of multipliers to be applied successively to sample values. The multipliers
are usually referred to as coefficients, and the number of coefficients is the order of a filter.
Engineers also call the coefficients taps or tap weights. The central question is this: How do
you determine what the coefficients h(n) should be, so that h(n) changes a digital signal in the
way that you want? This is the area of digital filter design, an elegant mathematical approach
that allows you to create filters to alter the frequency spectrum of a digital signal with a great
amount of control. Well return to this later in the chapter.
Now lets consider IIR filters. To describe an IIR, we need a mask of infinite length,
given by this equation:
Let x (n) be a digital audio signal of L samples for 0 n L 1 .
Let y (n) be the audio signal after it has undergone an IIR filtering
operation. Let h(n) be a convolution mask operating as an IIR
filter where N is the length of the forward filter and M is the length
of the feedback filter. Then the recursive form of the IIR filter
function is defined by
key
equation
y ( n) = h( n) x ( n) = h( k ) x ( n k )
k =0
k =0
k =1
y ( n) = h( n) x ( n) = a k x ( n k ) bk y ( n k )
Equation 5.2
24
coefficients based on specifications for the desired filter. We'll see how to do this in later in the
chapter.
The main advantage of IIR filters is that they allow you to create a sharp cutoff between
frequencies that are filtered out and those that are not. More precisely, FIR filters require larger
masks and thus more arithmetic operations to achieve an equivalently sharp cutoff as compared
to what could be achieved with an IIR filter. Thus, FIR filters generally require more memory
and processing time. A second advantage of IIR filters is that they can be designed from
equivalent analog filters. FIR filters do not have analog counterparts. On the other hand, FIR
filters have an important advantage in that they can be constrained to have a linear phase
response. In a filter with linear phase response, phase shifts for frequency components are
proportional to the frequency. Thus, harmonic frequencies are shifted by the same proportions
so that harmonic relationships are not distorted. Clearly, linear phase response is important for
music. Finally, FIR filters are not as sensitive to noise resulting from low bit depth and roundoff error.
25
low-pass filter
high-pass filter
bandpass filter
bandstop filter
convolution filters FIR filters that can be used to add an acoustical environment to a
sound file for example, mimicking the reverberations of a concert hall
graphic equalizer gives a graphical view that allows you to adjust the gain of
frequencies within an array of bands
26
parametric equalizer similar to a graphic equalizer but with control over the width of
frequency bands relative to their location
crossover splits an input signal into several output signals, each within a certain
frequency range, so that each output signal can be directed to a speaker that handles that
range of frequencies best.
We'll look at some of these tools more closely in the context of digital processing software, and
then return to the mathematics that makes them work.
5.4.4 EQ
Equalization (EQ) is the process of selectively boosting or cutting certain frequency
components in an audio signal. In digital audio processing, EQ can be used during recording to
balance instruments, voices, and sound effects; it can be used in post-processing to restore a
poorly recorded signal, apply special effects, or achieve a balance of frequencies that suits the
purpose of the audio; or it can be applied when audio is played, allowing frequencies to be
adjusted to the ear of the listener.
EQ is a powerful and useful audio processing tool, but it must be applied carefully.
When you change the relative amplitudes of frequency components of an audio piece, you run
the risk of making it sound artificial. The environment in which music is recorded puts an
acoustic signature on the music, adding resonant frequencies to the mix. The frequency spectra
of instruments and voices are also complex. Each instrument and voice has its own timbre,
characterized by its frequency range and overtones. Separating the frequency components of
different instruments and voices is next to impossible. You may think that by lowering the highfrequency components you're only affecting the flutes, but really you may also be affecting the
overtones of the oboes. The point is that generally, it's better to get the best quality, truest sound
when you record it rather than relying on EQ to "fix it in post-processing." Nevertheless, there
are times when a good equalizer is an invaluable tool, one that is frequently used by audio
engineers.
You've probably already used EQ yourself. The bass/treble controls on a car radio are a
simple kind of equalizer. These tone controls, as they are sometimes called, allow you to adjust
the amplitude in two broad bands of frequencies. The bass band is approximately 20 to 200 Hz
while the treble is approximately 4 to 20 kHz. Figure 5.22 shows four shelving filters.
Frequency is on the horizontal axis and gain is on the vertical axis. The gain level marked with a
1 indicates that the filter makes no change to the input signal. The other lines indicate five levels
to which you can boost or cut frequencies, like five discrete levels you might have for both bass
and treble on your car's tone controls. A negative gain cuts the amplitude of a frequency. A
shelving filter for cutting or boosting low frequencies is called a low-shelf filter. A high-shelf
filter boosts or cuts high frequencies. Shelving filters are similar to low- and high-pass filters
except that they boost or cut frequencies up to a certain cutoff frequency (or starting at a cutoff
frequency) after which (or before which) they allow the frequencies to pass through unchanged.
Low- and high-pass filters, in contrast, completely block frequencies lower or higher than a
given limit, as we'll see later in this chapter.
Consider the low-shelf filter for boosting low frequencies in Figure 5.22A. For any of
the five boost levels, as frequencies increase, the amount of gain decreases such that higher
frequencies, above the cutoff level, are not increased at all.
27
Gain
Gain
Boost Level 5
Boost Level 4
Boost Level 3
Boost Level 2
Boost Level 1
1
Boost Level 5
Boost Level 4
Boost Level 3
Boost Level 2
Boost Level 1
1
A.
Frequency
B.
Gain
Frequency
high-shelf filter for boosting high frequencies
Gain
Cut Level 1
Cut Level 2
Cut Level 3
Cut Level 4
Cut Level 5
Cut Level 1
Cut Level 2
Cut Level 3
Cut Level 4
Cut Level 5
C.
Frequency
D.
Frequency
You may also be familiar with more fine-tuned control of EQ in the form of a graphic
equalizer on your home stereo. A graphic equalizer divides the frequency spectrum into bands
and allows you to control the amplitude for these bands individually, usually with sliders.
Digital audio processing tools have their equivalent of graphic equalizers, and with an interface
that looks very much like the interface to the analog graphic equalizer on your home stereo
(Figure 5.23). With graphic equalizers, you can select the number of frequency bands.
Typically, there are ten frequency bands that are proportionately divided by octaves rather than
by frequency levels in Hz. Recall that if two notes (i.e., frequencies) are an octave apart, then
the higher note has twice the frequency of the lower one. Thus, the starting point of each band of
the graphic equalizer in Figure 5.23 is twice that of the previous band. If you ask for 20 bands,
1
then the bands are separated by 1/2 an octave and thus are spaced by a factor of 2 2 = 2 . If the
first band starts at 31, then the next starts at 31 * 2 44 Hz, the next starts at 44 * 2 62 Hz,
and so forth. If you divide the spectrum into 30 bands, the bands are separated by 1/3 of an
1
28
return to the concept of the frequency resolution vs. time resolution tradeoff when we discuss
frequency analysis for noise reduction.
Parametric EQs are based on bandpass and bandstop filters. A bandpass filter allows frequencies
in a certain band to pass through and filters out frequencies below and above the band. Ideally,
the unwanted frequencies would be filtered out entirely, but in reality this isn't possible. Thus,
the frequency response graph for a bandpass filter looks more like the bell curve in Figure 5.25.
This type of filter is sometimes called a peaking filter. If you set the gain to a positive number,
the peak is pointed upward and you create a bandpass filter around your central frequency. If
you set the gain to a negative number, the peak is pointed downward and you create a bandstop
filter. A very narrow bandstop filter is also called a notch filter. The Q-factor (also called
quality factor or simply Q) determines how steep and wide the peaking curve is. The higher the
Q-factor, the higher the peak in relation to the width of the frequency band.
Gain
29
Q-factor is a term adopted from physics and electronics. In physical and electrical
systems, Q-factor measures the rate at which a vibrating system dissipates its energy, a process
called damping. More precisely, it is the number of cycles required for the energy to fall off by a
factor of 535. (The rate 535 is actually e2, a number chosen because it simplified related
computations.) The Q-factor of an inductor is the ratio of its inductance to its resistance at a
given frequency, which is a measure of its efficiency. Q-factors are also used in acoustics to
describe how much a surface resonates. A surface with a high Q-factor resonates more than one
with a low Q-factor. Instruments and sound speakers have Q-factors. A bell is an example of a
high-Q system, since it resonates for a long time after it is struck. Although a high Q-factor
seems to imply better quality, the right Q-factor depends on the situation. If your speakers' Qfactor is too high, they may prolong the normal decay of the sounds instruments, or "ring" if a
signal suddenly stops, making the music sound artificial or distorted.
A graph can be drawn to depict how a system resonates, with frequency on the horizontal
axis and energy on the vertical axis. The same type of graph is used to depict a peaking filter. It
is essentially a frequency response graph, depicting which band of frequencies that are filtered
out or left in, except that it plots energy (rather than amplitude) against frequency. This is shown
in Figure 5.26. The peaking filter graph is parameterized by its Q-factor, a high Q-factor
corresponding to a steep peak. Formally, Q-factor can be defined as follows:
Given the graph of a peaking filter, let f width be the width of the
1
times the peak's height, and
peak measured at a point that is
2
let f center be the frequency at the center of the peak, both in Hz.
Then the Q-factor, Q, is defined as
f
Q = center
f width
key
equation
Equation 5.3
Note that in Figure 5.26, frequency is shown in a logarithmic scale. On this graph, the
center frequency f center is exactly in the center of the peak, and the bandwidth f width is measured
at a point that is the peak height, called the 3 dB point. On a frequency response graph of the
filter that plots amplitude (rather than energy) against frequency, the bandwidth would be
1
times the peak height.
measured at a point that is
2
energy
peak
-3 dB
bandwidth
frequency, on alogarithmic scale
30
key
equation
2n
Q= n
2 1
Equation 5.4
Given the Q-factor, you can compute the bandwidth of the filter.
Let Q be the Q-factor of a peaking filter. Then the bandwidth of
the filter, n, is given by
2
1
1
+ 1
+
n = 2 log 2
2Q
2Q
key
equation
Equation 5.5
When you use a parametric EQ, you may not need the equations above to compute a
precise bandwidth. The main point to realize is that the higher the Q, the steeper the peak of the
band. The parametric EQ pictured in Figure 5.24 allows Q to get set in the range of 0.10 to 10,
the former being a fairly flat peak and the latter a steep one. Some EQ interfaces actually show
you the graph and allow you to "pull" on the top of the curve to adjust Q, or you can type in
numbers or control them with a slider.
Different types of filters low-pass, high-pass, shelving, and bandpass can be grouped
into one EQ tool, sometimes called a paragraphic equalizer. Figure 5.27, Figure 5.28, Figure
5.29, and Figure 5.30 show four paragraphic EQs. Icons represent the type of filter being
applied.
or
represents a shelf filter;
or
represents a bandpass or bandstop filter;
and
represents a high-pass filter; and
represents a low-pass filter.
31
Figure 5.28 An EQ that combines high-pass, bandpass/stop, shelving, and low-pass filters
Figure 5.29 Another EQ that combines high-pass, bandpass/stop, shelving, and low-pass filters
When different filters are applied to different frequency bands, as is the case with a
paragraphic EQ, the filters are applied in parallel rather than serially. That is, the audio signal is
input separately into each of the filters, and the outputs of the filters are combined after
processing. A parallel rather than serial configuration is preferable because any phase distortions
introduced by a single filter will be compounded by a subsequent phase distortion when filters
are applied serially. If phase linearity is important, a linear phase EQ can be applied. Linear
phase filters will not distort the harmonic structure of music because if there is a phase shift at a
certain frequency, the harmonic frequencies will be shifted by the same amount. A linear phase
filter is pictured in Figure 5.31.
32
Some EQs allow you to set the master gain, which boosts or cuts the total audio signal
output after the EQ processing. You may also be able to control the amount of wet and dry
signal added to the final output. A wet signal is the audio signal that has undergone processing.
A dry signal is unchanged. A copy of the dry signal can be retained, and after the signal has
been processed and made "wet," varying amounts of wet and dry can be combined for the final
audio output. A paragraphic EQ with wet and dry controls is shown in Figure 5.30.
-m
output
input
Z-m
+
g
non-recursive filter
recursive filter
33
The comb filters given above can be used to create simple delay effects. Changing the
coefficients, changing the delay m, and cascading filters can create different effects similar to
real-world echoes and reverberation. The phase characteristics of these filters, however, are not
very natural, and thus a single filter like those above isn't the best for creating realistic echo and
reverberation. What is unnatural is that the phases of all the frequencies aren't changed
sufficiently relative to each other. This isn't how real audio echoes and reverberations happen.
Imagine that you stand in the middle of a room and play a note on a musical instrument.
You hear the initial sound, and you also hear reflections of the sound as the sound waves move
through space, bounce off the walls and objects in the room, and repeatedly come back to your
ears until the reflections finally lose all their energy and die away. The way in which the sound
waves reflect depends on many factors the size of the room, the materials with which it's built,
the number of objects, the heat and humidity, and so forth. Furthermore, different frequencies
are reflected or absorbed differently, low frequencies being absorbed less easily than high.
(Have you ever sat beside another car at a traffic light and heard only the booming bass of the
other person's stereo?)
Another type of filter, the all-pass filter, is often used as a building block of reverberation
effects. Adding the all-pass filter to a chain or bank of comb filters helps to make the reflections
sound like they're arriving at different times. The all-pass filter doesn't change the frequencies of
the wave on which it acts. It only changes the phase, and it does so in a way that "smears" the
harmonic relationships more realistically. A simple all-pass filter is defined
by y (n) = gx (n) + x (n m) + gy (n m) where m is the number of samples in the delay and
g 1 . The flow diagram is in Figure 5.33.
-g
input
Z-m
output
Different echo and reverberation effects can be created by combining the above filters in
various ways in series, in parallel, or nested. One of the first designs that combined filters in
this way was made by Manfred Schroeder. Filter designs for reverberation have gone beyond
this simple configuration, but the diagram in Figure 5.34 serves to illustrate how the simple
filters can be put together for more realistic sound than is possible with a single filter.
comb
comb
input
all-pass
all-pass
output
comb
comb
34
Digital audio processing programs offer tools for delay, multi-tap delay, chorus, flange,
and reverberation. More sophisticated algorithms can be found in audio effects plug-ins that you
can add to your basic software. Reverb (short for reverberation) is one of the most useful effects
for giving vibrance and color to sound. It can simulate not only how music and voice sound in
different listening spaces, but also how it is produced by particular instruments, microphones,
loudspeakers, and so forth. Reverb can also enhance the timbre of instruments and voices,
making a recording that is otherwise flat and dull sound interesting and immediate. Let's look at
how these effects are presented to you in audio processing software. The distinction between the
different delay-based effects is not a strict one, and the methods of implementation are mostly
transparent to the user, but the following discussion will give you an idea of what to expect.
Some audio processing programs make no distinction between delay and echo, while
others have separate tools for them. If the features are separated, then the delay effect is less
realistic, possibility implemented with non-recursive delay units that create simple repetition of
the sound rather than trailing echoes. From the interface in an audio processing program, you're
able to set the delay time and the amount of wet and dry signal to be added together. The multitap delay effect compounds the effects of multiple delays, a tap being a single delay. When an
echo effect is considered separate from delay, echo effect is more realistic in that the spectrum of
the signal is changed. That is, a special filter may be used so that the phases of echoed sounds
are changed to simulate the way different frequencies are reflected at different rates. The echo
effect uses feedback, and thus interfaces to echo tools generally allow you to control this
parameter.
The chorus effect takes a single audio signal presumably a voice and makes it sound
like multiple voices singing or speaking in unison. You can choose how many voices you want,
and the chorus effect delays the different copies of the signal one from the other. This isn't
enough, however, for a realistic effect. When people sing together, they sing neither in perfect
unison nor at precisely same pitch. Thus, the frequencies of the different voices are also
modulated according to your settings. Too much modulation makes the chorus sound off-pitch
as a group, but the right amount makes it sound like a real chorus.
The flange effect uses a comb filter with a variable delay time that changes while the
music is played, varying between about 0 and 20 ms. You can picture the teeth of the comb filter
moving closer together and then farther apart as the delay time changes. Rather than creating an
echo, the flange effect continuously boosts some frequencies, cuts some frequencies, and filters
others out completely but which frequencies are affected which way changes continuously over
time, which is the distinguishing characteristic of the flange effect. The resulting sound is
35
amplitude
described variously as "swooshy," "warbly," "wah wah" and so forth, depending on the audio
clip to which the effect is applied and the parameters that are set. Parameters that a flange tool
will probably allow you to set include the initial and final delay time and the amount of
feedback.
Reverb can alter an audio signal so that it sounds like it comes from a particular
acoustical space. Let's consider how reverb can be used to simulate the way sound travels in a
room, as depicted in Figure 5.36. Imagine that a sound is emitted in a room an instantaneous
impulse. The line marked direct sound indicates the moment when the sound of the impulse first
reaches the listener's ears, having traveled directly and with no reflections. In the meantime, the
sound wave also propagates toward the walls, ceilings, and objects in the room and is reflected
back. The moments when these first reflections reach the listener's ears are labeled first-order
reflections. When the sound reflects off surfaces, it doesn't reflect perfectly. It's not like a
billiard ball striking the side of a billiard table, where the angle of incidence equals the angle of
reflection. Some of the sound is diffused, meaning that a wave reflects back at multiple angles.
The first-order reflections then strike surfaces again, and these are the secondary reflections.
Because of the diffusion of the sound, there are even more reflections at a higher order than there
were at a lower order. However, each time the waves reflect, they lose some of their energy to
heat, as shown by the decreasing intensity of higher-order reflections. Eventually, there can be
so many reflected waves that there is barely any time between their moments of arrival to the
listener's ear. This blur of reflections is perceived as reverb.
direct sound
1st order reflections
secondary reflections
Two basic strategies are used to implement reverb effects. The first is to try to model the
physical space or the acoustical properties of an instrument or recording device. For rooms, this
involves analyzing the size of the room, the angles and spacing of walls and ceilings, the objects
in the room, the building and decorating materials, the usual temperature and humidity, the
number of people likely to be in the room, and so forth. One standard measure of an acoustical
space is its reverberation time the time it takes for a sound to decay by 60 dB from its original
level. Based on the analysis of the space, the space's effect on sound is simulated by arranging
multiple recursive and all-pass filters in serial, in parallel, or in nested configurations. The
many possible arrangements of the filters along with the possible parameter settings of delay
time and coefficients allow for a wide array of designs.
A second method, called the impulse response method, is something you could try
experimentally yourself. This method is applied to modeling the way things sound in a particular
36
listening space. The idea is to test the acoustics of a space by recording how a single quick
impulse of sound reverberates, and then using that recording to make your own sound file
reverberate in the same manner. Ideally, you would go into the listening room and generate a
pure, instantaneous impulse of sound that contains all audible frequencies, recording this sound.
In practice, you can't create an ideal impulse, but you can do something like clap your hands, pop
a balloon, or fire a starter pistol. This recording is your impulse response a filter that can be
convolved with the audio file to which you're applying the effect. (This technique is sometimes
called convolution reverb.) There are some hitches in this if you try it experimentally. The
impulse you generate may be noisy. One way to get around this is to use a longer sound to
generate the impulse a sine wave that sweeps through all audible frequencies over several
seconds. Then this recorded sound is deconvolved so that, hypothetically, only the
reverberations from the room remain, which can be used as the impulse response.
(Deconvolution begins with a and the output that would result from applying this filter and
determines the input signal.)
Trying to generate your own impulse response file is interesting and illustrative, but for
more polished results you have recourse to hundreds of libraries of impulse files (sometimes
called IRs) available commercially or free on the web. These IRs simulate the reverberance of
concert halls, echo chambers, electric guitars, violins, specialized mics just about any sound
environment, instrument, voice, or audio equipment you can think of. In fact, if you have
reverberation software that takes IRs as input, you can use any file at all, as long as it's of the
expected format (e.g., WAV or AIFF). You can get creative and add the resonance of your own
voice, for example, to a piece of music.
37
take the inverse discrete Fourier transform of Y (z ) , which gives y (n) , your filtered
signal represented in the time domain
The difficult part of the process above is finding H (z ) , just as the difficult part of designing a
convolution filter is finding h(n) . Actually, the two processes amount to the same thing.
Performing Y ( z ) = H ( z ) X ( z ) and then taking the inverse discrete Fourier transform of Y (z ) to
get y (n) is equivalent to doing a convolution operation described as
N
multiplied by
H(z)
DFT
convolution
theorem
x(n)
convolved with
h(n)
Y(z)
DFT-1
y(n)
The bottom line is that to design a particular filter yourself, you still have to determine
either h(n) or H (z ) .
38
The convolution mask h(n) is also called the impulse response, representing a filter in
the time domain. Its counterpart in the frequency domain, the frequency response H ( z ) , is also
sometimes referred to as the transfer function. The relationship between the impulse response
and the frequency response of a digital audio filter is precisely the relationship between h(n) and
H ( z ) described in the previous section.
A frequency response graph can be used to show the desired frequency response of a
filter you are designing. For example, the frequency response of an ideal low-pass filter is
shown in Figure 5.38. In the graph, angular frequency is on the horizontal axis. The vertical
axis represents the fraction of each frequency component to be permitted in the filtered signal.
This figure indicates that frequency components between c and c are to be left unchanged,
while all other frequency components are to be removed entirely. c is the angular cutoff
frequency. It can be assumed that the angular frequency in this graph is normalized. This is
done by mapping the Nyquist angular frequency which is the sampling frequency to .
That is, if f samp is the sampling frequency in Hz and f c is the cutoff frequency in Hz, then f c ,
normalized,
fc
fc
. It makes sense
f samp
f samp
to normalize in this way because the only frequencies that can be validly digitized at a sampling
f
frequency of f s are frequencies between 0 and s . Note also that normalization implies that
2
the cutoff frequency c must be less than .
This frequency response is ideal in that it displays a completely sharp cutoff between the
frequencies that are removed and those that are not changed. In reality, it isnt possible to design
such a filter, but the ideal frequency response graph serves as a place to begin. The first step in
FIR filter design is to determine what ideal impulse response corresponds to the ideal frequency
response.
0
-2
-c
passband
39
sinc(x) =
sin( x )
x
1
x = 0
for x 0
for
Lets see how this works mathematically. Recall the relationship between the frequency
response and the impulse response. The former is the Fourier transform of the latter. Saying that
the other way around, the inverse Fourier transform of the ideal frequency response H ideal ( )
gives us the ideal impulse response hideal (n) . (An integral rather than a summation is used
because the function is not assumed to be periodic.)
1
hideal (n) =
H ideal ( )e in d
2
Equation 5.6
step 1 :
1
2
in
ideal
( )e in d
1 c
cos(n) + i sin(n) d
2 c
sin( c n)
=
n
sin( 2f c n)
=
for n , n 0
n
and 2 f c for n = 0
=
step 2 :
step 3 :
step 4 :
In short, we get
hideal (n) =
sin( 2f c n)
for n , n 0
n
and 2 f c for n = 0
Equation 5.7
does a substitution using Euler's identity. (See Chapter 4.) Step 3 does the integration. Notice
that the sine term falls out because it is an odd function, and the negatives cancel the positives.
In step 4, we use the substitution c = 2f c , from the relationship between angular frequency
and frequency in Hz. The case where n = 0 results from taking the limit as n goes to 0, by
l'Hpital's rule. We have derived a form of the sinc function, as expected. The importance of
this derivation is that Equation 5.7 gives you the ideal impulse response based on your desired
cutoff frequency f c . By a similar process, you can derive the equations for ideal high-pass and
bandpass filters, which are given in Table 5.2.
Type of filter
hideal (n), n 0
Low-pass
sin(2f c n)
2 fc
n
sin(2f c n)
1 2 fc
n
sin(2f 2 n)
sin(2f 1 n) 2( f 2 f 1 )
f2
f1
n
n
sin( 2f1 n)
sin( 2f 2 n) 1 2( f 2 f 1 )
f1
f2
n
n
High-pass
Bandpass
Bandstop
hideal (0)
Table 5.2 Equations for ideal impulse responses for standard filters,
based on cutoff frequency fc and band edge frequencies f1 and f2
But were still in the realm of the ideal. We dont yet have a workable FIR filter. The
problem is that a sinc function goes on infinitely in the positive and negative directions, as
shown in Figure 5.39. The amplitude continues to diminish in each direction but never goes to 0.
We cant use this ideal, infinite sinc function as an impulse response i.e., a convolution mask
for an FIR filter. Think about the problem of trying to apply a convolution mask like this in realtime. To compute a new value for a given audio sample, wed need to know all past sample
values on to infinity, and wed need to predict what future sample values are on to infinity.
The next step in FIR filter design is to take this ideal, infinite sinc function and modify it
to something that is realizable as a convolution mask of finite length. However, a consequence
of the modification is that you cant achieve an ideal frequency response. Modifying the ideal
impulse response by making it finite creates ripples in the corresponding frequency response.
passband
fc
transition
band
stopband
frequency
41
Now lets consider a realistic frequency response graph. Figure 5.40 shows the kind of
frequency response graph youll encounter in the literature and documentation on digital audio
filters. Such a graph could represent the specifications of a filter design, or it could represent the
behavior of an existing filter. As with the graph of the ideal frequency response, the horizontal
axis corresponds to frequency and the vertical axis gives the fraction of the frequency that will
remain in the audio file after it is filtered, corresponding to attentuation. In Figure 5.40,
frequency is given in Hz rather than radians, and only the positive portion need be shown since
the graph is implicitly symmetrical around 0, but otherwise the graph is the same in nature as the
ideal frequency response graph in Figure 5.38. As before, we consider only the frequency
components only up to one-half the sampling rate, which is the Nyquist frequency, since no other
frequencies can be validly sampled. In frequency response graphs, units on the vertical axis are
a
sometimes shown in decibels. If measured in decibels, attenuation is equal to 20 log10 out dB ,
ain
where ain is the amplitude of frequency component f before the filter is applied, and a out is the
amplitude after the filter is applied. In the graph shown, the units are normalized to range from 0
to 1, indicating the fraction by which each frequency component is attenuated. In the case of
a
normalization, attenuation is measured as out .
ain
Compare the realistic frequency response graph to an ideal one. In a realistic graph, the
passband is the area from 0 up to the cutoff frequency f c and corresponds to the frequencies the
filter tries to retain. Ideally, you want frequencies up to f c to pass through the filter unchanged,
but in fact they may be slightly attenuated in either the positive or negative direction, shown as a
ripple, fluctuations in the frequency magnitude. The stopband corresponds to the frequencies
the filter attenuates or filters out. In the realistic graph, the stopband has ripples like the
passband, indicating that the unwanted frequencies are not filtered out perfectly. The transition
band lies in between. Unlike the transition in an ideal filter, in a real filter, the transition region
rolloff -- the slope of the curve leading to the transition band is not infinitely steep.
When you design an FIR filter, you specify certain parameters that define an acceptable,
though not ideal, frequency response. For example, you could specify:
for a low-pass filter, the cutoff frequency marking the end of the passband ( f c above)
for a low-pass filter, transition width, the maximum acceptable bandwidth from the end
of the passband to the beginning of the stopband
for a stopband or bandpass filter, f 1 and f 2 , the beginning and end frequencies for the
stopband or passband
passband deviation, the maximum acceptable range of attenuation in the passband due to
rippling
stopband deviation, the maximum acceptable range of attenuation in the stopband due to
rippling
So how do you realize a workable FIR filter based on your design specifications? This takes
us back to the ideal impulse response functions we derived in Table 5.2 One way to design an
FIR filter is to take an ideal impulse response and multiply it by a windowing function w(n) .
The purpose of the windowing function is to make the impulse response finite. However,
42
making the impulse response finite results in a frequency response that is less than ideal. It has
ripples in the passband or stopband, like those shown in Figure 5.40. Thus, you need to select a
windowing function that minimizes the ripples in the passband or stopband and/or creates your
desired cutoff slope.
The simplest windowing function is
> Aside: These are essentially the same
Hanning, Hamming, and Blackman function as
rectangular. It has the value 1 in some finite
discussed in Chapter 4, through they look a little
range across the infinite impulse response and 0s
different in form. The functions in Chapter 4 were
everywhere else. Multiplying by this window
expressed as continuous functions going from 0
simply truncates the infinite impulse response on
to T. In Table 5.3 they are expressed as discrete
both sides. The disadvantage of a rectangular
functions going from 1 / 2 N to 1 / 2 N . This
windowing function is that it has the effect of
shift of values symmetric around 0 results in
providing only limited attenuation in the
terms being added rather than subtracted in the
function.
stopband. Four other commonly-used windowing
functions are the triangular, Hanning, Hamming,
and Blackman windows. The triangular, Hamming, Hanning, and Blackman windows are
tapered smoothly to 0 at each end. The effect is that when you multiply a sinc function by one of
these windowing functions, the resulting frequency response has less of a ripple in the stopband
than would be achieved if you multiply by a rectangular window. The disadvantage is that the
transition band will generally be wider.
1
Rectangular windowing function
2n
w(n) = 0.5 + 0.5 cos
2n
w(n) = 0.54 + 0.46 cos
2n
4n
w(n) = 0.42 + 0.5 cos
+ 0.08 cos
N 1
N 1
The process discussed in this section is called the windowing method of FIR filter
design. In summary, the steps are given in Algorithm 5.1. The algorithm is for a low-pass filter,
but a similar one could be created for a high-pass, bandpass, or bandstop filter simply by
replacing the function with the appropriate one from Table 5.2 and including the needed
parameters. One parameter needed for all types of filters is the order of the filter, N i.e., the
number of coefficients. The order of the filter will have an effect on the width of the transition
band. Call the width of the transition band b. Assume that the Nyquist frequency in Hz is
normalized to 0.5 and that b is on this scale. Then the relationship between b and N is given by
b = 4 / N . Thus, if you know the width you'd like for your transition bandwidth, you can
determine an appropriate order for the filter. A higher-order filter gives a sharper cutoff between
filtered and unfiltered frequencies, but at the expense of more computation.
algorithm FIR_low_pass filter
/*Input: f_c, the cutoff frequency for the lowpass filter, in Hz
f_samp, the sampling frequency of the audio signal to be filtered, in Hz
N, the order of the filter; assume N is odd
Output: a low-pass FIR filter in the form of an N-element array */
{
43
Weve described the windowing method of FIR filter design because it gives an
Supplement on
intuitive sense of the filters and related terminology and is based on the mathematical
windowing
method for FIR
relationship between the impulse and frequency responses. However, it isn't necessarily the
best method. Two other methods are the optimal method and the frequency sampling method. filter design:
The optimal method is easy to apply and has the advantage of distributing the ripples more
evenly over the passband and stopband, called equiripple. The advantage of the frequency
sampling method is that it allows FIR filters to be implemented both non-recursively and
recursively and thereby leads to computational efficiency. Another important implementation
issue is the effect of quantization error on filter performance. We refer you to sources in the
mathematical
references for details on these filter design approaches and issues.
modeling
worksheet
In this section, weve given you the basic mathematical knowledge with which you
could design and implement your own software FIR filters. Of course you dont always have to
design filters from scratch. Signal processing tools such as MATLAB allow you to design filters
at a higher level of abstraction, but to do so, you still need to understand the terminology and
processes described here. Even when you use the predesigned filters that are available in audio
processing hardware or software, it helps to understand how the filters operate in order to choose
and apply them appropriately.
44
sided z-transform, with values of n going from 0 to . A full z-transform sums from to ,
but the one-sided transform works for our purposes.
Notice the naming convention that x (n) is transformed to X ( z ) , y (n) is transformed to
Y ( z ) , h(n) is transformed to H ( z ) , and so forth.
First, think of the z-transform in the abstract. It doesnt really matter what
x (n) corresponds to in the real world. Try applying the definition to x (n) = [5, -2, 3, 6, 6],
assuming that x (n) = 0 for n > 4 . (Henceforth, well assume x (n) = 0 if no value is specified
for x (n) . Array positions are numbered from 0.)
X ( z ) = 5 2 z 1 + 3z 2 + 6 z 3 + 6 z 4
You can see that X ( z ) is a function of the complex variable z.
Its no coincidence that the functions are called H ( z ) , X ( z ) , and Y ( z ) as they were in
the previous sections. This is because you can turn the z-transform into something more familiar
if you understand that it is a generalization of the discrete Fourier transform. Lets see how this
works.
In general, X ( z ) is a function that runs over all complex numbers z. However, consider
what you have if you specifically set z = e i and apply the z-transform to a vector of length N.
This gives you
N 1
2k
X ( z k ) = x (n)e in for z = e i and =
N
n =0
Equation 5.8
Equation 5.8 is the discrete Fourier transform of x (n) . The subscript k indicates that the
summation is performed to determine each k th frequency component. We will omit the k when
it is not important to the discussion, but you should keep in mind that the summation is
performed for every frequency component that is computed. In the context of this equation, we
are considering discretely-spaced frequency components. The N frequency components are
2
2k
, and the kth frequency component is frequency =
spaced by
. Thus, although X (z ) is
N
N
defined over the whole complex plane, when we perform the discrete Fourier transform of x (n) ,
we are only evaluating X (z ) at N specific points on the complex plane. These points represent
different frequency components, and they all lie evenly spaced around the unit circle.
We already observed that convolution in the time domain is equivalent to multiplication
in the frequency domain in the sense that performing y (n) = h(n) x (n) is equivalent to
performing Y ( z ) = H ( z ) X ( z ) and then taking the inverse Fourier transform of Y (z ) . So why
do you need z-transforms, since you can focus on a special case of the z-transform, which is the
Fourier transform? The reason is that expressing things using the notation of a z-transform is
more convenient mathematically in that it helps to lay bare how a given filter will behave. An
IIR filter is defined as a convolution by y (n) = h(n) x (n) . The equivalent relation in terms of
the z-transform is Y ( z ) = H ( z ) X ( z ) , from which it follows that
Y ( z)
H ( z) =
X (z )
Equation 5.9
45
Y ( z)
, is referred to as a transfer function. Note that this
X ( z)
form is closely related to the difference equation form of an IIR filter.
N 1
k =0
k =1
Let y ( n) = h( n) x ( n) = a k x ( n k ) bk y ( n k ) be an IIR
key
equation
Y ( z)
filter, as defined in Equation 5.2. Let H ( z ) =
be the transfer
X ( z)
function for this filter. Then
a + a1 z 1 + a 2 z 2 + ... a 0 z N + a1 z N 1 + ...
H(z) = 0
= N
1 + b1 z 1 + b2 z 2 + ..
z + b1 z N 1 + ...
Equation 5.10
We'll see in the next section that the transform function gives us a convenient way of predicting
how a filter will behave, or designing an IIR filter so that it behaves the way we want it to.
identity, this is e i .
46
Imaginary
/2
cos()+ i sin()=ei
r
b
Real
3/2
Why do we want to trace a circle on the complex number plane using the point e i ?
Because this circle is exactly the circle on which we evaluate the z-transform for each frequency
component. By Equation 5.8, the k th frequency component, H ( z) k , is obtained by evaluating
i
2k
N
47
Another way to get a zero and a pole associated with h(n) = [1,0.5] is as follows. The
difference equation form of this convolution is
1
y ( n) = h( k ) x ( n k ) = x ( n) 0.5 x ( n 1)
k =0
Equation 5.11
The z-transform has a property called the delay property whereby the z-transform of
x (n 1) equals z 1 X ( z ) . The z-transform also has the linearity property that says if two
functions are equal, then their z-transforms are equal; and the z-transform of a sum of terms is
equal to the sum of the z-transforms of the terms. Thus, if we take the z-transform of both sides
of the equation y (n) = x (n) 0.5 x (n 1) , we get
Y ( z ) = X ( z ) 0.5 z 1 X ( z )
Equation 5.12
.
Dividing both sides by X ( z ) yields
z 0.5
Y ( z)
= H ( z ) = 1 0.5 z 1 =
X ( z)
z
Equation 5.13
Thus, we have a zero at z = 0.5 (where the numerator equals 0) and a pole at z = 0 (where the
denominator equals 0).
Once you know the zeros and poles, you can plot them on the complex plane in a graph
called a zero-pole diagram. From this you can derive information about the frequency response
of the filter. At point z = 0.5 , there is no imaginary component, so this point lies on the
horizontal axis. The pole is at the origin. This is shown in Figure 5.42.
Imaginary
/2
Pk
pole at z=0
P0
Real
zero at z=0.5
3/2
To predict the filters effect on frequencies, we look at each kth point called it Pk -around the unit circle, where P0 falls on the x-axis. The angle formed between Pk, the origin, and
2k
P0 is (which is equal to
) . Each such angle corresponds to a frequency component of a
N
signal being altered by the filter. Since the Nyquist frequency has been normalized to , we
consider only the frequencies between 0 and , in the top semicircle. Let d zero be the distance
between Pk and the zero of the filter, and let d pole be the distance between Pk and the pole. The
48
d zero
. In this case,
d pole
Supplement on
z-transforms,
zero-pole
diagrams, and
IIR filters:
because the pole of the filter is at the origin, d pole is the distance from the origin to the unit
circle, which is always 1, so the size of
d zero
depends entirely on d zero . As d zero gets larger,
d pole
the frequency response gets larger, and d zero gets increasingly large as you move from = 0
to = , which means the frequency response gets increasingly large. Thus, the zero-pole
diagram represents a high-pass filter in that it attenuates low frequencies more than high
frequencies.
Now lets try deriving and analyzing the zero-pole diagram of an IIR filter. Say that
your filter is defined by the difference equation
y (n) = x (n) x (n 2) 0.64 y (n 2)
interactive tutorial
Supplement on
creating FIR
and IIR filters in
MATLAB:
Equation 5.14
mathematical
modeling
worksheet
The zeros are at z = 1 and z = 1 , and the poles are at z = 0.8i and z = 0.8i . (Note that if the
imaginary part of z is non-zero, roots appear as conjugate pairs a ib .) The zero-pole diagram
is in Figure 5.43. This diagram is harder to analyze by inspection because we have more than
one zero and more than one pole. To determine when the frequency response gets larger, we
( z + 1)( z 1)
have to determine when H ( z ) =
grows, which is the same thing as
( z + 0.8i )( z 0.8i )
determining how ( z + 1)( z 1) gets large relative to ( z + 0.8i )( z 0.8i ) . In fact, this diagram
represents a bandpass filter, with frequencies near / 2 being the least attenuated.
49
Imaginary
/2
pole at
z = 0.8i
zero at
z = -1
zero at
z=1
Real
pole at
z = -0.8i
Weve looked at how you analyze an existing filter by means of its zero-pole diagram.
It is also possible to design a filter using a zero-pole diagram by placing the zeros and poles
in relative positions such that they create the desired frequency response. This is feasible,
however, only for simple filters.
The most commonly-used design method for IIR filters is to convert analog filters into
equivalent digital ones. This method relies on the z-transform as well as the equivalent
transform in the continuous domain, the Laplace transform. We wont go into the details of this
method of filter design. However, you should be aware of four commonly-used types of IIR
filters, the Bessel, Butterworth, Chebyshev, and elliptic filters.
Recall that FIR filters are generally phase linear, which means that phase shifts affect
frequency components in a linear fashion. The result is that harmonic frequencies are shifted in
the same proportions and thus dont sound distorted after filtering, which is important in music.
In IIR filters, on the other hand, care must be taken to enforce phase linearity.
Bessel filters are the best IIR filter for ensuring a linear phase response. Butterworth
filters sacrifice phase linearity so that they can provide a frequency response that is as flat as
possible in the passband and stopband i.e., they minimize ripples. A disadvantage of
Butterworth filters is that they have a wide transition region. Chebyshev filters have a steeper
roll-off than Butterworth filters, the disadvantage being that they have more non-linear phase
response and more ripple. However, the amount of ripple can be constrained, with tradeoffs
made between ripple and roll-off or between ripple in the passband (Chebyshev 1 filters) versus
ripple in the stopband (Chebyshev 2 filters). Elliptic filters, also called Cauer filters, have the
worst phase linearity of the common IIR filters, but they have the sharp roll-off relative to the
number of coefficients and are equiripple in the passband and stopband. It should be clear that
the choice of filter depends on the nature of your audio signal and how you want it to sound after
filtering.
50
Supplement on
creating a filter
via a transfer
function:
mathematical
modeling
worksheet
to a voice signal such as a telephone transmission, the reduction in quality resulting from
effectively reducing the bit depth isn't objectionable.
Differential pulse code encoding is another way of making an audio file smaller than it
would be otherwise by recording the difference between one sample and the next rather than
recording the actual sample value. Variations of DPCM include ADPCM and delta modulation.
A-law encoding, -law encoding, delta modulation, and other variations of PCM are
time-based methods. That is, there is no need to transform the data into the frequency domain in
order to decide where and how to eliminate information. In a sense, these methods aren't really
compression at all. Rather, they are ways of reducing the amount of data at the moment an audio
signal is digitized, and they are often labeled "conversion techniques" rather than "compression
methods."
The most effective audio compression methods require some information about the
frequency spectrum of the audio signal. These compression methods are based on
psychoacoustical modeling and perceptual encoding.
51
in the band. This phenomenon is called masking. Critical bands are narrower for lower than for
higher frequency sounds. Between 1 and 500 Hz, bands are about 100 Hz in width. The critical
band at the highest audible frequency is over 4000 Hz wide. Think about the implications of
relatively narrow bands at low frequencies: Narrow bands imply that that interference among
frequencies occurs over a narrower range, which explains why we have greater frequency
resolution at low frequencies.
The phenomenon of critical bands is one of the most important in perceptual encoding,
since it gives us a basis for eliminating sound information that is not perceived anyway. The
masking phenomenon is pictured in Figure 5.44. Recall from Chapter 1 that the limit below
which you can't hear is called the threshold of hearing. Masking causes the threshold of hearing
to change in the presence of a dominant frequency component in a band. For example, a 500 Hz
frequency component is easy to hear, but when a 400 Hz tone occurs at about the same time and
at sufficient amplitude, it can mask the 500 Hz component. Another way to say this is that the
threshold of hearing for 500 Hz is higher in the presence of the 400 Hz signal. A frequency that
raises the threshold of hearing of another frequency is called a masking frequency or masking
tone. As shown in Figure 5.44, the threshold of hearing for all the frequencies within the critical
band of the masking frequency are raised, indicated by the masking threshold in the figure. All
frequencies that appear at amplitudes beneath the masking threshold will be inaudible. This
graph represents the effect of just one masking tone within one critical band at one window in
time. These effects happen continuously over time, over all the critical bands. The important
point to note is that frequencies that are masked out in an audio signal cannot be heard, so they
do not need to be stored. Eliminating masked information results in compression.
threshold of hearing
masking threshold
frequency
masked frequency
masking tone
Here's a sketch of how the masking phenomenon can be applied in compression: A small
window of time called a frame is moved across a sound file, and the samples in that frame are
compressed as one unit. Spectral analysis by means of a filter bank divides each frame into
bands of frequencies; 32 is often the number of bands used. A masking curve for each band is
calculated based upon the relative amplitudes of the frequencies in that band. Analysis of the
masking curve reveals the information that can be discarded or compressed. This is done by
determining the lowest possible bit-depth that can be used for the band such that the resulting
quantization noise is under the masking curve. Further refinements can be made to this scheme.
For example, temporal masking can be applied, based on the phenomenon of one signal blocking
out another that occurs just before or just after it. Following the perceptual filtering step, the
remaining information is encoded at the bit depth determined for each band.
52
In the sections that follow, we will use MPEG compression to illustrate how perceptual
encoding is implemented. MPEG coding has been one of the dominant audio compression
methods since the 1990s. Other important audio coding schemes exist that we won't treat here.
For example, AC3-Dolby audio is a widely-used audio codec that is used in DVD video, digital
TV, and video playstations. Based on perceptual encoding, AC3-Dolby uses many of the
techniques used in MPEG compression.
53
gets smaller from Layer I to Layer III is a result of the greater complexity of the Layer III
compression algorithm compared to the Layer I algorithm i.e., the Layer III can compress more
without loss of quality, but it takes more time to do so.
The amount of time needed for encoding and decoding has the greatest impact on realtime audio such as for two-way voice conversations. In general, a delay of more than 10 ms
can be disturbing in a voice conversation. MPEG audio encoding delays are on the order of
between 15 ms for Layer I and 60 ms for Layer III, dependent on the hardware and the software
implementation.
The compression rates that you may see cited for different layers of MPEG give you the
general picture that higher layers yield better compression. For example, MPEG-1 Layer 1 is
cited as having a compression rate of about 4:1; Layer 2 as about 7:1; and Layer III as about
11:1. However, as is true with bit rates, compression rates don't mean much unless you consider
what type of audio is being compressed. To get a sense of the compression rate, consider that
uncompressed CD-quality stereo requires 32 bits per sample * 44,100 samples/s = 1.4112 Mb/s.
Compressing CD-quality stereo to 128 kb/s using MP3 encoding yields a compression rate of
11:1. But MP3 can also be compressed at 192 kb/s, which is a rate of about 7:1. You can see
that to generate specific compression rates, you need to compare the initial bit rate with the final
bit rate. To some extent, the final bit rate is your own choice. When you compress an audio file
with an MPEG compressor, you get a list of bit rate options, as shown in Figure 5.45. The lower
the bit rate, the more compression you get, but at the sacrifice of some sound fidelity. Notice
that you can also choose between constant bit rate (CBR) and variable bit rate (VBR). CBR
uses the same number of bits per second regardless of how complex the signal is. VBR uses
fewer bits for passages with a smaller dynamic range, resulting overall in a smaller file.
The MPEG audio layer and bit rate that is best for a given application depends on the
nature of the audio files being compressed and how high a bit rate the user can tolerate. For
example, MPEG-1 Layer 2 (sometimes referred to as MP2) gives better quality sound than MP3
if you don't mind relatively higher bit rates, and it requires lower encoding delays. For these
reasons, it is widely used for digital radio. MP3 compresses well and has been extremely
popular for music files that are shared through the web. The usual MP3 bit rate of 128 or 192
kb/s creates reasonable-size files that don't take too much time to download and that have good
quality for playback.
54
MPEG layers are backward compatible so that, for example, a Layer III decoder can
decode a Layer I stream. MPEG-2 decoders are backward compatible with MPEG-1.
Proprietary implementations of MPEG such as AT&T's a2b and Liquid Audio are not mutually
compatible.
Version
Data rates
MPEG-1
Layer I
Layer II
Layer III
32448 kb/s
32384 kb/s
32320 kb/s
Applications
Channels
mono or
stereo
Sampling rates
supported
32, 44.1, and 48 kHz
mono or
16, 22.05, and 24 kHz
MPEG-2 LSF (Low
stereo
Sampling Frequency)
Layer I
32256 kb/s
Layer II
8160 kb/s
Layer III 8160 kb/s
multichannels,
5.1
32, 44.1, and 48 kHz
MPEG-2 Multichannel
surround
Layer I
max 448 kb/s multilingual
Layer II
max 384 kb/s extensions
Layer III
Max 32 kb/s
8 384 kb/s
multichannels, music up to 48
32, 44.1, and 48 kHz
AAC in
shared on the web,
channels
and other rates between
MPEG-2 and
portable music
8 and 96 kHz
MPEG-4
(Advanced
players, cell phones
Audio
Coding)
Table 5.4 Phases and layers of MPEG audio compression
55
algorithm MPEG-1_audio
/*Input: An audio file in the time domain
Output: The same audio file, compressed*/
{
Divide the audio file into frames
For each frame {
By applying a bank of filters, separate the signal into frequency bands.
For each frequency band {
Perform a Fourier transform to analyze the band's frequency spectrum
Analyze the influence of tonal and non-tonal elements (i.e., transients)
Analyze how much the frequency band is influenced by neighboring bands
Find the masking threshold and signal-to-mask ratio (SMR) for the band, and
determine the bit depth in the band accordingly
Quantize the samples from the band using the determined bit depth
Apply Huffman encoding (optionally)
}
Create a frame with a header and encoded samples from all bands
}
}
Algorithm 5.2
32-filter
filter bank
bit
allocator
...
audio signal
Fourier
transform
...
compressed
audio signal
psychoacoustical
analysis
56
1. Divide the audio file into frames and analyze the psychoacoustical properties of
each frame individually.
Motivation: Each frame covers a sequence of samples in a small window of time. To
analyze how one frequency component might mask neighboring ones, it's necessary to look at
samples that are close to each other in time. The masking phenomenon happens only when
different frequencies are played at close to the same time.
Details: Frames can contain 384, 576, or 1152 samples, depending on the MPEG phase
and layer. It may seem odd that these numbers are not powers of two. However, when
frequency analysis is done on these samples, a larger window is placed around the sample, and
the window is a power of two.
For the remainder of these steps, it is assumed that we're operating on an individual
frame.
2. By applying a bank of filters, separate the signal into frequency bands.
Motivation: The samples are divided into frequency bands because the psychoacoustical
properties of each band will be analyzed separately. Each filter removes all frequencies except
for those in its designated band. Once the psychoacoustical properties of a band are analyzed,
the appropriate number of bits to represent samples in that band can be determined.
Details: The use of filter banks is called subband coding. Say that there are n filters in
the filter bank. n copies of the signal are generated, and a copy is sent through each of the filters.
(In MPEG-1, for example, n = 32.) Time-domain bandpass filters are applied such that each
filter lets only a range of frequencies pass through. Note that the n frequency bands are still
represented in the time domain i.e., they contain samples of audio consecutive points in time.
Also, the amount of data does not increase as a result of dividing the frame into bands. If, for
example, 384 samples entered the 32 filters, each filter produces 12 samples. Since 12 * 32 =
384, so there is no increase in data.
The previous section on FIR and IIR filters shows that bandpass filters are never ideal in
that you cannot perfectly isolate the band of filters you want. The frequency response graph is
never a perfect rectangle like the one pictured in Figure 5.38, with a perfectly vertical cutoff
between the desired frequencies and the filtered-out frequencies. Instead, the cutoff between the
frequencies is a sloped line, called an attenuation slope. For this reason, the frequency bands
overlap a little. If this overlap is not accounted for, then there will be aliasing when the signal is
reconstructed. Quadrature mirror filtering (QMF) is a technique that eliminates the aliasing. It
does so by making the attenuation slope of one band a mirror image of the attenuation slope of
the previous band so that the aliased frequencies cancel out when the signal is reconstructed.
This is pictured in Figure 5.47.
In MPEG-1 Layers I and II, the frequency bands created by the filter banks are uniform
in size. This means that their width doesn't match the width of the critical bands in human
hearing, which are wider at high frequencies. However, the way in which the frequency
components of a band are analyzed can compensate somewhat for this disparity. In MP3
encoding, the modified discrete cosine transform (MDCT) is used to improve frequency
resolution, particularly at low-frequency bands, thus modeling the human ear's critical bands
more closely. Also, MP3 uses non-linear quantization, where (like in -law encoding) the
quantization intervals are larger for high amplitude samples.
57
attenuation slopes
frequency band1
frequency band2
3. Perform a Fourier transform on the samples in each band in order to analyze the
band's frequency spectrum.
Motivation: The filters have already limited each band to a certain range of frequencies,
but it's important to know exactly how much of each frequency component occurs in each band.
This is the purpose of applying a Fourier transform. From the frequency spectrum, a masking
curve can be produced for each band. Then the number of bits with which to encode each band
can be chosen in a way that pays attention to the elevated noise floor produced from masking.
Details: Another copy of each frame, called a sidechain, is created so that the Fourier
transform can be applied. A 512 or 1024-sample window is used. You may notice that this is
not the same size as the number of samples in a frame. In fact, the Fourier transform is done on
a window that surrounds the frame samples. For example, if a frame has 384 samples, 1024
samples that encompass this window are used for the Fourier transform. If the frame has 1152
samples, two 1024-sample Fourier transforms are done, covering two halves of the frame.
4. Analyze the influence of tonal and non-tonal elements in each band. (Tonal
elements are simple sinusoidal components, e.g., frequencies related to melodic and
harmonic music. Non-tonal elements are transients like the strike of a drum or the
clapping of hands.)
Motivation: It would not be good to allow masking to eliminate non-tonal elements that
punctuate an audio signal in meaningful ways. Also, if non-tonal elements are not compressed
properly, ringing or pre-echo effects can result.
Details: The longer the time window is in frequency analysis, the better the frequency
resolution, but the worse the time resolution. The effect of poor time resolution is that sound
elements that occur very close to one another in time might not be distinguished sufficiently, one
from the other. This in turn can mean that transient signals are not properly identified. Layers II
and III have a longer time window than Layer I, so they have better time resolution.
5. Determine how much each band's influence is likely to spread to neighboring
frequency bands.
Motivation: It isn't sufficient to deal with bands entirely in isolation from each other,
since there can be a masking effect between bands.
Details: Empirically-determined spreading functions can be applied.
6. Find the masking threshold and signal-to-mask ratio (SMR) for each band, and
determine the bit depth of each band accordingly.
58
Motivation: Within each band, a "loud" frequency component can mask other frequency
components. The precision with which a group of samples need to be represented depends on
ratio of the loudest sample among them to the amplitude below which they can't be heard, even if
they are present. This is in fact the definition of SMR the ratio between the peak sound
pressure level and the masking threshold is the SMR (Figure 5.48). This ratio determines how
many bits are needed to represent samples within a band.
Details: The main idea you should get from this is that the masking phenomenon causes
the noise floor within a band to be raised. The number of bits per sample varies from band to
band depending on how high the noise floor is. When the noise floor is high relative to the
maximum sound pressure level within a band, then fewer bits are needed. Fewer bits create
more quantization noise, but it doesn't matter if that quantization noise is below the masking
threshold. The noise won't be heard anyway.
MP3 has interesting variation in its design, allowing for a bit reservoir. In a constant bit
rate (CBR) encoding, all frames have the same number of bits at their disposal, but not all frames
need all the bits allocated to them. A bit reservoir scheme makes it possible to take unneeded
bits from a frame with a low SMR and use them in a frame with a high SMR. This doesn't
increase the compression rate (since we're assuming CBR), but it does improve the quality of the
compressed audio.
s/m = signal-to-masking noiseratio
threshold of hearing
masking threshold
frequency
maximum sound pressure level
7. Quantize the samples for the band with the appropriate number of bits, possibly
following this with Huffman encoding.
Motivation: Some bit patterns occur more frequently than others. Huffman encoding
allows the frequently-occurring bit patterns to be encoded with fewer bits than patterns that occur
infrequently.
Details: MPEG-1 Layers 1 and 2 use linear quantization, while MP3 uses non-linear.
After quantization, the values can be scaled so that the full dynamic range offered by the
bit depth is used. The scale factor then has to be stored in the frame.
In MP3 encoding, the frame data is reordered after quantization and divided into regions
so that Huffman tables can be applied. These empirically-generated tables reflect the
probabilities of certain bit patterns in each region.
signal with one signal plus a vector representing directional information. This yields the highest
stereo compression rate, but obviously it is a lossy method. Mid/side stereo-coding calculates a
middle channel as (left+right)/2 and side channel as (left-right)/2, which then is encoded with
fewer bits than straightforward stereo representation. Mid/side coding has a less detrimental
effect on phase information than intensity stereo does.
The MPEG standard defines how a compressed bit stream must look. The compressed
signal is divided into frames. A sketch of an MPEG frame is shown in Figure 5.49. Each frame
begins with a sync word to signal that this is the beginning of a new frame. The header then tells
the bit rate, sampling frequency, mono or stereo mode, and so forth. CRC stands for cyclic
redundancy check, which is an error checking algorithm to ensure that no errors have been
inserted in transmission. Following the CRC, the number of bits used in each band is listed.
Then the scale factor for each band is given. The largest part of the frame is devoted to the
samples. Extra information may be added to a frame in the form of ID3 tags. ID3 is an informal
standard for giving information about music files being compressed, including copyright, artist,
title, album name, lyrics, year, and web links.
scale
bit
CRC
factor
allocation
(Layers
sync word,
by band
phaseand layer #s, I and II) by band
header
sample values
bit rate,
sampling frequency,
etc.
extra
info
artist,
album,
year,
etc.
60
MP3 at 128 kb/s. Because of its good quality, AAC has gained popularity for use in portable
music players and cell phones, two of the biggest markets for compressed audio files.
The AAC psychoacoustical model is too complicated to describe in detail in a short
space. The main features include the following:
The modified discrete cosine transform improves frequency resolution as the signal is
divided into frequency bands;
Temporal noise shaping (TNS) helps in handling transients, moving the distortion
associated with their compression to a point in time that doesn't cause a pre-echo.
Predictive coding improves compression by storing the difference between a predicted
next value and what the value actually is. (Since the decoder uses the same prediction
algorithm, it is able to recapture the actual value from the difference value stored.)
Not all scale factors need to be stored since consecutive ones are often the same. AAC
uses a more condensed method for storing scale factors.
A combination of intensity coding and mid/side coding is used to compress information
from multiple channels.
A model for low-delay encoding provides good perceptual encoding properties with the
small coding delay necessary for two-way voice communication.
The AAC compression model is a central feature of both MPEG-2 and MPEG-4 audio.
MPEG-4, however, is a broad standard that also encompasses DVD video, streaming media,
MIDI, multimedia objects, text-to-speech conversion, score-driven sound synthesis, and
interactivity.
We refer you to the sources at the end of this chapter and Chapter 4 if you're interested in
more implementation details of AAC, MPEG-4, and the other compression models.
5.8 Vocabulary
AAC compression (Advanced Audio Coding)
all-pass filter
analog-to-digital converter (ADC)
attack time (of a sound)
band filters
low-pass
high-pass
bandpass
bandstop
bus
channel
mono
stereo
chorus effect
clipping
comb filter
constant vs. variable bit rate (CBR vs. VBR)
convolution filter
convolution theorem
crossover
delay effect (multi-tap delay)
61
62
63
release
order of a filter (number of taps or tap weights)
parametric equalizer
paragraphic equalizer
peaking filter (notch filter)
Q-factor (quality factor)
psychoacoustical modeling
rectangle function
reverb
shelving filters
low-shelf
high-shelf
cutoff frequency
sinc function
SMPTE (Society of Motion Picture and Television Engineers) time
sound card
external (digital audio interface)
internal
spectral leakage
stopband
track
transients
transport controls
unit impulse (delta function)
waveform view (sample editor)
wet vs. dry signal
windowing method of FIR filter design
zero-pole diagram
pole
zero
z-transform (one-sided)
delay property
linearity property
x (n) = [138, 232, 253, 194, 73, 70, 191, 252, 233, 141, 4, 135, 230, 253, 196, 77,
66, 189, 251, 235, 144, 7, 131, 229, 253, 198, 80, 63, 186, 251]
h(n) = [0.0788, 0.1017, 0.1201, 0.1319, 0.1361, 0.1319, 0.1201, 0.1017, 0.0788]
64
N 1
k =0
k =1
y ( n) = h( n) x ( n) = h( k ) x ( n k ) ,
k =0
N 1
k =0
k =1
5.10 Applications
1. Examine the specifications of your sound card (or the one you would like to have). What
kind of input and output ports does it have? What is its signal-to-noise ratio? Does it handle
MIDI? If so, how?
2. Examine the specifications of your microphone (or the one you would like to have). Is a
dynamic or a condenser mic? Some other kind? Does your microphone require power? What
frequencies does it handle best? Do you need different mics for different needs?
65
3. If you have access to more than two microphones with significantly different characteristics,
try making similar recordings with both of them. Listen to the recordings to see if you can hear
the difference. Then look at the frequency spectrum of the recordings and analyze them. Are the
results what you expect, based on the specifications of the mics?
4. Create a one-minute radio commercial for a hypothetical product, mixing voices, music, and
sound effects. Record the commercial at your digital audio workstation using multitrack editing
software. Record and edit the different instruments, voices, and so forth on different tracks.
Experiment with filters, dynamics processing, and special effects. When you've finished, mix
down the tracks, and save and compress the audio in an appropriate file type. Document and
justify your steps and choices of sampling rate, bit depth, editing tools, compression, and final
file type.
5. If you don't know already, find out what "loops" are. Find a source of free online loops.
Create background music for a hypothetical computer game by combining and editing loops. (At
the time of this writing, Acid Loops is an excellent commercial program for working with loops.
Royalty Free Music is a source for free loops.) If you create a computer game when you get to
Chapter 8, you can use your music in that game.
6. See if you can replicate the impulse response method of creating a convolution filter for
reverb that makes an audio file sound like it was recorded in a certain acoustical space. (If you
do a web search, you should be able to find descriptions of the details of this method. Look
under "convolution reverb." For example, at the writing of this chapter, a paper entitled
Implementation of Impulse Response Measurement Techniques by J. Shaeffer and E. Elyashiv
could be found at http://www.acoustics.net/objects/pdf/IR-paper.pdf.)
Examine the features of your digital audio processing program or programs and try the
exercises below with features that are available.
7. What file types are supported by your audio editing software? What compressions methods?
8. Make a simple audio file and intentionally put a click or pop in it. See if you can remove the
click or pop with your audio editing program.
9. Examine the tools for dynamics processing offered in your audio editing program. Try
compressing the dynamic range of an audio file. Then try expansion. Experiment with different
attack and release times and listen to the differences.
10. Examine the types of filters offered in your audio editing program. See if your software has
a graphic equalizer, parametric EQ, paragraphic EQ, notch filters, and/or convolution. Can you
tell if they are implemented with FIR or IIR filters? What parameters do you have to set to use
these filters?
11. Examine the types of reverb, echo, and delay effects available in your audio editing program
and experiment with the effects.
Additional exercises or applications may be found at the book or author's websites.
5.11 References
Coulter, Doug. Digital Audio Processing. Lawrence, KS: R & D Books, 2000.
Embree, Paul M., and Damon Danieli. C++ Algorithms for Digital Signal Procssing. 2nd ed.
Upper Saddle River, NJ: Prentice Hall, 1999.
66
Noll, Peter. MPEG Digital Audio Coding." IEEE Signal Processing Magazine, 14 (5): 59-81,
Sept. 1997.
Pan, Davis. "A Tutorial on MPEG/Audio Compression." IEEE Multimedia 2 (2): 60-74, 1995.
Phillips, Dave. The Book of Linux Music & Sound. San Francisco: No Starch Press/Linux
Journal Press, 2000.
Smith, Steven W. The Scientist and Engineer's Guide to Digital Signal Processing. San Diego:
California Technical Publishing, 1997.
See Chapter 4 for additional references on digital audio.
5.11.1
Print Publications
5.11.2
Websites
aRts-project. http://www.arts-projects.org.
Audacity. http://jackit.sourceforge.net.
67