0% found this document useful (0 votes)
111 views

Audio Compression

Digital audio compression removes redundant or irrelevant information from audio signals to reduce file sizes and transmission bandwidth needs. It works by lowering sampling rates, bit rates, or number of channels, or by using data compression algorithms. Lossy compression provides higher compression by removing imperceptible audio data. Popular lossy techniques include psychoacoustic modeling to eliminate inaudible frequencies, predictive coding, and Huffman coding. Lossless compression only removes redundant data to allow perfect reconstruction.

Uploaded by

Geek Anant
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Audio Compression

Digital audio compression removes redundant or irrelevant information from audio signals to reduce file sizes and transmission bandwidth needs. It works by lowering sampling rates, bit rates, or number of channels, or by using data compression algorithms. Lossy compression provides higher compression by removing imperceptible audio data. Popular lossy techniques include psychoacoustic modeling to eliminate inaudible frequencies, predictive coding, and Huffman coding. Lossless compression only removes redundant data to allow perfect reconstruction.

Uploaded by

Geek Anant
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Audio Compression

Techniques

1
Introduction
 Digital Audio Compression
 Removal of redundant or otherwise irrelevant
information from audio signal
 Audio compression algorithms are often referred to as
“audio encoders”
 Applications
 Reduces required storage space
 Reduces required transmission bandwidth

2
Audio Compression
 Audio signal – overview
 Sampling rate (# of samples per second)
 Bit rate (# of bits per second). Typically,
uncompressed stereo 16-bit 44.1KHz signal has a
1.4MBps bit rate
 Number of channels (mono / stereo / multichannel)
 Reduction by lowering those values or by data
compression / encoding

3
Why Compression is Needed
 Data rate = sampling rate * quantization
bits * channels (+ control information)

 For example (digital audio):


 44100 Hz; 16 bits; 2 channels
 generates about 1.4M of data per second;
84M per minute; 5G per hour
Audio Data Compression
 Redundant information
 Implicit
in the remaining information
 Ex. oversampled audio signal

 Irrelevant information
 Perceptuallyinsignificant
 Cannot be recovered from remaining
information

5
Audio Data Compression
 Lossless Audio Compression
 Removes redundant data
 Resulting signal is same as original – perfect
reconstruction E.g. Huffmann, LZW
 Lossy Audio Encoding
 Removes irrelevant data
 Resulting signal is similar to original
E.g. ADPCM, LPC

6
Audio Data Compression
 Audio vs. Speech Compression
Techniques
 Speech Compression uses a human vocal
tract model to compress signals
 Audio Compression does not use this
technique due to larger variety of possible
signal variations

7
Generic Audio Encoder
 Psychoacoustic Model
 Psychoacoustics – study of how sounds are
perceived by humans
 Uses perceptual coding
 eliminate information from audio signal that is
inaudible to the ear
 Detectsconditions under which different audio
signal components mask each other

8
Additional Encoding Techniques
 Other encoding techniques techniques are
available (alternative or in combination)
 Predictive Coding
 Coupling / Delta Encoding
 Huffman Encoding

9
Additional Encoding Techniques
 Predictive Coding
 Often used in speech and image compression
 Estimates the expected value for each sample based
on previous sample values
 Transmits/stores the difference between the expected
and received value
 Generates an estimate for the next sample and then
adjusts it by the difference stored for the current
sample
 Used for additional compression in MPEG2 AAC

10
Additional Encoding Techniques
 Coupling / Delta encoding
 Used in cases where audio signal consists of two or
more channels (stereo or surround sound)
 Similarities between channels are used for
compression
 A sum and difference between two channels are
derived; difference is usually some value close to
zero and therefore requires less space to encode
 This is a case of lossless encoding process

11
Additional Encoding Techniques
 Huffman Coding
 Information-theory-based technique
 An element of a signal that often reoccurs in the
signal is represented by a simpler symbol, and its
value is stored in a look-up table
 Implemented using a look-up tables in encoder and in
decoder
 Provides substantial lossless compression, but
requires high computational power and therefore is
not very popular
 Used by MPEG1 and MPEG2 AAC

12
Psychoacoustics
Limits of Human Hearing

– Time Domain Considerations

– Frequency Domain (Spectral) Considerations

– Amplitude vs. Power

– Masking in Time and Frequency Domains

– Sampling Rate and Signal Bandwidth


Limits of Human
Hearing
 Time and Frequency

Events longer than 0.03 seconds are resolvable in time


shorter events are perceived as features in frequency

20 Hz. < Human Hearing < 20 KHz.


(for those from 18 to 25)

“Pitch” is PERCEPTION related to FREQUENCY


Human Pitch Resolution is about 40 - 4000 Hz.
Limits of Human
Hearing
 Amplitude or Power???
– “Loudness” is PERCEPTION related to POWER,
not AMPLITUDE

– Power is proportional to (integrated) square of signal

– Human Loudness perception range is about 120 dB,


where +10 db = 10 x power = 20 x amplitude

– Waveform shape is of little consequence. Energy at


each frequency, and how that changes in time, is the
most important feature of a sound.
Limits of Human
Hearing
 Waveshape or Frequency Content??
– Here are two waveforms with identical power spectra,
and which are (nearly) perceptually identical:

Wave 1

Wave 2

Magnitude
Spectrum
of Either
Limits of Human Hearing
Masking in Amplitude, Time, and Frequency

– Masking in Amplitude: Loud sounds „mask‟ soft ones.


Example: Quantization Noise

– Masking in time: A soft sound just before a louder


sound is more likely to be heard than if it is just after.

– Masking in Frequency: Loud „neighbor‟ frequency masks soft spectral


components. Low sounds mask higher ones more than high masking low.
Limits of Human Hearing
 Masking in Amplitude
 Intuitively, a soft sound will not be heard if there
is a competing loud sound. Reasons:
 Gain controls in the ear
stapedes reflex and more
 Interaction
(inhibition) in the cochlea
 Other mechanisms at higher levels
Limits of Human Hearing
 Masking in Time
 In the time range of a few milliseconds:

A soft event following a louder event tends to be


grouped perceptually as part of that louder event

 Ifthe soft event precedes the louder event, it


might be heard as a separate event (become
audible)
Limits of Human Hearing
 Masking in Frequency

Only one component in this spectrum is


audible because of frequency masking
Spectral Analysis
 Tasks of Spectral Analysis
 To derive masking thresholds to determine
which signal components can be eliminated
 To generate a representation of the signal to
which masking thresholds can be applied
 Spectral Analysis is done through
transforms or filter banks

21
Spectral Analysis
 Transforms
 Fast Fourier Transform (FFT)
 Discrete Cosine Transform (DCT) - similar to
FFT but uses cosine values only
 Modified Discrete Cosine Transform (MDCT)
[used by MPEG-1 Layer-III, MPEG-2 AAC,
Dolby AC-3] – overlapped and windowed
version of DCT

22
Spectral Analysis
 Filter Banks
 Time sample blocks are passed through a set
of bandpass filters
 Masking thresholds are applied to resulting
frequency subband signals
 Poly-phase and wavelet banks are most
popular filter structures

23
Compression Models

•Perceptual Models

•Production Models

•Event Based Models

24
Perceptual Models
 Exploit masking, etc., to discard perceptually
irrelevant information.
 Example: Quantize soft sounds more accurately,
loud sounds less accurately

 Benefits: Generic, does not require assumptions


about what produced the sound
 Drawbacks: Highest compression is difficult to achieve
Loudness and Pitch
(Review on Psychoacoustic Effects)

 More sensitive to loudness at mid


frequencies than at other frequencies
 intermediate
frequencies at [500hz, 5000hz]
 Human hearing frequencies at [20hz,20000hz]

 Perceived loudness of a sound changes


based on frequency of that sound
 basilarmembrane reacts more to intermediate
frequencies than other frequencies
Fletcher-Munson Contours

Each contour represents an equal perceived sound


Perception sensitivity (loudness) is not linear across all frequencies and intensities
Production Models
 Build a model of the sound production system, then
fit the parameters

 Example: If signal is speech, then a well-


parameterized vocal model can yield
highest quality and compression ratio

 Benefits: Highest possible compression


 Drawbacks: Signal source(s) must be assumed, known, or
identified
MIDI and Other „Event‟ Models
 Musical Instrument Digital Interface
Represents Music as Notes and Events and uses a
synthesis engine to “render” it.

 An Edit Decision List (EDL) is another example.


A history of source materials, transformations, and
processing steps is kept. Operations can be undone or
recreated easily.
Future: Multi-Model
Parametric Compressors?
 Analysis front end identifies source(s)
 Audio is (separated and) sent to optimal
model(s)
Benefits:
 High compression
Drawbacks:
 Complexity
MPEG-1 Audio Encoding
 Characteristics
 Precision16 bits
 Sampling frequency: 32KHz, 44.1 KHz, 48
KHz
 3 compression layers: Layer 1, Layer 2, Layer
3 (MP3)
 Layer I: Uses sub-band coding 32-448 kbps, target 192
kbps
 Layer II: Uses sub-band coding (longer frames, more
compression) 32-384 kbps, target 128 kbps
 Layer III: Uses both sub-band coding and transform coding
32-320 kbps, target 64 kbps
MPEG Audio Encoding Steps
MPEG Audio Filter Bank
 Sub-band i defined
 Filter bank divides input into multiple sub-bands
(32 equal frequency sub-bands)
7 7
(2i  1)(k  16)
St[i]   3 cos( * (C[k  64 j ] * x[k  64 j ]
k 0 j 0 64

 i [0,31], St[i] - filter output sample for sub-band


i at time t,
 C[n] – one of 512 coefficients,
 x[n] – audio input sample from 512 sample
buffer
MPEG/audio divides audio signal into frequency sub-bands that approximate critical
bands. Then we quantize each sub-band according to the audibility of quantization
noise within the band
Dominant band and the mask
 Dominant band is found and the corresponding
mask is applied
Quantization of Audible Sound
 The components exceed the mask are
quantized and encoded using the Huffman
coding method
 Masking and Quantization (Example) performing the
sub-band filtering step on the input results in the
following values (for demonstration, we are only looking
at the first 16 of the 32 bands):
 Level 0 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1
 Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
 The 60dB level of the 8th band gives a masking of 12 dB in the 7th band,
15dB in the 9th. (according to the Psychoacoustic model) The level in
7th band is 10 dB ( < 12 dB ), so ignore it. The level in 9th band is 35 dB
( > 15 dB ), so send it. We only send the amount above the masking level

 Determine number of bits needed to represent the coefficient such that, the
noise introduced by quantization is below the masking effect i.e. [noise
introduced = 12dB; masking = 15 dB]

38
Rate control loop
 For a given bit rate allocation, adjust the
quantization steps to achieve the bit rate.
 This loop checks if the number of bits
resulting from the coding operation exceeds
the number of bits available to code for a
given block of data.
 If it is true, then the quantization step is
increased to reduce the total bits.
MPEG Audio Bit Allocation
 This process determines number of code bits allocated to
each sub-band based on information from the psycho-
acoustic model
 Algorithm:
1. Compute mask-to-noise ratio: MNR=SNR-SMR
 Standard provides tables that give estimates for SNR resulting
from quantizing to a given number of quantizer levels
2. Search for sub-band with the lowest MNR
3. Allocate code bits to this sub-band.
 If sub-band allocated gets more code bits than appropriate, look
up new estimate of SNR and repeat step 1
Distortion control loop
 This loop shape the quantization steps according to the
perceptual mask threshold
 Start with a default factor 1.0 for every band
 If the quantization error in a band exceeds the mask
threshold, the scale factor is adjusted to reduce this
quantization error
 This will cause more bits and the rate control loop has
to be invoked every time the scale factors are
changed
 The distortion control is executed until the noise level
is below the perceptual mask for every band
Decoder
 Decoder side is relatively easier. The gain,
scale factor, quantization steps recovered
are used to reconstruct the filter bank
responses.
 Filter bank responses are combined to
reconstruct the decoded audio signal
MPEG Coding Specifications

MPEG Layer I
Filter is applied one frame (12x32 = 384 samples) at a time.
At 48 kHz, each frame carries 8ms of sound.
Uses a 512-point FFT to get detailed spectral information
about the signal. (sub-band filter).
Uses equal frequency spread per band.
Psychoacoustic model only uses frequency masking.

Typical applications: Digital recording on tapes, hard disks,


or magneto-optical disks, which can tolerate the high bit
rate. Highest quality is achieved with a bit rate of 384k
bps.
43
MPEG Layer II
Use three frames in filter (before, current, next, a total of
1152 samples). At 48 kHz, each frame carries 24 ms of
sound.
--Models a little bit of the temporal masking.
--Uses a 1024-point FFT for greater frequency resolution.
--Uses equal frequency spread per band.
--Highest quality is achieved with a bit rate of 256k bps.

Typical applications: Audio Broadcasting, Television,


Consumer and Professional Recording, and Multimedia.

44
MPEG Layer III

Better critical band filter is used


Uses non-equal frequency bands
Psychoacoustic model includes temporal masking effects,
takes into account stereo redundancy, and uses Huffman
coder.
Stereo Redundancy Coding: Intensity stereo coding -- at
upper-frequency sub-bands, encode summed signals
instead of independent signals from left and right channels.

Middle/Side (MS) stereo coding -- encode middle (sum of


left and right) and side (difference of left and right)
channels.
45
Joint Stereo
 Joint stereo coding takes advantage of the fact
that both channels of a stereo channel pair
contain similar information
 These stereophonic irrelevancies and
redundancies are exploited to reduce the total
bitrate.
 Joint stereo is used in cases where only low
bitrates are available but stereo signals are
desired.
MP3 Audio Format

Source: http://wiki.hydrogenaudio.org/images/e/ee/Mp3filestructure.jpg
Successor of MP3
 Advanced Audio Coding (AAC)(MPEG-2
AAC)– now part of MPEG-4 Audio
 Can deliver 320 kbps for five channels (5.1 Channel
system).
 Also capable of delivering high quality stereo sound at
bitrates of below 128 kbps.
 Inclusion of 48 full-bandwidth audio channels
 Support 3 different profiles i.e. Main ,Low Complexity,
Scalable Sampling rate.
 Default audio format for iPhone, iPad, PlayStation, Nokia,
Android, BlackBerry
 Introduced in 1997 as MPEG-2 Part 7
 In 1999 – updated and included in MPEG-4
AAC‟s Improvements over MP3
 More sample frequencies (8-96 kHz)
 Arbitrary bit rates and variable frame
length
 Higher efficiency and simpler filterbank
 Uses pure MDCT (modified discrete cosine
transform)
 Used in Windows Media Audio
MPEG-4 Audio
 Variety of applications
 General audio signals
 Speech signals
 Synthetic audio
 Synthesized speech (structured audio)
MPEG-4 Audio Part 3
 Includes variety of audio coding technologies
 Lossy speech coding (e.g., CELP)
 CELP – code-excited linear prediction – speech
coding
 General audio coding (AAC)
 Hardware data compression
 Text-to-Speech interface
 Structured Audio (e.g., MIDI)
MPEG-4 Part 14
 Called MP4 with Extension .mp4
 Multimedia container format
 Stores digital video and audio streams and
allows streaming over Internet
 Container or wrapper format
 meta-fileformat whose spec describes how
different data elements and metadata coexist
in computer file
Conclusion
 MPEG Audio is an integral part of the
MPEG standard to be considered together
with video
 MPEG-4 Audio represents a major
extension in terms of capabilities to
MPEG-1 Audio

You might also like