0% found this document useful (0 votes)
1K views

COM 416 Multimedia

This document discusses multimedia and its components. Multimedia refers to computer information represented through various modalities like text, graphics, audio, video and animation. Hypermedia is a nonlinear application of multimedia that allows navigation through resources using links. The World Wide Web is a large hypermedia application that uses HTTP for transmission and HTML for publishing content. Video signals can be organized as component, composite or S-video, with component providing the best quality separation of color channels.

Uploaded by

Hussaini73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

COM 416 Multimedia

This document discusses multimedia and its components. Multimedia refers to computer information represented through various modalities like text, graphics, audio, video and animation. Hypermedia is a nonlinear application of multimedia that allows navigation through resources using links. The World Wide Web is a large hypermedia application that uses HTTP for transmission and HTML for publishing content. Video signals can be organized as component, composite or S-video, with component providing the best quality separation of color channels.

Uploaded by

Hussaini73
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

COM 416 - MULTIMEDIA

Introduction
What is Multimedia?
The term multimedia has different meaning to different people. While a PC vendor
may view the term in terms of a PC that has sound capability, a DVD-ROM drive and
probably a better multimedia-enabled microprocessor, a consumer entertainment
vendor may take it to mean an interactive cable TV with hundreds of digital channels.
To a computer science student multimedia would be looked at from the perspective of
what it consists of: applications that use multiple modalities to their advantage,
including text, graphics, animation, video, sound and most likely some level of
interactivity.
It is safe to say, multimedia is computer information represented through audio,
graphics, images, video and animation in addition to text.
Components of Multimedia
The multiple modalities of text, audio, images, drawings, animation and video are the
components that, either alone or in combination of two or more, give rise to
multimedia like video conferencing, telemedicine, cooperative work environment etc.
Multimedia and Hypermedia
While our traditional media (like books) is a linear medium, that is, it is read from
beginning to end, a hypertext system is meant to be read in a nonlinear fashion.
Hypermedia is not constrained to be text-based. It can include other media like
graphics, images and continuous media sound and video. The World Wide Web
(WWW) is a good example of a hypermedia application.
We can therefore say hypermedia is one particular application of multimedia.
World Wide Web
The World Wide Web is the largest and most commonly used hypermedia application.
Due to the amount of information available from web servers, the capacity to post such
information and the ease of navigation through such information with a web browser
the popularity of WWW is epic.
HyperText Transfer Protocol (HTTP)
HTTP is a protocol for the transmission of hypermedia which supports any file type. It
is a stateless request/response protocol where a client opens a connection to the
HTTP server, request information, the server responds and the connection is
terminated.
The basic format is:
Method URI version
Additional Headers

Message body
The Uniform Resource Identifier (URI) identifies the resource accessed, such as the
host name, always preceded by the token http://. The URI could be a URL.
Two popular methods are used in HTTP. They are the GET and POST methods. GET
specifies that the information requested is in the request string itself, while POST
method specifies that the resource pointed to in the URI should consider the message
body. POST is generally used for submitting HTML forms.
GET method has the following format:
GET URI HTTP/version
An example of a GET method is GET http://www.kanopoly.edu.ng/ HTTP/1.1
The basic response format is:
Version Status-Code Status-Phrase
Additional Headers
Message body
Status-Code is a number that identifies the response type (or error that occurs), and
Status-Phrase is a textual description of it. Two commonly seen status codes and
phrases are 200 OK when the request was processed successfully and 404 Not
Found when the URI does not exist.
HyperText Markup Language (HTML)
HTML is the language that is used to publish hypermedia on the World Wide Web. It
is defined used SGML (Standard Generalized Markup Language), and derives
elements that describe generic document structure and formatting. Since it uses
ASCII, it is portable to all different computer hardware which allows for global
exchange of information.
HTML uses tags to describe document elements. The tags are in the format <token
params> to define the start point of a document element and </token> to define the
end of the element. Some elements have only inline parameters and dont require
ending tags.
HTML divides the document into HEAD and BODY as follows:
<HTML>
<HEAD>
..
</HEAD>
<BODY>
.
</BODY>
</HTML>
The HEAD describes document definitions, which are parsed before any document
rendering is done. These include page title, resource links and meta-information the
author decides to specify. The BODY part describes the document structure and
content. Common structure elements are paragraphs, tables, forms, links, item lists
and buttons.
A simple HTML page is given follows:

<!DOCTYPE HTML>
<HTML>
<HEAD>
<TITLE>
Web Page Demo
</TITLE>
</HEAD>
<BODY>
<P> This is the body of the page. Everything goes here
</P>
</BODY>
</HTML>
The current HTML standard is HTML Version 5.
Graphics and Image Data Representations
Graphics/Image Data Types
The number of file formats used in multimedia keeps on increasing. We shall be more
interested in GIF and JPG image file formats since these formats are distinguished by
the fact that most web browsers can decompress and display them.
1-Bit Images
Images consist of pixels or pels picture elements in digital images. A 1-bit image
consist of on and off only and thus it is the simplest type of image. Each pixel is store
as a single bit (0 or 1). Hence, such an image is also referred to as a binary image.
It is also called a 1-bit monochrome image, since it contains no colour. Monochrome 1-
bit images can be satisfactory for pictures containing only simple graphics and text.
8-Bit Gray-Level Images
Every pixel in an 8-bit image has a gray value between 0 and 255. Each pixel is
represented by a byte for example, a dark pixel might have a value of 10 and a bright
one might be 230.
The whole image can be viewed as a two-dimensional array of pixel values. Such an
array is referred to as a bitmap a representation of the graphics/image data that
parallels the manner in which it is stored in video memory.
Image resolution refers to the number of pixels in a digital image (higher resolutions
yield better quality).
24-Bit Colour Images
In a 24-bit colour image, each pixel is represented by three bytes, usually representing
RGB. Since each value is in the range of 0 255, this format supports 256 X 256 X
256, or a total of 16,777,216 possible combined colours
8-Bit Colour Images
Many systems can make use of only 8 bits of colour information (the so-called 256
colours) in producing a screen image.
Popular File Formats
GIF
Graphics Interchange Format (GIF) was devised for transmitting graphical images
over phone lines via modem. The GIF standard uses the Lempel-Ziv-Welch algorithm,
modified slightly for image scanline packets to use the line grouping of pixels
effectively.
The GIF standard is limited to 8-bit colour images only. While this produces
acceptable colour, it is best suited for images with few distinctive colours.
The GIF image format has some interesting features. It allows for successive display of
pixels in widely spaced rows by a four-pass display process known as interlacing. It
also supports simple animation via a Graphic Control Extension block in the data.
JPEG
JPEG is the most important current standard for image compression. It was created by
a working group of the International Organization for Standardization (ISO) popularly
known as the Joint Photographic Experts Group.
The human vision system has some limitations, which JPEG takes advantage of to
achieve high rates of compression. The eye-brain system cannot see extremely fine
detail. This limitation is even more conspicuous for colour vision than for grayscale
(black and white).
PNG
With the popularity of the Internet there are efforts toward more system-independent
image formats. Portable Network Graphics (PNG) is one of such formats. It is meant to
supersede GIF and extend it in more important ways.
Special features of PNG files include support for up to 48 bits of colour information.
Files may also contain gamma-correction information.
Fundamental Concepts in Video
Type of Video Signals
Video signal can be organized in three different ways: Component video, Composite
video and S-video.
Component Video
Component video makes use of three separate video signals for the red, green and blue
image planes. This kind of system has three wires (and connectors) connecting the
camera or other devices to a TV or monitor. Most computer systems use component
video, with separate signals for R, G and B signals.
For any colour separation scheme, component video gives the best colour
reproduction, since there is no crosstalk between the three different channels, unlike
composite video or S-video. Component video, however, requires more bandwidth and
good synchronization of the three components.
Composite Video
In composite video, colour (chrominance) and intensity (luminance) signals are
mixed into a single carrier wave. Chrominance is a composite of two colour
components. This type of signal is used by broadcast colour TVs; it is downward
compatible with black-and-white TV.
When connecting to TVs or VCRs, composite video uses only one wire (and hence one
connector), and video colour signals are mixed, not sent separately. The audio signal is
another addition to this signal. Since colour information is mixed and both colour and
intensity are wrapped into the same signal, some interference between the luminance
and chrominance signals is inevitable.
S-Video
As a compromise, S-video (separated video or super-video) uses two wires: one for
luminance and another for a composite chrominance signal. As a result there is less
crosstalk between the colour information and the crucial gray-scale information.
The reason for placing luminance into its own part of the signal is that black-and-
white information is crucial for visual perception.
Analog Video
Most TV signals are still sent and received as analog signals. An analog signal samples
a time-varying image. So-called progressive scanning traces through a complete
picture (frame) row-wise for each time interval. A high resolution computer monitor
typically uses a time interval of 1/72 second.
In TV and in some monitors and multimedia standards, another system, interlaced
scanning, is used. Here, odd-numbered lines are traced first, then the even-numbered
lines. This results in odd and even fields two fields make up one frame.
NTSC Video
The NTSC TV standard is mostly used in North America and Japan. It uses a familiar
4:3 aspect ratio (i.e. the ratio of picture width to height) and 525 scan lines per frame
at 30 frames per second.
NTSC follows the interlaced scanning system and each frame is divided into two fields,
with 262.5 lines/field.
PAL Video
PAL (Phase Alternating Line) is a TV standard originally invented by German
scientists. It uses 625 lines per frame, at 25 frames per second, with a 4:3 aspect ratio
and interlaced fields. Its broadcast TV signals are also used in composite video.
SECAM Video
SECAM stands for Systeme Electronique Couleur Avec Memoire. SECAM uses 625
scan lines per frame, at 25 frames per second, with a 4:3 aspect ratio and interlaced
fields.
SECAM and PAL are similar, differing slightly in their coding scheme.
Digital Video
The advantages of digital representation for video are many. It permits
o Storing video on digital devices or in memory, ready to be processed (noise
removal, cut and paste, and so on) and integrated into various multimedia
applications.
o Direct access, which makes nonlinear video editing simple.
o Repeated recording without degradation of image quality.
o Ease of encryption and better tolerance to channel noise.
High Definition TV (HDTV)
The main thrust of High Definition TV (HDTV) is not to increase the definition in
each unit area, but rather to increase the visual field, especially its width.
The salient difference between conventional TV and HDTV is that the latter has a
much wider aspect ratio of 16:9 instead of 4:3. Another feature of HDTV is its move
toward progressive scan. The rationale is that interlacing introduces serrated edges to
moving objects and flickers along horizontal edges.
Basics of Digital Audio
Digitization of sound
What is Sound?
Sound is a wave phenomenon that involves molecules of air being compressed and
expanded under the action of some physical device. For example, a speaker in an audio
system vibrates back and forth and produces a longitudinal pressure wave that we
perceive as sound.
Without air there is no sound. Since sound is a pressure wave, it takes on continuous
values, as opposed to digitized ones with a finite range. Nevertheless, if we wish to use
a digital version of sound waves, we must form digitized representations of audio
information.
Digitization
Values of sound wave change over time in amplitude: the pressure increases or
decreases with time. The amplitude value is a continuous quantity. Since we are
interested in working with such data in a computer storage, we must digitize the
analog signals produced by microphones. Digitization means conversion to a stream
of numbers preferably integers for efficiency.
To fully digitize an analog signal we have to sample it both in time and amplitude.
Sampling means measuring the quantity we are interested in, usually at evenly spaced
intervals. The first kind of sampling is simply called sampling, and the rate at which it
is performed is called sampling frequency.
For audio, typical sampling rates are from 8 kHz to 48 kHz. The human ear can hear
from about 20 Hz to as much as 20 kHz. The human voice can reach approximately
4 kHz.
Sampling in the amplitude or voltage dimension is called quantization.
To decide how to digitize audio data, we need to answer the following questions:
1. What is the sampling rate?
2. How finely is the data to be quantized, and is the quantization uniform?
3. How is audio data formatted?
Nyquist Theorem
The Nyquist Theorem states that: if a signal is band-limited that is, if it has a lower
limit f1 and an upper limit f2 of frequency components in the signal then we need a
sampling rate of at least 2(f1 f2).
Signal-to-Noise Ratio (SNR)
In any analog system, random fluctuations produce noise added to the signal, and the
measure voltage is thus incorrect. The ratio of the power of the correct signal to the
noise is called the signal-to-noise ratio (SNR). Therefore, the SNR is a measure of the
quality of the signal.
The SNR is usually measure in decibels (dB). The SNR value, in units of dB, is defined
in terms of base-10 logarithms of squared voltages:
= 10 log = 20 log

Audio Filtering
Prior to sampling, the audio signal is also usually filtered to remove unwanted
frequencies. The frequencies depend on the application. For speech, typically from
50 Hz to 10 kHz is retained. Other frequencies are blocked by a band-pass filter, also
called a band-limiting filter, which screens out lower and higher frequencies.
An audio music signal will typically contain from about 20 Hz up to 20 kHz. So the
band-pass filter for music will screen out frequencies outside this range.
Multimedia Data Compression
Lossless Compression Algorithms
With the emergence of multimedia technologies the quantum of data generated is
enormous. As a result, there is need for a technique that will reduce the number of bits
involved in storing and/or transmitting this data. The process is referred to as
compression.
Below is a general data compression scheme:

We call the output of the encoder codes or codewords. The intermediate medium could
either be data storage or a communication/computer network. If the compression and
decompression processes induce no information loss, the compression scheme is
lossless; otherwise, it is lossy.
If the total number of bits required to represent the data before compression is B0 and
the total number of bits required to represent the data after compression is B1, then we
define the compression ratio as
#
!"
#
Basics Information Theory
According to Claude E. Shannon, the entropy of an information source with alphabet
S = {s1, s2, ..., sn} is defined as:

1
$ % log
&

' % log
&

Where is the probability that symbol in will occur.


What is the entropy? In science, entropy is a measure of the disorder of a system the
more entropy, the more disorder. Typically, we add negative entropy to a system when
we impart more order to it.
The definition of entropy is aimed at identifying often-occurring symbols in the
datastream as good candidates for short codewords in the compressed bitstream. We
use a variable-length coding scheme for entropy coding - frequently-occurring symbols
are given codes that are quickly transmitted, while infrequently-occurring ones are
given longer codes. For example, E occurs frequently in English, so we should give it a
shorter code than Q, say.
Run-Length Coding
Instead of assuming a memoryless source, run-length coding (RLC) exploits memory
present in the information source. It is one of the simplest forms of data compression.
The basic idea is that if the information source we wish to compress has the property
that symbols tend to form continuous groups, instead of coding each symbol in the
group individually, we can code one such symbol and the length of the group.
Variable-Length Coding
Since the entropy indicates the information content in an information source S, it
leads to a family of coding methods commonly known as entropy coding methods. As
described earlier, variable-length coding (VLC) is one of the best-known such
methods. Here, we will study the Shannon-Fano algorithm and Huffman coding.
Shannon-Fano Algorithm
To illustrate the algorithm, let's suppose the symbols to be coded are the characters in
the word HELLO.
Symbol H E L O
Count 1 1 2 1

The encoding steps of the Shannon-Fano algorithm can be presented in the following
top-down manner:
1. Sort the symbols according to the frequency count of their occurrences.
2. Recursively divide the symbols into two parts, each with approximately the same
number of counts, until all parts contain only one symbol.
A natural way of implementing the above procedure is to build a binary tree. As a
convention, let's assign bit 0 to its left branches and 1 to the right branches.

Coding tree for HELLO by the Shannon-Fano algorithm.


Huffman Coding
In contradistinction to Shannon-Fano, which is top-down, the encoding steps of the
Huffman algorithm are described in the following bottom-up manner. Let's use the
same example word, HELLO. A similar binary coding tree will be used as above, in
which the left branches are coded 0 and right branches 1. A simple list data structure is
also used.

Coding tree for HELLO using the Huffman algorithm


Huffman Coding Algorithm
1. Initialization; put all symbols on the list sorted according to their frequency counts.
2. Repeat until the list has only one symbol left.
(a) From the list, pick two symbols with the lowest frequency counts. Form a
Huffman subtree that has these two symbols as child nodes and create a parent
node for them.
(b) Assign the sum of the children's frequency counts to the parent and insert it
into the list, such that the order is maintained.
(c) Delete the children from the list.
3. Assign a codeword for each leaf based on the path from the root.
Dictionary-Based Coding
The Lempel-Ziv-Welch (LZW) algorithm employs an adaptive, dictionary-based
compression technique. Unlike variable-length coding, in which the lengths of the
codewords are different, LZW uses fixed-length codewords to represent variable-
length strings of symbols/characters that commonly occur together, such as words in
English text.
As in the other adaptive compression techniques, the LZW encoder and decoder builds
up the same dictionary dynamically while receiving the data - the encoder and the
decoder both develop the same dictionary. Since a single code can now represent more
than one symbol/character, data compression is realized.
LZW proceeds by placing longer and longer repeated entries into a dictionary, then
emitting the code for an element rather than the string itself, if the element has
already been placed in the dictionary.
Lossy Compression Algorithms
For image compression in multimedia applications, where a higher compression ratio
is required, lossy methods are usually adopted. In lossy compression, the compressed
image is usually not the same as the original image but is meant to form a close
approximation to the original image perceptually.
Distortion Measures
A distortion measure is a mathematical quantity that specifies how close an
approximation is to its original, using some distortion criteria. When looking at
compressed data, it is natural to think of the distortion in terms of the numerical
difference between the original data and the reconstructed data. However, when the
data to be compressed is an image, such a measure may not yield the intended result.
Quantization
Quantization in some form is the heart of any lossy scheme. Without quantization, we
would indeed be losing little information.
The source we are interested in compressing may contain a large number of distinct
output values (or even infinite, if analog). To efficiently represent the source output,
we have to reduce the number of distinct values to a much smaller set, via
quantization.
Each algorithm (each quantizer) can be uniquely determined by its partition of the
input range, on the encoder side, and the set of output values, on the decoder side. The
input and output of each quantizer can be either scalar values or vector values, thus
leading to scalar quantizers and vector quantizers.
Discrete Cosine Transform (DCT)
The Discrete Cosine Transform (DCT), a widely used transform coding technique, is
able to perform decorrelation of the input signal in a data-independent manner.
Because of this, it has gained tremendous popularity.
Image Compression Standards
The JPEG Standard
JPEG is an image compression standard developed by the Joint Photographic Experts
Group. It was formally accepted as an international standard in 1992.
JPEG consists of a number of steps, each of which contributes to compression. We'll
look at the motivation behind these steps.
Main Steps in JPEG Image Compression
As we know, unlike one-dimensional audio signals, a digital image f(i, j) is not defined
over the time domain. Instead, it is defined over a spatial domain - that is, an image is
a function of the two dimensions i and j (or, conventionally, x and y). The 2D DCT is
used as one step in JPEG, to yield a frequency response that is a function F(u, v) in the
spatial frequency domain, indexed by two integers u and v.
JPEG is a lossy image compression method. The effectiveness of the DCT transform
coding method in JPEG relies on three major observations:
Observation 1. Useful image contents change relatively slowly across the
image - that is, it is unusual for intensity values to vary widely several times in a
small area for - example, in an 8 x 8 image block. Spatial frequency indicates
how many times pixel values change across an image block. The DCT formalizes
this notion with a measure of how much the image contents change in relation
to the number of cycles of a cosine wave per block.
Observation 2. Psychophysical experiments suggest that humans are much
less likely to notice the loss of very high-spatial-frequency components than
lower-frequency components.
Observation 3. Visual acuity (accuracy in distinguishing closely spaced lines)
is much greater for gray (black and white) than for color. We simply cannot see
much change in color if it occurs in close proximity- think of the blobby ink
used in comic books. This works simply because our eye sees the black lines
best, and our brain just pushes the color into place. In fact, ordinary broadcast
TV makes use of this phenomenon to transmit much less color information than
gray information.
JPEG Modes
The JPEG standard supports numerous modes (variations). Some of the commonly
used ones are:
Sequential Mode
Progressive Mode
Hierarchical Mode
Lossless Mode
Sequential Mode. This is the default JPEG mode. Each gray-level image or color
image component is encoded in a single left-to-right, top-to-bottom scan. We
implicitly assumed this mode in the discussions so far. The "Motion JPEG" video
codec uses Baseline Sequential JPEG, applied to each image frame in the video.
Progressive Mode. Progressive JPEG delivers low-quality versions of the image
quickly, followed by higher-quality passes, and has become widely supported in web
browsers. Such multiple scans of images are of course most useful when the speed of
the communication line is low. In Progressive Mode, the first few scans carry only a
few bits and deliver a rough picture of what is to follow. After each additional scan,
more data is received, and image quality is gradually enhanced. The advantage is that
the user-end has a choice whether to continue receiving image data after the first
scan(s).
Progressive JPEG can be realized in one of the following two ways. The main steps
(DCT, quantization, etc.) are identical to those in Sequential Mode.
Spectral selection: This scheme takes advantage of the spectral (spatial frequency
spectrum) characteristics of the DCT coefficients: the higher AC components provide
only detail information.
Successive approximation: Instead of gradually encoding spectral bands, all DCT
coefficients are encoded simultaneously, but with their most significant bits (MSBs)
first.
Hierarchical Mode. As its name suggests, Hierarchical JPEG encodes the image in
a hierarchy of several different resolutions. The encoded image at the lowest resolution
is basically a compressed low-pass-filtered image, whereas the images at successively
higher resolutions provide additional details (differences from the lower-resolution
images). Similar to Progressive JPEG, Hierarchical JPEG images can be transmitted in
multiple passes with progressively improving quality.
Lossless Mode. Lossless JPEG is a very special case of JPEG which indeed has no
loss in its image quality. However, it employs only a simple differential coding method,
involving no transform coding. It is rarely used, since its compression ratio is very low
compared to other, lossy modes. On the other hand, it meets a special need, and the
newly developed JPEG-LS standard is specifically aimed at lossless image
compression.
The JPEG2000 Standard
The IPEG standard is no doubt the most successful and popular image format to date.
The main reason for its success is the quality of its output for relatively good
compression ratio. However, in anticipating the needs and requirements of next-
generation imagery applications, the JPEG committee has defined a new standard:
JPEG2000.
The new JPEG2000 standard aims to provide not only a better rate-distortion tradeoff
and improved subjective image quality but also additional functionalities the current
JPEG standard lacks. In particular, the JPEG2000 standard addresses the following
problems:
Low-bitrate compression. The current JPEG standard offers excellent rate-
distortion performance at medium and high bitrates. However, at bitrates
below 0.25 bpp, subjective distortion becomes unacceptable. This is important
if we hope to receive images on our web-enabled ubiquitous devices, such as
web-aware wristwatches, and so on.
Lossless and lossy compression. Currently, no standard can provide
superior lossless compression and lossy compression in a single bitstream.
Large images. The new standard will allow image resolutions greater than
64 k x 64 k without tiling. It can handle image sizes up to 232 - 1.
Single decompression architecture. The current JPEG standard has 44
modes, many of which are application-specific and not used by the majority of
JPEG decoders.
Transmission in noisy environments. The new standard will provide
improved error resilience for transmission in noisy environments such as
wireless networks and the Internet.
Progressive transmission. The new standard provides seamless quality
and resolution scalability from low to high bitrates. The target bitrate and
reconstruction resolution need not be known at the time of compression.
Region-of-interest coding. The new standard permits specifying Regions of
Interest (ROI), which can be coded with better quality than the rest of the
image. We might, for example, like to code the face of someone making a
presentation with more quality than the surrounding furniture.
Computer-generated imagery. The current JPEG standard is optimized
for natural imagery and does not perform well on computer-generated
imagery.
Compound documents. The new standard offers metadata mechanisms for
incorporating additional non-image data as part of the file. This might be
useful for including text along with imagery, as one important example.
PNG Standard
Portable Network Graphics (PNG) is a raster graphics file format that supports
lossless data compression. PNG was created as an improved, non-patented
replacement for Graphics Interchange Format (GIF), and is the most used lossless
image compression format on the Internet.
PNG supports palette-based images (with palettes of 24-bit RGB or 32-bit RGBA
colors), grayscale images (with or without alpha channel), and full-color non-palette-
based RGB images (with or without alpha channel). PNG was designed for transferring
images on the Internet, not for professional-quality print graphics, and therefore does
not support non-RGB color spaces such as CMYK.
PNG files nearly always use file extension PNG or png and are assigned MIME media
type image/png. PNG was approved for this use by the Internet Engineering Steering
Group on 14 October 1996, and was published as an ISO/IEC standard in 2004.
Audio Compression Standards
MP3
MPEG-1 and/or MPEG-2 Audio Layer III, more commonly referred to as MP3, is an
audio coding format for digital audio which uses a form of lossy data compression. It is
a common audio format for consumer audio streaming or storage, as well as a de facto
standard of digital audio compression for the transfer and playback of music on most
digital audio players and computing devices.
The use of lossy compression is designed to reduce by a factor of 10 the amount of data
required to represent digital audio recordings yet still sound like the original
uncompressed audio to most listeners.
Compared to CD quality digital audio, MP3 compression commonly achieves 75 to
95% reduction in size. MP3 files are thus 1/4 to 1/20 the size of the original digital
audio stream. This is important for both transmission and storage concerns. The basis
for such comparison is the CD-ROM digital audio format which requires 1411200
bit/s. A commonly used MP3 encoding setting is CBR 128 kbit/s resulting in file size 1/
11 (=9% or 91% compression) of the original CD-quality file.
The MP3 lossy compression works by reducing (or approximating) the accuracy of
certain parts of a continuous sound that are considered to be beyond the auditory
resolution ability of most people. This method is commonly referred to as perceptual
coding. It uses psychoacoustic models to discard or reduce precision of components
less audible to human hearing, and then records the remaining information in an
efficient manner.
MP3 was designed by the Moving Picture Experts Group (MPEG) as part of its MPEG-
1 standard and later extended in the MPEG-2 standard. MPEG-1 Audio (MPEG-1 Part
3), which included MPEG-1 Audio Layer I, II and III was approved as a committee
draft of ISO/IEC standard in 1991, finalized in 1992 and published in 1993 (ISO/IEC
11172-3:1993). A backwards compatible MPEG-2 Audio (MPEG-2 Part 3) extension
with lower sample and bit rates was published in 1995 (ISO/IEC 13818-3:1995).
MP3 is a streaming or broadcast format (as opposed to a file format) meaning that
individual frames can be lost without affecting the ability to decode successfully
delivered frames. Storing an MP3 stream in a file enables time shifted playback.
Basic Video Compression Techniques
Introduction to Video Compression
A video consists of a time-ordered sequence of frames - images. An obvious solution to
video compression would be predictive coding based on previous frames. For example,
suppose we simply created a predictor such that the prediction equals the previous
frame. Then compression proceeds by subtracting images: instead of subtracting the
image from itself (i.e., use a derivative), we subtract in time order and code the
residual error.
And this works. Suppose most of the video is unchanging in time. Then we get a nice
histogram peaked sharply at zero - a great reduction in terms of the entropy of the
original video, just what we wish for.
Video Compression Based on Motion compensation
A video can be viewed as a sequence of images stacked in the temporal dimension.
Since the frame rate of the video is often relatively high (e.g., 15 frames per second)
and the camera parameters (focal length, position, viewing angle, etc.) usually do not
change rapidly between frames, the contents of consecutive frames are usually similar,
unless certain objects in the scene move extremely fast. In other words, the video has
temporal redundancy.
Temporal redundancy is often significant and it is exploited, so that not every frame of
the video need to be coded independently as a new image. Instead, the difference
between the current frame and other frame(s) in the sequence is coded. If redundancy
between them is great enough, the difference images could consist mainly of small
values and low entropy, which is good for compression.
As we mentioned, although a simplistic way of deriving the difference image is to
subtract one image from the other (pixel by pixel), such an approach is ineffective in
yielding a high compression ratio. Since the main cause of the difference between
frames is camera and/or object motion, these motion generators can be
compensated by detecting the displacement of corresponding pixels or regions in
these frames and measuring their differences. Video compression algorithms that
adopt this approach are said to be based on motion compensation (MC). The three
main steps of these algorithms are:
1. Motion estimation (motion vector search)
2. Motion-compensation-based prediction
3. Derivation of the prediction error the difference
H.263
H.263 is an improved video coding standard for videoconferencing and other audio-
visual services transmitted on Public Switched Telephone Networks (PSTN). It aims at
low bitrate communications at bitrates of less than 64 kbps. It was adopted by the
ITU-T Study Groups in 1995. It uses predictive coding for inter-frames, to reduce
temporal redundancy, and transform coding for the remaining signal, to reduce spatial
redundancy (for both intra-frames and difference macroblocks from inter-frame
prediction.
MP4
MPEG-4 Part 14 or MP4 is a digital multimedia container format most commonly used
to store video and audio, but can also be used to store other data such as subtitles and
still images. Like most modern container formats, it allows streaming over the
Internet. The only official filename extension for MPEG-4 Part 14 files is .mp4, but
many have other extensions, most commonly .m4a and .m4p. M4A (audio only) is
often compressed using AAC encoding (lossy), but can also be in Apple Lossless
format. M4P is a protected format which employs DRM technology to restrict copying.
MPEG-4 Part 14 (formally ISO/IEC 14496-14:2003) is a standard specified as a part of
MPEG-4.
.MP4 versus .M4A
M4A stands for MPEG 4 Audio and is a filename extension used to represent audio
files.
The existence of two different filename extensions, .MP4 and .M4A, for naming audio-
only MP4 files has been a source of confusion among users and multimedia playback
software. Some file managers, such as Windows Explorer, look up the media type and
associated applications of a file based on its filename extension. But since MPEG-4
Part 14 is a container format, MPEG-4 files may contain any number of audio, video,
and even subtitle streams, making it impossible to determine the type of streams in an
MPEG-4 file based on its filename extension alone. In response, Apple Inc. started
using and popularizing the .m4a filename extension, which is used for MP4 containers
with audio data in the lossy Advanced Audio Coding (AAC) or its own Apple Lossless
(ALAC) formats. Software capable of audio/video playback should recognize files with
either .m4a or .mp4 filename extensions, as would be expected, since there are no file
format differences between the two. Most software capable of creating MPEG-4 audio
will allow the user to choose the filename extension of the created MPEG-4 files.

You might also like