Seminar Report On Mpeg-7
Seminar Report On Mpeg-7
( SEMINAR REPORT )
CHANDERSHEEL PAHAL
07-CS-036
( DEPARTMENT OF CSE )
LINGAYA’S INSTITUTE OF
MANAGEMENT & TECHNOLOGY
1
ACKNOWLEDGEMENT
It is indeed a matter of immense pleasure and privilege to me, to present this Seminar
report on “MPEG-7”.
Needless to say, without all the above help and support, successful completion of my
research for seminar would not have been possible.
CHANDERSHEEL PAHAL
7CS-36, B.TECH 8th sem.
Computer Science.
LIMAT.
2
WHY HAVE I CHOSEN THIS TOPIC?
Multimedia has become an association to everyone’s life. The fast growing technologies
in the Internet world has created a likeliness and developed an interest to know the
technologies concerned. Since it is more than a mammoth task to know about all the
technologies and architecture of the multimedia world, I have chosen the topic “MPEG-
7” so that I can start to develop my understanding from this topic of the multimedia
technologies.
Keeping all the minute aspects in mind I have chosen this topic for the seminar as per
the curriculum of the RTU Final Year. The vast amount of research work done in
knowing the intricacies of this topic has left an essence which always remind of the great
pleasure I felt while doing a good amount of study to this topic .I have referred a lot
number of books and web links for doing a good research work. All the references have
been specified clearly in the References part of this report.
Finally I would like to thanks God for being supportive to me in keeping me well during
the research work period. I also like to thanks the Seminar Incharge for supporting me
doing well research work.
3
INDEX
TOPIC PAGE NO.
1. MPEG-7 AN INTRODUCTION…………………………………………………….5
2. CONTEXT OF MPEG-7……………………………………………………………..7
3. MPEG-7 OBJECTIVES……………………………………………………………...8
4. SCOPE OF THE STANDARD……………………………………………………...11
5. MPEG-7 APPLICATION AREA……………………………………………………12
6. MPEG-7 PARTS……………………………………………………………………...14
7. METHOD OF WORK AND DEVELOPMENT SCHEDULE……………………15
8. MAJOR FUNCTIONALITIES IN MPEG-7………………………………………..16
9. MPEG-7 VISUAL…………………………………………………………………….19
-Basic Structures.
-Color Descriptors.
-Texture Descriptors.
-Shape Descriptors.
-Motion Descriptors.
-Localization.
-Others
10. MPEG-7 AUDIO……………………………………………………………………33
-Audio Framework.
-High Level Audio Description Tools.
11. MPEG-7 CONFORMANCE TESTING…………………………………………..42
12. MPEG-7 EXTRACTION AND USE OF DESCRIPTION………………………45
13. REFERENCES……………………………………………………………………...46
4
INTRODUCTION
Accessing audio and video used to be a simple matter - simple because of the simplicity
of the access mechanisms and because of the poverty of the sources. An
incommensurable amount of audiovisual information is becoming available in digital
form, in digital archives, on the World Wide Web, in broadcast data streams and in
personal and professional databases, and this amount is only growing. The value of
information often depends on how easy it can be found, retrieved, accessed, filtered and
managed.
The transition between the second and third millennium abounds with new ways to
produce, offer, filter, search, and manage digitized multimedia information. Broadband
is being offered with increasing audio and video quality and speed of access. The trend
is clear: in the next few years, users will be confronted with such a large number of
contents provided by multiple sources that efficient and accurate access to this almost
infinite amount of content seems unimaginable today. In spite of the fact that users
have increasing access to these resources, identifying and managing them efficiently is
becoming more difficult, because of the sheer volume. This applies to professional as
well as end users. The question of identifying and managing content is not just
restricted to database retrieval applications such as digital libraries, but extends to areas
like broadcast channel selection, multimedia editing, and multimedia directory services.
This challenging situation demands a timely solution to the problem. MPEG-7 is the
answer to this need.
5
MPEG-7 offers a comprehensive set of audiovisual Description Tools (the metadata
elements and their structure and relationships, that are defined by the standard in the
form of Descriptors and Description Schemes) to create descriptions (i.e., a set of
instantiated Description Schemes and their corresponding Descriptors at the users will),
which will form the basis for applications enabling the needed effective and efficient
access (search, filtering and browsing) to multimedia content. This is a challenging task
given the broad spectrum of requirements and targeted multimedia applications, and
the broad number of audiovisual features of importance in such context.
6
CONTEXT OF MPEG-7
Audiovisual sources will play an increasingly pervasive role in our lives, and there will
be a growing need to have these sources processed further. This makes it necessary to
develop forms of audiovisual information representation that go beyond the simple
waveform or sample-based, compression-based (such as MPEG-1 and MPEG-2) or even
objects-based (such as MPEG-4) representations. Forms of representation that allow
some degree of interpretation of the information meaning are necessary. These forms
can be passed onto, or accessed by, a device or a computer code. In the examples given
above an image sensor may produce visual data not in the form of PCM samples (pixels
values) but in the form of objects with associated physical measures and time
information. These could then be stored and processed to verify if certain programmed
conditions are met. A PVR could receive descriptions of the audiovisual information
associated to a program that would enable it to record, for example, only news with the
exclusion of sport. Products from a company could be described in such a way that a
machine could respond to unstructured queries from customers making inquiries.
MPEG-7 is a standard for describing the multimedia content data that will support
these operational requirements. The requirements apply, in principle, to both real-time
and non real-time as well as push and pull applications. MPEG-7 does not standardize
or evaluate applications, although in the development of the MPEG-7 standard
applications have been used for understanding the requirements and evaluation of
technology. It must be made clear that the requirements are derived from analyzing a
wide range of potential applications that could use MPEG-7 tools. MPEG-7 is not aimed
at any one application in particular; rather, the elements that MPEG-7 standardizes
support as broad a range of applications as possible.
7
MPEG-7 OBJECTIVES
In October 1996, MPEG started a new work item to provide a solution to the questions
described above. The new member of the MPEG family, named "Multimedia Content
Description Interface" (in short MPEG-7), provides standardized core technologies
allowing the description of audiovisual data content in multimedia environments. It
extends the limited capabilities of proprietary solutions in identifying content that exist
today, notably by including more data types.
Audiovisual data content that has MPEG-7 descriptions associated with it, may include:
still pictures, graphics, 3D models, audio, speech, video, and composition information
about how these elements are combined in a multimedia presentation (scenarios). A
special case of these general data types is facial characteristics.
MPEG-7 descriptions do, however, not depend on the ways the described content is
coded or stored. It is possible to create an MPEG-7 description of an analogue movie or
of a picture that is printed on paper, in the same way as of digitized content.
MPEG-7 allows different granularity in its descriptions, offering the possibility to have
different levels of discrimination. Even though the MPEG-7 description does not depend
on the (coded) representation of the material, MPEG-7 can exploit the advantages
provided by MPEG-4 coded content. If the material is encoded using MPEG-4, which
provides the means to encode audio-visual material as objects having certain relations
in time (synchronization) and space (on the screen for video, or in the room for audio),
it will be possible to attach descriptions to elements (objects) within the scene, such as
audio and visual objects.
Because the descriptive features must be meaningful in the context of the application,
they will be different for different user domains and different applications. This implies
that the same material can be described using different types of features, tuned to the
area of application. To take the example of visual material: a lower abstraction level
would be a description of e.g. shape, size, texture, color, movement (trajectory) and
position (‘where in the scene can the object be found?); and for audio: key, mood,
tempo, tempo changes, position in sound space. The highest level would give semantic
information: ‘This is a scene with a barking brown dog on the left and a blue ball that
falls down on the right, with the sound of passing cars in the background.’ Intermediate
levels of abstraction may also exist.
The level of abstraction is related to the way the features can be extracted: many low-
level features can be extracted in fully automatic ways, whereas high level features need
(much) more human interaction.
8
· The form - An example of the form is the coding format used (e.g. JPEG, MPEG-2), or
the overall data size. This information helps determining whether the material can be
‘read’ by the user terminal;
· Conditions for accessing the material - This includes links to a registry with intellectual
property rights information, and price;
· Classification - This includes parental rating, and content classification into a number
of pre-defined categories;
· Links to other relevant material - The information may help the user speeding up the
search;
· The context - In the case of recorded non-fiction content, it is very important to know
the occasion of the recording (e.g. Olympic Games 1996, final of 200 meter hurdles,
men).
· Description Tools: Descriptors (D), that define the syntax and the semantics of each
feature (metadata element); and Description Schemes (DS), that specify the structure
and semantics of the relationships between their components, that may be both
Descriptors and Description Schemes;
· System tools, to support binary coded representation for efficient storage and
transmission, transmission mechanisms (both for textual and binary formats),
multiplexing of descriptions, synchronization of descriptions with content, management
and protection of intellectual property in MPEG-7 descriptions, etc.
· Information describing the creation and production processes of the content (director,
title, short feature movie).
· Information related to the usage of the content (copyright pointers, usage history,
broadcast schedule).
9
· Information of the storage features of the content (storage format, encoding).
· Information about low level features in the content (colors, textures, sound timbres,
melody description).
· Conceptual information of the reality captured by the content (objects and events,
interactions among objects).
· Information about the interaction of the user with the content (user preferences, usage
history).
All these descriptions are of course coded in an efficient way for searching, filtering, etc.
A description generated using MPEG-7 Description Tools will be associated with the
content itself, to allow fast and efficient searching for, and filtering of material that is of
interest to the user.
MPEG-7 data may be physically located with the associated AV material, in the same
data stream or on the same storage system, but the descriptions could also live
somewhere else on the globe. When the content and its descriptions are not co-located,
mechanisms that link the multimedia material and their MPEG-7 descriptions are
needed; these links will have to work in both directions.
10
SCOPE OF THE STANDARD
MPEG-7 addresses applications that can be stored (on-line or off-line) or streamed (e.g.
broadcast, push models on the Internet), and can operate in both real-time and non
real-time environments. A ‘real-time environment’ in this context means that the
description is generated while the content is being captured.
The DDL allows the definition of the MPEG-7 description tools, both Descriptors and
Description Schemes, providing the means for structuring the Ds into DSs. The DDL
also allows the extension for specific applications of particular DSs. The description
tools are instantiated as descriptions in textual format (XML) thanks to the DDL (based
on XML Schema). Binary format of descriptions is obtained by means of the BiM
defined in the Systems part.
11
MPEG-7 APPLICATION AREAS
All application domains making use of multimedia will benefit from MPEG-7.
Considering that at present day it is hard to find one not using multimedia, please
extend the list of the examples below using your imagination:
· Architecture, real estate, and interior design (e.g., searching for ideas).
· Journalism (e.g. searching speeches of a certain politician using his name, his voice or
his face).
12
· Multimedia editing (e.g., personalized electronic news service, media authoring).
The way MPEG-7 descriptions will be used to answer user queries or filtering operations
is outside the scope of the standard. The type of content and the query do not have to be
the same; for example, visual material may be queried and filtered using visual content,
music, speech, etc. It is the responsibility of the search engine and filter agent to match
the query data to the MPEG-7 description.
· Play a few notes on a keyboard and retrieve a list of musical pieces similar to the
required tune, or images matching the notes in a certain way, e.g. in terms of emotions.
· Draw a few lines on a screen and find a set of images containing similar graphics,
logos, ideograms,...
· Define objects, including color patches or textures and retrieve examples among which
you select the interesting objects to compose your design.
· Using an excerpt of Pavarotti’s voice, obtaining a list of Pavarotti’s records, video clips
where Pavarotti is singing and photographic material portraying Pavarotti.
13
MPEG-7 PARTS
1. MPEG-7 Systems – the tools needed to prepare MPEG-7 descriptions for efficient
transport and storage and the terminal architecture.
2. MPEG-7 Description Definition Language - the language for defining the syntax of
the MPEG-7 Description Tools and for defining new Description Schemes.
3. MPEG-7 Visual – the Description Tools dealing with (only) Visual descriptions.
4. MPEG-7 Audio – the Description Tools dealing with (only) Audio descriptions.
8. MPEG-7 Extraction and use of descriptions – informative material (in the form of a
Technical Report) about the extraction and use of some of the Description Tools.
10. MPEG-7 Schema Definition - specifies the schema using the Description Definition
Language
Besides the different official MPEG-7 parts, there are also MPEG-7 Liaisons within the
broader scope of the MPEG Liaisons activity. MPEG Liaisons deals with organizing
formal collaboration between MPEG and other related activities under development in
other standardization bodies. Currently MPEG-7 related liaisons include, among others,
SMPTE, TV-Anytime, EBU P/Meta, Dublin Core and W3C.
14
METHOD OF WORK AND DEVELOPMENT SCHEDULE
The method of development has been comparable to that of the previous MPEG
standards. MPEG work is usually carried out in three stages: definition, competition,
and collaboration. In the definition phase, the scope, objectives and requirements for
MPEG-7 were defined. In the competitive stage, participants worked on their
technology by themselves. The end of this stage was marked by the MPEG-7 Evaluation
following an open Call for Proposals (CfP). The Call asked for relevant technology fitting
the requirements. In answer to the Call, all interested parties, no matter whether they
participate or have participated in MPEG, were invited to submit their technology to
MPEG. Some 60 parties submitted, in total, almost 400 proposals, after which MPEG
made a fair expert comparison between these submissions.
Selected elements of different proposals were incorporated into a common model (the
eXperimentation Model, or XM) during the collaborative phase of the standard with the
goal of building the best possible model, which was in essence a draft of the standard
itself. During the collaborative phase, the XM was updated and improved in an iterative
fashion, until MPEG-7 reached the Committee Draft (CD) stage in October 2000, after
several versions of the Working Draft. Improvements to the XM were made through
Core Experiments (CEs). CEs were defined to test the existing tools against new
contributions and proposals, within the framework of the XM, according to well-defined
test conditions and criteria. Finally, those parts of the XM (or of the Working Draft) that
corresponded to the normative elements of MPEG-7 were standardized.
The main MPEG-7 (version 1) development schedule is shown below, where it is seen
that MPEG-7 reached International Standard status in 2001, although as will be shown
in the next table, it the first parts weren’t published by ISO until 2002.:
15
MAJOR FUNCTIONALITIES IN MPEG-7
The following subsections (MPEG-7 part ordered) contain the major functionalities
offered by the different parts of the MPEG-7 standard.
1. MPEG-7 Systems
MPEG-7 Systems includes currently the binary format for encoding MPEG-7
descriptions and the terminal architecture.
"... a language that allows the creation of new Description Schemes and, possibly,
Descriptors. It also allows the extension and modification of existing Description
Schemes."
The DDL is based on XML Schema Language. But because XML Schema Language has
not been designed specifically for audiovisual content description, there are certain
MPEG-7 extensions which have been added. As a consequence, the DDL can be broken
down into the following logical normative components:
3. MPEG-7 Visual
MPEG-7 Visual Description Tools consist of basic structures and Descriptors that cover
following basic visual features: color, texture, shape, motion, localization, and face
recognition. Each category consists of elementary and sophisticated Descriptors.
4. MPEG-7 Audio
16
Tools that are more specific to a set of applications. Those high-level tools include
general sound recognition and indexing Description Tools, instrumental timbre
Description Tools, spoken content Description Tools, an audio signature Description
Scheme, and melodic Description Tools to facilitate query-by-humming.
MPEG-7 Multimedia Description Schemes (also called MDS) comprises the set of
Description Tools (Descriptors and Description Schemes) dealing with generic as well as
multimedia entities.
Generic entities are features, which are used in audio and visual descriptions, and
therefore "generic" to all media. These are, for instance, "vector", "time", textual
description tools, controlled vocabularies, etc.
Apart from this set of generic Description Tools, more complex Description Tools are
standardized. They are used whenever more than one medium needs to be described
(e.g. audio and video.) These Description Tools can be grouped into 5 different classes
according to their functionality:
· Content management: information about the media features, the creation and the
usage of the AV content;
· User interaction: description of user preferences and usage history pertaining to the
consumption of the multimedia material.
The eXperimentation Model (XM) software is the simulation platform for the MPEG-7
Descriptors (Ds), Description Schemes (DSs), Coding Schemes (CSs), and Description
Definition Language (DDL). Besides the normative components, the simulation
platform needs also some non-normative components, essentially to execute some
procedural code to be executed on the data structures. The data structures and the
procedural code together form the applications. The XM applications are divided in two
types: the server (extraction) applications and the client (search, filtering and/or
transcoding) applications.
7. MPEG-7 Conformance
17
MPEG-7 Conformance includes the guidelines and procedures for testing conformance
of MPEG-7 implementations.
9. MPEG-7 Profiles
The MPEG-7 "Profiles and Levels" collects standard profiles and levels for MPEG-7,
specified across ISO/IEC 15938 parts. While all parts are potential candidates for
profiling, current Profiles concentrate on the Description Definition Language [ISO/IEC
15938-2], Visual [ISO/IEC 15938-3], Audio [ISO/IEC 15938-4], Multimedia Description
Schemes [ISO/IEC 15938-5], which are based on the namespace versioning defined in
Schema Definition [ISO/IEC 15938-10].
The MPEG-7 "Schema Definition" collects the complete MPEG-7 schemas, collecting
them from the different standards, corrigenda and amendments.
18
MPEG-7 VISUAL
MPEG-7 Visual Description Tools included in the standard consist of basic structures
and Descriptors that cover the following basic visual features: Color, Texture, Shape,
Motion, Localization, and Face recognition. Each category consists of elementary and
sophisticated Descriptors.
Basic structures
There are five Visual related Basic structures: the Grid layout, and the Time series,
Multiple view, the Spatial 2D coordinates, and Temporal interpolation.
1.Grid layout
The grid layout is a splitting of the image into a set of equally sized rectangular regions,
so that each region can be described separately. Each region of the grid can be described
in terms of other Descriptors such as color or texture. Furthermore, the descriptor
allows to assign the subDescriptors to all rectangular areas, as well as to an arbitrary
subset of rectangular regions.
2. Time Series
This descriptor defines a temporal series of Descriptors in a video segment and provides
image to video-frame matching and video-frames to video-frames matching
functionalities. Two types of TimeSeries are available: RegularTimeSeries and
IrregularTimeSeries. In the former, Descriptors locate regularly (with constant
intervals) within a given time span. This enables a simple representation for the
application that requires low complexity. On the other hand, Descriptors locate
irregularly (with various intervals) within a given time span in the latter. This enables
an efficient representation for the application that has the requirement of narrow
transmission bandwidth or low storage capability. These are useful in particular to build
Descriptors that contain time series of Descriptors.
19
image plane to describe features of the 3D (real world) objects. The descriptor allows
the matching of 3D objects by comparing their views, as well as comparing pure 2D
views to 3D objects.
4. Spatial 2D Coordinates
It supports two kinds of coordinate systems: "local" and "integrated" (see Figure 20). In
a "local" coordinate system, the coordinates used for the calculation of the description is
mapped to the current coordinate system applicable. In an "integrated" coordinate
system, each image (frame) of e.g. a video may be mapped to different areas with
respect to the first frame of a shot or video.
5. Temporal Interpolation
20
Figure 21: Real Data and Interpolation functions
Color Descriptors
There are seven Color Descriptors: Color space, Color Quantization, Dominant Colors,
Scalable Color, Color Layout, Color-Structure, and GoF/GoP Color.
1. Color space
The feature is the color space that is to be used in other color based descriptions.
· R,G,B
· Y,Cr,Cb
· H,S,V
· HMMD
· Monochrome
2. Color Quantization
This descriptor defines a uniform quantization of a color space. The number of bins
which the quantizer produces is configurable, such that great flexibility is provided for a
wide range of applications. For a meaningful application in the context of MPEG-7, this
descriptor has to be combined with dominant color descriptors, e.g. to express the
meaning of the values of dominant colors.
3. Dominant Color(s)
This color descriptor is most suitable for representing local (object or image region)
features where a small number of colors are enough to characterize the color
information in the region of interest. Whole images are also applicable, for example, flag
21
images or color trademark images. Color quantization is used to extract a small number
of representing colors in each region/image. The percentage of each quantized color in
the region is calculated correspondingly. A spatial coherency on the entire descriptor is
also defined, and is used in similarity retrieval.
4. Scalable Color
The Scalable Color Descriptor is a Color Histogram in HSV Color Space, which is
encoded by a Haar transform. Its binary representation is scalable in terms of bin
numbers and bit representation accuracy over a broad range of data rates. The Scalable
Color Descriptor is useful for image-to-image matching and retrieval based on color
feature. Retrieval accuracy increases with the number of bits used in the representation.
5. Color Layout
This descriptor effectively represents the spatial distribution of color of visual signals in
a very compact form. This compactness allows visual signal matching functionality with
high retrieval efficiency at very small computational costs. It provides image-to-image
matching as well as ultra high-speed sequence-to-sequence matching, which requires so
many repetitions of similarity calculations. It also provides very friendly user interface
using hand-written sketch queries since this descriptors captures the layout information
of color feature. The sketch queries are not supported in other color descriptors.
· that there are no dependency on image/video format, resolutions, and bit-depths. The
descriptor can be applied to any still pictures or video frames even though their
resolutions are different. It can be also applied both to a whole image and to any
connected or unconnected parts of an image with arbitrary shapes.
· that the required hardware/software resources for the descriptor is very small. It needs
as law as 8 bytes per image in the default video frame search, and the calculation
complexity of both extraction and matching is very low. It is feasible to apply this
descriptor to mobile terminal applications where the available resources is strictly
limited due to hardware constrain.
· that the captured feature is represented in frequency domain, so that users can easily
introduce perceptual sensitivity of human vision system for similarity calculation.
6. Color-Structure Descriptor
22
The Color structure descriptor is a color feature descriptor that captures both color
content (similar to a color histogram) and information about the structure of this
content. Its main functionality is image-to-image matching and its intended use is for
still-image retrieval, where an image may consist of either a single rectangular frame or
arbitrarily shaped, possibly disconnected, regions. The extraction method embeds color
structure information into the descriptor by taking into account all colors in a
structuring element of 8x8 pixels that slides over the image, instead of considering each
pixel separately. Unlike the color histogram, this descriptor can distinguish between two
images in which a given color is present in identical amounts but where the structure of
the groups of pixels having that color is different in the two images. Color values are
represented in the double-coned HMMD color space, which is quantized non-uniformly
into 32, 64, 128 or 256 bins. Each bin amplitude value is represented by an 8-bit code.
The Color Structure descriptor provides additional functionality and improved
similarity-based image retrieval performance for natural images compared to the
ordinary color histogram.
7. GoF/GoP Color
23
Texture Descriptors
There are three texture Descriptors: Homogeneous Texture, Edge Histogram, and
Texture Browsing.
Homogeneous texture has emerged as an important visual primitive for searching and
browsing through large collections of similar looking patterns. An image can be
considered as a mosaic of homogeneous textures so that these texture features
associated with the regions can be used to index the image data. For instance, a user
browsing an aerial image database may want to identify all parking lots in the image
collection. A parking lot with cars parked at regular intervals is an excellent example of
a homogeneous textured pattern when viewed from a distance, such as in an Air Photo.
Similarly, agricultural areas and vegetation patches are other examples of homogeneous
textures commonly found in aerial and satellite imagery. Examples of queries that could
be supported in this context could include "Retrieve all Land-Satellite images of Santa
Barbara which have less than 20% cloud cover" or "Find a vegetation patch that looks
like this region". To support such image retrieval, an effective representation of texture
is required. The Homogeneous Texture Descriptor provides a quantitative
representation using 62 numbers (quantified to 8 bits each) that is useful for similarity
retrieval. The extraction is done as follows; the image is first filtered with a bank of
orientation and scale tuned filters (modeled using Gabor functions) using Gabor filters.
The first and the second moments of the energy in the frequency domain in the
corresponding sub-bands are then used as the components of the texture descriptor.
The number of filters used is 5x6 = 30 where 5 is the number of "scales" and 6 is the
number of "directions" used in the multi-resolution decomposition using Gabor
functions. An efficient implementation using projections and 1-D filtering operations
exists for feature extraction. The Homogeneous Texture descriptor provides a precise
quantitative description of a texture that can be used for accurate search and retrieval in
this respect. The computation of this descriptor is based on filtering using scale and
orientation selective kernels.
2. Texture Browsing
The Texture Browsing Descriptor is useful for representing homogeneous texture for
browsing type applications, and requires only 12 bits (maximum). It provides a
perceptual characterization of texture, similar to a human characterization, in terms of
24
regularity, coarseness and directionality. The computation of this descriptor proceeds
similarly as the Homogeneous Texture Descriptor. First, the image is filtered with a
bank of orientation and scale tuned filters (modeled using Gabor functions); from the
filtered outputs, two dominant texture orientations are identified. Three bits are used to
represent each of the dominant orientations. This is followed by analyzing the filtered
image projections along the dominant orientations to determine the regularity
(quantified to 2 bits) and coarseness (2 bits x 2). The second dominant orientation and
second scale feature are optional. This descriptor, combined with the Homogeneous
Texture Descriptor, provide a scalable solution to representing homogeneous texture
regions in images.
3 Edge Histogram
The edge histogram descriptor represents the spatial distribution of five types of edges,
namely four directional edges and one non-directional edge. Since edges play an
important role for image perception, it can retrieve images with similar semantic
meaning. Thus, it primarily targets image-to-image matching (by example or by sketch),
especially for natural images with non-uniform edge distribution. In this context, the
image retrieval performance can be significantly improved if the edge histogram
descriptor is combined with other Descriptors such as the color histogram descriptor.
Besides, the best retrieval performances considering this descriptor alone are obtained
by using the semi-global and the global histograms generated directly from the edge
histogram descriptor as well as the local ones for the matching process.
25
Shape Descriptors
There are three shape Descriptors: Region Shape, Contour Shape, and Shape 3D.
1. Region Shape
The shape of an object may consist of either a single region or a set of regions as well as
some holes in the object as illustrated in Figure 22. Since the Region Shape descriptor
makes use of all pixels constituting the shape within a frame, it can describe any shapes,
i.e. not only a simple shape with a single connected region as in Figure 22 (a) and (b)
but also a complex shape that consists of holes in the object or several disjoint regions
as illustrated in Figure 22 (c), (d) and (e), respectively. The Region Shape descriptor not
only can describe such diverse shapes efficiently in a single descriptor, but is also robust
to minor deformation along the boundary of the object.
Figure 22 (g), (h) and (i) are very similar shape images for a cup. The differences are at
the handle. Shape (g) has a crack at the lower handle while the handle in (i) is filled. The
region-based shape descriptor considers (g) and (h) similar but different from (i)
because the handle is filled. Similarly, Figure 22 (j-l) show the part of video sequence
where two disks are being separated. With the region-based descriptor, they are
considered similar.
The descriptor is also characterized by its small size, fast extraction time and matching.
The data size for this representation is fixed to 17.5 bytes. The feature extraction and
matching processes are straightforward to have low order of computational
complexities, and suitable for tracking shapes in the video data processing.
2. Contour Shape
· It reflects properties of the perception of human visual system and offers good
generalization
26
· It is robust to non-rigid motion
· It is compact
Some of the above properties of this descriptor are illustrated in Figure 23, each frame
containing very similar images according to CSS, based on the actual retrieval results
from the MPEG-7 shape database.
Figure 23: (a) shape generalization properties (perceptual similarity among different
shapes), (b) robustness to non-rigid motion (man running), (c) robustness to partial
occlusion (tails or legs of the horses)
3. Shape 3D
27
Motion Descriptors
There are four motion Descriptors: Camera Motion, Motion Trajectory, Parametric
Motion, and Motion Activity.
1. Camera Motion
This descriptor characterizes 3-D camera motion parameters. It is based on 3-D camera
motion parameter information, which can be automatically extracted or generated by
capture devices.
The camera motion descriptor supports the following well-known basic camera
operations (see Figure 24): fixed, panning (horizontal rotation), tracking (horizontal
transverse movement, also called traveling in the film industry), tilting (vertical
rotation), booming (vertical transverse movement), zooming (change of the focal
length), dollying (translation along the optical axis), and rolling (rotation around the
optical axis).
Figure 24: (a) Camera track, boom, and dolly motion modes, (b) Camera pan, tilt and
roll motion modes.
The sub-shots for which all frames are characterized by a particular type of camera
motion, which can be single or mixed, determine the building blocks for the camera
motion descriptor. Each building block is described by its start time, the duration, the
speed of the induced image motion, by the fraction of time of its duration compared
with a given temporal window size, and the focus-of-expansion (FOE) (or focus-of-
contraction – FOC). The Descriptor represents the union of these building blocks, and it
has the option of describing a mixture of different camera motion types. The mixture
mode captures the global information about the camera motion parameters,
disregarding detailed temporal information, by jointly describing multiple motion types,
even if these motion types occur simultaneously. On the other hand, the non-mixture
mode captures the notion of pure motion type and their union within certain time
28
interval. The situations where multiple motion types occur simultaneously are described
as a union of the description of pure motion types. In this mode of description, the time
window of a particular elementary segment can overlap with time window of another
elementary segment.
2. Motion Trajectory
The motion trajectory of an object is a simple, high level feature, defined as the
localization, in time and space, of one representative point of this object.
The descriptor is essentially a list of keypoints (x,y,z,t) along with a set of optional
interpolating functions that describe the path of the object between keypoints, in terms
of acceleration. The speed is implicitly known by the keypoints specification. The
keypoints are specified by their time instant and either their 2-D or 3-D Cartesian
coordinates, depending on the intended application. The interpolating functions are
defined for each component x(t), y(t), and z(t) independently.
· it is compact and scalable. Instead of storing object coordinate for each frame, the
granularity of the descriptor is chosen through the number of keypoints used for each
time interval. Besides, interpolating function-data may be discarded, as keypoint-data
are already a trajectory description.
3. Parametric Motion
29
Parametric motion models have been extensively used within various related image
processing and analysis areas, including motion-based segmentation and estimation,
global motion estimation, mosaicing and object tracking. Parametric motion models
have been already used in MPEG-4, for global motion estimation and compensation and
sprite generation. Within the MPEG-7 framework, motion is a highly relevant feature,
related to the spatio-temporal structure of a video and concerning several MPEG-7
specific applications, such as storage and retrieval of video databases and hyperlinking
purposes. Motion is also a crucial feature for some domain specific applications that
have already been considered within the MPEG-7 framework, such as sign language
indexation.
The basic underlying principle consists of describing the motion of objects in video
sequences as a 2D parametric model. Specifically, affine models include translations,
rotations, scaling and combination of them, planar perspective models make possible to
take into account global deformations associated with perspective projections and
quadratic models makes it possible to describe more complex movements.
4. Motion Activity
30
Localization
There two descriptors for localization: Region locator and Spatio-temporal locator
1. Region Locator
31
Others
1. Face Recognition
The FaceRecognition descriptor can be used to retrieve face images which match a
query face image. The descriptor represents the projection of a face vector onto a set of
basis vectors which span the space of possible face vectors. The FaceRecognition feature
set is extracted from a normalized face image. This normalized face image contains 56
lines with 46 intensity values in each line. The centers of the two eyes in each face image
are located on the 24th row and the 16th and 31st column for the right and left eye
respectively. This normalized image is then used to extract the one dimensional face
vector which consists of the luminance pixel values from the normalized face image
arranged into a one dimensional vector using a raster scan starting at the top-left corner
of the image and finishing at the bottom-right corner of the image. The FaceRecogniton
feature set is then calculated by projecting the one dimensional face vector onto the
space defined by a set of basis vectors.
32
MPEG-7 AUDIO
MPEG-7 Audio provides structures—building upon some basic structures from the MDS
—for describing audio content. Utilizing those structures are a set of low-level
Descriptors, for audio features that cut across many applications (e.g., spectral,
parametric, and temporal features of a signal), and high-level Description Tools that are
more specific to a set of applications. Those high-level tools include the audio signature
Description Scheme, musical instrument timbre Description Schemes, the melody
Description Tools to aid query-by-humming, general sound recognition and indexing
Description Tools, and spoken content Description Tools.
The Audio Framework contains low-level tools designed to provide a basis for the
construction of higher level audio applications. By providing a common platform for the
structure of descriptions and the basic semantics for commonly regarded audio features,
MPEG-7 Audio establishes a platform for interoperability across all applications that
might be built on the framework. The framework provides structures appropriate for
representing audio features, and a basic set of features.
1. Structures
There are essentially two ways of describing low-level audio features. One may sample
values at regular intervals or one may use Segments (see the discussion on segments in
3.1.1.3) to demark regions of similarity and dissimilarity within the sound. Both of these
possibilities are embodied in two low-level descriptor types (one for scalar values, such
as power or fundamental frequency, and one for vector types, such as spectra), which
create a consistent interface. Any descriptor inheriting from these types can be
instantiated, describing a segment with a single summary value or a series of sampled
values, as the application requires.
The sampled values themselves may be further manipulated through another unified
interface: they can form a Scalable Series. The Scalable Series allows one to
progressively down-sample the data contained in a series, as the application,
33
bandwidth, or storage requires. This hierarchical resampling forms a sort of ‘scale tree,’
which may also store various summary values along the way, such as minimum,
maximum, mean, and variance of the descriptor values.
2. Features
The low-level audio Descriptors are of general importance in describing audio. There
are seventeen temporal and spectral Descriptors that may be used in a variety of
applications. They can be roughly divided into the following groups:
· Basic
· Basic Spectral
· Signal Parameters
· Timbral Temporal
· Timbral Spectral
· Spectral Basis
Additionally, a very simple but useful tool is the MPEG-7 silence Descriptor. Each of
these classes of audio Descriptors can be seen in Figure 26 and are briefly described
below.
3. Basic
The two basic audio Descriptors are temporally sampled scalar values for general use,
applicable to all kinds of signals. The AudioWaveform Descriptor describes the audio
waveform envelope (minimum and maximum), typically for display purposes. The
AudioPower Descriptor describes the temporally-smoothed instantaneous power, which
is useful as a quick summary of a signal, and in conjunction with the power spectrum,
below.
4. Basic Spectral
The four basic spectral audio Descriptors all share a common basis, all deriving from a
single time-frequency analysis of an audio signal. They are all informed by the first
Descriptor, the AudioSpectrumEnvelope Descriptor, which is a logarithmic-frequency
spectrum, spaced by a power-of-two divisor or multiple of an octave. This
AudioSpectrumEnvelope is a vector that describes the short-term power spectrum of an
34
audio signal. It may be used to display a spectrogram, to synthesize a crude
"auralization" of the data, or as a general-purpose descriptor for search and comparison.
5. Signal Parameters
The two signal parameter Descriptors apply chiefly to periodic or quasi-periodic signals.
The AudioFundamentalFrequency descriptor describes the fundamental frequency of an
audio signal. The representation of this descriptor allows for a confidence measure in
recognition of the fact that the various extraction methods, commonly called "pitch-
tracking," are not perfectly accurate, and in recognition of the fact that there may be
sections of a signal (e.g., noise) for which no fundamental frequency may be extracted.
The AudioHarmonicity Descriptor represents the harmonicity of a signal, allowing
distinction between sounds with a harmonic spectrum (e.g., musical tones or voiced
speech [e.g., vowels]), sounds with an inharmonic spectrum (e.g., metallic or bell-like
sounds) and sounds with a non-harmonic spectrum (e.g., noise, unvoiced speech [e.g.,
fricatives like ‘f’], or dense mixtures of instruments).
6. Timbral Temporal
7. Timbral Spectral
35
The five timbral spectral Descriptors are spectral features in a linear-frequency space
especially applicable to the perception of musical timbre. The SpectralCentroid
Descriptor is the power-weighted average of the frequency of the bins in the linear
power spectrum. As such, it is very similar to the AudioSpectrumCentroid Descriptor,
but specialized for use in distinguishing musical instrument timbres. It is has a high
correlation with the perceptual feature of the "sharpness" of a sound.
The four remaining timbral spectral Descriptors operate on the harmonic regularly-
spaced components of signals. For this reason, the descriptors are computed in linear-
frequency space. The HarmonicSpectralCentroid is the amplitude-weighted mean of the
harmonic peaks of the spectrum. It has a similar semantic to the other centroid
Descriptors, but applies only to the harmonic (non-noise) parts of the musical tone. The
HarmonicSpectralDeviation Descriptor indicates the spectral deviation of log-amplitude
components from a global spectral envelope. The HarmonicSpectralSpread describes
the amplitude-weighted standard deviation of the harmonic peaks of the spectrum,
normalized by the instantaneous HarmonicSpectralCentroid. The
HarmonicSpectralVariation Descriptor is the normalized correlation between the
amplitude of the harmonic peaks between two subsequent time-slices of the signal.
8. Spectral Basis
Together, the descriptors may be used to view and to represent compactly the
independent subspaces of a spectrogram. Often these independent subspaces (or groups
thereof) correlate strongly with different sound sources. Thus one gets more salience
and structure out of a spectrogram while using less space. For example, in Figure 27, a
pop song is represented by an AudioSpectrumEnvelope Descriptor, and visualized using
a spectrogram. The same song has been data-reduced in Figure 28, and yet the
individual instruments become more salient in this representation.
9. Silence segment
The silence segment simply attaches the simple semantic of "silence" (i.e. no significant
sound) to an Audio Segment. Although it is extremely simple, it is a very effective
descriptor. It may be used to aid further segmentation of the audio stream, or as a hint
not to process a segment.
36
High-level audio Description Tools (Ds and DSs)
Because there is a smaller set of audio features (as compared to visual features) that
may canonically represent a sound without domain-specific knowledge, MPEG-7 Audio
includes a set of specialized high-level tools that exchange some degree of generality for
descriptive richness. The five sets of audio Description Tools that roughly correspond to
application areas are integrated in the standard: audio signature, musical instrument
timbre, melody description, general sound recognition and indexing, and spoken
content. The latter two are excellent examples of how the Audio Framework and MDS
Description Tools may be integrated to support other applications.
While low-level audio Descriptors in general can serve many conceivable applications,
the spectral flatness Descriptor specifically supports the functionality of robust
matching of audio signals. The Descriptor is statistically summarized in the
AudioSignature Description Scheme as a condensed representation of an audio signal
designed to provide a unique content identifier for the purpose of robust automatic
identification of audio signals. Applications include audio fingerprinting, identification
of audio based on a database of known works and, thus, locating metadata for legacy
audio content without metadata annotation.
Within four possible classes of musical instrument sounds, two classes are well detailed,
and had been the subject of extensive development within MPEG-7. Harmonic,
coherent, sustained sounds, and non-sustained, percussive sounds are represented in
the standard. The HarmonicInstrumentTimbre Descriptor for sustained harmonic
sounds combines the four harmonic timbral spectral Descriptors (see 3.3.1.7) with the
LogAttackTime Descriptor. The PercussiveIinstrumentTimbre Descriptor combines the
37
timbral temporal Descriptors (see 3.3.1.6) with a SpectralCentroid Descriptor.
Comparisons between descriptions using either set of Descriptors are done with an
experimentally-derived scaled distance metric.
The melody Description Tools include a rich representation for monophonic melodic
information to facilitate efficient, robust, and expressive melodic similarity matching.
The Melody Description Scheme includes a MelodyContour Description Scheme for
extremely terse, efficient melody contour representation, and a MelodySequence
Description Scheme for a more verbose, complete, expressive melody representation.
Both tools support matching between melodies, and can support optional supporting
information about the melody that may further aid content-based search, including
query-by-humming.
The general sound recognition and indexing Description Tools are a collection of tools
for indexing and categorization of general sounds, with immediate application to sound
effects. The tools enable automatic sound identification and indexing, and the
specification of a Classification Scheme of sound classes and tools for specifying
hierarchies of sound recognizers. Such recognizers may be used to automatically index
and segment sound tracks. Thus the Description Tools address recognition and
representation all the way from low-level signal-based analyses, through mid-level
statistical models, to highly semantic labels for sound classes.
The recognition tools use the low-level spectral basis Descriptors as their foundation
(see 3.3.1.8). These basis functions are then collected into a series of states that
comprise a statistical model (the SoundModel Description Scheme), such as a hidden
Markov or Gaussian mixture model. The SoundClassificationModel Description Scheme
combines a set of SoundModels into a multi-way classifier for automatic labelling of
audio segments using terms from a Classification Scheme. The resulting probabilistic
38
classifiers may recognize broad sounds classes, such as speech and music, or they can be
trained to identify narrower categories such as male, female, trumpet, or violin. Other
applications include genre classification and voice recognition.
The spoken content Description Tools allow detailed description of words spoken within
an audio stream. In recognition of the fact that current Automatic Speech Recognition
(ASR) technologies have their limits, and that one will always encounter out-of-
vocabulary utterances, the spoken content Description Tools sacrifice some
compactness for robustness of search. To accomplish this, the tools represent the output
and what might normally be seen as intermediate results of Automatic Speech
Recognition (ASR). The tools can be used for two broad classes of retrieval scenario:
indexing into and retrieval of an audio stream, and indexing of multimedia objects
annotated with speech.
The Spoken Content Description Tools are divided into two broad functional units: the
SpokenContentLattice Description Scheme, which represents the actual decoding
produced by an ASR engine, and the SpokenContentHeader, which contains
information about the speakers being recognized and the recognizer itself.
39
greatly alleviated and retrieval may still be carried out when the original word
recognition was in error. A simplified SpokenContentLattice is depicted in Figure 29.
Figure 29: A lattice structure for an hypothetical (combined phone and word)
decoding of the expression "Taj Mahal drawing …". It is assumed that the name ‘Taj
Mahal’ is out of the vocabulary of the ASR system
· Annotated Media Retrieval. This is similar to spoken document retrieval, but the
spoken part of the media would generally be quite short (a few seconds). The result of
the query is the media which is annotated with speech, and not the speech itself. An
example is a photograph retrieved using a spoken annotation.
40
MPEG-7 CONFORMANCE TESTING
MPEG-7 Conformance Testing includes the guidelines and procedures for testing
conformance of MPEG-7 implementations both for descriptions and terminals.
Conformance testing
41
2. Conformance testing of terminals
(1) Does the test terminal provide correct response for check of description validity, and
(2) Does the test terminal provide the same results for the reconstructed canonical XML
(infoset) representation as the reference terminal.
In the case of an input description that is in the form of textual or binary access units,
the Systems processing must first convert the description into a textual XML form. In
the case of an input description that is already in textual XML form, the Systems
processor passes the input description on for DDL processing. In either case, the textual
XML form of the description is then operated on by DDL processor, which checks the
description for well-formedness and validity. The DDL processesor takes as input the
schema composed from the MDS, visual, audio, and other parts in order to allow the
checking of the syntax of the textual XML description against the specifications of
ISO/IEC 15938 Parts 1-5.
42
Interoperability points
Given the conformance testing procedures described above, the interoperability point in
the standard corresponds to the reconstruction of a canonical XML representation of
the description at the terminal. This allows for different possible implementations at the
terminal in which different internal representations are used as long as the terminal is
able to produce a conforming canonical XML representation of the description.
1. Normative interfaces
The objective of this section is to describe MPEG-7 normative interfaces. MPEG-7 has
two normative interfaces as depicted in Figure 42 and further described in this section.
Content : These are the data to be represented according to the format described by
this specification. Content refer either to essence or to content description.
43
Textual Format interface : This interface describes the format of the textual access
units. The MPEG-7 Textual Decoder consumes a flow of such Access Units and
reconstruct the content description in a normative way.
Binary Format Interface : This interface describes the format of the binary access
units. The MPEG-7 Binary Decoder consumes a flow of such Access Units and
reconstruct the content description in a normative way.
The objective of this section is to describe how proof can be established that thelossless
binary representation and the textual representation provide dual representations of the
content. The process is described in Figure 43 and further described in this section.
In addition to the elements described in section 3.8.2.1.1, the validation process involves
the definition of a canonical representation of a content description. In the canonical
space, content description can be compared. The validation process works as follows:
· The two encoded descriptions are decoded with their respective binary and textual
decoders.
· Two canonical descriptions are generated from the reconstructed content descriptions.
44
MPEG-7 Extraction and Use of Descriptions
The MPEG-7 Extraction and Use of Descriptions Technical Report gives examples of
extraction and use of descriptions using Description Schemes, Descriptors, and
datatypes as specified in ISO/IEC 15938. The following set of subclauses are provided
for each description tool, where optional subclauses are indicated as (optional):
· Use (optional): provides informative examples that illustrate the use of descriptions.
45
REFERENCES
• www.wikipedia.org
• www.mpegif.com
• www.chiariglione.org/mpeg
• http://www.cselt.it/mpeg
• http://www.mpeg-7.com
• www.berkeley.org
46