Cavallaro CSVT Semantic Encoding
Cavallaro CSVT Semantic Encoding
Abstract— We present an encoding framework which exploits indexing of video content is also an desirable feature for sports
semantics for video content delivery. The video content is orga- broadcasting [5].
nized based on the idea of main content message. In the work To cope with these challenges, video content needs to be
reported in this paper, the main content message is extracted from
the video data through semantic video analysis, an application- automatically analyzed and adapted to the needs of the specific
dependent process that separates relevant information from non application, to the capabilities of the connected terminal and
relevant information. We use here semantic analysis and the network, and to the preferences of the user. Three main
corresponding content annotation under a new perspective: the strategies for adaptive content delivery have been proposed
results of the analysis are exploited for object-based encoders, throughout the literature, namely Info Pyramid, scalable cod-
such as MPEG-4, as well as for frame-based encoders, such as
MPEG-1. Moreover, the use of MPEG-7 content descriptors in ing and transcoding. The work presented in this paper aims
conjunction with the video is used for improving content visual- to go beyond traditional adaptation techniques. We focus on
ization for narrow channels and devices with limited capabilities. semantic encoding by looking to exploit video analysis prior
Finally, we analyze and evaluate the impact of semantic video to encoding (Figure 1). Specifically, we use semantic video
analysis in video encoding and show that the use of semantic analysis to extract relevant areas of a video. These areas are
video analysis prior to encoding sensibly reduces the bandwidth
requirements compared to traditional encoders not only for an encoded at a higher level of quality or summarized in textual
object-based encoder but also for a frame-based encoder. form. The idea behind this approach is to organize the content
so that a particular network or device does not inhibit the main
Index Terms— Video analysis, video encoding, object segmen-
tation, metadata, MPEG. content message. The main content message is dependent on
the specific application. In particular, for applications such as
video surveillance and sport video the main content message
I. I NTRODUCTION is defined based on motion information.
The contribution of this paper is twofold. On the one
The diffusion of network appliances such as cellular phones, hand, a framework for adaptive video delivery is defined and
personal digital assistants, and hand-held computers creates a implemented based on video objects and on their associated
new challenge for content delivery: how to adapt the media metadata. On the other hand, two new modalities of video
transmission to various device capabilities, network charac- delivery are proposed in such a framework. The first modality
teristics, and user preferences [1], [2], [3]. Each device is combines semantic analysis with a traditional frame-based
characterized by certain display capabilities and processing video encoder. The second modality uses metadata to effi-
power. Moreover, such appliances are connected through dif- ciently encode the main content message. In particular, the
ferent kinds of networks with diverse bandwidths. Finally, use of metadata enables not only to make the content more
users with different preferences access the same multimedia searchable, but also to improve visualization and to preserve
content. Therefore there exists a need to personalize the way privacy in video-based applications.
media content is delivered to the end user. In addition to the The paper is organized as follows. Section II is an overview
above, recent devices, such as digital radio receivers, and new of existing adaptation techniques. Section III presents the
applications, such as intelligent visual surveillance, require algorithm for extracting the main content message, the frame-
novel forms of video analysis for content adaptation and sum- work for adaptive video delivery and automatic description
marization. Digital radios allow for the display of additional using semantic video analysis. Section IV discusses quality
information alongside the traditional audio stream to enrich the assessment issues, whereas experimental results are presented
audio content. For instance, digital audio broadcasting (DAB) in Section V. Finally, Section VI concludes the paper.
allocates 128 Kb/s to streaming audio, whereas 8Kb/s can
be used to send additional information, such as visual data II. BACKGROUND
[4]. Moreover, the growth of video surveillance systems poses Three main approaches have been presented in the literature
challenging problems for the automatic analysis, interpretation to provide adaptive content delivery, namely Info Pyramid,
and indexing of video data as well as for selective content scalable coding and transcoding. Info Pyramid provides a
filtering for privacy preservation. Finally, the instantaneous general framework for managing and manipulating media
objects [6], [7]. Info Pyramid manages different versions, or
A. Cavallaro is with the Multimedia and Vision Laboratory, Queen Mary, variations, of media objects with different modalities (e.g.,
University of London (QMUL), E1 4NS London, United Kingdom.
O. Steiger and T. Ebrahimi are with the Signal Processing Institute, Swiss video, image, text, and audio) and fidelities (summarized, com-
Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland. pressed, and scaled variations). Moreover, it defines methods
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 101
Temporal
downsampling
M (1-5)
Frame-based
U
encoder
Spatial X
downsampling
Compositing
(8)
Foreground
Video Semantic Object-based (6, 7)
video analysis Background Background encoder
simplification
M
U Metadata
X encoder
(9)
Still image
encoder
Fig. 1. Flow diagram of the proposed encoding framework based on semantic video analysis and description
for manipulating, translating, transcoding, and generating the transcoding). Content-blind transcoding strategies include spa-
content. When a client device requests a multimedia document, tial resolution reduction, temporal resolution reduction, and
the server selects and delivers the most appropriate variation. bit-rate reduction [10]. Recent transcoding techniques make
The selection is made based on network characteristics and use of semantics to minimize the degradation of important
terminal capabilities, such as display size, frame rate, color image regions [11], [12]. In [13], optimal quantization pa-
depth and storage capacity. rameters and frame skip are determined for each video object
As opposed to Info Pyramid, scalable coding processes individually. The bit-rate budget for each object is allocated by
multimedia content only once. Lower qualities, lower spatial a difficulty hint, a weight indicating the relative encoding com-
and temporal resolutions of the same content are then obtained plexity of each object. Frame skip is controlled by a shape hint,
by truncating certain layers or bits from the original stream [8]. which measures the difference between two consecutive shapes
Basic modes of video scalability include quality scalability, to determine whether an object can be temporally downsam-
spatial scalability, temporal scalability, and frequency scala- pled without visible composition problems. Key objects are
bility. Combinations of these basic modes are also possible. selected based on motion activity and on bit complexity.
Quality scalability is defined as the representation of a video The transcoding strategies described thus far are referred
sequence with varying accuracies in the color patterns. This to as intramedia transcoding strategies and do not change the
is typically obtained by quantizing the color values with media nature of the input signal. On the other hand, intermedia
increasingly finer quantization step sizes. Spatial scalability transcoding, or transmoding, is the process of converting
is the representation of the same video in varying spatial the media input into another media format. Examples of
resolutions. Temporal scalability is the representation of the intermedia transcoding include speech-to-text [14] and video-
same video at varying temporal resolutions or frame rates. to-text [15] translation. Both the intramedia and the intermedia
Frequency scalability includes different frequency components adaptation concepts are used in this paper for video encoding,
in each layer, with the base layer containing low-frequency as described in the following section.
components and the other layers containing increasingly high-
frequency components. Such decomposition can be achieved III. A DAPTIVE CONTENT DELIVERY AND DESCRIPTION
via frequency transforms like the DCT or wavelet transforms. USING SEMANTICS
Finally, the basic scalability schemes can be combined to reach The proposed framework for adaptive video delivery and
fine-granularity scalability, such as in MPEG–4 FGS [9].
automatic description uses video content analysis and semantic
The various scalable coding methods introduced previously pre-filtering prior to encoding (Figure 1) in order to improve
perform the same operation over the entire video frame. In
the perceived content quality and to provide additional func-
object-based temporal scalability (OTS), the frame rate of
tionalities, such as privacy preservation and automatic video
foreground objects is enhanced so that it has a smoother indexing. Semantic video analysis and semantic encoding are
motion than the background.
described next.
Video transcoding is the process of converting a compressed
video signal into another compressed signal with different
properties. Early solutions to video transcoding determine the A. Semantic video analysis
output format based on network and appliance constraints, Semantic video analysis is used to extract the main content
independently of the semantics in the content (content-blind message from the video. The semantics to be included in
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 102
and text-based modalities, is here referred toe as metadata- evaluation metric, the semantic peak signal-to-noise ratio,
enhanced encoding. Using metadata-enhanced encoding, con- SPSNR, is the following:
tent descriptors help enhance parts of the video that are hidden 2
!
or difficult to perceive due to heavy compression. In this case, VM
SPSNR = 10 log10 , (6)
the video itself is the background and the descriptors highlight SMSE
relevant portions of the data. One example is the ball in a where VM is the maximum peak-to-peak value of the color
football match for transmission to a PDA or a mobile phone, range. When the object classes are foreground and back-
as shown in Figure 3(e). ground, then N = 2 in Eq (3). If we denote with wf the
foreground weight, then SPSNR ≡ PSNR when wf = 0.5.
IV. Q UALITY ASSESSMENT The larger wf , the more important the contribution of the
foreground. When wf = 1, then the foreground only is
Perceptual video quality assessment is a difficult task
considered in the evaluation of the peak signal-to-noise ratio.
already when dealing with traditional coders [22]. When
An illustration of the impact of wf in the distortion measure
dealing with object-based coders, the task becomes even
is shown in in Fig. 5. The figure presents a comparison of the
more challenging. For this reason, we use a combination of
subjective and objective evaluation techniques to compare the 35
distortion curves. 25
SPSNR and subjective results. The results are summarized in of semantic analysis on the encoding performance of frame-
Table I. For the sequence Akiyo, where the foreground covers based as well as object-based coders and demonstrate the use
a large area of each frame and the background is simple, the of the proposed approach for advanced applications, such as
the observers focused mostly on foreground, thus leading to a privacy preservation in video surveillance. Sample results are
value of wf = 0.97. For Hall Monitor, whose background is shown from the MPEG–4 test sequence Hall Monitor and
more complex and objects are smaller, the foreground attracted from the MPEG–7 test sequence Highway. Both sequences
slightly more the attention than the background (wf = 0.55). are in CIF format at 25 Hz. The modalities under analysis are:
The sequence Children has a very complex and colored back- (1) coded original sequence; (2) temporal resolution reduction
ground that attracted the observers’ attention, thus resulting (from 25 frames/s. to 12.5 frames/s.); (3) spatial resolution
in foreground and background being equally weighted (wf = reduction (from CIF to QCIF); (4,6) video objects composited
0.5). The sequence Coastguard contains camera motion that with static background; (5,7) video objects composited with
prevented the observer from focusing on background steadily, simplified background. The background is simplified using a
even though the background is quite complex. In this case, the Gaussian 9x9 low-pass filter with µ = 0 and σ = 2.
resulting foreground weight is wf = 0.7. In general, results The following coders have been used in the encoding
confirm that large moving objects and complex background process: (i) TMPGEnc 2.521.58.169 using constant bitrate
tend to attract user’s attention. Based on the data collected with (CBR) rate control for frame-based MPEG–1; (ii) MoMuSys
subjective experiments, it is possible to predict the foreground MPEG-4 VM reference software version 1.0 using VM5+
weight based on the following formula: global rate control for object-based MPEG–4; (iii) Expway
MPEG-7 BiM Payload encoder/decoder version 02/11/07 for
wf = α · r + (δ − β · r)σb + γ · v + δ, (7) MPEG–7 metadata; (iv) Kakadu JPEG2000 codec version
where r represents the portion of the image occupied by 4.2 for JPEG200 still images. The value of the foreground
foreground pixels, expressed as r = |Cf |/(|Cf | + |Cb |), with weight used in the objective evaluation is wf = 0.55 for
|Cf | and |Cb | representing the number of foreground and Hall Monitor, as determined with the subjective experiments,
background pixels, respectively. The background complexity and wf = 0.53 for Highway, computed using Eq. (7) with
is taken into account with σb , the standard deviation of the r = 0.07, σb = 48, v = 0.
luminance of background pixels. The presence of camera Figure 6 shows the rate-distortion diagrams for the test
motion is considered with the term v: v = 1 for moving sequences. The average SPSNR for five encoding modalities is
camera, and v = 0 otherwise. α, β, γ, and δ are constants plotted against the encoding bitrate. Figures 6 (a) and (b) show
whose values are determined based on the results of the the rate-distortion diagrams for MPEG–1 at bitrates between
subjective experiments and are the following: α = 5.7, β = 150 Kbit/s and 1000 Kbit/s. At low bitrates (150-300 Kbit/s),
0.108, γ = 0.2 and δ = 0.01. The final value of wf is the semantic encoding with static background (4) leads to a larger
average of the foreground weights over the sequence. SPSNR than any of the content-blind methods (1-3). This is
In addition to semantic weight, Table I provides information because inter-coded static background blocks do not produce
about accuracy, monotonicity and consistency of the SPSNR residue and most of the available bitrate can be allocated to
metric. Accuracy is given by Pearson linear correlation coeffi- foreground objects. In Figures 6 (c) and (d), foreground and
cient rp , monotonicity by Spearman rank-order correlation co- background are encoded in two separate streams using object-
efficient rs , and consistency by outliers ratio ro [24]. Pearson based MPEG–4 at bitrates between 100 Kbit/s and 500 Kbit/s.
correlation of PSNR, rp (0.5), is given for comparison. Pearson Here semantic analysis is used in all five modalities. It possible
correlation rp and Spearman correlation rs are close to 1 for to notice that quality is improved at low bitrates by low-pass
all sequences. Thus, accuracy and monotonicity of SPSNR are filtering the background or using a still frame representing the
high. Outliers ratio ro is around 10%, thus consistency of the background.
metric is good as well. Note that using semantics improves Figure 7 shows a sample frame from each test sequence
accuracy by up to 8% (Akiyo), as compared to PSNR. coded with MPEG–1 at 150 Kbit/s with and without semantic
pre-filtering. Figure 8 shows magnified excepts of both test
TABLE I sequences coded with MPEG–1 at 150 Kbit/s. Figure 8 (top)
F OREGROUND WEIGHT AND SPSNR ACCURACY shows the person that carries a monitor in Hall monitor. The
Akiyo Hall monitor Children Coastguard
amount of coding artifacts is notably reduced by semantic
wf 0.97 0.55 0.50 0.7 pre-filtering ((d) and (e)). In particular, the person’s mouth
rp (wf ) 0.95 0.90 0.95 0.92 and the monitor are visible in (e), whereas they are corrupted
rp (0.5) 0.87 0.89 0.95 0.90 by coding artifacts in the non-semantic modalities. Similar
rs (wf ) 0.90 0.84 0.95 0.93 observations can be made for Figure 8 (bottom), which shows
ro (wf ) 0.10 0.11 0.07 0.07
a blue truck entering the scene at the beginning of the Highway
sequence. Coding artifacts are less disturbing on the object in
(d) and (e) than in (a)-(c). Moreover, the front-left wheel of
V. E XPERIMENTAL RESULTS the truck is only visible with semantic pre-filtering ((d) and
In this section, experimental results of the proposed seman- (e)).
tic video encoding and annotation framework with standard Next, we evaluate the cost of sending metadata for metadata-
test sequences are presented. The results illustrate the impact based and metadata-enhanced encoding. Table II shows the
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 106
40 40
35 35
30 30
SPSNR [db]
SPSNR [db]
25 25
(a) (b)
40 40
35 35
30 30
SPSNR [db]
SPSNR [db]
25 25
(c) (d)
Fig. 6. Rate-distortion diagrams. (a) Hall monitor, MPEG–1; (b) Highway, MPEG–1; (c) Hall monitor, MPEG–4 object-based; (d) Highway, MPEG–4
object-based
bitrate required by three types of description for Hall Monitor level of information hiding obtained using object descriptors
and Highway using MPEG–7 binary format (BiM). MPEG– for the sequence Hall Monitor. A surveillance operator can
7 binary format is used for sending summary information be shown different video types, ranging from the full appear-
to terminals with limited capabilities and to enhance heavily ance of the objects (Figure 9 (a)) to the visualization of a
compressed videos. The descriptions are represented by the position locator that allows the operator to derive statistics
spatial locators of the foreground objects, their bounding about number of objects, their behavior and position without
boxes, and an approximation of their shape with 20-sided disclosing their identity (Figure 9 (d)). Intermediate levels of
polygons, respectively. The metadata size increases with the visualization include the approximation of object shapes that
description complexity and with the number of objects in the hides the identity of the subjects captured by the surveillance
scene (Hall Monitor vs. Highway). The cost for metadata- camera, while allowing to derive information about their size
enhanced encoding can be further reduced by sending the and form (Figure 9 (b)), and the bounding box (Figure 9 (c)).
description of critical objects only. In addition to the above, The encoding cost associated with this additional functionality
added to a surveillance system is 21 Kbit/s for the spatial
TABLE II
locator, 59 Kbit/s for the bounding box and 89 Kbit/s for the
AVERAGE BITRATE OF MPEG–7 B I M SEQUENCE DESCRIPTION
polygonal shape. The choice of the description to be used
D ESCRIPTION Spatial locator Bounding box Polygon shape depends on the trade-off between privacy and the monitoring
Hall monitor 21 Kbit/s 59 Kbit/s 89 Kbit/s task at hand.
Highway 26 Kbit/s 66 Kbit/s 98 Kbit/s
VI. C ONCLUSIONS
metadata-enhanced encoding is used for privacy preservation We presented a content-based video encoding framework
in video surveillance. Figure 9 shows an example of different which is based on semantic analysis. Semantic analysis enables
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 107
Fig. 7. Frame 190 of Hall monitor (top) and frame 44 of Highway (bottom) coded with MPEG–1 at 150 Kbit/s using different modalities: (a) coded original
sequence; (b) static background; (c) simplified background
Fig. 8. Details of frame 280 of Hall monitor (top) and frame 16 of Highway (bottom). The sequences are encoded with MPEG–1 at 150 Kbit/s using
different encoding modalities: (a) coded original sequence; (b) temporal resolution reduction; (c) spatial resolution reduction; (d) static background; (e)
simplified background
video content delivery,” IEEE Trans. on Circuits and Systems for Video May 2004.
Technology, vol. 11, no. 3, pp. 387–401, March 2001. [16] R.L. Hsu, M.Abdel-Mottaleb, and A. Jain, “Face detection on color
[14] N. Morgan and H. Bourlard, “Continuous speech recognition,” IEEE images,” IEEE Trans. on Pattern Analysis and Machine Intelligence,
Signal Processing Magazine, vol. 12, no. 3, pp. 24–42, May 1995. vol. 24, no. 5, pp. 696–706, 2002.
[15] K. Jung, K.I. Kim, and A.K. Jain, “Text information extraction in images [17] A. Cavallaro and T. Ebrahimi, “Video object extraction based on adap-
and video: a survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, tive background and statistical change detection,” in Proceedings of SPIE
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 109