0% found this document useful (0 votes)
30 views

Cavallaro CSVT Semantic Encoding

1) The document presents a framework that uses semantic video analysis to extract relevant areas of video for adaptive content delivery. 2) Semantic analysis separates important content from unimportant content. Important content is encoded at higher quality or summarized textually. 3) This approach organizes content so the main message is preserved for different devices and networks, unlike traditional adaptation techniques.

Uploaded by

juanfe0479
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Cavallaro CSVT Semantic Encoding

1) The document presents a framework that uses semantic video analysis to extract relevant areas of video for adaptive content delivery. 2) Semantic analysis separates important content from unimportant content. Important content is encoded at higher quality or summarized textually. 3) This approach organizes content so the main message is preserved for different devices and networks, unlike traditional adaptation techniques.

Uploaded by

juanfe0479
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 100

Semantic video analysis for adaptive content


delivery and automatic description
Andrea Cavallaro, Olivier Steiger, Touradj Ebrahimi

Abstract— We present an encoding framework which exploits indexing of video content is also an desirable feature for sports
semantics for video content delivery. The video content is orga- broadcasting [5].
nized based on the idea of main content message. In the work To cope with these challenges, video content needs to be
reported in this paper, the main content message is extracted from
the video data through semantic video analysis, an application- automatically analyzed and adapted to the needs of the specific
dependent process that separates relevant information from non application, to the capabilities of the connected terminal and
relevant information. We use here semantic analysis and the network, and to the preferences of the user. Three main
corresponding content annotation under a new perspective: the strategies for adaptive content delivery have been proposed
results of the analysis are exploited for object-based encoders, throughout the literature, namely Info Pyramid, scalable cod-
such as MPEG-4, as well as for frame-based encoders, such as
MPEG-1. Moreover, the use of MPEG-7 content descriptors in ing and transcoding. The work presented in this paper aims
conjunction with the video is used for improving content visual- to go beyond traditional adaptation techniques. We focus on
ization for narrow channels and devices with limited capabilities. semantic encoding by looking to exploit video analysis prior
Finally, we analyze and evaluate the impact of semantic video to encoding (Figure 1). Specifically, we use semantic video
analysis in video encoding and show that the use of semantic analysis to extract relevant areas of a video. These areas are
video analysis prior to encoding sensibly reduces the bandwidth
requirements compared to traditional encoders not only for an encoded at a higher level of quality or summarized in textual
object-based encoder but also for a frame-based encoder. form. The idea behind this approach is to organize the content
so that a particular network or device does not inhibit the main
Index Terms— Video analysis, video encoding, object segmen-
tation, metadata, MPEG. content message. The main content message is dependent on
the specific application. In particular, for applications such as
video surveillance and sport video the main content message
I. I NTRODUCTION is defined based on motion information.
The contribution of this paper is twofold. On the one
The diffusion of network appliances such as cellular phones, hand, a framework for adaptive video delivery is defined and
personal digital assistants, and hand-held computers creates a implemented based on video objects and on their associated
new challenge for content delivery: how to adapt the media metadata. On the other hand, two new modalities of video
transmission to various device capabilities, network charac- delivery are proposed in such a framework. The first modality
teristics, and user preferences [1], [2], [3]. Each device is combines semantic analysis with a traditional frame-based
characterized by certain display capabilities and processing video encoder. The second modality uses metadata to effi-
power. Moreover, such appliances are connected through dif- ciently encode the main content message. In particular, the
ferent kinds of networks with diverse bandwidths. Finally, use of metadata enables not only to make the content more
users with different preferences access the same multimedia searchable, but also to improve visualization and to preserve
content. Therefore there exists a need to personalize the way privacy in video-based applications.
media content is delivered to the end user. In addition to the The paper is organized as follows. Section II is an overview
above, recent devices, such as digital radio receivers, and new of existing adaptation techniques. Section III presents the
applications, such as intelligent visual surveillance, require algorithm for extracting the main content message, the frame-
novel forms of video analysis for content adaptation and sum- work for adaptive video delivery and automatic description
marization. Digital radios allow for the display of additional using semantic video analysis. Section IV discusses quality
information alongside the traditional audio stream to enrich the assessment issues, whereas experimental results are presented
audio content. For instance, digital audio broadcasting (DAB) in Section V. Finally, Section VI concludes the paper.
allocates 128 Kb/s to streaming audio, whereas 8Kb/s can
be used to send additional information, such as visual data II. BACKGROUND
[4]. Moreover, the growth of video surveillance systems poses Three main approaches have been presented in the literature
challenging problems for the automatic analysis, interpretation to provide adaptive content delivery, namely Info Pyramid,
and indexing of video data as well as for selective content scalable coding and transcoding. Info Pyramid provides a
filtering for privacy preservation. Finally, the instantaneous general framework for managing and manipulating media
objects [6], [7]. Info Pyramid manages different versions, or
A. Cavallaro is with the Multimedia and Vision Laboratory, Queen Mary, variations, of media objects with different modalities (e.g.,
University of London (QMUL), E1 4NS London, United Kingdom.
O. Steiger and T. Ebrahimi are with the Signal Processing Institute, Swiss video, image, text, and audio) and fidelities (summarized, com-
Federal Institute of Technology (EPFL), CH-1015 Lausanne, Switzerland. pressed, and scaled variations). Moreover, it defines methods
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 101

Temporal
downsampling
M (1-5)
Frame-based
U
encoder
Spatial X
downsampling

Compositing
(8)

Foreground
Video Semantic Object-based (6, 7)
video analysis Background Background encoder
simplification
M
U Metadata
X encoder

(9)
Still image
encoder

Fig. 1. Flow diagram of the proposed encoding framework based on semantic video analysis and description

for manipulating, translating, transcoding, and generating the transcoding). Content-blind transcoding strategies include spa-
content. When a client device requests a multimedia document, tial resolution reduction, temporal resolution reduction, and
the server selects and delivers the most appropriate variation. bit-rate reduction [10]. Recent transcoding techniques make
The selection is made based on network characteristics and use of semantics to minimize the degradation of important
terminal capabilities, such as display size, frame rate, color image regions [11], [12]. In [13], optimal quantization pa-
depth and storage capacity. rameters and frame skip are determined for each video object
As opposed to Info Pyramid, scalable coding processes individually. The bit-rate budget for each object is allocated by
multimedia content only once. Lower qualities, lower spatial a difficulty hint, a weight indicating the relative encoding com-
and temporal resolutions of the same content are then obtained plexity of each object. Frame skip is controlled by a shape hint,
by truncating certain layers or bits from the original stream [8]. which measures the difference between two consecutive shapes
Basic modes of video scalability include quality scalability, to determine whether an object can be temporally downsam-
spatial scalability, temporal scalability, and frequency scala- pled without visible composition problems. Key objects are
bility. Combinations of these basic modes are also possible. selected based on motion activity and on bit complexity.
Quality scalability is defined as the representation of a video The transcoding strategies described thus far are referred
sequence with varying accuracies in the color patterns. This to as intramedia transcoding strategies and do not change the
is typically obtained by quantizing the color values with media nature of the input signal. On the other hand, intermedia
increasingly finer quantization step sizes. Spatial scalability transcoding, or transmoding, is the process of converting
is the representation of the same video in varying spatial the media input into another media format. Examples of
resolutions. Temporal scalability is the representation of the intermedia transcoding include speech-to-text [14] and video-
same video at varying temporal resolutions or frame rates. to-text [15] translation. Both the intramedia and the intermedia
Frequency scalability includes different frequency components adaptation concepts are used in this paper for video encoding,
in each layer, with the base layer containing low-frequency as described in the following section.
components and the other layers containing increasingly high-
frequency components. Such decomposition can be achieved III. A DAPTIVE CONTENT DELIVERY AND DESCRIPTION
via frequency transforms like the DCT or wavelet transforms. USING SEMANTICS
Finally, the basic scalability schemes can be combined to reach The proposed framework for adaptive video delivery and
fine-granularity scalability, such as in MPEG–4 FGS [9].
automatic description uses video content analysis and semantic
The various scalable coding methods introduced previously pre-filtering prior to encoding (Figure 1) in order to improve
perform the same operation over the entire video frame. In
the perceived content quality and to provide additional func-
object-based temporal scalability (OTS), the frame rate of
tionalities, such as privacy preservation and automatic video
foreground objects is enhanced so that it has a smoother indexing. Semantic video analysis and semantic encoding are
motion than the background.
described next.
Video transcoding is the process of converting a compressed
video signal into another compressed signal with different
properties. Early solutions to video transcoding determine the A. Semantic video analysis
output format based on network and appliance constraints, Semantic video analysis is used to extract the main content
independently of the semantics in the content (content-blind message from the video. The semantics to be included in
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 102

the analysis is dependent on the specific application. In the


following, we discuss possible semantics and, in particular,
we describe the use of motion as semantics.
Semantic video analysis refers to a human abstraction and
uses a priori information to translate the semantics into rules.
The rules are then applied through an algorithm. Examples
of semantic video analysis based on a priori information are
template matching, extraction of captions and text, face de-
tection, and moving object segmentation. Template matching (a) (b)
is used to implement the semantics when the shape objects
we want to segment is known a priori. In this case, which
includes in particular the detection of captions and text, the
extraction method searches for specific object features in terms
of geometry. For segmenting faces of people, color-based
segmentation can be used [16]. The face detection task consists
in finding the pixels whose spectral characteristics lie in a
specific region in the chromaticity diagram. For extracting
moving objects, motion information can be used as semantics. (c) (d)
Several applications, such as sport broadcasting and video
Fig. 2. Example of semantic video analysis results for the test sequences (a)
surveillance, deal with segmenting moving objects. Hall Monitor and (c) Highway: (b) separation of foreground and background
A typical tool used to tackle the problem of object segmen- for Hall Monitor. (d) separation of foreground and background for Highway.
tation based on motion is change detection. Different change The background is color-coded in black
detection techniques can be employed for moving camera
and static camera conditions. If the camera moves, change
detection aims at recognizing coherent and incoherent moving possible to derive the significance test as
areas. The former correspond to background areas, the latter 2
to video objects. If the camera is static, the goal of change Γ(q/2, g(i, j) /2σ 2 )
P {g(i, j) ≥ τ (i, j)|H0 } = . (2)
detection is to recognize moving objects (foreground) and the Γ(q/2)
static background. The semantic analysis we use addresses When this probability is smaller than a certain significance
the static camera problem and is applicable in the case of a level, α, we consider that H0 is not satisfied at the pixel
moving camera after global motion compensation. The change position (i, j). Therefore we label that pixel as belonging to a
detector decides whether in each pixel position the foreground moving object. The significance level α is a stable parameter
signal corresponding to an object is present. This decision that does not need manual tuning along a sequence or for
is taken by thresholding the frame difference between the different sequences. Experimental results indicate that valid
current frame and a frame representing the background. The values fall in the range from 10−2 to 10−6 .
frame representing the background is dynamically generated The change detection process produces the segmentation of
based on temporal information [17]. The thresholding aims the moving objects from the background (Figure 2) and, cou-
at discarding the effect of the camera noise after frame pled with video object tracking [18], enables the subsequent
differencing. A locally adaptive threshold, τ (i, j), is used that extraction of object metadata, as described in the following
models the noise statistics and applies a significance test. To section.
this end, we want to determine the probability that frame
difference at a given position (i, j) is due to noise, and not to
B. Semantic encoding
other causes. Let us suppose that there is no moving object in
the frame difference. We refer to this hypothesis as the null The decomposition computed with semantic video analysis
hypothesis, H0 . Let g(i, j) be the sum of the absolute values is used with an object-based encoder as well as with a
of the frame difference in an observation window of q pixels traditional frame-based encoder. We will refer to the former
around (i, j). Moreover, let us assume that the camera noise is case as object-based encoding and to the latter as frame-
additive and follows a Gaussian distribution with variance σ. based encoding. Furthermore, metadata are used to efficiently
Given H0 , the conditional pdf of the frame difference follows encode relevant information and to enhance relevant part of a
a χ2q distribution with q degrees of freedom defined by low-quality coded video. These approaches are referred to as
metadata-based encoding and metadata-enhanced encoding,
 1 (q−2)/2 −g(i,j)2 /2σ2 respectively. Relevant examples of the modalities presented
f g(i, j)|H0 = g(i, j) e , in this section are illustrated in Figure 3. The analysis and
2q/2 σ q Γ(q/2)
(1) evaluation of the different approaches in terms of results and
where Γ(·) is the Gamma function, √ that can be evaluated as bandwidth requirements are presented in Section V.
Γ(x+ 1) = xΓ(x), and Γ(1/2) = π. To obtain a good trade- 1) Object-based encoding: With object-based encoding, the
off between robustness to noise and accuracy in the detection encoder needs to support the coding of individual video objects
we choose q = 25 (5 × 5 window centered in (i, j)). It is now (e.g., MPEG–4 object-based). Each video object is assigned to
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 103

(a) (b) (c) (d) (e)


Fig. 3. Examples of encoding modalities. (a) Sample frame from the sequence Soccer; (b) Semantic frame-based encoding: the background is selectively
lowpass-filtered prior to encoding; (c) Metadata-based encoding: object shapes are superimposed on the background; (d) Spatial resolution reduction; (e)
Metadata-enhanced encoding: metadata are used to enhance relevant portions of a video

a distinct object class, according to its importance in the scene.


The encoding quality can be set depending on the object class:
the higher the relevance, the higher the encoding quality. One
advantage of this approach is the possibility of controlling
the sequencing of objects: video objects may be encoded
with different degree of compression, thus allowing better
granularity for the areas in the video that are of more interest to
the viewer. Moreover, objects may be decoded in their order
of priority, and the relevant content can be viewed without (a) (b)
having to reconstruct the entire image (network limitations). Fig. 4. (a) Selective lowpass-filtering simplifies the information in the
Another advantage is the possibility of using a simplified background, while still retaining essential contextual information; (b) filtering
background (appliance limitation), so as to enhance the rel- the entire image inhibits the main content message
evant objects. Using a simplified background aims at taking
advantage of the task-oriented behavior of the human visual
system for improving compression ratios. Recent work on 3) Metadata-based and metadata-enhanced encoding: A
foveation [19] demonstrated that using nonlinear integration further processing of the video content is performed to cope
of low-level visual cues mimicking the processing in primate with limited device or network capabilities as well as to
occipital and posterior parietal cortex allows one to sensibly automatically generate metadata. Such processing transforms
increase compression ratios. Moreover, the work reported in the foreground objects extracted through semantic analysis
[20] demonstrated that an overall increase in image quality into quantitative descriptors and enables video annotation.
can be obtained when the increase in quality of the relevant Video annotation is desirable for applications such as video
areas of an image more than compensates for the decrease in surveillance, where terabytes of data are produced and need
quality of the image background. to be searched quickly. Moreover, the descriptors can be
2) Semantic frame-based encoding: The semantic frame- transmitted instead of the video content itself and superim-
based encoding mode exploits semantics in a traditional frame- posed by the terminal on a still background. For example, an
based encoding framework (e.g., MPEG–1). The use of the object identifier and a shape descriptor are used in [21]. The
decomposition of the scene into meaningful objects prior object identifier is a unique numerical identifier describing
to encoding, referred here as semantic pre-filtering, helps the spatial location of each object in the scene. The shape
support low bandwidth transmission. The areas belonging to descriptor is used to represent the shape of an object, ranging
the foreground class, or semantic objects, are used as region from a bounding box to a polygonal representation with a
of interest. The areas not included in the region of interest different number of vertices (Figure 3(c)). This approach is
may either be eliminated, that is set to a constant value, or useful to preserve privacy in video surveillance applications
lowered in importance by using a low-pass filter. The latter as well as to reduce bandwidth requirements under critical
solution simplifies the information in the background, while network conditions. A progressive representation is used where
still retaining essential contextual information. An example the number of vertices corresponding to the best resolution
of this solution is reported in Fig. 4 (a). On the other hand, is computed, and any number of vertices smaller that this
filtering the entire image inhibits the main content message maximum can be used according to the requirements of the
Fig. 4 (b). Another way to take into account less relevant application. In addition to the above, other features such as
portions of an image before coding is to take advantage of the color and texture descriptors may be added in the description
specifics of the coding algorithm. In the case of block-based process. The choice of these additional features depends on
coding, each background macroblock can be replaced by its the application at hand.
DC value. Semantic frame-based encoding mimics the way In addition to the above, the descriptors can be transmitted
humans perceive visual information and allows for a reduction along with the video itself and used for rendering the video
of information to be coded. content. This solution, consisting in a mixture of video-based
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 104

and text-based modalities, is here referred toe as metadata- evaluation metric, the semantic peak signal-to-noise ratio,
enhanced encoding. Using metadata-enhanced encoding, con- SPSNR, is the following:
tent descriptors help enhance parts of the video that are hidden 2
!
or difficult to perceive due to heavy compression. In this case, VM
SPSNR = 10 log10 , (6)
the video itself is the background and the descriptors highlight SMSE
relevant portions of the data. One example is the ball in a where VM is the maximum peak-to-peak value of the color
football match for transmission to a PDA or a mobile phone, range. When the object classes are foreground and back-
as shown in Figure 3(e). ground, then N = 2 in Eq (3). If we denote with wf the
foreground weight, then SPSNR ≡ PSNR when wf = 0.5.
IV. Q UALITY ASSESSMENT The larger wf , the more important the contribution of the
foreground. When wf = 1, then the foreground only is
Perceptual video quality assessment is a difficult task
considered in the evaluation of the peak signal-to-noise ratio.
already when dealing with traditional coders [22]. When
An illustration of the impact of wf in the distortion measure
dealing with object-based coders, the task becomes even
is shown in in Fig. 5. The figure presents a comparison of the
more challenging. For this reason, we use a combination of
subjective and objective evaluation techniques to compare the 35

performance of the different encoding modalities. Subjective


evaluation includes the visual comparison of frames and frame
details. This analysis is performed at different bitrates and 30

at different frame resolutions. Objective evaluation includes


temporal signal-to-noise ratio analysis and the analysis of rate-
SPSNR [db]

distortion curves. 25

A. Semantic peak signal-to-noise ratio 20


(1)
Traditional peak signal-to-noise ratio (PSNR) analysis uni- (2)
(3)
formly weights the contribution of each pixel in an image (4)
(5)
when computing the mean squared error (MSE). This analysis 15
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Foreground weight
gives the same importance to relevant as well as less relevant
areas of an image. To account for the way humans perceive Fig. 5. Illustration of the impact of wf in the distortion measure: average
visual information, different areas of an image, or object SPSNR vs. foreground weight for Hall monitor sequence. The five labels
correspond to the following sequence types: (1) coded original; (2) tempo-
classes, should be considered [11]. We take into account rally down-sampled; (3) spatially down-sampled; (4) static background; (5)
object classes through a distortion measure, the semantic mean simplified background. Content-blind coding methods (1)-(3) decrease their
squared error, SMSE, defined as: performance when the foreground is given more importance. Methods based
on semantic, (4) and (5), increase their performance when the foreground is
N given more importance
X
SMSE = wk · MSEk , (3)
k=1 average SPSNR of the sequence Hall Monitor for the different
where N is the number of object classes and wk the weight of encoding modalities described in Section III-B as function of
class k. Class weights are chosen depending on the semantics, wf . The value of wf to be used is estimated as described in
PN
with wk ≥ 0, ∀k = 1, . . . , N and i=1 wk = 1. The mean the next section.
squared error of each class, MSEk , can be written as
1 X B. Determination of the foreground weight
MSEk = d2 (i, j), (4)
|Ck | Subjective performance evaluation experiments have been
(i,j)∈Ck
performed to estimate the foreground weight leading to the
where Ck is the set of pixels belonging to the object class k closest match between SPSNR prediction and human judg-
and |Ck | is its cardinality. The class membership of each pixel ment. Twenty non-expert observers of different ages and
(i, j) is defined by semantic video analysis. The error d(i, j) backgrounds have been presented a series of video sequences
between the original image IO and the distorted image ID in according to ITU-T Recommendation P.910, Absolute Cate-
Eq.(4) is the pixel-wise color distance. The color distance is gory Rating [23]. The evaluation has been carried out using
computed in the 1976 CIE Lab color space in order to consider the MPEG–4 test sequences Akiyo, Hall Monitor, Children,
perceptually uniform color distances with the Euclidean norm and Coastguard. Video sequences have been generated using
and is expressed as: the encoding strategies described in Section III-B, at different
q 2 2 2 bitrates, and rated by the observers on a scale ranging from 0
d(i, j) = ∆I L (i, j) + ∆I a (i, j) + ∆I b (i, j) , (bad) to 100 (excellent). This range of values was presented
(5) to the observers in a training phase.
with ∆I L (i, j) = IO L
(i, j) − IDL
(i, j), ∆I a (i, j) = IOa
(i, j) − The foreground weight, wf , is determined for each test
a
ID (i, j), and ∆I b (i, j) = IO b
(i, j) − IDb
(i, j). The final quality sequence by maximizing the Pearson correlation [24] between
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 105

SPSNR and subjective results. The results are summarized in of semantic analysis on the encoding performance of frame-
Table I. For the sequence Akiyo, where the foreground covers based as well as object-based coders and demonstrate the use
a large area of each frame and the background is simple, the of the proposed approach for advanced applications, such as
the observers focused mostly on foreground, thus leading to a privacy preservation in video surveillance. Sample results are
value of wf = 0.97. For Hall Monitor, whose background is shown from the MPEG–4 test sequence Hall Monitor and
more complex and objects are smaller, the foreground attracted from the MPEG–7 test sequence Highway. Both sequences
slightly more the attention than the background (wf = 0.55). are in CIF format at 25 Hz. The modalities under analysis are:
The sequence Children has a very complex and colored back- (1) coded original sequence; (2) temporal resolution reduction
ground that attracted the observers’ attention, thus resulting (from 25 frames/s. to 12.5 frames/s.); (3) spatial resolution
in foreground and background being equally weighted (wf = reduction (from CIF to QCIF); (4,6) video objects composited
0.5). The sequence Coastguard contains camera motion that with static background; (5,7) video objects composited with
prevented the observer from focusing on background steadily, simplified background. The background is simplified using a
even though the background is quite complex. In this case, the Gaussian 9x9 low-pass filter with µ = 0 and σ = 2.
resulting foreground weight is wf = 0.7. In general, results The following coders have been used in the encoding
confirm that large moving objects and complex background process: (i) TMPGEnc 2.521.58.169 using constant bitrate
tend to attract user’s attention. Based on the data collected with (CBR) rate control for frame-based MPEG–1; (ii) MoMuSys
subjective experiments, it is possible to predict the foreground MPEG-4 VM reference software version 1.0 using VM5+
weight based on the following formula: global rate control for object-based MPEG–4; (iii) Expway
MPEG-7 BiM Payload encoder/decoder version 02/11/07 for
wf = α · r + (δ − β · r)σb + γ · v + δ, (7) MPEG–7 metadata; (iv) Kakadu JPEG2000 codec version
where r represents the portion of the image occupied by 4.2 for JPEG200 still images. The value of the foreground
foreground pixels, expressed as r = |Cf |/(|Cf | + |Cb |), with weight used in the objective evaluation is wf = 0.55 for
|Cf | and |Cb | representing the number of foreground and Hall Monitor, as determined with the subjective experiments,
background pixels, respectively. The background complexity and wf = 0.53 for Highway, computed using Eq. (7) with
is taken into account with σb , the standard deviation of the r = 0.07, σb = 48, v = 0.
luminance of background pixels. The presence of camera Figure 6 shows the rate-distortion diagrams for the test
motion is considered with the term v: v = 1 for moving sequences. The average SPSNR for five encoding modalities is
camera, and v = 0 otherwise. α, β, γ, and δ are constants plotted against the encoding bitrate. Figures 6 (a) and (b) show
whose values are determined based on the results of the the rate-distortion diagrams for MPEG–1 at bitrates between
subjective experiments and are the following: α = 5.7, β = 150 Kbit/s and 1000 Kbit/s. At low bitrates (150-300 Kbit/s),
0.108, γ = 0.2 and δ = 0.01. The final value of wf is the semantic encoding with static background (4) leads to a larger
average of the foreground weights over the sequence. SPSNR than any of the content-blind methods (1-3). This is
In addition to semantic weight, Table I provides information because inter-coded static background blocks do not produce
about accuracy, monotonicity and consistency of the SPSNR residue and most of the available bitrate can be allocated to
metric. Accuracy is given by Pearson linear correlation coeffi- foreground objects. In Figures 6 (c) and (d), foreground and
cient rp , monotonicity by Spearman rank-order correlation co- background are encoded in two separate streams using object-
efficient rs , and consistency by outliers ratio ro [24]. Pearson based MPEG–4 at bitrates between 100 Kbit/s and 500 Kbit/s.
correlation of PSNR, rp (0.5), is given for comparison. Pearson Here semantic analysis is used in all five modalities. It possible
correlation rp and Spearman correlation rs are close to 1 for to notice that quality is improved at low bitrates by low-pass
all sequences. Thus, accuracy and monotonicity of SPSNR are filtering the background or using a still frame representing the
high. Outliers ratio ro is around 10%, thus consistency of the background.
metric is good as well. Note that using semantics improves Figure 7 shows a sample frame from each test sequence
accuracy by up to 8% (Akiyo), as compared to PSNR. coded with MPEG–1 at 150 Kbit/s with and without semantic
pre-filtering. Figure 8 shows magnified excepts of both test
TABLE I sequences coded with MPEG–1 at 150 Kbit/s. Figure 8 (top)
F OREGROUND WEIGHT AND SPSNR ACCURACY shows the person that carries a monitor in Hall monitor. The
Akiyo Hall monitor Children Coastguard
amount of coding artifacts is notably reduced by semantic
wf 0.97 0.55 0.50 0.7 pre-filtering ((d) and (e)). In particular, the person’s mouth
rp (wf ) 0.95 0.90 0.95 0.92 and the monitor are visible in (e), whereas they are corrupted
rp (0.5) 0.87 0.89 0.95 0.90 by coding artifacts in the non-semantic modalities. Similar
rs (wf ) 0.90 0.84 0.95 0.93 observations can be made for Figure 8 (bottom), which shows
ro (wf ) 0.10 0.11 0.07 0.07
a blue truck entering the scene at the beginning of the Highway
sequence. Coding artifacts are less disturbing on the object in
(d) and (e) than in (a)-(c). Moreover, the front-left wheel of
V. E XPERIMENTAL RESULTS the truck is only visible with semantic pre-filtering ((d) and
In this section, experimental results of the proposed seman- (e)).
tic video encoding and annotation framework with standard Next, we evaluate the cost of sending metadata for metadata-
test sequences are presented. The results illustrate the impact based and metadata-enhanced encoding. Table II shows the
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 106

40 40

35 35

30 30
SPSNR [db]

SPSNR [db]
25 25

(1) Original sequence (1) Original sequence


20 20
(2) 12.5 frames/s. (2) 12.5 frames/s.
(3) QCif (3) QCif
(4) Lowpass−filtered bckg (4) Lowpass−filtered bckg
(5) Static bckg (5) Static bckg
15 15
200 300 400 500 600 700 800 900 1000 200 300 400 500 600 700 800 900 1000
Bitrate [Kbit/s] Bitrate [Kbit/s]

(a) (b)
40 40

35 35

30 30
SPSNR [db]

SPSNR [db]

25 25

(1) Original sequence (1) Original sequence


20 20
(2) 12.5 frames/s. (2) 12.5 frames/s.
(3) QCif (3) QCif
(6) Lowpass−filtered bckg (6) Lowpass−filtered bckg
(7) Static bckg (7) Static bckg
15 15
100 150 200 250 300 350 400 450 500 100 150 200 250 300 350 400 450 500
Bitrate [Kbit/s] Bitrate [Kbit/s]

(c) (d)
Fig. 6. Rate-distortion diagrams. (a) Hall monitor, MPEG–1; (b) Highway, MPEG–1; (c) Hall monitor, MPEG–4 object-based; (d) Highway, MPEG–4
object-based

bitrate required by three types of description for Hall Monitor level of information hiding obtained using object descriptors
and Highway using MPEG–7 binary format (BiM). MPEG– for the sequence Hall Monitor. A surveillance operator can
7 binary format is used for sending summary information be shown different video types, ranging from the full appear-
to terminals with limited capabilities and to enhance heavily ance of the objects (Figure 9 (a)) to the visualization of a
compressed videos. The descriptions are represented by the position locator that allows the operator to derive statistics
spatial locators of the foreground objects, their bounding about number of objects, their behavior and position without
boxes, and an approximation of their shape with 20-sided disclosing their identity (Figure 9 (d)). Intermediate levels of
polygons, respectively. The metadata size increases with the visualization include the approximation of object shapes that
description complexity and with the number of objects in the hides the identity of the subjects captured by the surveillance
scene (Hall Monitor vs. Highway). The cost for metadata- camera, while allowing to derive information about their size
enhanced encoding can be further reduced by sending the and form (Figure 9 (b)), and the bounding box (Figure 9 (c)).
description of critical objects only. In addition to the above, The encoding cost associated with this additional functionality
added to a surveillance system is 21 Kbit/s for the spatial
TABLE II
locator, 59 Kbit/s for the bounding box and 89 Kbit/s for the
AVERAGE BITRATE OF MPEG–7 B I M SEQUENCE DESCRIPTION
polygonal shape. The choice of the description to be used
D ESCRIPTION Spatial locator Bounding box Polygon shape depends on the trade-off between privacy and the monitoring
Hall monitor 21 Kbit/s 59 Kbit/s 89 Kbit/s task at hand.
Highway 26 Kbit/s 66 Kbit/s 98 Kbit/s
VI. C ONCLUSIONS
metadata-enhanced encoding is used for privacy preservation We presented a content-based video encoding framework
in video surveillance. Figure 9 shows an example of different which is based on semantic analysis. Semantic analysis enables
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 107

(a) (b) (c)

Fig. 7. Frame 190 of Hall monitor (top) and frame 44 of Highway (bottom) coded with MPEG–1 at 150 Kbit/s using different modalities: (a) coded original
sequence; (b) static background; (c) simplified background

the decomposition of a video into meaningful objects. Using R EFERENCES


this decomposition, the encoder may adapt its behavior to code
relevant and non relevant objects differently. Three modalities [1] P. van Beek, J. Smith, T. Ebrahimi, T. Suzuki, and J. Askelof, “Metadata-
driven multimedia access,” IEEE Signal Processing Magazine, pp. 40–
of video delivery have been discussed, analyzed, and com- 52, March 2003.
pared using standard encoders. The first exploits semantics [2] R. Mohan, J. Smith, and C.-S. Li, “Adapting multimedia internet content
in traditional frame-based encoding. Semantically pre-filtering for universal access,” IEEE Transactions on Multimedia, vol. 1, no. 1,
pp. 104–114, 1999.
the video prior to coding leads to significant improvements in
[3] A. Vetro, H. Sun, and Y. Wang, “Object-based transcoding for adaptable
video compression efficiency in terms of bandwidth require- video content delivery,” IEEE Transactions on Circuits and Systems for
ments as well as visual quality at low bitrates. The second Video Technology, vol. 11, no. 3, pp. 387–401, 2001.
modality uses metadata to efficiently encode relevant infor- [4] I. P. Duncumb, P. F. Gadd, G. Wu, and J. L. Alty, “Visual radio: Should
we paint pictures with words, or pictures?,” Tech. Rep., Loughborough
mation. Object descriptors are generated for content retrieval University, UK, 2004.
as well as used for coding at very low bit-rates or for devices [5] Gopal S. Pingali, Agata Opalach, Yves D. Jean, and Ingrid B. Carlbom,
with limited capabilities. The third modality combines video “Instantly indexed multimedia databases of real world events,” IEEE
Transactions on Multimedia, vol. 4, no. 2, pp. 269–282, June 2002.
and metadata for visualization. Metadata are used for content [6] Chung-Sheng Li, Rakesh Mohan, and John R. Smith, “Multimedia
enhancement at low bitrates and for preserving privacy in content description in the Info Pyramid,” in Proc. of the IEEE
video surveillance applications. International Conference on Accoustics, Speech and Signal Processing,
May 1998, pp. 171–178.
In the specific implementation discussed in Section V, the [7] John R. Smith, Rakesh Mohan, and Chung-Sheng Lj, “Scalable
multimedia delivery for pervasive computing,” in Proc. of the ACM
semantics is defined by motion. Given the modularity of the conference on Multimedia, Oct.-Nov. 1999, vol. 1, pp. 131–140.
proposed encoding framework other semantics can also be [8] Yao Wang, Jörn Ostermann, and Ya-Qin Zhang, Video Processing and
used in the analysis step. Examples are face detection and Communications, Prentice Hall, 1 edition, 2001.
[9] W. Li, “Overview of fine granularity scalability in MPEG–4 video
text segmentation. standard,” IEEE Trans. on Circuits and Systems for Video Technology,
vol. 11, no. 3, pp. 301–317, March 2001.
The quality metric used in this work is a promising first step [10] A. Vetro, C. Christopoulos, and H. Sun, “Video transcoding architectures
towards measuring the quality taking semantics into account. and techniques: an overview,” IEEE Signal Processing Magazine, vol.
Future work includes the study and definition of a perceptual 20, no. 2, pp. 18–29, March 2003.
[11] R. Cucchiara, C. Grana, and A. Prati, “Semantic transcoding for live
metric that accounts for user satisfaction, depending on the video server,” in Proceedings of ACM Multimedia, Juan–Les–Pins
application and the user preferences. To this end, an object of (France), December 2002, pp. 223–226.
interest metric, such as that used in [3], will be an important [12] Andrea Cavallaro, Olivier Steiger, and Touradj Ebrahimi, “Semantic
building block of the overall quality metric. This quality segmentation and description for video transcoding,” in Proceedings of
the IEEE International Conference on Multimedia and Expo, July 2003,
metric will be used to automatically select the best encoding vol. 3, pp. 597–600.
technique that maximizes user experience. [13] A. Vetro, H. Sun, and Y. Wang, “Object-based transcoding for adaptable
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 108

(a) (b) (c) (d) (e)

Fig. 8. Details of frame 280 of Hall monitor (top) and frame 16 of Highway (bottom). The sequences are encoded with MPEG–1 at 150 Kbit/s using
different encoding modalities: (a) coded original sequence; (b) temporal resolution reduction; (c) spatial resolution reduction; (d) static background; (e)
simplified background

(a) (b) (c) (d)


Fig. 9. Example of use of the proposed encoding framework for privacy preservation in an indoor surveillance application. Four different methods are shown
representing different privacy levels. a) Video objects; (b) Object shape; (c) Bounding box; (d) Object position. The method can also be used to adapt the
video delivery to the channel capacity and the terminal characteristics.

video content delivery,” IEEE Trans. on Circuits and Systems for Video May 2004.
Technology, vol. 11, no. 3, pp. 387–401, March 2001. [16] R.L. Hsu, M.Abdel-Mottaleb, and A. Jain, “Face detection on color
[14] N. Morgan and H. Bourlard, “Continuous speech recognition,” IEEE images,” IEEE Trans. on Pattern Analysis and Machine Intelligence,
Signal Processing Magazine, vol. 12, no. 3, pp. 24–42, May 1995. vol. 24, no. 5, pp. 696–706, 2002.
[15] K. Jung, K.I. Kim, and A.K. Jain, “Text information extraction in images [17] A. Cavallaro and T. Ebrahimi, “Video object extraction based on adap-
and video: a survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, tive background and statistical change detection,” in Proceedings of SPIE
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TO APPEAR) 109

Electronic Imaging - Visual Communications and Image Processing, San


Jose, California, USA, 2001, pp. 465–475.
[18] A. Cavallaro, O. Steiger, and T. Ebrahimi, “Multiple object tracking in
complex scenes,” in Proceedings of ACM Multimedia, Juan–Les–Pins
(France), December 2002, pp. 523–532.
[19] L. Itti, “Automatic foveation for video compression using a neurobio-
logical model of visual attention,” IEEE Trans. on Image Processing,
vol. 13, no. 10, pp. 1304–1318, 2004.
[20] A.P. Bradley and F.W.M. Stentiford, “Visual attention for region of
interest coding in jpeg 2000,” Journal of Visual Communication and
Image Representation, vol. 14, pp. 232–250, 2003.
[21] O. Steiger, A. Cavallaro, and T. Ebrahimi, “MPEG-7 description of
generic video objects for scene reconstruction,” in Proceedings of SPIE
Electronic Imaging, San Jose, California, USA, January 2002, pp. 223–
226.
[22] S. Olsson, M. Stroppiana, and J. Baina, “Objective methods for
assessment of video quality : state of the art,” IEEE Transactions on
Broadcasting, vol. 43, no. 4, pp. 487–495, 1997.
[23] ITU, “Subjective video quality assessment methods for multimedia
applications,” Tech. Rep. P.910, ITU-T Recommandation, September
1999.
[24] David Freedman, Robert Pisani, and Roger Purves, Statistics, W.W.
Norton & Company, 3 edition, 1997.
[25] Yap-Peng Tan, Yongqing Liang, and Haiwei Sun, “On the methods
and performances of rational downsizing video transcoding,” Signal
Processing: Image Communications, vol. 19, pp. 47–65, 2004.
[26] Anthony Vetro, Toshihiko Hata, Naoki Kuwahara, Hari Kalva, and Shun-
Ichi Sekiguchi, “Complexity-quality analysis of transcoding architec-
tures for reduced spatial resolution,” in IEEE Transactions on Consumer
Electronics, August 2002, pp. 515–521.
[27] P. A. A. Assunção and M. Ghanbari, “A frequency-domain video
transcoder for dynamic bit-rate reduction of MPEG–2 bit streams,” IEEE
Trans. on Circuits and Systems for Video Technology, vol. 8, no. 8, pp.
953–967, Dec. 1998.
[28] T. Shanableh and M. Ghanbari, “Heterogeneous video transcoding to
lower spatio-temporal resolutions and different encoding formats,” IEEE
Trans. on Multimedia, vol. 2, no. 2, pp. 101–110, June 2000.
[29] Y. Liang and Y.-P. Tan, “A new content-based hybrid video transcoding
method,” in Proc. of IEEE Int. Conf. on Image Processing 2001, Oct.
2001, vol. 1, pp. 429–432.
[30] A. Cavallaro, E. Salvador, and T. Ebrahimi, “Shadow detection in image
sequences,” in Proc. of IEE Conference on Visual Media Production,
London, UK, 2004, pp. 165–174.
[31] A. Vetro and H. Sun, “Media conversions to support mobile users,” in
Proc. of IEEE Canadian Conf. on Electrical and Computer Engineering,
CCECE 2001, May 2001, vol. 1, pp. 607–612.
[32] R. Mohan, J.R. Smith, and C.-S. Li, “Adapting multimedia internet
content for universal access,” IEEE Trans. on Multimedia, vol. 1, no. 1,
pp. 104–114, March 1999.
[33] H. Sun, A. Vetro, and K. Asai, “Resource adaptation based on MPEG–
21 usage environment description,” in Proc. IEEE Int. Symposium on
Circuits and Systems, May 2003, vol. 2, pp. 536–539.
[34] ISO/IEC, “Information technology – generic coding of moving pictures
and associated audio information: Video, 2nd ed.,” Tech. Rep. ISO/IEC
FDIS 13818-2:2000, ISO/IEC JTC 1/SC29/WG11, 2000.
[35] ISO/IEC, “Information technology – coding of audio-visual objects
– part 2 visual–amendment 2: Streaming video profiles,” Tech. Rep.
ISO/IEC FDIS 14496-2:2001, ISO/IEC JTC 1/SC29/WG11, 2001.
[36] Julien Bourgeois, Emmanuel Mory, and François Spies, “Video trans-
mission adaptation on mobile devices,” Journal of System Architecture,
vol. 49, pp. 475–484, 2003.
[37] Surya Nepal and Uma Srinivasan, “Dave: A system for quality
driven adaptive video delivery,” Proceedings of the 5th ACM SIGMM
international workshop on Multimedia information retrieval, pp. 223–
230, 2003.
[38] Niklas Björk and Charilaos Christopoulos, “Video transcoding for
universal multimedia access,” in ACM Multimedia Workshop, 2000, pp.
75–79.

You might also like