0% found this document useful (0 votes)
10 views

SemanticVideo_compress_2019

This paper presents a novel approach for semantic segmentation in compressed videos, utilizing I-frames and P-frames without extracting all frames as RGB images, which significantly reduces computation time. The proposed ConvLSTM model captures temporal information to enhance segmentation accuracy while maintaining speed, outperforming traditional frame-by-frame methods. Experimental results demonstrate that this method achieves similar accuracy to existing models but operates much faster, making it suitable for real-time applications like autonomous driving.

Uploaded by

hungmanucian29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

SemanticVideo_compress_2019

This paper presents a novel approach for semantic segmentation in compressed videos, utilizing I-frames and P-frames without extracting all frames as RGB images, which significantly reduces computation time. The proposed ConvLSTM model captures temporal information to enhance segmentation accuracy while maintaining speed, outperforming traditional frame-by-frame methods. Experimental results demonstrate that this method achieves similar accuracy to existing models but operates much faster, making it suitable for real-time applications like autonomous driving.

Uploaded by

hungmanucian29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Semantic Segmentation in Compressed Videos

Ang Li* Yiwei Lu* Yang Wang


University of Manitoba University of Manitoba University of Manitoba
Winnipeg, Canada Winnipeg, Canada Winnipeg, Canada
[email protected] [email protected] [email protected]

Abstract—Existing approaches for semantic segmentation in


videos usually extract each frame as an RGB image, then apply
standard image-based semantic segmentation models on each
frame. This is time-consuming. In this paper, we tackle this
problem by exploring the nature of video compression techniques.
A compressed video contains three types of frames, I-frames,
P-frames, and B-frames. I-frames are represented as regular
images, P-frames are represented as motion vectors and residual
errors, and B-frames are bidirectionally frames that can be Fig. 1. Current solutions of semantic segmentation in videos require extracting
regarded as a special case of a P frame. We propose a method all frames as regular RGB images, then process each image separately to
that directly operates on I-frames (as RGB images) and P- produce its semantic segmentation. This can lead to heavy computation and
low speed. In this paper, we propose a semantic segmentation method that
frames (motion vectors and residual errors) in a video. Our
directly operates on compressed videos without extracting all frames.
proposed model uses a ConvLSTM model to capture the temporal
information in the video required for producing the semantic
segmentation on P-frames. Our experimental results show that
our method performs much faster than other alternatives while Existing works already explore the usability of compressed
achieveing similar performance in terms of accuracies. videos in computer vision tasks such as action recognition
[18] and object detection [17]. However, to the best of our
I. I NTRODUCTION knowledge, this is the first work on using compressed videos in
semantic segmentation. We propose a ConvLSTM model that
Semantic segmentation in videos is of crucial importance propagates the temporal information from I-frame to succeed-
for real-time application such as autonomous driving. Existing ing P/B-frames for semantic segmentation. Our experimental
approaches usually operate on a frame-by-frame basis. They results show that the proposed method performs either better or
first extract each frame as a regular RGB image, then apply on-par with standard frame-based methods. But the proposed
standard semantic segmentation models on this frame. These method can run at a much faster speed.
methods suffer from very high computational cost or low
speed. Typically, our video is compressed to 15 to 30 frames II. R ELATED WORK
per second (fps). However, according to the frame-by-frame A. Semantic Segmentation
model, it takes 0.17s to provide semantic segmentation on one
frame. For example, if the video is played at 30 fps and the The goal of semantic segmentation is to assign a label to
video length is 2 minutes, the frame-by-frame model will take each pixel in an image (see Fig 1). For semantic segmentation
10 minutes to provide semantic segmentation for the video. in images, there have been a lot of models that apply deep
As a result, they are not applicable to real-time semantic convolutional neural networks [6], [7], [16] for semantic seg-
segmentation scenarios such as self-driving. mentations. For example, methods such as FCN [10], dilated
Existing frame-by-frame approaches ignore the fact that convolutions [19] and SegNet [1] are widely used. To apply
videos usually come in compressed format for transmission semantic segmentation on videos, the most popular approach
and storage. In this paper, we propose a semantic segmentation is to extract each frame in the video as an image, then apply
method that directly operates on compressed videos. Working a standard image-based semantic segmentation algorithm to
directly with compressed videos provides several advantages. process each frame.
First of all, since we do not need to extract frames from a For semantic segmentation in videos, there is always a
video, our method can be much faster. Secondly, the com- trade-off between accuracy and efficiency. In order to obtain
pressed video directly provides the motion information that higher accuracy, A new method for dealing with the spatial
RGB images do not have. As a result, our method can directly and temporal features of video semantic segmentation was
take advantage of this information and consider temporal proposed in [5]. A pyramid scene parsing network was applied
information of a video clip. in [20] to acquire more accurate semantic segmentation. But
these methods requires a lot of computation time. A model
* Equal Contribution that only focuses on a single annotation object was proposed
978-1-7281-1817-8/19/$31.00 ©2019 IEEE in [3]. In order to reduce the computation time, a method based
Fig. 2. We divide the video by groups. Each group contains 1 RGB image of the I-frame and 11 P-frames represented by the motion vector and the residual
error. The processing of I-frames and P-frames are different: we first obtained a semantic segmentation of the I-frame based on ResNet. Then the information
of the I-frame is taken as the initial state of a ConvLSTM module, which also takes the information of each P-frame to update its hidden state. At each
time-step, the module produce a semantic segmentation prediction.

on clockwork driven by a fixed or adaptive clock signals was Following prior work [18], we divide frames in an entire
proposed in [9], [15]. video into several groups, while each group contains one I-
frame and several P-frames, represented by the collection {I,
B. Computer Vision with Compressed video
P1 , P2 , ... , PT }. The I-frame I is represented as a regular
Videos usually come in compressed format for transmis- RGB image, while each P-frame Pt only stores the difference
sion and storage. Several popular compression techniques with respect to the previous frame. Our model takes {I, P1 ,
are widely applied, including AVI [12], MPEG4 [8], FLV P2 , ... , PT } as the input. The desired output is the semantic
[13], and so on. Recently, there has been some work on segmentation of each image, regardless of the frame type. The
solving computer vision problems using compressed videos. semantic segmentation network is represented by fs (x), where
For example, [18] uses MPEG4 on action recognition and their x can be either a I-frame or a P-frame. Given the ground-truth
approach shows that operating on motion vectors and residual semantic segmentation masks, our learning objective function
errors in compressed videos is more efficient than traditional can be described below:
methods that operate on RGB frames. [17] combines com-
T
pressed video technology with LSTM to obtain spatial and X
L = Lce (GTI − fs (I)) + Lce (GTPt − fs (Pt )) (1)
temporal information on object detection problem. However,
t=1
as far as we know, there is no existing work on semantic
segmentation in compressed videos. where Lce is cross-entropy loss function, GTI is the ground-
truth semantic segmentation mask of the I-frame, and GTPt is
III. A PPROACH the ground-truth semantic segmentation mask of the P-frame
A. Overview Pt . Our goal is to learn a network that minimizes the loss
Videos are usually stored and transmitted in some com- function defined in Eq. 1.
pressed format, such as MPEG-4, H.264, etc. Most of the
video compression techniques use the fact that adjacent frames B. Semantic Segmentation for I-frame
in a video are often similar. As a result, we only need to In order to obtain the semantic segmentation of an I-
store a small number of frames (called I-frame) as regular frame, we use a standard encoder-decoder architecture for
images, while other frames (called P-frame) can be efficiently semantic segmentation (see Fig 2). An I-frame is represented
represented by only storing the difference between frames. as a regular RGB image tensor with three channels. Let
I ∈ RH×W ×3 be the image of the I-frame, where H × W is
the spatial size of the image. We use ResNet as the backbone
network to extract a feature map of the image denoted as
H W
z(I) ∈ R 32 × 32 ×c , where c is the number of channels of the
last convolutional layer of the feature extractor. We set c as
the number of classes in semantic segmentation. The spatial
size of z(I) is smaller than the original image I due to max-
pooling. In order to obtain the pixel-wise prediction at the
original image size, we apply an upsampling layer to enlarge
z(I) to have the same spatial size of the input image. We use
fs (I) ∈ RH×W ×c to denote the output of this upsampling
layer. We can interpret the c-dimensional vector at each pixel
location of fs (I) as the score of classifying the pixel to each
of the c classes. Fig. 3. The process of our network on P-frame when the timestep = t

C. Semantic Segmentation for P-frame


Since a P-frame is represened as the difference from the the 32 semantic classes. Most videos are collected by using a
previous frame, an P-frame by itself does not contain enough fixed-position CCTV-style camera taken from the observation
information for semantic segmentation. In order to segment of a driver in the car. The driving scenes increase the number
a P-frame, intuitively we should capture the temporal infor- and heterogeneity of the observed object classes. We use three
mation between this P-frame and the preceding I-frame. In videos from CamVid: seq06R0, seq01TP, and seq 05VD. The
this work, we apply a ConvLSTM module to accumulate the total number of frames in the 1436 group is 17,239, and each
information of previous frames (see Fig 3) that are needed for group contains 12 frames (1 I-frame and 11 P-frames), with 19
segmenting a particular P-frame at time t. semantic classes in the selected image. We divide the frame
Let Pt denote the P-frame at time t. A P-frame is repre- into 70% as training data (1005) and 30% (431) test data.
sented as the motion vector and the residual error (see Fig. 2). Cityscapes provides an image segmentation dataset in an self-
We can interpret the motion vector and the residual error as driving environment. It is used to evaluate the performance
two images. We apply two different CNNs to extract features of visual algorithms in the semantic understanding of urban
from these two images denoted as z1 (t) and z2 (t) (where scenes. Cityscapes contains 50 different scenes, different back-
H W
z1 , z2 ∈ R 32 × 32 ×c ), respectively. We then concatenate z1 and grounds, different seasons of streetscapes. The total number
z2 as the input to ConvLSTM at time t (see Fig. 3). of frames in the 960 group is 11,520 with 15fps and 19 class
The ConvLSTM module will process information starting numbers. We divided the frame into 70% as training data (672)
from the I-frame in the group. We set the initial hidden and 30% (288) test data.
state h(0) of ConvLSTM as the feature of the corresponding b) Ground-truth labels: The videos in our evaluation
I-frame, i.e. h(0) = z(I). For the P-frame P (t) at time datasets do not contain ground-truth labels for all frames. In
t, we simply take the aforementioned concatenated features order to get the ground-truth, we first decompress the video
cat(z1 (t), z2 (t)) (where cat represents the concatenation op- and extract all frames as regular RGB images. We then run
eration) as the input at t in ConvLSTM. ResNet [14] to obtain semantic segmentation maps for all
We consider the hidden state h(t) as the feature representa- frames based on their RGB images and use the predicted
tion of the information of the P-frame P (t). Since h(t) has ac- segmentation maps as the ground-truth.
cumulated the information of frames starting from an I-frame c) Evaluation metrics: We use the mean Intersection
that leads to P (t), h(t) has enough information for semantic Over Union (MeanIoU) and the pixel accuracy to measure the
segmentation of P (t). We use h(t) as the input and apply an performance of the semantic segmentation. We also measure
upsampling layer to obtain the semantic segmentation. the speed of the proposed approach during inference time.
d) Baselines: We consider the following baseline meth-
IV. E XPERIMENTS ods for comparison. First, we consider standard semantic
In this section, we first describe our experimental setup and segmentation models that operate on regular images, including
datasets in Section IV-A. We then present the experimental FCN-32s, FCN-8s, ResNet [5], [11]. Note that these baselines
results in Section IV-B. cannot directly handle compressed video format. They have
to extract each frame as a regular image in order to predict
A. Experimental Setup the semantic segmentation of this frame. Since there is no
a) Datasets: We evalute the performance of our approach existing work that directly produces semantic segmentation
on the Cambridge-driving labelled video database (CamVid) for compressed videos, we also define our own baseline as
dataset [2] and the Semantic Understanding of Urban Street follows. This baseline first produces the semantic segmentation
Scenes dataset (Cityscapes) dataset [4]. CamVid provides map on an I-frame. For remaining P-frames in the group, this
object-class semantic labels that assign each pixel to one of baseline simply uses the semantic segmentation map from this
Fig. 4. Speed and accuracy on CamVid, compared to FCN-32s, FCN-8s, and ResNet.

Network Pixel Accuracy MeanIoU CamVid


FCN-32s [5] 91% 46.1% Network Pixel Accuracy MeanIoU
FCN-8s [5] 92.6% 49.7% Baseline 89% 25%
ResNet [5] 95% 53% Ours 94% 51%
Ours 94% 51%
Cityscapes
TABLE I
E VALUATING P ERFORMANCE BETWEEN FCN, R ES N ET, AND O URS Network Pixel Accuracy MeanIoU
APPROACH FOR VIDEO SEMANTIC SEGMENTATION ON C AM V ID Baseline 80% 22%
Ours 87% 34%

Network Inference time (ms per frame) TABLE III


E VALUATING P ERFORMANCE BETWEEN BASELINE AND O URS APPROACH
FCN-32s 42.5 FOR VIDEO SEMANTIC SEGMENTATION ON C AM V ID AND C ITYSCAPES
FCN-8s 56
ResNet 168
Ours 17
TABLE II
E VALUATING INFERENCE TIME BETWEEN FCN, R ES N ET, AND O URS
temporal inforamtion required segmenting the P-frames. Our
APPROACH FOR VIDEO SEMANTIC SEGMENTATION ON C AM V ID experimental results show that the proposed method performs
on-par with frame-based methods in terms of accuracy. But our
method can perform at a much higher speed during inference
time. We believe our method can potentially be used in real-
I-frame as the prediction for each P-frame. time applications where the efficiency is crucial.
B. Results
VI. ACKNOWLEDGEMENT
We first compare different methods in terms of both their
This work was supported by NSERC. We thank NVIDIA
accuracy and inference speed using the CamVid dataset. The
for donating some of the GPUs used in this work.
comparisons are shown in Table I, Table II and Fig 4. Our
method achieves better performance than FCN-32s and FCN- R EFERENCES
8s [5] in terms of both accuracy and speed. Our method
performs comparably to ResNet in terms of MeanIoU and [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep con-
volutional encoder-decoder architecture for image segmentation. IEEE
pixel accuracy, but our method is much faster. Transactions on Pattern Analysis and Machine Intelligence, 2017.
We also compare the performance between our method and [2] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation
the baseline we have defined earlier in Table III. Our method and recognition using structure from motion point clouds. In European
Conference on Computer Vision, 2008.
achieves higher pixel accuracy and MeanIoU. [3] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and
L. V. Gool. One-shot video object segmentation. CoRR, 2016.
V. C ONCLUSION [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen-
son, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for
We have proposed a new method for semantic segmentation semantic urban scene understanding. In IEEE Conference on Computer
in compressed videos. Our method does not require extracting Vision and Pattern Recognition, 2016.
each frame as an RGB image. Instead, it directly operates [5] M. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy, F. Huang, and
R. Klette. Stfcn: spatio-temporal fully convolutional neural network
on the compressed video format consisting of I-frames and P- for semantic segmentation of street scenes. In Asian Conference on
frames. Our model uses a ConvLSTM model for capturing the Computer Vision, 2016.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for
image recognition. In IEEE Conference on Computer Vision and Pattern
Recognition, 2016.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in Neural
Information Processing Systems, 2012.
[8] D. J. LeGall. Mpeg (moving pictures expert group) video compression
algorithm: a review. In Image Processing Algorithms and Techniques II,
1991.
[9] Y. Li, J. Shi, and D. Lin. Low-latency video semantic segmentation.
CoRR, 2018.
[10] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for
semantic segmentation. In IEEE Conference on Computer Vision and
Pattern Recognition, 2015.
[11] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for
semantic segmentation. In IEEE Conference on Computer Vision and
Pattern Recognition, 2015.
[12] G. Maertens and K. Soroushian. Accurate and error resilient time
stamping method and/or apparatus for the audio-video interleaved (avi)
format, 2007.
[13] A. Mozo, M. Obien, C. Rigor, D. Rayel, K. Chua, and G. Tangonan.
Video steganography using flash video (flv). In IEEE Instrumentation
and Measurement Technology Conference, 2009.
[14] D. Pakhomov, V. Premachandran, M. Allan, M. Azizian, and N. Navab.
Deep residual learning for instrument segmentation in robotic surgery.
arXiv preprint arXiv:1703.08580, 2017.
[15] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell. Clockwork
convnets for video semantic segmentation. CoRR, 2016.
[16] K. Simonyan and A. Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[17] S. Wang, H. Lu, P. Dmitriev, and Z. Deng. Fast object detection in
compressed video. CoRR, 2018.
[18] C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and
P. Krähenbühl. Compressed video action recognition. In IEEE Con-
ference on Computer Vision and Pattern Recognition, 2018.
[19] F. Yu and V. Koltun. Multi-scale context aggregation by dilated
convolutions. arXiv preprint arXiv:1511.07122, 2015.
[20] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing net-
work. In IEEE conference on Computer Vision and Pattern Recognition,
2017.

You might also like