SemanticVideo_compress_2019
SemanticVideo_compress_2019
on clockwork driven by a fixed or adaptive clock signals was Following prior work [18], we divide frames in an entire
proposed in [9], [15]. video into several groups, while each group contains one I-
frame and several P-frames, represented by the collection {I,
B. Computer Vision with Compressed video
P1 , P2 , ... , PT }. The I-frame I is represented as a regular
Videos usually come in compressed format for transmis- RGB image, while each P-frame Pt only stores the difference
sion and storage. Several popular compression techniques with respect to the previous frame. Our model takes {I, P1 ,
are widely applied, including AVI [12], MPEG4 [8], FLV P2 , ... , PT } as the input. The desired output is the semantic
[13], and so on. Recently, there has been some work on segmentation of each image, regardless of the frame type. The
solving computer vision problems using compressed videos. semantic segmentation network is represented by fs (x), where
For example, [18] uses MPEG4 on action recognition and their x can be either a I-frame or a P-frame. Given the ground-truth
approach shows that operating on motion vectors and residual semantic segmentation masks, our learning objective function
errors in compressed videos is more efficient than traditional can be described below:
methods that operate on RGB frames. [17] combines com-
T
pressed video technology with LSTM to obtain spatial and X
L = Lce (GTI − fs (I)) + Lce (GTPt − fs (Pt )) (1)
temporal information on object detection problem. However,
t=1
as far as we know, there is no existing work on semantic
segmentation in compressed videos. where Lce is cross-entropy loss function, GTI is the ground-
truth semantic segmentation mask of the I-frame, and GTPt is
III. A PPROACH the ground-truth semantic segmentation mask of the P-frame
A. Overview Pt . Our goal is to learn a network that minimizes the loss
Videos are usually stored and transmitted in some com- function defined in Eq. 1.
pressed format, such as MPEG-4, H.264, etc. Most of the
video compression techniques use the fact that adjacent frames B. Semantic Segmentation for I-frame
in a video are often similar. As a result, we only need to In order to obtain the semantic segmentation of an I-
store a small number of frames (called I-frame) as regular frame, we use a standard encoder-decoder architecture for
images, while other frames (called P-frame) can be efficiently semantic segmentation (see Fig 2). An I-frame is represented
represented by only storing the difference between frames. as a regular RGB image tensor with three channels. Let
I ∈ RH×W ×3 be the image of the I-frame, where H × W is
the spatial size of the image. We use ResNet as the backbone
network to extract a feature map of the image denoted as
H W
z(I) ∈ R 32 × 32 ×c , where c is the number of channels of the
last convolutional layer of the feature extractor. We set c as
the number of classes in semantic segmentation. The spatial
size of z(I) is smaller than the original image I due to max-
pooling. In order to obtain the pixel-wise prediction at the
original image size, we apply an upsampling layer to enlarge
z(I) to have the same spatial size of the input image. We use
fs (I) ∈ RH×W ×c to denote the output of this upsampling
layer. We can interpret the c-dimensional vector at each pixel
location of fs (I) as the score of classifying the pixel to each
of the c classes. Fig. 3. The process of our network on P-frame when the timestep = t