Video Summarization Using Deep Semantic Features
Video Summarization Using Deep Semantic Features
Features
{otani.mayu.ob9,n-yuta,yokoya}@is.naist.jp
2
Center for Machine Vision and Signal Analysis, University of Oulu
{erahtu,jth}@ee.oulu.fi
1 Introduction
With the proliferation of devices for capturing and watching videos, video hosting
services have gained an enormous number of users. According to [1] for example,
almost one third of the people online use YouTube to upload or review videos.
This increasing popularity of Internet videos has accelerated the demand for
efficient video retrieval. Current video retrieval engines usually rely on various
types of metadata, including title, user tags, descriptions, and thumbnails, to find
videos, which is usually given by video owners. However, such metadata may not
be very descriptive to represent the entire content of a video. Moreover, titles
and tags are completely up to video owners and so their semantic granularity
can vary video by video, or such metadata can even be irrelevant to the content.
Consequently users need to review retrieved videos, at least partially, to get
rough ideas on their content.
2 M. Otani et al.
Input video
Summary
Fig. 1. An example of an input video and a generated video summary. The same
content (i.e., the dog) repeatedly appears in the input video in different appearances
or background, which may be semantically redundant. Our video summary successfully
reduces such redundant video segments, thanks to our deep features encoding higher-
level semantics.
the original videos into short video segments, for each of which we calculate
deep features in a high-dimensional, continuous semantic space using a DNN.
We then sample a subset of video segments such that the sampled segments
are semantically representative of the entire video content and are not redun-
dant. For sampling such segments, we define an objective function that evaluates
representativeness and redundancy of sampled segments. After sampling video
segments, we simply concatenate them in the temporal order to generate a video
summary (Fig. 1).
To capture various levels of semantics in the original video, deep features
play the most important role. Several types of deep features have been proposed
recently using convlutional neural networks (CNNs) [7, 8]. These deep features
are basically trained for a certain classification task, which predicts class labels
of a certain domain, such as objects and actions. Being different from these deep
features, our deep features need to encode a diversity of concepts to handle a wide
range of Internet video contents. To obtain such deep features, we design a DNN
to map videos and descriptions to the semantic space and train it with a dataset
consisting of videos and their associated descriptions. Such a dataset contains
descriptions like a man is playing the guitar on stage, which includes various
levels of semantic concepts, such as objects (man, guitar), actions (play),
and a scene (on stage). Our DNN is jointly trained using such a dataset so that
a pair of a video and its associated sentence gives a smaller Euclidean distance
in the semantic space. We use this DNN to obtain our deep features; therefore,
our deep features well capture various levels of semantic concepts.
The contribution of this work can be summarized as follows:
2 Related Work
focus on a certain genre of videos. For example, the importance of a video seg-
ment in broadcasting sports program may be easily defined based on the event
happening in that segment according to the rules of the sports [11]. Furthermore,
a game of some sports (e.g., baseball and American football) has a specific struc-
ture that can facilitate important segment extraction. Similarly, characters that
appear in movies are also used as domain knowledge [12]. For these domains,
various types of metadata (e.g., a textual record of scoring in a game, movie
scripts, and closed captions) help to generate video summaries [1113]. Egocen-
tric videos are another interesting example of video domains, for which a video
summarization approach using a certain set of predefined objects as a type of
domain knowledge has been proposed [14]. More recent approaches in this di-
rection adopt supervised learning techniques to embody domain knowledge. For
example, Potapov et al. [15] proposed to summarize a video focusing on a spe-
cific event and used an event classifiers confidence score as the importance of
a video segment. Such approaches, however, are almost impossible to generalize
to other genres because they heavily depend on domain knowledge.
In the last few years, video summarization has been addressed in an un-
supervised fashion or without using any domain knowledge. Such approaches
introduce the importance of video segment by using various types of criteria and
cast video summarization into an optimization problem involving these criteria.
Yang et al. [16] proposed to utilize an auto-encoder, in which its encoder con-
verts an input videos features into a more compact one, and the decoder then
reconstructs the input. The auto-encoder is trained with Internet videos in the
same topic. According to the intuition that the decoder can well reconstruct
features from videos with frequently appearing content, they assess the segment
importance based on the reconstruction errors. Another innovative approach was
presented by Zhao et al., which finds a video summary that well reconstructs
the rest of the original video. The diversity of segments included in a video sum-
mary is an important criterion and many approaches use various definitions of
the diversity [3, 17, 18].
These approaches used various criteria in the objective function, but their
contributions have been determined heuristically. Gygli et al. added some super-
vised flavor to these approaches for learning each criterions weight [19, 20]. One
major problem of these approaches is that such datasets do not scale because
manually creating good video summaries is cumbersome for people.
Canonical views of visual concepts can be an indicator of important video
segments, and several existing work uses this intuition for generating a video
summary [2123]. These approaches basically find canonical views in a given
video, assuming that results of image or video retrieval using the videos title
or keywords as query contain canonical views. Although a group of images or
videos retrieved for the given video can effectively predict the importance of
video segments, retrieving these images/videos for every input video is expensive
and can be difficult because there are only a few relevant images/videos for rare
concepts.
Video Summarization using Deep Semantic Features 5
3 Approach
Fig. 2. Our approach for video summarization using deep semantic features. We extract
uniform length video segments from an input video. The segments are fed to a CNN
for feature extraction and mapped to points in a semantic space. We generate a video
summary by sampling video segments that correspond to cluster centers in the semantic
space.
Fig. 3. A two-dimensional plot of our deep features calculated from a video, where
we reduce the deep features dimensionality with t-SNE [29]. Some deep features are
represented by the corresponding video segments keyframes, and the edges connect-
ing deep features represent temporal adjacency of video segments. The colors of deep
features indicate clusters obtained by k-means, i.e., points with the same color belong
to the same cluster.
The efficiency of the deep features is crucial in our approach. To obtain good
deep features that can capture higher-level semantics, we use the DNN shown
in Fig. 4, consisting of two sub-networks to map a video and a description to
a common semantic space and jointly train them using a large-scale dataset of
Video Summarization using Deep Semantic Features 7
Contrastive loss
Mean pooling
Nonlinearity Nonlinearity
Fully-connect Fully-connect
Nonlinearity Nonlinearity
Fully-connect Fully-connect
CNN RNN
A girl is singing
Fig. 4. The network architecture. Video segments and descriptions are encoded into
vectors in the same size. Both sub-network for videos and descriptions are trained
jointly by minimizing the contrastive loss.
To cope with higher-level semantics, we jointly train the DNN shown in Fig. 4
with pairs of videos and sentences, and we use its video sub-network for extract-
ing deep features. The video sub-network is a modified version of VGG [10], which
is renowned for a good classification performance. In our video sub-network,
VGGs classification (fc8) layer is replaced with two fully-connected layers
with hyperbolic tangent (tanh) nonlinearity, which is followed by a mean pool-
ing layer to fuse different frames in a video segment. Let V = {vi |i = 1, . . . , M }
8 M. Otani et al.
be a video segment, where vi represents frame i. We feed the frames to the video
sub-network and compute a video representation X Rd .
For the sentence sub-network, we use skip-thought vector by Kiros et al. [28],
which encodes a sentence into 4800-dimensional vectors with an RNN. Similarly
to the video sub-network, we introduce two fully-connected layers with tanh non-
linearity (but without a mean pooling layer) as in Fig. 4 to calculate a sentence
representation Y Rd from a sentence S.
For training these sub-networks jointly, we use a video-description dataset
(e.g., [9]). We sample positive and negative pairs, where a positive pair consists
of a video segment and its associated description, and a negative pair consists
of a video and a randomly sampled irrelevant description. Our DNN is trained
with the contrastive loss [26], which is defined using extracted features (Xn , Yn )
for the n-th video and description pair as:
compute Euclidean distance of positive pairs with initial DNNs before training
and employ the largest distance among them as . This enable most pairs to be
used to update the parameters at the begining of the training. Our DNNs for
videos and descriptions can be optimized using the backpropagation technique.
Figure 5 shows a 2D plot of learned deep features, in which the dimensionality
of the semantic space is reduced using t-SNE [29] and a keyframe of each video
segment is placed at the corresponding position. This plot demonstrates that our
deep neural net successfully locates semantically relevant videos at closer points.
For example, the group of videos around the upper left area (pink) contains
cooking videos, and another group on the lower left (green) shows various sports
videos. For video summarization, we use the deep features to represent a video
segment.
M = 5). The activations of VGGs fc7 layer consists of 4,096 units. We set
the unit size of the two fully connected layers to 1,000 and 300 respectively,
which means our deep feature is a 300-dimensional vector. For the description
sub-network, the fully-connected layers on top of the RNN have the same sizes as
the video sub-networks. During the training, we fixed the network parameters of
VGG and skip-thought, but those of the top two fully-connected layers for both
video and description sub-networks were updated. We sampled 20 negative pairs
for each positive pair to compute the contrastive loss. Our DNN was trained over
the MSR-VTT dataset [9], which consists of 1M video clips annotated with 20
descriptions for each. We used Adam [31] to optimize the network parameters
with the learning rate of 24 and trained for 4 epochs.
4 Experiment
To demonstrate the advantages of incorporating our deep features in video sum-
marization, we evaluated and compared our approach with some baselines. We
used the SumMe dataset [19] consisting of 25 videos for evaluation. As the videos
in this dataset are either unedited or slightly edited, unimportant or redundant
parts are left in the videos. The dataset includes videos with various contents. It
also provides manually created video summaries for each video, with which we
compare our summaries. We compute the f-measure that evaluates agreement to
reference video summaries using the code provided in [19].
4.1 Baselines
We compared our video summaries with following several baselines as well as re-
cent video summarization approaches: (i) Manually-created video summaries
are a powerful baseline that may be viewed as the upper bound for automatic
approaches. The SumMe dataset provides at least 15 manually-created video
summaries whose length is 15% of the original video. We computed the av-
erage f-measure of each manually-created video summary with letting each of
the rest manually-created video summaries as ground truth (i.e., if there are 20
manually-created video summaries, we compute 19 f-measures for each summary
in a pairwise manner and calculate their average). We denote the summary with
the highest f-measure among all manually-created video summaries by the best-
human video summary. (ii) Uniform sampling (Uni.) is widely used baseline
for video summarization evaluation. (iii) We also compare to video summaries
generated in the same approach as ours except that VGGs fc7 activations
were used instead of our deep features, which is referred to as VGG-based video
Video Summarization using Deep Semantic Features 11
Car Railcrossing
Paluma Jump
Valparaiso Downhill
Fig. 6. Segments selected by our approach. Keyframes of selected segments are shown.
The green areas in the graphs indicate selected segments. The blue lines represents the
ratio of annotators who selected the segment for their manually-created summary.
4.2 Results
Several examples of video summaries generated with our approach are shown in
Fig. 6, along with ratio of annotators who agreed to include each video segments
in their manually-created video summary. The peaks of the blue lines indicate
that the corresponding video segments were frequently selected to create a video
summary. These blue lines demonstrate that human annotators were consistent
in some extent. Also we observe that the video segments selected by our approach
(green areas) are correlated to the blue lines. This suggests that our approach
is consistent with the human annotators.
The results of the quantitative evaluation are shown in the Table 1. In
this table, we report the minimum, average, and maximum f-measure scores
12 M. Otani et al.
dataset. Our video summaries were generated using a relatively simple algo-
rithm to extract a subset of segments; nevertheless, ours outperformed the
interestingness-based for some videos, and even got a better mean f-measure
score than attention-based.
Our approach got low scores, especially for short videos, such as Jumps
and Fire Domino. Since we extract uniform length segments (5 second), in the
case of short videos, our approach only extracts a few segments. This may result
in a lower f-measure score. This limitation can be solved by extracting shorter
video segments or using more sophisticated video segmentation like [12, 19].
We also observed that our approach got lower scores than others on the
St Maarten Landing and Notre Dame, which are challenging because of
long unimportant parts and diversity of content, respectively. For St Maarten
Landing, as our approach is unsupervised, it failed to exclude unimportant
segments. For Notre Dame, generating a summary is difficult because there
are too many possible segments to be included in a summary. While our summary
shares small parts with manually created summaries, it is a challenging example
even for human annotators, which is shown in the low scores of manually-created
video summaries.
Figure 7 shows examples of video summaries created with our approach and
baselines. The video Cooking shows a person cooking some vegetables while
doing a performance. Ours and the best-human video summary include the same
scene of the performance with fire, while others do not. On the other hand, ours
extracts unimportant segments from the video Car over Camera. The original
video is highly redundant with static scenes just showing the ground or the sky,
and such scenes make up large clusters in the semantic space even if they are
unimportant. As our approach extracts representatives from each cluster, a video
with lengthy unimportant parts resulted in a poor video summaries. We believe
that this problem can be avoided by using visual cues such as interestingness
[34] and objectiveness [35].
5 Conclusion
In this work, we proposed to learn semantic deep features for video summa-
rization and a video summarization approach that extracts a video summary
based on the representativeness in the semantic feature space. For deep feature
learning, we designed a DNN with two sub-networks for videos and descriptions,
which are jointly trained using the contrastive loss. We observed that learned
features extracted from videos with similar content make clusters in the semantic
space. In our approach, the input video is represented by deep features in the
semantic space, and segments corresponding to cluster centers are extracted to
generate a video summary. By comparing our summaries to manually created
summaries, we shown that the advantage of incorporating our deep features in a
video summarization technique. Furthermore, our results even outperformed the
worst human created summaries. We expect that the quality of video summaries
will be improved by incorporating video segmentation methods. Moreover, our
14 M. Otani et al.
Cooking
Human
Ours
VGG
Uniform
Bear Climbing
Human
Ours
objective function can be extended by considering other criteria used in the area
of video summarization, such as interestingness and temporal uniformity.
References
1. YouTube.com: StatisticsYouTube. https://www.youtube.com/yt/press/en-
GB/statistics.html (2016)
2. Gong, Y., Liu, X.: Video summarization using singular value decomposition. In:
Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition
(CVPR). (2000) 174180
3. Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection
for supervised video summarization. In: Proc. Advances in Neural Information
Processing Systems (NIPS). (2014) 20692077
4. Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In:
Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition
(CVPR). (2014) 25132520
5. Lowe, D.G.: Distinctive image features from scale invariant keypoints. Int. Journal
of Computer Vision 60 (2004) 9111020042
6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition
(CVPR). (2005) 886893
7. Yao, L., Ballas, N., Larochelle, H., Courville, A.: Describing videos by exploiting
temporal structure. In: Proc. IEEE Int. Conf. Computer Vision (ICCV). (2015)
45074515
8. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.:
DeCAF: A deep convolutional activation feature for generic visual recognition. In:
Proc. Int. Conf. Machine Learning (ICML). Volume 32. (2014) 647655
9. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for
bridging video and language. In: Proc. IEEE Computer Society Conf. Computer
Vision and Pattern Recognition (CVPR). (2016) 52885296
10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recoginition. In: Proc. Int. Conf. Learning Representations (ICLR). (2015)
14 pages
11. Babaguchi, N., Kawai, Y., Ogura, T., Kitahashi, T.: Personalized abstraction of
broadcasted American football video by highlight selection. IEEE Trans. Multi-
media 6 (2004) 575586
12. Sang, J., Xu, C.: Character-based movie summarization. In: Proc. ACM Int. Conf.
Multimedia (MM). (2010) 855858
13. Evangelopoulos, G., Zlatintsi, A., Potamianos, A., Maragos, P., Rapantzikos, K.,
Skoumas, G., Avrithis, Y.: Multimodal saliency and fusion for movie summariza-
tion based on aural, visual, and textual attention. IEEE Trans. Multimedia 15
(2013) 15531568
14. Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: Proc.
IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR).
(2013) 27142721
15. Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video sum-
marization. In: Proc. European Conf. Computer Vision (ECCV). (2014) 540555
16. Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction
of video highlights via robust recurrent auto-encoders. In: Proc. IEEE Int. Conf.
Computer Vision (ICCV). (2015) 46334641
17. Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled
egocentric video summarization via constrained submodular maximization. In:
Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition
(CVPR). (2015) 22352244
16 M. Otani et al.
18. Tschiatschek, S., Iyer, R.K., Wei, H., Bilmes, J.A.: Learning mixtures of submod-
ular functions for image collection summarization. In: Proc. Advances in Neural
Information Processing Systems (NIPS). (2014) 14131421
19. Gygli, M., Grabner, H., Riemenschneider, H., van Gool, L.: Creating summaries
from user videos. In: Proc. European Conf. Computer Vision (ECCV). (2014)
505520
20. Gygli, M., Grabner, H., van Gool, L.: Video summarization by learning submodular
mixtures of objectives. In: Proc. IEEE Computer Society Conf. Computer Vision
and Pattern Recognition (CVPR). (2015) 30903098
21. Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: Summarizing web videos
using titles. In: Proc. IEEE Computer Society Conf. Computer Vision and Pattern
Recognition (CVPR). (2015) 51795187
22. Khosla, A., Hamid, R., Lin, C.j., Sundaresan, N.: Large-scale video summarization
using web-image priors. In: Proc. IEEE Computer Society Conf. Computer Vision
and Pattern Recognition (CVPR). (2013) 26982705
23. Chu, W.S., Jaimes, A.: Video co-summarization: Video summarization by visual co-
occurrence. In: Proc. IEEE Computer Society Conf. Computer Vision and Pattern
Recognition (CVPR). (2015) 35843592
24. Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos.
In: Proc. IEEE Int. Conf. Computer Vision (ICCV). (2015) 27942802
25. Frome, A., Corrado, G., Shlens, J.: DeViSE: A deep visual-semantic embedding
model. In: Proc. Advances in Neural Information Processing Systems (NIPS).
(2013) 21212129
26. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively,
with application to face verification. In: Proc. IEEE Computer Society Conf.
Computer Vision and Pattern Recognition (CVPR). (2005) 539546
27. Lin, T.Y., Belongie, S., Hays, J.: Learning deep representations for ground-to-
aerial geolocalization. In: Proc. IEEE Computer Society Conf. Computer Vision
and Pattern Recognition (CVPR). (2015) 50075015
28. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fi-
dler, S.: Skip-thought vectors. In: Proc. Advances in Neural Information Processing
Systems (NIPS). (2015) 32763284
29. van der Maaten, L., Hinton, G.E.: Visualizing high-dimensional data using t-SNE.
Journal of Machine Learning Research 9 (2008) 25792605
30. DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simpli-
fication. In: Proc. ACM Int. Conf. Multimedia (MM). (1998) 211218
31. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proc. Int.
Conf. Learning Representations (ICLR). (2015) 11 pages
32. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.:
Cost-effective outbreak detection in networks. Proc. ACM SIGKDD Int. Conf.
Knowledge Discovery and Data Mining (KDD) (2007) 420429
33. Ejaz, N., Mehmood, I., Wook Baik, S.: Efficient visual attention based framework
for extracting key frames from videos. Signal Processing: Image Communication
28 (2013) 3444
34. Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., Gool, L.V.: The interest-
ingness of images. In: IEEE Int. Conf. Computer Vision (ICCV). (2013) 16331640
35. Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: Proc. IEEE Computer
Society Conf. Computer Vision and Pattern Recognition (CVPR). (2010) 7380