0% found this document useful (0 votes)
14 views8 pages

AudioVisual Video Summarization

Uploaded by

hruthwik05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

AudioVisual Video Summarization

Uploaded by

hruthwik05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

8, AUGUST 2023 5181

Brief Papers
AudioVisual Video Summarization
Bin Zhao , Member, IEEE , Maoguo Gong , Senior Member, IEEE , and Xuelong Li , Fellow, IEEE
Abstract— Audio and vision are two main modalities in video data. videos according to manual criteria [9]–[11], including representa-
Multimodal learning, especially for audiovisual learning, has drawn tiveness, importance, interestingness, and so on. They are developed
considerable attention recently, which can boost the performance of
to select the key-frames or key-shots that represent the whole video
various computer vision tasks. However, in video summarization, most
existing approaches just exploit the visual information while neglecting content, contain the important objects, have less redundancy, etc.,
the audio information. In this brief, we argue that the audio modality so as to distill the video information effectively. On the other hand,
can assist vision modality to better understand the video content and the video data are taken as a sequence of frames. The summary
structure and further benefit the summarization process. Motivated is generated according to the temporal dependencies among frames.
by this, we propose to jointly exploit the audio and visual informa-
tion for the video summarization task and develop an audiovisual To achieve this, the most popular recurrent neural network, long-
recurrent network (AVRN) to achieve this. Specifically, the proposed short term memory (LSTM) [12], is used as the backbone in video
AVRN can be separated into three parts: 1) the two-stream long- summarization. Recently, various sequence models are developed
short term memory (LSTM) is used to encode the audio and visual based on it, such as bidirectional LSTM [13], hierarchical LSTM [14],
feature sequentially by capturing their temporal dependency; 2) the
audiovisual fusion LSTM is used to fuse the two modalities by exploring
[15], and attention-based LSTM [16], [17]. By taking advantage of
the latent consistency between them; and 3) the self-attention video the deep learning and sequential modeling ability of LSTM, they
encoder is adopted to capture the global dependency in the video. Finally, surpass the traditional approaches developed based on manual criteria
the fused audiovisual information and the integrated temporal and global and take the leading position.
dependencies are jointly used to predict the video summary. Practically,
the experimental results on the two benchmarks, i.e., SumMe and TVsum, A. Motivation and Overview
have demonstrated the effectiveness of each part and the superiority of
AVRN compared with those approaches just exploiting visual information
Video data are naturally composed of two modalities, i.e., audio
for video summarization. and vision. They record the activities from different aspects and
cooperate together to help the viewer understand the video content.
Index Terms— Audiovisual learning, multimodal learning, Recently, multimodal learning has proved that audio and vision
recurrent network, video summarization. modalities share a consistency space, and there are semantic rela-
I. I NTRODUCTION tionships between them [18], [19]. Lots of relevant video analysis
Video summarization is a typical computer vision task developed tasks have demonstrated that the performance is promoted using the
for video analysis. It can distill the video information effectively multimodal information in previous single modality tasks [20]–[22].
by extracting several key-frames or key-shots to display the video However, few researchers in video summarization have recognized
content [1]. Under the help of video summary, the viewer can perceive the potential contributions of audio information to the performance.
the information without watching the whole video. Therefore, video Most of them just consider the vision modality and extract shallow
summary provides an efficient way for video browsing. Moreover, or deep visual features to represent video frames, while the audio
by removing the redundant and meaningless video content, it has features are ignored.
potential applications in video retrieval, storage, and indexing [2], In this brief, we argue that the audio modality can assist the
[3], as well as boosting the performance of related video analysis vision modality to better understand the video content and structure.
tasks, such as video captioning [4] and action recognition [5]. Concretely, the audio and vision are complementary to present
In the data explosion era, video summarization draws increasing activities in different modalities. For example, the music at the party
attention. Lots of approaches are proposed in recent years [6]–[8]. reflects the pleasant atmosphere of the scene, and the cheers in soccer
The existing approaches mainly summarize videos in two aspects. games indicate a good goal. However, the audiovisual inconsistency
On one hand, comprehensive models are developed to summarize situations usually occur in videos as well. For example, the sounding
object is not in the field of view. It will also bring interferences for
Manuscript received 18 January 2021; revised 10 July 2021; accepted vision modality, which is the main challenge in audiovisual video
4 October 2021. Date of publication 25 October 2021; date of current version summarization.
4 August 2023. This work was supported in part by the National Key Research
and Development Program of China under Grant 2018AAA0102200, in part Facing the above opportunities and challenges, we propose an
by the National Natural Science Foundation of China under Grant 62106183, audiovisual recurrent network (AVRN) to jointly use audio and visual
in part by the Natural Science Basic Research Program of Shaanxi under information in the video summarization task. To guarantee the consis-
Grant 2021JQ-204, and in part by the China Postdoctoral Science Foundation tency, the fusion of audio and visual information is in two stages. The
under Grant 2020TQ0236. (Corresponding author: Bin Zhao.)
two-stream LSTM is used in the first stage to encode the audio and
Bin Zhao is with the Academy of Advanced Interdisciplinary Research,
Xidian University, Xi’an 710071, China, and also with the School of Artificial visual features sequentially and capture their temporal dependency.
Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical Then, the audiovisual fusion LSTM is developed to exploit the
University, Xi’an 710072, China (e-mail: [email protected]). consistency space between audio and visual information and fuse
Maoguo Gong is with the Key Laboratory of Intelligent Perception and them with an adaptive gating mechanism. Besides, considering that
Image Understanding of Ministry of Education, International Research Center
for Intelligent Perception and Computation, Xidian University, Xi’an 710071, there are multihop storylines in the video stream due to montage and
China (e-mail: [email protected]). edit, the temporal neighborhood dependency is not capable enough
Xuelong Li is with the School of Artificial Intelligence, Optics and Elec- for video summarization. In this case, a self-attention video encoder is
tronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China adopted to encode the global video information. Finally, the temporal
(e-mail: [email protected]).
Color versions of one or more figures in this article are available at
and global dependencies of audiovisual information captured by the
https://doi.org/10.1109/TNNLS.2021.3119969. sequence encoder and global encoder are jointly used for predicting
Digital Object Identifier 10.1109/TNNLS.2021.3119969 the video summary.
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.
5182 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

B. Novelties and Contributions summary directly by taking advantage of the learning ability of deep
The novelties and contributions of our work are as follows. networks. Specifically, RNNs are used to model the frame sequence.
In [13], the bidirectional LSTM is developed to exploit the temporal
1) The audio information is introduced to the video summarization
dependency. Considering that the most favorable video length for
task, so as to complement with the visual information in video
LSTM is less than 100 frames, a stacked memory network is proposed
content and structure modeling.
in [43], where the LSTM is augmented with a memory module to
2) A hierarchical multimodal LSTM is developed to exploit the
boost the performance on long-term dependency modeling. Similarly,
latent consistency space between the two modalities and capture
a hierarchical LSTM is developed to extend the capability of dealing
the temporal dependencies.
with long sequence [14]. Different from recursive models, a self-
3) By combining the LSTM and self-attention encoder together,
attention model is adopted in [17] as the video encoder, and a fully
the global and temporal dependencies are captured jointly to
convolutional sequence network is conducted in [44]. However, they
benefit the summarization process.
both just consider one side of the dependencies among frames, i.e.,
temporal dependencies or global dependencies, while neglecting the
C. Organization other one.
The organization of the rest brief is presented as follows. The To boost the performance, different mechanisms are developed to
relevant works in the literature are analyzed in Section II. The optimize the sequence model [6], [7], [45]. An attentive and semantic
proposed AVRN is described in Section III. The experimental results preserving model is proposed in [42] to explore the inherent rela-
are discussed and compared with the state-of-the-arts in Section IV. tionships and minimize the semantic information loss between video
Finally, the conclusion of our work is drawn in Section V. and summary. A dual mixture attention is developed in [46], which
shows better generalization to small datasets. In SUM-GAN [47],
II. R ELATED W ORKS a discriminator is conducted after the summary generator, and the
A. Traditional Video Summarization generated summary is discriminated by the adversarial loss. The
encoder–decoder architecture is designed in [48], where the decoder
Earlier works devote to find a set of representatives to summarize
is used to recover the video content from the obtained summary in
the video content. Hand-crafted feature extractors are first used to
the semantic space. It can guide the encoder to select key-shots
extract feature vectors for each frame, such as color histogram [23],
that best represent the video content. Similarly, a dual learning
optical flow [24], and histograms of gradient [25]. To determine
framework is developed in [1] based on a summary generator and a
the representativeness, clustering algorithms and dictionary learning
video reconstructor. In this case, the unsupervised optimization of the
methods are used to generate the video summary [26], [27]. For
summary generator is achieved. The manual criteria are also adopted
example, k-means [28] and k-medoids [29] allocate frames into
in [49], and the reinforcement learning strategy is adopted to reward
different clusters and obtain the cluster center. Naturally, the clus-
the summary generator.
ter centers are viewed as the representatives and selected in the
summary. Dictionary learning is also an effective method to select
the representatives. It takes the frame sequence as a dictionary and C. Multimodal Video Analysis
tries to determine a subset of the elements to represent the original Recently, researchers have realized the multimodal characteristics
video [30], [31]. Sparsity is a widely used prior for dictionary-based of video data and developed lots of interesting tasks for video
video summarization. To achieve this, the l0 and l0,1 norms are added analysis. A spatial–temporal graph is proposed for the video cap-
as the regularizer, and the block sparsity constraint is designed to tioning task in [50], which can translate the visual modality into text
speed up the convergence [32]–[34]. modality. A transformer with instance attention is conducted in [20]
Later, researchers have realized that the representativeness is not to jointly use audiovisual information for event localization. A two-
enough to quantify the summary quality, and more manual criteria are stream architecture is developed based on the transformer to fuse text
proposed. A user attention model is constructed in [35] by combining and visual information for video retrieval [51]. Furthermore, there
the visual, audio, and textual information. It shows the superiority of are also lots of works developed for the multimodal representation
multimodal information and inspires our work. Diversity is designed learning [52], cross-modal consistency learning [53], multimodal
to reduce the redundancy in the summary, where similar clips are generation tasks [18], etc. All of them inspire us to develop a
removed from the video. In this case, the key point is to determine the multimodal work for video summarization.
similarity among frames and shots. In [36], Segeral distance is used
as the similarity metric. Furthermore, several distance metrics are
III. P ROPOSED A PPROACH
combined together in [37], and the determinantal point process (DPP)
model is modified to select the diverse key-frames sequentially [38]. In this brief, we propose an AVRN to integrate the audio and visual
Importance is developed to constrain the summary to maintain information together to the video summarization task. As depicted
important objects in the video, in which several local features are used in Fig. 1, the proposed AVRN is composed of three parts, where the
to determine the importance of different objects [39], [40], including two-stream LSTM encodes the audio and visual feature sequentially,
distance to the frame centroid, frequency of occurrence, esthetics met- the fusion LSTM fuses the audiovisual multimodal information
rics, etc. To model the summary comprehensively, several criteria are dynamically, and the self-attention module captures the video infor-
combined together in [24] and [41], including importance, representa- mation globally. In the following, they are elaborated successively.
tiveness, uniformity, and storyness. They measure the summary qual-
ity in different aspects and provide a comprehensive score function. A. Two-Stream LSTM
LSTM is a typical recurrent neural network developed to deal
B. Deep-Learning-Based Video Summarization with sequence data. Videos are naturally temporal sequence data.
Recently, deep-learning-based approaches have made tremendous To encode the audio and visual information, the video data are
progress and taken the leading position in video summarization [8], separated into the audio signal and visual frames, and a two-stream
[42]. They prefer to learn the complex mapping from video to structure of LSTM is conducted.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023 5183

Fig. 1. Architecture of the proposed AVRN, which is composed of the two-stream LSTM, the audiovisual fusion LSTM, and the self-attention video encoder.
The last row displays the log-mel spectrograms of audio data.

Specifically, given the frame sequence, the visual features are first Then, the fused audio and visual information xtav is input to the
extracted as Xv = {x1v , x2v , . . . , xnv }, where n stands for the length audiovisual fusion LSTM to encode them sequentially, i.e.,
of the frame sequence. Then, to capture the temporal dependencies  
htav = BiLSTM xtav , ht−1
av
. (5)
among frames, the bidirectional LSTM is used as one stream to
process Xv sequentially, which is formulated as Practically, htav captures the temporal dependencies of the audio-
  visual information at the tth step, which is essential for summary
htv = BiLSTM xtv , ht−1 v
(1)
generation.
where htv is the hidden state of the bidirectional LSTM. Practically,
BiLSTM is conducted by combining two LSTMs together. They C. Self-Attention Video Encoder
capture the temporal dependency among frames from the forward
and reverse directions, respectively, and encode the dependency into Although videos are naturally sequential data, multihops of sto-
the hidden state vector htv in each step. ryline usually occur in the video stream, and the activities recorded
Similarly, to exploit the temporal dependency of audio features in each shot vary largely. In some occasions, there are no obvious
Xa = {x1a , x2a , . . . , xna }, another stream of bidirectional LSTM is temporal dependencies among consecutive shots, such as those videos
conducted, i.e., with edit and montage. Therefore, the sequence networks are not
  enough for modeling the complex video structure and content.
hta = BiLSTM xta , ht−1
a
. (2) In this case, we develop a global encoder to cooperate with the
sequence model. It is achieved by a self-attention module
In the two-stream LSTM, the temporal dependencies among audio
and visual modalities are encoded into hta and htv separately. They 
n

are not fused correctly, and the difference and consistency between Vt = αti xiav (6)
i=1
them are not considered, which may cause interference to the video
summarization task. To address this problem, an audiovisual fusion where Vt is the encoded global dependency among frames at the tth
LSTM is further developed. step. αti is the attention weight, which is computed by
 
lti = W1 xiav , W2 xtav (7)
B. Audiovisual Fusion LSTM   n
 
In this part, an audiovisual fusion LSTM is conducted to explore αti = exp lti exp lti (8)
i=1
the sharing latent space among the audio and visual modality, so as
to exploit the consistency and reduce the difference. Different from where W and W are the training weights. ·, · denotes the inner-
1 2

the two-stream structure, the audiovisual fusion LSTM takes the product operation. lti captures the dependency between xiav and xtav .
combined audio and visual information as input and fuses them Equation (8) is used to normalize the attention weight.
sequentially.
Specifically, an adaptive gating mechanism is adopted to the D. Summary Generation
audiovisual fusion LSTM, which is formulated as
  Given the computed temporal and global dependencies of audiovi-
ct = Sigmoid Wa hta + Wv htv + b (3) sual information, the importance of each frame to the video content
is computed by
where Wa , Wv , and b are the training parameters. ct is the gate to  
control the information flow of different modalities, which is operated pt = Sigmoid W p xtav , htav , Vt + b p (9)
as follows:
where W p and b p are the training weight and bias, respectively. xtav ,
xtav = ct hta + (1 − ct )htv . (4) htav , and Vt denote the fused audiovisual information, the temporal

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.
5184 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

dependency, and the global video information, respectively. They For audio information, VGGish [56] pretrained on Audioset2 is
are integrated together to predict the frame-level importance. Then, adopted for audio feature extraction. Specifically, the audio data
the shot-level importance is obtained by averaging the importance are temporally separated with a duration of 1 s. Two neighboring
scores of those frames in the shot, i.e., segments share the overlap of half a second, so that the length of

t=si audio feature can match that of the visual feature. Note that there are
pis = pt (10) two videos in SumMe without audio data, “Scuba” and “St Maarten
t=si−1 +1 Landing.” Their audio features are padded with zeros.
3) Performance Evaluation: The summary quality is evaluated by
where S = [0, s1 , s2 , . . . , sm−1 , n] are the shot boundaries of m shots
measuring the temporal consistency between the predicted summary
in the video. Following the existing protocols, they are computed
and human-created summary. Precision (P), recall (R), and F-
by the kernel temporal segmentation (KTS) method [9]. In the final,
measure (F) are widely adopted metrics. They are defined as
the video summary is generated with those higher score shots.
 
Practically, the proposed AVRN is trained end to end under # summary p ∩ summary h
the supervision from human-created summaries. The mean square P =
#summary p
error (MSE) is used for optimization, i.e.,  
# summary p ∩ summary h
1 R =
loss = p − g22 (11) #summary h
n P·R
F =
where p is the predicted frame-level importance vector, and g is the (P + R)
ground truth annotated by human beings.
The proposed AVRN is operated on Python 3.7 with the deep where #summary p and #summary h denote the number of frames in
learning platform of PyTorch 1.6. The dimensionalities of the hidden the predicted summary and human-created summary, respectively.
states in both the two-stream LSTM and audiovisual fusion LSTM #(summary p ∩ summary h ) stands for their overlap. Considering that
are fixed as 256. The optimizer is selected as Adam with the learning each video contains multiple human-created summaries, the pair-
rate 1e−5, the decay rate 0.1, and decay step 30. Generally, AVRN wise comparisons are conducted. Following the existing protocols,
can reach the convergence in less than 60 epochs. the maximum scores are taken as the results on SumMe. The average
scores are taken as the results on the TVsum dataset.
To evaluate the performance comprehensively, the rank-based
IV. E XPERIMENTS
evaluation metrics are also adopted in this work. They are Kendall’s
The experiments are carried out on two benchmark datasets, τ and Spearman’s ρ [57], which measure the correlation coefficients
SumMe [25] and TVsum [54]. In the following subsections, the abla- of the generated importance scores and annotated importance scores.
tion studies are conducted, and several state-of-the-art approaches are The pairwise evaluations are conducted among multiple annotated
compared. importance scores, and the average coefficients are taken as the final
results on both the SumMe and TVsum datasets.
A. Setup
1) Dataset Introduction: SumMe and TVsum are popular video B. Ablation Studies
summarization datasets. The SumMe dataset consists of 25 videos The proposed AVRN is composed of three parts, including the
with diverse topics, including cooking, traveling, sports, etc. Most of two-stream LSTM (TS-LSTM), audiovisual fusion LSTM (AVF-
them are raw videos without human edit. For each video, 15–18 users LSTM), and the self-attention video encoder (SAVE). To verify the
are used to select key-shots and generate the video summaries. The effectiveness of each part, the ablation studies are conducted, and
TVsum dataset is composed of 50 videos in open domain. They are several baselines are compared, including the following.
edited videos about news, cooking, pets, etc. Each of them contains
20 annotations of shot-level importance scores, where the shots are 1) Audio LSTM: A bidirectional LSTM is used to encode the
generated by segmenting the video into 2-s clips evenly. audio feature and generate the video summary, while the visual
Following the existing protocols, the videos in SumMe and TVsum feature is ignored.
are separated into 80% for training and the rest 20% for testing. For 2) Visual LSTM: A bidirectional LSTM is used to encode the
simplicity, the validation is operated on the training set. In the training visual feature and generate the video summary, while the audio
process, the human-created summaries in SumMe and TVsum are feature is ignored.
modified to the frame-level importance scores, which are taken as the 3) TS-LSTM: Only the two-stream LSTM is conducted to capture
supervision information for AVRN optimization. Practically, the sum- the temporal dependency among audio and visual features and
marization performance varies among different videos. To make the generate the video summary.
results more convincing, random training/test splits are carried out for 4) AVF-LSTM: The audio and visual features are fused directly
five times, and the average performance is taken as the final results. without exploiting their temporal dependencies.
Besides, the OVP [28] and YouTube [28] datasets are used in the 5) AVRN(w/o SAVE): The global video information is not con-
experimental part to augment the training set. They are composed sidered when predicting the video summary.
of 89 videos with human-created summaries. 6) AVRN(single): The bidirectional LSTMs in TS-LSTM and
2) Feature Extraction: For visual information, GoogLeNet [55] AVF-LSTM are replaced with single LSTM (forward LSTM).
pretrained on ImageNet1 is adopted to extract visual features for video Table I exhibits the results of ablation studies. The audio LSTM
frames, and the 1024-dim feature vector in pool5 layer is taken as can get satisfactory results just with the audio feature as input,
the frame representation. Particularly, considering that neighboring which indicates that the audio modality can indeed provide useful
frames are quite similar to each other, the features are extracted for information for video summarization. TS-LSTM outperforms the
every 15 frames (about 2 fps). visual LSTM and audio LSTM. It proves the necessity to integrate

1 http://www.image-net.org/ 2 https://research.google.com/audioset/index.html

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023 5185

TABLE I
R ESULTS OF A BLATION S TUDIES FOR AVRN

audio and visual information together to boost the performance. AVF- TABLE II
LSTM takes the audio and visual features as input and fuses them R ESULTS OF AVRN AND T RADITIONAL A PPROACHES
without using the TS-LSTM to capture their temporal dependency.
AVRN(w/o SAVE) equals to the combination of TS-LSTM and
AVF-LSTM. AVRN(w/o SAVE) outperforms them. It verifies the
importance of temporal dependency among audio and visual features
and the necessity to fuse the audio and visual information to achieve
the mutual benefit between them.
The difference between the proposed AVRN and the baseline
AVRN(w/o SAVE) lies in that the self-attention video encoder is
not included in AVRN(w/o SAVE). It means that only the temporal
dependency is captured in AVRN(w/o SAVE), while the global video
information is ignored. The better performance of the proposed
AVRN indicates that the global video information are also very
important to the summary quality. Besides, AVRN(single) gets worse
results than AVRN that uses bidirectional LSTMs to encode and fuse
the audiovisual information. It explains the rationality to capture the the summary, which means more information is exploited than
bidirectional temporal dependencies jointly for video summarization. the traditional approaches just extracting visual features. Therefore,
Overall, the results in Table I have demonstrated the effectiveness of it outperforms the traditional approaches significantly.
the proposed AVRN, including the two-stream LSTM, the audiovisual Table III presents the results of AVRN and other deep learning
fusion LSTM, and the self-attention video encoder. approaches, to show the superiority of AVRN. vsLSTM first uses bidi-
rectional LSTM to capture the temporal dependency among frames
C. Comparison With State-of-the-Arts and summarize the video. dppLSTM extends vsLSTM usin the DPP
Table II presents the results of AVRN and traditional approaches. model to guarantee the diversity of key-shots. However, they are plain
The compared approaches are in various categories. k-medoids, sequence models without considering the global video information,
Delauny, and VSUMM are the clustering-based approaches. They so they perform much worse than the proposed AVRN. vsLSTM-
are developed based on k-medoids, Delauny clustering, and k-means, att and dppLSTM-att are modified by adding attention models to
respectively. Particularly, considering that the video frames vary encode the global video information. By exploiting both the temporal
smoothly, the clusters in VSUMM are initialized by segmenting and global video information, vsLSTM-att and dppLSTM-att perform
video temporally, so that better results are achieved. It indicates that much better than the original ones. A-AVS is also an attention-based
the domain knowledge of video is important for the summarization LSTM model, which performs comparably with vsLSTM-att and
task. SALF, LiveLight and Block Sparse are the dictionary-learning- dppLSTM-att. It has demonstrated the necessity to integrate global
based approaches. SALF and Block Sparse generate summary by encoder to the sequence model for the video summarization task.
self-reconstructing the video with shots sparsely. LiveLight conducts That is why the self-attention video encoder is developed in AVRN.
an online learning strategy to incrementally select those shots that Furthermore, the even better performance of AVRN has verified the
cannot be represented by current key-shot set. They get better per- superiority of using both the audio and visual feature to summarize
formance than clustering-based approaches. It is mainly because they the video.
can capture the dependency among frames. CSUV and LSMO are SUM-GAN proposes to use the generative adversarial net-
designed based on manual criteria. CSUV selects key-shots accord- work (GAN) to generate video summary discriminatively and uses the
ing to their interestingness measured by factors such as esthetics, discriminator to achieve the unsupervised learning. SUM-GANsup is
landmarks, faces, persons, and objects. Inspired by it, LSMO fur- its supervised version, which outperforms SUM-GAN considerably.
ther constructs the interestingness, representativeness, and uniformity SASUM and SASUMsup take the video captions as the auxiliary
models. CSUV and LSMO are supervised approaches, so that they information to boost the performance. It is also a multimodal video
can outperform the unsupervised clustering and dictionary-learning- summarization approach. However, the video captions require much
based approaches. human resources, which limits its applicability. Fortunately, the audio
The proposed AVRN maintains the advantages of the traditional modality matches vision modality naturally in videos. Better results
approaches. It can capture the temporal and global dependencies are obtained by AVRN, which shows the advantages of audiovisual
among frames by the LSTM and global video encoder. It can also fusion than text-visual fusion in the video summarization task.
be optimized under the supervision of human-created summaries. DR-DSN adopts the manual criteria, i.e., representativeness and
Moreover, AVRN uses audio and visual features jointly to predict diversity, to guide the summary generator. The results indicate that

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.
5186 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

Fig. 2. Example summaries generated by AVRN. The above and below samples are from SumMe and TVsum, respectively. The blue curves and red curves
below each sample are frame-level importance scores annotated by human and predicted by AVRN. The red histograms above each sample indicate the
generated summary.

TABLE III 2) Augmented: When training on the SumMe dataset, the videos
R ESULTS OF AVRN AND D EEP L EARNING A PPROACHES in TVsum, OVP, and YouTube are used to augment the training
set. Similar strategies are also conducted on the TVsum dataset.
3) Transfer: The videos in TVsum, OVP, and YouTube are used
to train the SumMe dataset. Similarly, the videos in SumMe,
OVP, and YouTube are used to train the TVsum dataset.
From Table IV, we can see that most of the approaches get better
results under the augmented setting. This phenomenon indicates that
the training data are not enough for most approaches, which leads to
the overfitting problem. One effective way to address this problem
is to provide more information for the summary generator. DR-DSN
adopts the reinforcement learning scheme to exploit the summary
properties to reward the summary generator, so better results are
obtained than the plain LSTMs, including vsLSTM and dppLSTM.
VASNet also conducts an attention model to select key-shots accord-
ing to the encoded video information.3 re-SEQ2SEQ develops a ret-
rospective encoder to keep the consistency of the semantics between
video and summary. It also develops a single LSTM as an extra to
segment videos into shots. In contrast, the proposed AVRN follows
a much compact end-to-end architecture. Overall, AVRN performs
the best on the SumMe dataset and performs nearly the best on the
manual criteria in the traditional approaches can also promote the TVsum dataset. It has verified the superiority of jointly using the
performance of deep learning approaches. audio and visual information in video summarization.
H-RNN and HSA-RNN are the hierarchical structures of LSTMs.
They outperform those approaches with plain LSTM structures, such
D. Evaluation on Rank-Based Metrics
as vsLSTM and dppLSTM. It is mainly due to the better non-linear
fitting ability of the hierarchical structure. Our AVRN also follows the Precision, recall, and F-measure quantify the summary quality by
hierarchical structure of LSTM, where the first layer is the two-stream measuring the temporal consistency between the generated summary
LSTM and the second layer is the audiovisual fusion LSTM. They and human-created summary. They neglect more fine-grained human
encode and fuse the audiovisual information hierarchically, so better preference of video shots. To address this problem, the rank-based
performance are achieved than plain LSTMs. Furthermore, AVRN metrics, Kendall’s τ and Spearman’s ρ, are used in this part to provide
also surpasses H-RNN and HSA-RNN. It is mainly because AVRN comprehensive evaluation of summary quality. They measure the cor-
extracts the multimodal features for video summarization, while relation between the predicted probability curve and human-annotated
H-RNN and HSA-RNN just extract visual features. AVRN can better importance curve.
understand the video content and structure with audiovisual informa- Table V presents the results evaluated on rank-based metrics.
tion, so that better results are obtained than H-RNN and HSA-RNN. We can see that the result of the summary generated by random
In Table IV, different settings of training data are conducted to fur- selection is zero. It means there is no correlation between randomly
ther analyze the results on the SumMe and TVsum datasets. They are generated summary and human annotation. Besides, considering that
canonical, augmented, and transfer, which are described as follows. each video contains multiple human annotations, the evaluation is also
conducted among them via leave-one-out strategy. They get highest
1) Canonical: The SumMe and TVsum datasets are trained and
tested individually. The training/test splits are fixed as 80% 3 The results of VASNet are produced by modifying the released source code
and 20%. to the same experimental setting in this brief.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023 5187

TABLE IV
R ESULTS OF AVRN AND C OMPARED A PPROACHES IN D IFFERENT T RAINING D ATA O RGANIZATIONS

TABLE V
R ESULTS OF R ANK -BASED E VALUATION (K ENDALL’ S τ AND S PEARMAN ’ S ρ)

scores in Table V, which shows there are considerable consistency R EFERENCES


among human annotations. The results meet our expectations.
Some typical RNN-based approaches are compared in Table V. [1] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video
summarization,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 10,
dppLSTM is developed based on a plain bidirectional LSTM. DR- pp. 3989–4000, Oct. 2020.
DSN conducts representativeness and diversity reward to the sum- [2] L. Jin, Z. Li, and J. Tang, “Deep semantic multimodal hashing net-
mary generator. HSA-RNN constructs the hierarchical structure of work for scalable image-text and video-text retrievals,” IEEE Trans.
LSTM. The proposed AVRN surpasses most of them on Kendall’s τ Neural Netw. Learn. Syst., early access, Jun. 5, 2020, doi: 10.1109/
TNNLS.2020.2997020.
and Spearman’s ρ. Besides, Fig. 2 displays some exemplar summaries [3] X. Li, M. Chen, F. Nie, and Q. Wang, “A multiview-based parameter
generated by the proposed AVRN. It can be observed from them free framework for group detection,” in Proc. 31st AAAI Conf. Artif.
that AVRN is able to accurately predict the importance scores and Intell., 2017, pp. 4147–4153.
effectively summarize the video. The results have demonstrated the [4] Y. Chen, S. Wang, W. Zhang, and Q. Huang, “Less is more: Picking
informative frames for video captioning,” in Proc. Eur. Conf. Comput.
superiority of AVRN: 1) the fusion of audio and visual features can Vis., 2018, pp. 367–384.
provide more information for understanding the video content and [5] H. Zhang, L. Zhang, X. Qui, H. Li, P. H. Torr, and P. Koniusz, “Few-
structure, so as to benefit the video summarization process; 2) the shot action recognition with permutation-invariant attention,” in Proc.
hierarchical structure of LSTM can enhance the learning ability and Eur. Conf. Comput. Vis., 2020, pp. 525–542.
[6] W. Zhu, J. Lu, J. Li, and J. Zhou, “DSNet: A flexible detect-to-
further promote the performance; and 3) the temporal and global summarize network for video summarization,” IEEE Trans. Image
dependencies are both very important to the summarization task. Process., vol. 30, pp. 948–962, 2021.
[7] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and
I. Patras, “AC-SUM-GAN: Connecting actor-critic and generative
V. C ONCLUSION adversarial networks for unsupervised video summarization,” IEEE
Trans. Circuits Syst. Video Technol., vol. 31, no. 8, pp. 3278–3292,
In this brief, we propose to introduce the audio information for Aug. 2021.
the video summarization task and develop an AVRN to achieve the [8] B. Zhao, X. Li, and X. Lu, “TTH-RNN: Tensor-train hierarchical
fusion of audiovisual features and boost the summarization perfor- recurrent neural network for video summarization,” IEEE Trans. Ind.
Electron., vol. 68, no. 4, pp. 3629–3637, Apr. 2021.
mance. Specifically, AVRN contains three parts, including the two- [9] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-
stream LSTM, the audiovisual fusion LSTM, and the self-attention specific video summarization,” in Proc. Eur. Conf. Comput. Vis., 2014,
video encoder. Specifically, the two-stream LSTM can capture the pp. 540–555.
temporal dependency among audio features and video features, [10] R. Anirudh, A. Masroor, and P. Turaga, “Diversity promoting online
sampling for streaming video summarization,” in Proc. IEEE Int. Conf.
respectively. The audiovisual fusion LSTM can exploit the latent Image Process. (ICIP), Sep. 2016, pp. 3329–3333.
consistency between audio and visual information. The self-attention [11] Z. Lu and K. Grauman, “Story-driven summarization for egocentric
video encoder can capture the global dependency in the whole video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
video stream. The experimental results on SumMe and TVsum have pp. 2714–2721.
[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
demonstrated that: 1) the audiovisual multimodal feature can provide Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
more information for the summarization task than the single visual [13] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization
feature; 2) the hierarchical structure can enhance the learning ability with long short-term memory,” in Proc. Eur. Conf. Comput. Vis., 2016,
of LSTM; and 3) the fusion of audio and visual features and the pp. 766–782.
[14] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network
integration of temporal and global dependencies are both necessary for video summarization,” in Proc. 25th ACM Int. Conf. Multimedia,
for the video summarization task. Oct. 2017, pp. 863–871.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.
5188 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 8, AUGUST 2023

[15] B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive [41] S. Tschiatschek, R. Iyer, H. Wei, and J. Bilmes, “Learning mixtures of
RNN for video summarization,” in Proc. IEEE/CVF Conf. Comput. Vis. submodular functions for image collection summarization,” in Proc. Adv.
Pattern Recognit., Jun. 2018, pp. 7405–7414. Neural Inf. Process. Syst., 2014, pp. 1413–1421.
[16] L. L. Casas and E. Koblents, “Video summarization with LSTM and [42] Z. Ji, F. Jiao, Y. Pang, and L. Shao, “Deep attentive and semantic pre-
deep attention models,” in Proc. Int. Conf. MultiMedia Modeling, 2019, serving video summarization,” Neurocomputing, vol. 405, pp. 200–207,
pp. 67–79. Sep. 2020.
[17] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, [43] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, and T. Tan, “Stacked
“Summarizing videos with attention,” in Proc. Asian Conf. Comput. Vis., memory network for video summarization,” in Proc. 27th ACM Int. Conf.
2018, pp. 39–54. Multimedia, Oct. 2019, pp. 836–844.
[18] H. Zhu, M. Luo, R. Wang, A. Zheng, and R. He, “Deep audio- [44] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully
visual learning: A survey,” 2020, arXiv:2001.04758. [Online]. Available: convolutional sequence networks,” in Proc. Eur. Conf. Comput. Vis.,
http://arxiv.org/abs/2001.04758 2018, pp. 347–363.
[19] T. Afouras, A. Owens, J. S. Chung, and A. Zisserman, “Self-supervised [45] J. Gao, X. Yang, Y. Zhang, and C. Xu, “Unsupervised video summariza-
learning of audio-visual objects from video,” in Proc. Eur. Conf. Comput. tion via relation-aware assignment learning,” IEEE Trans. Multimedia,
Vis., 2020, pp. 208–224. vol. 23, pp. 3203–3214, 2021.
[20] Y.-B. Lin and Y.-C. F. Wang, “Audiovisual transformer with instance [46] J. Wang et al., “Query twice: Dual mixture attention meta learning
attention for audio-visual event localization,” in Proc. Asian Conf. for video summarization,” in Proc. 28th ACM Int. Conf. Multimedia,
Comput. Vis., 2020, pp. 274–290. Oct. 2020, pp. 4023–4031.
[21] X. Li, M. Chen, F. Nie, and Q. Wang, “Locality adaptive discriminant [47] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summa-
analysis,” in Proc. 26th Int. Joint Conf. Artif. Intell., Aug. 2017, rization with adversarial LSTM networks,” in Proc. IEEE Conf. Comput.
pp. 2201–2207. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2982–2991.
[22] Y. Wang, W. Huang, F. Sun, T. Xu, Y. Rong, and J. Huang, “Deep [48] K. Zhang, K. Grauman, and F. Sha, “Retrospective encoders for
multimodal fusion by channel exchanging,” in Proc. Adv. Neural Inf. video summarization,” in Proc. Eur. Conf. Comput. Vis., 2018,
Process. Syst., 2020, pp. 1–11. pp. 383–399.
[23] X. Li, B. Zhao, and X. Lu, “Key frame extraction in the summary space,” [49] K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement
IEEE Trans. Cybern., vol. 48, no. 6, pp. 1923–1934, Jun. 2018. learning for unsupervised video summarization with diversity-
[24] X. Li, B. Zhao, and X. Lu, “A general framework for edited video and representativeness reward,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018,
raw video summarization,” IEEE Trans. Image Process., vol. 26, no. 8, pp. 7582–7589.
pp. 3652–3664, Aug. 2017. [50] B. Pan et al., “Spatio-temporal graph for video captioning with knowl-
[25] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating edge distillation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
summaries from user videos,” in Proc. Eur. Conf. Comput. Vis., 2014, nit. (CVPR), Jun. 2020, p. 10.
pp. 505–520. [51] M. Dzabraev, M. Kalashnikov, S. Komkov, and A. Petiushko, “MDMMT:
[26] Y. Cong, J. Liu, G. Sun, Q. You, Y. Li, and J. Luo, “Adaptive greedy Multidomain multimodal transformer for video retrieval,” in Proc.
dictionary selection for web media summarization,” IEEE Trans. Image IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
Process., vol. 26, no. 1, pp. 185–195, Jan. 2017. Jun. 2021, pp. 3354–3363.
[27] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame [52] S. Mai, H. Hu, and S. Xing, “Modality to modality translation: An adver-
extraction using unsupervised clustering,” in Proc. Int. Conf. Image sarial representation learning and graph fusion network for multimodal
Process., 1988, pp. 866–870. fusion,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 164–172.
[28] S. E. F. de Avila, A. P. B. Lopes, A. da Luz, Jr., and [53] D. Xie, C. Deng, C. Li, X. Liu, and D. Tao, “Multi-task consistency-
A. de A. Araújo, “VSUMM: A mechanism designed to produce static preserving adversarial hashing for cross-modal retrieval,” IEEE Trans.
video summaries and a novel evaluation method,” Pattern Recognit. Lett., Image Process., vol. 29, pp. 3626–3637, 2020.
vol. 32, no. 1, pp. 56–68, Jan. 2011. [54] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TVSum: Summarizing
[29] Y. Hadi, F. Essannouni, and R. O. H. Thami, “Video summarization by web videos using titles,” in Proc. IEEE Conf. Comput. Vis. Pattern
k-medoid clustering,” in Proc. ACM Symp. Appl. Comput. (SAC), 2006, Recognit. (CVPR), Jun. 2015, pp. 5179–5187.
pp. 1400–1401. [55] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
[30] E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9.
Sparse modeling for finding representative objects,” in Proc. IEEE Conf. [56] S. Hershey et al., “CNN architectures for large-scale audio classifica-
Comput. Vis. Pattern Recognit., Jun. 2012, pp. 1600–1607. tion,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
[31] S. Mei, G. Guan, Z. Wang, S. Wan, M. He, and D. D. Feng, “Video Mar. 2017, pp. 131–135.
summarization via minimum sparse reconstruction,” Pattern Recognit., [57] M. Otani, Y. Nakashima, E. Rahtu, and J. Heikkila, “Rethinking the
vol. 48, no. 2, pp. 522–533, Feb. 2015. evaluation of video summaries,” in Proc. IEEE/CVF Conf. Comput. Vis.
[32] S. Mei, G. Guan, Z. Wang, M. He, X.-S. Hua, and D. D. Feng, “L2,0 Pattern Recognit. (CVPR), Jun. 2019, pp. 7596–7604.
constrained sparse dictionary selection for video summarization,” in [58] P. Mundur, Y. Rao, and Y. Yesha, “Keyframe-based video summarization
Proc. IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2014, pp. 1–6. using Delaunay clustering,” Int. J. Digit. Libraries, vol. 6, no. 2,
[33] Y. Cong, J. Yuan, and J. Luo, “Towards scalable summarization of con- pp. 219–232, 2006.
sumer videos via sparse dictionary selection,” IEEE Trans. Multimedia, [59] B. Zhao and E. P. Xing, “Quasi real-time summarization for consumer
vol. 14, no. 1, pp. 66–75, Feb. 2012. videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
[34] M. Ma, S. Mei, S. Wan, J. Hou, Z. Wang, and D. D. Feng, “Video pp. 2513–2520.
summarization via block sparse dictionary selection,” Neurocomputing, [60] M. Gygli, H. Grabner, and L. Van Gool, “Video summarization by
vol. 378, pp. 197–209, Feb. 2019. learning submodular mixtures of objectives,” in Proc. IEEE Conf.
[35] Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang, “A generic framework of Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3090–3098.
user attention model and its application in video summarization,” IEEE [61] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary trans-
Trans. Multimedia, vol. 7, no. 5, pp. 907–919, Oct. 2005. fer: Exemplar-based subset selection for video summarization,” in
[36] J. Ren, J. Jiang, and Y. Feng, “Activity-driven content adaptation for Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
effective video summarization,” J. Vis. Commun. Image Represent., pp. 1059–1067.
vol. 21, no. 8, pp. 930–938, Nov. 2010. [62] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summariza-
[37] N. Ejaz, T. B. Tariq, and S. W. Baik, “Adaptive key frame extraction for tion via semantic attended networks,” in Proc. 32th AAAI Conf. Artif.
video summarization using an aggregation mechanism,” J. Vis. Commun. Intell., 2018, pp. 216–223.
Image Represent., vol. 23, no. 7, pp. 1031–1040, Oct. 2012. [63] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with
[38] B. Gong, W. Chao, K. Grauman, and F. Sha, “Diverse sequential subset attention-based encoder-decoder networks,” IEEE Trans. Circuits Syst.
selection for supervised video summarization,” in Proc. Adv. Neural Inf. Video Technol., vol. 30, no. 6, pp. 1709–1717, Jun. 2020.
Process. Syst., 2014, pp. 2069–2077. [64] J. Lei, Q. Luan, X. Song, X. Liu, D. Tao, and M. Song, “Action parsing-
[39] Y. J. Lee and K. Grauman, “Predicting important objects for egocentric driven video summarization based on reinforcement learning,” IEEE
video summarization,” Int. J. Comput. Vis., vol. 114, no. 1, pp. 38–55, Trans. Circuits Syst. Video Technol., vol. 29, no. 7, pp. 2126–2137,
2015. Jul. 2019.
[40] Y. Jae Lee, J. Ghosh, and K. Grauman, “Discovering important people [65] Y. Chen, L. Tao, X. Wang, and T. Yamasaki, “Weakly supervised video
and objects for egocentric video summarization,” in Proc. IEEE Conf. summarization by hierarchical reinforcement learning,” in Proc. ACM
Comput. Vis. Pattern Recognit., Jun. 2012, pp. 1346–1353. Multimedia Asia, 2019, pp. 1–6.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on September 01,2024 at 16:41:28 UTC from IEEE Xplore. Restrictions apply.

You might also like