0% found this document useful (0 votes)
35 views

Attention-Guided Multi-Granularity Fusion Model For Video Summarization

Uploaded by

hanochliu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Attention-Guided Multi-Granularity Fusion Model For Video Summarization

Uploaded by

hanochliu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Expert Systems With Applications 249 (2024) 123568

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Attention-guided multi-granularity fusion model for video summarization


Yunzuo Zhang ∗, Yameng Liu, Cunyu Wu
School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang 050043, China

ARTICLE INFO ABSTRACT

Keywords: Video summarization has attracted extensive attention benefiting from its valuable capability to facilitate
Video summarization video browsing. While achieving notable improvement, existing methods still fail to sufficiently and effectively
Multi-granularity model contextual information within videos, hindering the summarization performance owing to a deficiency
Content-aware enhancement
in powerful contextual representations. To address this limitation, we present a novel Attention-Guided Multi-
Scale-adaptive fusion
Granularity Fusion Model (AMFM), which allows for optimizing the modeling process from the context
Self-attention
Temporal convolution
capturing and fusion perspective. AMFM comprises three dominant components including a content-aware
enhancement (CAE) module, a multi-granularity encoder (MGE), and a scale-adaptive fusion (SAF) module.
More specifically, CAE dynamically enhances pre-trained visual features by learning the potential visual
relationship across frame-level and video-level embeddings. Subsequently, coarse-grained and fine-grained con-
textual information is simultaneously modeled in the same representation space by MGE with the combination
of self-attention and temporal convolution scheme. Furthermore, the multi-granularity representations with a
significant difference in the semantic scale are adaptively fused by SAF. Our method can precisely pinpoint key
segments by effectively modeling and processing rich temporal representations. Extensive comparisons with
state-of-the-art methods on standard datasets demonstrate the effectiveness of the proposed method, and the
ablation studies further verify the positive impact of each module in our model.

1. Introduction & Sohn, 2020; Xie et al., 2022; Zhu, Lu, Li, & Zhou, 2021). The
majority of traditional methods (De Avila, Lopes, da Luz Jr, & de
Recently, the proliferation of mobile devices has led to an expo- Albuquerque Araújo, 2011; Gygli, Grabner, Riemenschneider, & Gool,
nential increase in the number of videos (Hussain et al., 2021; Lin, 2014; Zhang, Tao, & Wang, 2017) concentrate on the selection of
Zhao, Su, Wang, & Yang, 2018), as evidenced by the staggering amount meaningful segments with heuristic representations based on hand-
of videos uploaded to YouTube daily (James, 0000). This creates an crafted features. Nevertheless, these features are limited in their ability
urgent demand for intelligent video analysis, and video summarization to provide rich semantic information, and the temporal cues within
has become a hot research topic aimed at reducing this overload. At its videos are rarely exploited, which is insufficient for comprehensively
core, video summarization involves understanding the content of videos understanding the source video. Recently, deep learning-based meth-
and generating a concise yet comprehensive synopsis by removing ods (Jung, Cho, Kim, Woo, & Kweon, 2019; Yuan, Tay, Li, & Feng,
massive redundant content (Xiao, Zhao, Zhang, Guan, & Cai, 2020). 2020; Zhu, Lu, Han, & Zhou, 2022) have gained increasing interest.
To date, it has been studied in many specific scenarios (Bettadapura,
Usually, lots of methods adopt Recurrent Neural Networks (RNNs) to
Pantofaru, & Essa, 2016; Li, Pan, Wang, Xing, & Han, 2022; Merler
enrich visual features with contextual information. For instance, Zhang,
et al., 2019; Xu et al., 2021; Zhang, Zhu and Roy-Chowdhury, 2016).
Chao, Sha, and Grauman (2016b) fed frame-level representations into
Video summarization can be broadly categorized into static methods
Long Short-Term Memory (LSTM) for long-range temporal aggregation.
and dynamic methods (Huang & Wang, 2020; Yuan, Mei, Cui, & Zhu,
Although these variants of RNNs are capable of effectively modeling
2019). Static methods aim to select a set of key frames, while dynamic
the video sequence, they encounter gradient vanishing. Additionally,
methods pick several key shots composed of consecutive frames to
the difficulty in implementing parallel computing (Liang, Lv, Li, Zhang
represent the entire video content. This paper concentrates on key shot-
based video summarization, as dynamic summarization is more helpful and Zhang, 2022) is also an aspect that cannot be ignored.
for users to understand the video storyline. To address these issues, the fully convolutional sequence network
Existing methods have made unprecedented progress in summa- (Rochan, Ye, & Wang, 2018) is proposed. However, it cannot effec-
rizing videos (Jiang & Mu, 2022; Liu et al., 2022; Park, Lee, Kim, tively learn the pairwise relationship across frames. To improve the

∗ Corresponding author.
E-mail addresses: [email protected] (Y. Zhang), [email protected] (Y. Liu), [email protected] (C. Wu).

https://doi.org/10.1016/j.eswa.2024.123568
Received 20 January 2024; Received in revised form 20 February 2024; Accepted 24 February 2024
Available online 27 February 2024
0957-4174/© 2024 Elsevier Ltd. All rights reserved.
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

Fig. 1. Overview of AMFM. The visual features modeled from the input video are sequentially fed into the content-aware enhancement module, the multi-granularity encoder,
and the scale-adaptive fusion module. Based on the fused contexts, the prediction head outputs importance scores, which are used for generating video summarization.

capability of video understanding, some attention-based hierarchical motivated by the fact that global and local contextual information is
methods (Zhao, Gong and Li, 2022; Zhu et al., 2022) are proposed to obtained from frames of significantly different ranges, hence intuitive
model local contextual information before learning global contextual fusion strategies cannot ensure powerful contextualized representations
information. They usually segment the entire video sequence into owing to the lack of deep interaction. Finally, the fused features are
subsequences, which serve as local modeling units and the basis for fed into a prediction head for frame-level importance score prediction
learning long-range contextual information (Zhao, Li, & Lu, 2017). and summary generation. In practice, the training procedure of the
However, such a rough division either introduces too much noise or proposed method can be easily parallelized, making it computationally
discards too many valuable characteristics. Although shot boundary de- efficient. Extensive experiments on standard datasets are conducted,
tection (Potapov, Douze, Harchaoui, & Schmid, 2014) is also exploited and empirical studies demonstrate the effectiveness of the proposed
in existing efforts, they are still susceptible to inaccurate subsequence method.
partitioning and cascading negative effects on global contextual feature In a nutshell, the main contributions of this paper can be summa-
learning due to the limited detection performance (Zhao, Li, & Lu, rized as follows:
2018). Actually, global and local contextual information can provide
• We develop the CAE module grounded on the attention scheme
coarse-grained and fine-grained semantic information about the input
to learn powerful visual representations to facilitate the effective-
video, respectively, allowing the deep learning-based model for com-
ness of modeling contextual information within videos.
prehensive video understanding. Because of defective learning schemes
• We build the MGE module, which simultaneously models global
in existing methods, contextual information within videos is still diffi-
and local contextual information and allows for robust learning
cult to sufficiently and effectively model, hindering the summarization
of finer temporal cues for accurate video understanding.
performance.
• We consider the significant semantic scale difference across global
Given the aforementioned problem, this paper proposes a novel
and local contextual information, devising the SAF module to
Attention-Guided Multi-Granularity Fusion Model (AMFM), which al-
form powerful contextualized representations.
lows for optimizing the modeling process from the context capturing
• We conduct extensive experiments on popular benchmark
and fusion perspective. As shown in Fig. 1, AMFM consists of three
datasets including SumMe and TVSum. The experimental results
dominant components including a content-aware enhancement (CAE)
clearly demonstrate the effectiveness of the proposed method.
module, a multi-granularity encoder (MGE), and a scale-adaptive fusion
(SAF) module. Firstly, based on the attention scheme, CAE targets The remaining parts of this paper are organized as follows. Sec-
to achieve feature enhancement by estimating the potential visual tion 2 briefly reviews the related work. Then, the proposed method is
relationship across frame-level and video-level embeddings, learning described in Section 3. Section 4 shows the experimental results and
powerful visual representations to facilitate the effectiveness of model- analysis. Finally, we conclude this work and provide the future prospect
ing contextual information. As an individual module, it can be easily in Section 5.
embedded into our model to be trained in an end-to-end manner.
Secondly, MGE simultaneously aggregates global and local contextual 2. Related work
information with the combination of self-attention and temporal con-
volution scheme. Different from previous hierarchical structure-based This section briefly reviews the related methods, broadly including
methods, our method models entire video sequences and subsequences two topics: video summarization and attention mechanism, which are
separately in a parallel manner, which avoids the potential negative discussed in the following.
impact of cascading learning. In addition, MGE can capture rich and
finer temporal cues using multiple convolution operators for accurate 2.1. Video summarization
video understanding.
After global and local contextual information is aggregated, SAF Creating high-quality video summaries has remained a continual
adaptively fuses these features by learning fusion attention. This is challenge, prompting researchers to investigate numerous promising

2
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

methods over the years. Existing methods can be broadly categorized Currently, attention-based methods have exhibited remarkable
into two groups: traditional methods, which rely on conventional tech- progress in a wide range of domains, including object detection (Carion
niques, and deep learning-based methods which harness the power of et al., 2020), image segmentation (Cheng, Misra, Schwing, Kirillov, &
neural networks. Traditional methods typically utilize the unsupervised Girdhar, 2022), speech recognition (Yeh et al., 2021), and person re-
learning paradigm to identify and curate key frames or shots that identification (Zhao, Wang et al., 2022). For instance, Mao, Yang, Lin,
encapsulate the core content. Cluster-based methods (Chu, Song, &
Xuan, and Liu (2022) proposed positional attention-guided transformer-
Jaimes, 2015; De Avila et al., 2011; Zhuang, Rui, Huang, & Mehro-
like architecture to model features within and across the visual and
tra, 1998) represent an initial and straightforward attempt at video
language modalities. Yang, Miech, Sivic, Laptev, and Schmid (2022)
summarization. The objective is to cluster visually similar frames or
shots and designate their centers as representative summaries of the leveraged self-attention to jointly model spatial and visual-linguistic
original content. On the other hand, dictionary learning belongs to interactions. Badamdorj, Rochan, Wang, and Cheng (2022) proposed
the realm of unsupervised methods. It entails the selection of repre- a simple contrastive learning method to detect video highlights, which
sentative elements from the video to construct a summary dictionary were directly selected by attention scores. For the video summarization
capable of accurately reconstructing the content. For example, Mei task, Ji, Xiong, Pang, and Li (2020) combined RNNs with self-attention
et al. (2014) put forward a sparse dictionary selection method by 𝐿2,0 to mimic the way of selecting the shots of humans. Fu and Wang
norm. Wang et al. (2016) introduced a similar inhibition constraint for (2021) designed a self-attention binary neural tree to address shot-
increasing the diversity of summaries. Nevertheless, the performance level video summarization, where feature representations are learned
of these traditional methods remains suboptimal due to their lim- from coarse to fine. Zhu et al. (2022) presented a hierarchical attention
ited representation capability, highlighting the need for deep learning model via multi-scale features. In particular, it tried to learn the intra-
methods.
block attention and the inter-block attention, all of which including
The fact that videos are displayed frame by frame underscores the
frame embeddings are immediately concatenated and fed into a scoring
importance of aggregating the temporal cues within the video sequence
and this core idea has been widely studied in computer vision (Cui, module.
2022; Zhang, Guo, Wu, Li & Tao, 2023; Zhang, Kang, Liu and Zhu, Motivated by the success of the attention mechanism, we design
2023; Zhang, Zhang, Wu and Tao, 2023). Due to the outstanding an attention-based feature enhancement module to guarantee more
modeling capability, RNN-based methods achieve substantial improve- effective features by estimating the potential visual relationship across
ment (Fu & Wang, 2021; Zhong, Wang, Zou, Hong, & Hu, 2021). To frame-level and video-level embeddings. Moreover, we incorporate
predict importance scores, Zhang et al. (2016b) utilized bidirectional attention mechanism and temporal convolution into a unified learn-
LSTM to model the forward and backward dependencies, and intro- able module to robustly learn multi-granularity contextual information
duced determinantal point processes (DPP) to increase the diversity of within videos. Finally, we reflect on the significant difference in the
summary content. On the basis of the hierarchical LSTM network (Zhao semantic scale across global and local temporal cues and adopt a fusion
et al., 2017), Zhao et al. (2018) incorporated shot boundary detec-
module, which achieves adaptive fusion by learning attention-based
tion and sequence modeling into one unified method to select key
fusion weights. Experimental results on standard datasets demonstrate
shots. Zhou, Qiao, and Xiang (2018) proposed a reinforcement learning
method for video summarization, which comprehensively considered the effectiveness of the proposed method.
the dissimilarity and representativeness of summary results. Mahas-
seni, Lam, and Todorovic (2017) combined a LSTM-based key frame
3. Proposed method
selector with a discriminator to generate video summarization through
adversarial learning. Apostolidis, Adamantidou, Metsai, Mezaris, and
Patras (2021) introduced the Actor-Critic model into the summarization 3.1. Preliminary
task and tried to learn a policy to select key shots. In practice, these
RNN-based networks are generally hampered by their expensive com-
putational cost. To achieve parallel computing, Rochan et al. (2018) This section briefly reviews the multi-head attention mechanism
employed the fully convolutional sequence model to label each frame. (MHAM) in Transformer (Vaswani et al., 2017) since it plays a crucial
Although these methods have proved to be effective, they either role in our overall method. In order to make the method description
ignore the pairwise relationships across video frames or the local clearer and avoid excessive redundant descriptions, the part involving
contextual information within the video sequence, both of which are MHAM simply uses symbols instead of listing the detailed calculation
essential to video understanding. Additionally, in these methods that si- process.
multaneously learn global and local contextual information, they might Concretely, MHAM takes a query matrix 𝑸, a key matrix 𝑲, and
still face the situation of inappropriate video sequence partitioning,
a value matrix 𝑽 as input and maps them to different representa-
which leads to difficulties in fully and effectively modeling contextual
tion subspaces using linear transformation layers. These features are
information. Our AMFM can overcome the aforementioned issues and
enriched with global dependencies according to scaled dot-product
successfully outperform state-of-the-art methods.
attention. The final features can be obtained by concatenating all
2.2. Attention mechanism outputs of different subspaces, followed by a linear transformation
layer. Mathematically, it is represented as:
The attention mechanism, which mimics the selective cognitive
MultiHead(𝑸, 𝑲, 𝑽 ) = Concat(𝑯 1 , 𝑯 2 , … , 𝑯 ℎ )𝑊 𝑜 (1)
function of humans in focusing on relevant information, has emerged
as a powerful deep learning technique (Niu, Zhong, & Yu, 2021). As
where
an exceptional method for processing sequential data, the self-attention ( )
block initially showed remarkable performance in the machine transla- 𝑸𝑾 𝑞𝑖 (𝑲𝑾 𝑘𝑖 )T
𝑯 𝒊 = Softmax √ 𝑽 𝑾 𝑣𝑖 (2)
tion task (Vaswani et al., 2017). It assigns weights to each element by 𝑑ℎ
calculating the pairwise relationship, allowing the current location to
access all positions without considering their distance. In comparison where ℎ denotes the number of attention heads, which is simply set to 1
to sequential models like LSTM, self-attention overcomes the inher- in this paper to save parameters. 𝑯 𝒊 is the output of 𝑖th attention head.
ent issues of RNNs, such as inefficient computing parallelization and 𝑑ℎ is used for scaling. 𝑾 𝑞𝑖 , 𝑾 𝑘𝑖 , 𝑾 𝑣𝑖 , and 𝑾 𝑜𝑖 are learnable parameters.
historical information decay with increasing sequence length. MHAM performs self-attention when 𝑸 = 𝑲 = 𝑽 .

3
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

Fig. 2. Pipeline of the CAE module, which primarily aims to learn attention for each
frame to enhance visual features by weighting.

3.2. Overview

Fig. 1 illustrates the overview of the proposed method. Specifically,


given an input video with 𝑁 frames, AMFM initially utilizes a pre-
trained feature extractor to encode frames, forming visual features 𝑭 =
[𝒇 1 , 𝒇 2 , … , 𝒇 𝑁 ], where 𝒇 𝑖 ∈ R𝐷 is the feature vector of 𝑖th frame and 𝐷
indicates the dimension. These features are fed into CAE to be enhanced
by referring to the potential visual relationship across frame-level Fig. 3. Illustration of MGE, which consists of (a) a coarse-grained stream and (b) a
fine-grained stream to simultaneously model global and local contextual information.
and video-level embeddings. Afterward, both coarse-grained and fine-
grained contextual information is simultaneously captured by MGE.
Next, SAF adaptively fuses multi-granularity contextual information
by learning fusion attention. Finally, the fused contextual information 3.4. Multi-granularity encoder
is fed into the prediction head to compute importance scores 𝑺 =
[𝑠1 , 𝑠2 , … , 𝑠𝑁 ], which are used to select key shots under a duration Temporal cues are of great essence to video understanding (Wang
constraint. et al., 2021). Modeling the entire video sequence is capable of pro-
viding a coarse-grained view of the storyline while modeling the local
3.3. Content-aware enhancement module
sequences can effectively tackle the detailed information happening in
certain period segments. We present MGE, which consists of a coarse-
The majority of existing methods usually exploit visual features
grained stream (CGS) and a fine-grained stream (FGS) to aggregate
extracted by pre-trained deep networks to perform subsequent tasks.
global and local contextual information based on enhanced visual
However, these independent features only reflect the visual content at
the current position and do not consider their interrelationships with features. The core idea is shown in Fig. 3.
the entire video content, which might result in bottlenecks in video (1) Coarse-grained Stream: The long-range modeling capability and
understanding. Inspired by the success of the attention mechanism, we the computing efficiency can be measured by the maximum path length
propose CAE to address this limitation, which enables each frame to and the minimum number of sequential operations (Vaswani et al.,
receive informative guidance signals through modeling the similarity 2017). The self-attention mechanism shows great advantages in both
relationship across frame-level and video-level representations. By us- aspects compared to RNNs. Regarding global contextual information,
ing the idea, frames related to video content are given higher attention, we employ self-attention to obtain the responses at all positions, which
while irrelevant backgrounds are suppressed, further improving the can dramatically reduce the expensive computing cost brought by
discriminability of features. RNNs and achieve remarkable temporal aggregation. Particularly, CGS
The pipeline of CAE is depicted in Fig. 2. Specifically, similar to Xiao begins by projecting the enhanced features 𝑭 𝑟 into 𝑭 𝑟𝑞 , 𝑭 𝑟𝑘 , and 𝑭 𝑟𝑣 ,
et al. (2020), we initiate the process by defining a video-level embed- respectively, followed by pairwise attention calculation and feature
ding 𝑭 𝑒 ∈ R1×𝐷 . This is achieved through global average pooling (GAP) aggregation. Simply put, the globally contextualized information 𝑮 ∈
along the temporal dimension, which allows us to obtain a general R𝑁×𝐷 can be represented by:
representation of video content. Mathematically, the calculation can be
written as follows: 𝑮 = MultiHead(𝑭 𝑟 , 𝑭 𝑟 , 𝑭 𝑟 ) (5)
1 ∑
𝑁
𝑭𝑒 = 𝒇𝑖 (3) (2) Fine-grained Stream: FGS concentrates on modeling temporal
𝑁 𝑖=1 cues within tiny local windows to finely understand the video story-
Through MHAM, we compute the similarity attention 𝑨𝑒 ∈ R1×𝑁 line. Specifically, temporal convolution is selected as our fine-grained
by setting 𝑭 𝑒 and 𝑭 as the query and key matrices, respectively, which context aggregator due to its excellent performance in extracting local
is used to reveal how 𝑖th frame is similar to the input video itself. features in computer vision (Ke, Sun, Li, Yan, & Lau, 2022; Liang, Guo,
Subsequently, the similarity attention is exploited to form weighted Li, Jin and Shen, 2022; Zhang, Song and Li, 2023). Multi-granularity
representations 𝑭 𝑖𝑛𝑡𝑒𝑟 ∈ R𝑁×𝐷 . This calculation can be written as: features can provide more valuable assistance, which encourages us
T to specify more than one window size. Relied on the consideration
̃𝑒 ⊙ 𝑭 𝑾 𝑒
𝑭 𝑖𝑛𝑡𝑒𝑟 = 𝑨 (4) of summarization effectiveness and the number of parameters, the
{ }
̃𝑒 𝑒
where 𝑨 ∈ R𝐷×𝑁 is obtained by repeating 𝑨 . ⊙ denotes element-wise candidate sizes are pre-denoted as 𝑘1 = 3, 𝑘2 = 5, 𝑘3 = 7 , respectively.
production. 𝑾 𝑒 ∈ R𝐷×𝐷 is the projection parameter to be learned. The Accordingly, to alleviate the computational burden arising from multi-
enhanced features 𝑭 𝑟 ∈ R𝑁×𝐷 are calculated by a linear transform. ple windows, we further exploit depth-separable convolution to achieve
Based on CAE, our method learns discriminative visual features by our purpose. Technically, the fine-grained stream starts with encoded
assigning attention to frames with different video content correlations. representation 𝑭 𝑟𝑣 to model representations from the same latent space

4
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

Fig. 5. Pipeline of the prediction head, which mainly consists of two fully connected
layers and a sigmoid function for importance score prediction.

layer to model global attention 𝑪 𝑓𝑔 ∈ R1×𝐷 . Formally, this can be


formulated as follows:

𝑪 𝑓𝑙 = H(𝑪 𝑓 ) (8)
𝑪 𝑓𝑔 = H(𝛿(𝑪 )) 𝑓
(9)

where 𝛿(⋅) denote the GAP operation. Subsequently, we incorporate the


learned attentions into a unified representation form, which is followed
by a sigmoid function to generate an adaptive attention matrix 𝑴 ∈
R𝑁×𝐷 :
( )
𝑴 = 𝜎 𝑪 𝑓𝑙 + 𝑪 𝑓𝑔 (10)

Fig. 4. Illustration of the SAF module, which consists of dual pathways to adaptively where 𝜎(⋅) is the sigmoid function. Based on the attention matrix, the
perform deep context fusion by learning fusion attention. fused contextual representations 𝑪 ℎ ∈ R𝑁×𝐷 are formed by a weighted
averaging. Mathematically, this process can be written as:

with the coarse-grained stream. The result regarding window 𝑘𝑖 (𝑖 = 𝑪 ℎ = 𝑴 ⊙ 𝑮 + (𝟏 − 𝑴) ⊙ 𝑳 (11)


1, 2, 3) can be calculated by:
Leveraging this module, our method is capable of conducting com-
𝑶𝑘𝑖 = PConv(DConv𝑘𝑖 (𝑭 𝑟𝑣 )) (6) prehensive and adaptive context fusion while preserving important
information to accurately represent the storyline. Moreover, the well-
where PConv(⋅) and DConv(⋅) denote the pointwise and depthwise
designed bottleneck structure significantly reduces the training burden,
convolution. Since they are aggregated within extremely similar frame
resulting in a concise and feasible method.
ranges, we form the final fine-grained contextual information 𝑳 ∈
R𝑁×𝐷 by directly summing them up:

𝑤 3.6. Summary generation
𝑳= 𝑎𝑖 𝑶 𝑘 𝑖 (7)
𝑖=1
The proposed method includes a prediction head, in addition to
where 𝑤 = 3 and 𝑎𝑖 ∈ {0, 1} denotes whether window 𝑘𝑖 participates in
the components mentioned before, which is responsible for predicting
calculation. Our default model only incorporates 𝑘1 and 𝑘2 , which will
importance scores for each frame according to the fused contextual
be discussed in Section 4.
information 𝑪 ℎ . Its architecture, depicted in Fig. 5, consists primarily
of two fully connected layers. The final summary of an input video is
3.5. Scale-adaptive fusion module
created by selecting a set of sub-shots. In particular, following previous
efforts, we exploit the Kernel Temporal Segmentation (KTS) (Potapov
The purpose of context fusion is to gather the positive aspects of et al., 2014) to detect change points, which are used for segmenting an
features across multiple levels to represent video content in a con- entire video into 𝑈 disjoint subsequences. Then, we convert frame-level
densed form. Typically, intuitive fusion strategies (e.g., summation) are importance scores to shot-level importance scores by taking an average
often preferred due to their ease of computation. However, coarse- within each shot:
grained and fine-grained contextual information cover a significant
𝑙𝑖
difference in the semantic scale, hence, these simple operations can- 1∑
𝑝𝑖 = 𝑠 (12)
not achieve sufficient context fusion owing to limited information 𝑙𝑖 𝑘=1 𝑘
exchange. To address this limitation, we propose SAF to adaptively
where 𝑝𝑖 and 𝑙𝑖 are importance score and duration of 𝑖th shot, respec-
perform more effective fusion across coarse-grained and fine-grained
tively. The duration of a summary is limited to no more than 15%
contextual information learned by MGE by learning fusion attention.
of the total duration. Next, we formulate the selection of a key-shot-
As depicted in Fig. 4, SAF is composed of dual-learning pathways
based summary as a knapsack problem, which can be mathematically
to aggregate features comprehensively. Similar to the bottleneck layer
represented as:
in ResNet (He, Zhang, Ren, & Sun, 2016), each pathway includes a
bottleneck structure H(⋅) that consists of two temporal convolution
layers and an activation function sandwiched by them, where the first ∑
𝑈 ∑
𝑈

temporal convolution reduces the feature dimension from 𝐷 to 𝐷∕𝑚, max 𝑥𝑖 𝑝𝑖 , 𝑠.𝑡. 𝑥𝑖 𝑙𝑖 ≤ 0.15 × 𝑁 (13)
𝑖=1 𝑖=1
followed by another temporal convolution to restore the dimension to
𝐷. Concretely, the first pathway is responsible for learning local atten- where 𝑥𝑖 ∈ {0, 1} indicates whether 𝑖th shot is selected or not. We solve
tion 𝑪 𝑓𝑙 ∈ R𝑁×𝐷 from initial fusion features 𝑪 𝑓 ∈ R𝑁×𝐷 obtained by this problem by dynamic programming and generate summarization by
summation. The second pathway performs GAP before the bottleneck concatenating those shots with 𝑎𝑖 = 1 in chronological order.

5
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

3.7. Network training (3) Implementation Detail: We employ GoogLeNet (Szegedy et al.,
2015) pre-trained on the ImageNet (Deng et al., 2009) to extract visual
To measure the difference between the ground truth scores and pre- features. The output of the pool-5 layer is taken as the feature vector
dicted scores, we adopt the mean-square error (MSE) as our objective to represent the visual content. It indicates that the dropout, fully
function  to iteratively optimize the model parameters during the connected layer, and softmax are excluded. The dimension 𝐷 of visual
training process. The form can be written as: features is 1024. The number of attention heads is set to 1 for saving
parameters. By default, we use convolution kernels of size 3 and 5 for
1 ∑
𝑁
 (𝜽) = (𝑠 − 𝑦𝑖 )2 (14) fine-grained context modeling in MGE. In SAF, by sensitivity analysis,
𝑁 𝑖=1 𝑖 we set the reduction rate 𝑚 = 4. Moreover, we adopt convolution
where 𝜽 is the parameters of our AMFM. 𝑦𝑖 is the human-created kernels of size 3 and 1 for feature learning in the first and second
importance score of 𝑖th frame. pathways, respectively. We train our model using the Adam optimizer
with the learning rate of 2 × 10−5 for the SumMe dataset, 5 × 10−5 for the
4. Experiment TVSum dataset. The training process is terminated after 300 epochs.

This section focuses on evaluating the summarization performance 4.2. Comparisons with state-of-the-art methods
of our method on benchmark datasets. We start by providing the exper-
imental setup, which includes the description of datasets, evaluation (1) Baselines: We perform comprehensive experiments to compare
metrics, and implementation details. Then, we present quantitative the effectiveness of the proposed method with that of existing state-of-
results, ablation study, and qualitative results. the-art methods, across different evaluation manners. Specifically, these
methods can be categorized into traditional methods including TV-
Sum (Song et al., 2015), DPP (Zhang, Chao, Sha, & Grauman, 2016a),
4.1. Experimental setup
ERSUM (Li, Zhao, & Lu, 2017), MSDS-CC (Meng, Wang, Wang, Yuan, &
Tan, 2017), and deep learning-based methods including vsLSTM (Zhang
(1) Datasets: We evaluate our method using two well-established
et al., 2016b), dppLSTM (Zhang et al., 2016b), SUM-GAN (Mahasseni
benchmark datasets including SumMe (Gygli et al., 2014) and TV-
et al., 2017), DR-DSN (Zhou et al., 2018), FCSN (Rochan et al., 2018),
Sum (Song, Vallmitjana, Stent, & Jaimes, 2015). The SumMe dataset
SASUM (Wei et al., 2018), HSA-RNN (Zhao et al., 2018), ACGAN (He
consists of 25 videos depicting various events, such as food and sports,
et al., 2019), CSNet (Jung et al., 2019), A-AVS (Ji et al., 2020), M-
and each video is annotated by at least 15 users. The duration of
AVS (Ji et al., 2020), RSGN (Zhao, Li, Lu and Li, 2022), DHAVS (Lin,
each video ranges from 1 to 6 min. The TVSum dataset comprises 50
Zhong, & Fares, 2022), LMHA (Zhu et al., 2022), CAAN (Liang, Lv et al.,
videos from 10 categories, with each video annotated by 20 users.
2022), HMT (Zhao, Gong et al., 2022), VJMHT (Li, Ke, Gong and Zhang,
The duration of the videos varies from 2 to 10 min. In addition, we
2023), and SSPVS (Li, Ke, Gong and Drummond, 2023).
employ YouTube (De Avila et al., 2011) and OVP (De Avila et al.,
(2) Comparisons Under the Standard Setting : Table 1 presents the
2011) to augment our training data, both of which are created with
experimental results for the standard setting on the SumMe and TVSum
key frame-level labels and contain 89 videos in total. The datasets
datasets. The results demonstrate that the proposed method performs
used in our experiments contain videos with quick and slow scene
exceptionally well in comparison to existing state-of-the-art methods.
changes, posing challenges in evaluating the effectiveness of the pro-
Notably, the values reported indicate that the deep learning-based
posed method. Regarding the evaluation setting, following previous
methods, which leverage feature representations with rich semantic
methods, 80% videos of each dataset are selected for training and
information and effective modeling of temporal cues, generally out-
20% for testing. Existing methods primarily adopt 5 random splits
perform traditional methods. AMFM is based solely on matrix op-
(5 Random), 10 random splits (10 Random), and multiple random
erations and does not use any recursive structures used in works
splits (M Random) and report average summarization performance.
like (Yuan et al., 2019; Zhang et al., 2016b; Zhao, Li et al., 2022).
Nevertheless, such split manners inevitably can cause certain videos
This design choice enables efficient GPU parallelization. Although our
in the dataset to be used or omitted multiple times, leading to unfair
method achieves comparable performance to M-AVS (Ji et al., 2020)
performance comparisons due to inappropriate data splitting schema.
and LHMA (Zhu et al., 2022) on the TVSum dataset, all the experiments
To alleviate this problem, this paper follows He et al. (2019), Jung et al.
are conducted using standard cross-validation instead of a random split,
(2019), Li, Ke, Gong and Zhang (2023), Liang, Lv et al. (2022) and
indicating comprehensive and reliable evaluation results as discussed
Zhou et al. (2018), adopting standard 5-fold cross-validation (5 FCV)
in Section 4.1. Compared with those efforts adopting the 5 FCV test
to ensure that each video participates in testing procedure, and thus
method, AMFM surpasses them by a large margin in the F-score evalua-
generates more reliable experimental results.
tion, which can be attributed to the excellent extraction and processing
(2) Evaluation Metric: In order to evaluate our method comprehen-
capability of our meticulous model. By observing the provided param-
sively with other state-of-the-art methods, we first report the F-score
eters, our method can achieve a good trade-off compared with other
performance on the SumMe and TVSum datasets. Suppose 𝑿 𝑝 and 𝑿 ℎ
state-of-the-art methods.
represent the predicted summaries and human summaries, respectively.
(3) Comparisons Under the Augmented and Transfer Settings: We fur-
The F-score can be computed by:
ther compare the proposed method with state-of-the-art methods under
overlapped duration of 𝑿 𝑝 and 𝑿 ℎ the augmented and transfer settings, as shown in Table 2. We can
𝑃 = (15)
duration of 𝑿 𝑝 observe that AMFM achieves an encouraging performance. The transfer
overlapped duration of 𝑿 𝑝 and 𝑿 ℎ setting is an effective but challenging method to verify the transfer-
𝑅= (16) ability of the model. The reported performance values on both datasets
duration of 𝑿 ℎ
illustrate that AMFM can learn meaningful semantic information from
2×𝑃 ×𝑅 videos from other domains. The performance under the transfer setting
𝐹 -𝑠𝑐𝑜𝑟𝑒 = × 100% (17)
𝑃 +𝑅 is significantly lower, indicating that cross-dataset learning remains a
Moreover, it has been studied in a recent study (Otani, Nakashima, difficult problem that requires further investigation.
Rahtu, & Heikkila, 2019) that the F-score metric may not be sensitive (4) Comparisons of Rank Correlation Coefficients: Moreover, we also
enough to the differences in the importance score computation, thus take into account the rank order statistics on the TVSum dataset to eval-
we also use Kendall’s 𝜏 and Spearman’s 𝜌 to calculate correlation uate the effectiveness of different methods, as recommended by Otani
coefficients between the ground truth scores and predicted scores. et al. (2019). Table 3 presents the results, which indicate that random

6
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

Table 1
Comparisons of the F-score (%) and the number of parameters (M) with state-of-the-art methods under the standard evaluation setting.
Method Shot segmentation Feature SumMe ↑ TVSum ↑ Params ↓ Test method
TVSum (Song et al., 2015) Change-point detection HoG+GIST+SIFT – 50.0 – –
DPP (Zhang et al., 2016a) KTS AlexNet 40.9 – – 5 Random
ERSUM (Li et al., 2017) Uniform segmentation VGGNet-16 43.1 59.4 – –
MSDS-CC (Meng et al., 2017) KTS GIST+GoogLeNet 40.6 52.3 – –
vsLSTM (Zhang et al., 2016b) KTS GoogLeNet 37.6 54.2 2.63 5 Random
dppLSTM (Zhang et al., 2016b) KTS GoogLeNet 38.6 54.7 2.63 5 Random
SUM-GAN (Mahasseni et al., 2017) KTS GoogLeNet 41.7 56.3 295.86 5 Random
FCSN (Rochan et al., 2018) KTS GoogLeNet 47.5 56.8 36.58 M Random
SASUM (Wei et al., 2018) KTS InceptionV3 45.3 58.2 44.07 10 Random
HSA-RNN (Zhao et al., 2018) Change-point detection VGGNet-16 42.3 58.7 4.20 5 Random
A-AVS (Ji et al., 2020) KTS GoogLeNet 43.9 59.4 4.40 5 Random
M-AVS (Ji et al., 2020) KTS GoogLeNet 44.4 61.0 4.40 5 Random
RSGN (Zhao, Li et al., 2022) KTS GoogLeNet 45.0 60.1 – 5 Random
DHAVS (Lin et al., 2022) KTS 3D ResNeXt-101 45.6 60.8 – 5 Random
LMHA (Zhu et al., 2022) KTS GoogLeNet 51.1 61.0 – 5 Random
HMT (Zhao, Gong et al., 2022) KTS GoogLeNet 44.1 60.1 – 5 Random
DR-DSN (Zhou et al., 2018) KTS GoogLeNet 42.1 58.1 2.63 5 FCV
ACGAN (He et al., 2019) KTS GoogLeNet 47.2 59.4 – 5 FCV
CSNet (Jung et al., 2019) KTS GoogLeNet 48.6 58.5 – 5 FCV
CAAN (Liang, Lv et al., 2022) KTS GoogLeNet 50.6 59.3 – 5 FCV
VJMHT (Li, Ke, Gong and Zhang, 2023) KTS GoogLeNet 50.6 60.9 35.44 5 FCV
SSPVS (Li, Ke, Gong and Drummond, 2023) KTS GoogLeNet 48.7 60.3 – 5 FCV
AMFM KTS GoogLeNet 51.8 61.0 13.66 5 FCV

Table 2
Comparisons of the F-score (%) with state-of-the-art methods under the canonical (C), augmented (A), and transfer (T) settings,
respectively.
Method SumMe ↑ TVSum ↑
C A T C A T
vsLSTM (Zhang et al., 2016b) 37.6 41.6 40.7 54.2 57.9 56.9
dppLSTM (Zhang et al., 2016b) 38.6 42.9 41.8 54.7 59.6 58.7
SUM-GAN (Mahasseni et al., 2017) 41.7 43.6 – 56.3 61.2 –
DR-DSN (Zhou et al., 2018) 42.1 43.9 42.6 58.1 59.8 58.9
FCSN (Rochan et al., 2018) 47.5 51.1 44.1 56.8 59.2 58.2
HSA-RNN (Zhao et al., 2018) 42.3 42.1 – 58.7 59.8 –
CSNet (Jung et al., 2019) 48.6 48.7 44.1 58.5 57.1 57.4
A-AVS (Ji et al., 2020) 43.9 44.6 – 59.4 60.8 –
M-AVS (Ji et al., 2020) 44.4 46.1 – 61.0 61.8 –
RSGN (Zhao, Li et al., 2022) 45.0 45.7 44.0 60.1 61.1 60.0
DHAVS (Lin et al., 2022) 45.6 46.5 43.5 60.8 61.2 57.5
LMHA (Zhu et al., 2022) 51.1 52.1 45.4 61.0 61.5 55.1
HMT (Zhao, Gong et al., 2022) 44.1 44.8 – 60.1 60.3 –
VJMHT (Li, Ke, Gong and Zhang, 2023) 50.6 51.7 46.4 60.9 61.9 58.9
SSPVS (Li, Ke, Gong and Drummond, 2023) 48.7 50.4 45.8 60.3 61.8 57.8
AMFM 51.8 52.8 46.4 61.0 60.8 58.6

Table 3 summary performs the worst and is entirely irrelevant to manual anno-
Comparisons of rank correlation coefficients with state-of-the-art methods. tations. Compared to existing state-of-the-art methods under this more
Method Kendall’s 𝜏 ↑ Spearman’s 𝜌 ↑ reliable evaluation metric, our method yields significantly superior per-
Random (Otani et al., 2019) 0.000 0.000 formance and even surpasses human summaries. This finding suggests
Human (Otani et al., 2019) 0.177 0.204 that there may be internal inconsistency within human annotations,
dppLSTM (Zhang et al., 2016b) 0.042 0.055
whereas our method can effectively capture multiple users’ preferences.
DR-DSN (Zhou et al., 2018) 0.020 0.026
HSA-RNN (Zhao et al., 2018) 0.082 0.088 Besides, our method can accurately capture the storyline by robustly
CAAN (Liang, Lv et al., 2022) 0.038 0.050 mining and fusing global and local information, enabling it to precisely
DHAVS (Lin et al., 2022) 0.082 0.089 predict the importance of each frame. It is noteworthy that SSPVS yields
RSGN (Zhao, Li et al., 2022) 0.083 0.090 comparable performance with AMFM in these correlation coefficients,
HTM (Zhao, Gong et al., 2022) 0.096 0.107
possibly owing to it being trained on an extremely larger-scale dataset.
VJMHT (Li, Ke, Gong and Zhang, 2023) 0.097 0.105
SSPVS (Li, Ke, Gong and Drummond, 2023) 0.177 0.233 However, our AMFM, trained on a small dataset, achieves comparable
AMFM 0.179 0.233 or even superior performance, thus demonstrating the effectiveness of
the proposed method.
(5) Comparisons of Elapsed Time: To investigate the inference effi-
Table 4
Comparisons of elapsed time (millisecond).
ciency of AMFM, we provide experimental results about our method
Dataset SumMe ↓ TVSum ↓
and a RNN-based method in Table 4. Here, as a classic and effective
(Average duration) (2min 26s) (3min 55s) architecture, DR-DSN (Zhou et al., 2018) is chosen as the target for
DR-DSN (Zhou et al., 2018) 5.18 7.87
comparison, which only includes a concise Bi-LSTM structure. Since our
AMFM 5.24 5.61 method is obviously more complex, the reported values can highlight
our advantage in terms of computing time. All the results are conducted
under a completely fair experimental environment using a 2080Ti GPU.
After obtaining all inference times of videos on the SumMe and TVSum

7
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

Table 6
Ablation study on multi-granularity information. ✓ and ✗ denote the corresponding
contextual information is learned or not.
Exp. Granularity settings SumMe ↑ TVSum ↑
Coarse-grained Fine-grained
1 ✗ ✗ 48.7 58.9
2 ✓ ✗ 48.8 60.0
3 ✗ ✓ 50.0 60.1
4 ✓ ✓ 51.8 61.0

Table 7
Ablation study on local window. ✓ denotes whether the corresponding temporal
convolution kernel is used in the MGE module.
Exp. Local window settings SumMe ↑ TVSum ↑
𝑘1 = 3 𝑘2 = 5 𝑘3 = 7
1 ✓ 50.6 60.4
2 ✓ 51.4 60.4
3 ✓ 51.9 60.2
4 ✓ ✓ 51.8 61.0
Fig. 6. Ablation study on the dimension reduction rate 𝑚. 5 ✓ ✓ 51.9 60.4
6 ✓ ✓ 53.0 60.3
Table 5 7 ✓ ✓ ✓ 51.9 60.6
Ablation study on dominant components. ✓ denotes whether the corresponding module
is included in the model.
Exp. Component settings SumMe ↑ TVSum ↑
Subsequently, we further integrate MGE into our architecture to model
CAE MGE Sum. Concat. SAF
contextual information within videos. By obtaining both coarse-grained
1 47.2 56.7
and fine-grained contextual information, we implement different fusion
2 ✓ 48.7 58.9
3 ✓ ✓ ✓ 50.1 60.7 strategies, including summation (Sum.), concatenation (Concat.), and
4 ✓ ✓ ✓ 50.7 59.5 SAF. The summation-based model directly fuses multi-granularity con-
5 ✓ ✓ ✓ 51.8 61.0 texts by element-wise summation and the concatenation-based model
leverages a fully connected layer to reduce the dimension of con-
catenated multi-granularity contexts from 2 × 𝐷 to 𝐷. Notably, SAF
outperforms other fusion methods by a substantial margin, thanks to
datasets, we report the average values, respectively. In addition, to
its unique adaptive fusion mechanism. In summary, these experimen-
eliminate the time deviation caused by irrelevant factors, we only
tal findings strongly underscore the importance of each individual
consider the time it takes to predict the importance score, excluding
component in our method.
pre-processing and post-processing. It can be found that AMFM has
(3) Study on Multi-Granularity Information: We explore the impact of
significantly less running time on the TVSum datasets and a comparable
coarse-grained and fine-grained contextual information by sequentially
running time on the SumMe dataset compared with DR-DSN. This is
removing CGS and FGS in the MGE module. As depicted in Table 6,
because the TVSum dataset has a longer duration, hence, recursive
the model (Exp. 1) devoid of any contextual information exhibits
neural networks need to execute recursively more times while our
the worst performance across the listed situations, underscoring the
architecture can support parallel computing.
pivotal role of context in comprehending video content. Introducing
single-granularity contextual information (Exp. 2 and Exp. 3) yields
4.3. Ablation study substantial performance enhancements, enhancing the model’s perfor-
mance by a minimum of 1.6% and 3.3% on the SumMe and TVSum
There are three dominant components in AMFM, including CAE, datasets, respectively. Notably, fine-grained context confers a more
MGE, and SAF. To verify the necessity and contribution of each com- significant performance boost for summarization. This is attributable to
ponent, we implement ablation experiments and conduct analysis. our well-designed architecture, which adeptly captures precise content
(1) Study on Dimension Reduction Rate: We conduct a thorough understanding by robustly learning rich temporal cues within small
analysis of parameter 𝑚 in SAF to examine its effect on information local windows. These experimental results given by our default method
transmission. To obtain the most suitable parameter for video summa- (Exp. 4) substantiate the rationality behind our method.
rization, we sequentially set 𝑚 to 1 to 64 and report the experimental (4) Study on Local Window: To investigate the influence of different
results on the SumMe dataset and TVSum dataset. The results are local window sizes, we conduct ablation experiments by using different
presented in Fig. 6. We can observe that the F-scores are lower as 𝑚 in- combinations of temporal convolution in FGS of MGE. The experimen-
creases and it can be explained by that significant dimension reduction tal results are presented in Table 7. Our method achieves the best
would lead to more information loss. To balance the summarization performance on the SumMe dataset when utilizing both 𝑘2 and 𝑘3 ,
performance between the SumMe and TVSum datasets, we set 𝑚 to 4 as while on the TVSum dataset, it performs best with 𝑘1 and 𝑘2 . These
our default value, which determines the intermediate feature dimension results can be attributed to the nature of the videos in the TVSum
in the SAF module. dataset, which frequently feature abrupt shot changes. Consequently,
(2) Study on Dominant Components: The contribution of different employing larger local windows in such cases may result in a less
components has been thoroughly examined, and the results are sum- coherent information representation. Furthermore, it is worth noting
marized in Table 5. In our initial experiment (Exp. 1), the base model that the F-scores on the SumMe dataset exhibit significant fluctuations
only includes the prediction head, directly predicting importance scores with an absolute difference of 2.4%. This can be attributed to the
based on the pre-trained visual features. To enhance the representation specific evaluation protocol adopted for this dataset. Considering the
capability of these visual features, we introduce CAE to the base summarization performance on both datasets and the parameters, we
model (Exp. 2), which leads to significant improvements of 1.5% and opt to use 𝑘1 and 𝑘2 as the default configuration to ensure comparable
2.2% in F-scores on the SumMe and TVSum datasets, respectively. performance with other state-of-the-art methods.

8
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

Fig. 7. Qualitative results on the SumMe and TVSum datasets. Four videos are selected as examples, including 9th and 19th videos in SumMe, 15th and 41th videos in TVSum.
The 𝑥-axis is the frame index. The images below are some example frames in generated summaries by our method.

Fig. 8. Correlation visualization results of 14th and 39th video in TVSum. The predicted scores are consistent with manual annotations, indicating that the AMFM can be aware
of video content and select important and valuable segments.

4.4. Qualitative results The results affirm that our method excels at identifying the most
significant shots, as indicated by their high-importance scores. This
underscores the capability of AMFM to effectively capture the primary
To visually demonstrate the effectiveness of our method, we present
content of the video. Furthermore, the generated summaries offer a
qualitative results in Fig. 7. In this illustration, the light gray bars
comprehensive narrative of the entire story and encompass a diverse
represent the ground truth summaries, while the colored bars depict range of content. This diversity enables viewers to quickly grasp the
the predicted summaries of our method. Additionally, at the bottom of activities depicted in the videos. Additionally, we provide correlation
the histogram, we showcase example frames chosen by our method. To results between ground truth scores and predicted scores in Fig. 8. The
ensure a comprehensive evaluation of the summarization performance, results clearly highlight the effectiveness of our method in modeling
we select four videos from both the SumMe and TVSum datasets, the relative importance of videos.
encompassing various topics such as landscapes, pets, and sports. These Interestingly, our visualization results reveal that despite achieving
selections allow us to thoroughly validate the performance of our a high F-score, some peak points are not selected. This phenomenon
method. may be attributed to the inherent limitations of the widely employed

9
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

KTS segmentation method, which often prioritizes computational speed Chu, Wen-Sheng, Song, Yale, & Jaimes, Alejandro (2015). Video co-summarization:
at the expense of accuracy. Consequently, it becomes imperative for Video summarization by visual co-occurrence. In Proceedings of the IEEE conference
on computer vision and pattern recognition (pp. 3584–3592).
future research endeavors to concentrate on the development of more
Cui, Yiming (2022). Dynamic feature aggregation for efficient video object detection.
advanced shot boundary detection methods that can enhance the per- In Proceedings of the Asian conference on computer vision (pp. 944–960).
formance of video summarization methods. De Avila, Sandra Eliza Fontes, Lopes, Ana Paula Brandao, da Luz Jr, Antonio, & de
Albuquerque Araújo, Arnaldo (2011). VSUMM: A mechanism designed to produce
5. Conclusion static video summaries and a novel evaluation method. Pattern Recognition Letters,
32(1), 56–68.
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, & Fei-Fei, Li (2009). ImageNet:
In this work, we propose an effective Attention-Guided Multi-
A large-scale hierarchical image database. In Proceedings of the IEEE conference on
Granularity Fusion Model, which comprises the CAE module, the MGE computer vision and pattern recognition (pp. 248–255).
module, and the SAF module, to sufficiently and effectively model Fu, Hao, & Wang, Hongxing (2021). Self-attention binary neural tree for video
contextual information for video summarization. Specifically, CAE summarization. Pattern Recognition Letters, 143, 19–26.
targets to complete the feature enhancement for pre-trained visual Gygli, Michael, Grabner, Helmut, Riemenschneider, Hayko, & Gool, Luc Van (2014).
Creating summaries from user videos. In Proceedings of the European conference on
features. Then, MGE is introduced to robustly model coarse-grained and
computer vision (pp. 505–520).
fine-grained contextual information. Finally, SAF is used to facilitate He, Xufeng, Hua, Yang, Song, Tao, Zhang, Zongpu, Xue, Zhengui, Ma, Ruhui, et
adaptive fusion across multi-granularity contextual information with al. (2019). Unsupervised video summarization with attentive conditional gener-
a significant difference in the semantic scale. The proposed AMFM is ative adversarial networks. In Proceedings of the ACM international conference on
extensively evaluated on benchmark datasets and demonstrates signif- multimedia (pp. 2296–2304).
icantly competitive performance. Moving forward, the integration of He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, & Sun, Jian (2016). Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
multimodal information to generate high-quality summary results that vision and pattern recognition (pp. 770–778).
meet human preferences is an interesting avenue for future research. Huang, Cheng, & Wang, Hongmei (2020). A novel key-frames selection framework for
comprehensive video summarization. IEEE Transactions on Circuits and Systems for
CRediT authorship contribution statement Video Technology, 30(2), 577–589.
Hussain, Tanveer, Muhammad, Khan, Ding, Weiping, Lloret, Jaime, Baik, Sung Wook,
& de Albuquerque, Victor Hugo C. (2021). A comprehensive survey of multi-view
Yunzuo Zhang: Conceptualization, Supervision, Project administra-
video summarization. Pattern Recognition, 109, Article 107567.
tion, Funding acquisition, Writing – review & editing. Yameng Liu: James, Hale More than 500 hours of content are now being uploaded to
Conceptualization, Methodology, Investigation, Data curation, Writing YouTube every minute. https://www.tubefilter.com/2019/05/07/number-hours-
– original draft, Writing – review & editing. Cunyu Wu: Conceptual- video-uploaded-to-youtube-per-minute/.
ization, Investigation, Writing – review & editing. Ji, Zhong, Xiong, Kailin, Pang, Yanwei, & Li, Xuelong (2020). Video summarization
with Attention-Based Encoder–Decoder networks. IEEE Transactions on Circuits and
Systems for Video Technology, 30(6), 1709–1717.
Declaration of competing interest Jiang, Hao, & Mu, Yadong (2022). Joint video summarization and moment localization
by cross-task sample transfer. In Proceedings of the IEEE conference on computer vision
The authors declare that they have no known competing finan- and pattern recognition (pp. 16388–16398).
cial interests or personal relationships that could have appeared to Jung, Yunjae, Cho, Donghyeon, Kim, Dahun, Woo, Sanghyun, & Kweon, In So (2019).
influence the work reported in this paper. Discriminative feature learning for unsupervised video summarization. Vol. 33, In
Proceedings of the AAAI conference on artificial intelligence (pp. 8537–8544).
Ke, Zhanghan, Sun, Jiayu, Li, Kaican, Yan, Qiong, & Lau, Rynson WH (2022). Modnet:
Data availability Real-time trimap-free portrait matting via objective decomposition. Vol. 36, In
Proceedings of the AAAI conference on artificial intelligence (pp. 1140–1147).
Data will be made available on request. Li, Haopeng, Ke, Qiuhong, Gong, Mingming, & Drummond, Tom (2023). Progressive
video summarization via multimodal self-supervised learning. In Proceedings of the
IEEE winter conference on applications of computer vision (pp. 5584–5593).
Acknowledgments
Li, Haopeng, Ke, Qiuhong, Gong, Mingming, & Zhang, Rui (2023). Video joint modelling
based on hierarchical transformer for co-summarization. IEEE Transactions on
This work is jointly supported by the National Natural Science Pattern Analysis and Machine Intelligence, 45(3), 3904–3917.
Foundation of China (No. 61702347, No. 62027801), the Natural Sci- Li, Wenxu, Pan, Gang, Wang, Chen, Xing, Zhen, & Han, Zhenjun (2022). From coarse
ence Foundation of Hebei Province, China (No. F2022210007, No. to fine: Hierarchical structure-aware video summarization. ACM Transactions on
Multimedia Computing Communications and Applications, 18(1s).
F2017210161), the Science and Technology Project of Hebei Education
Li, Xuelong, Zhao, Bin, & Lu, Xiaoqiang (2017). A general framework for edited
Department, China (No. ZD2022100, No. QN2017132), the Central
video and raw video summarization. IEEE Transactions on Image Processing, 26(8),
Guidance on Local Science and Technology Development Fund, China 3652–3664.
(No. 226Z0501G). Liang, Zhiyuan, Guo, Kan, Li, Xiaobo, Jin, Xiaogang, & Shen, Jianbing (2022). Person
foreground segmentation by learning multi-domain networks. IEEE Transactions on
References Image Processing, 31, 585–597.
Liang, Guoqiang, Lv, Yanbing, Li, Shucheng, Zhang, Shizhou, & Zhang, Yanning (2022).
Video summarization with a convolutional attentive adversarial network. Pattern
Apostolidis, Evlampios, Adamantidou, Eleni, Metsai, Alexandros I., Mezaris, Vasileios,
Recognition, 131, Article 108840.
& Patras, Ioannis (2021). AC-SUM-GAN: Connecting actor-critic and generative
adversarial networks for unsupervised video summarization. IEEE Transactions on Lin, Tianwei, Zhao, Xu, Su, Haisheng, Wang, Chongjing, & Yang, Ming (2018). Bsn:
Circuits and Systems for Video Technology, 31(8), 3278–3292. Boundary sensitive network for temporal action proposal generation. In Proceedings
Badamdorj, Taivanbat, Rochan, Mrigank, Wang, Yang, & Cheng, Li (2022). Contrastive of the European conference on computer vision (pp. 3–19).
learning for unsupervised video highlight detection. In Proceedings of the IEEE Lin, Jingxu, Zhong, Sheng-hua, & Fares, Ahmed (2022). Deep hierarchical LSTM
conference on computer vision and pattern recognition (pp. 14042–14052). networks with attention for video summarization. Computers & Electrical Engineering,
Bettadapura, Vinay, Pantofaru, Caroline, & Essa, Irfan (2016). Leveraging contextual 97, Article 107618.
cues for generating basketball highlights. In Proceedings of the ACM international Liu, Tianrui, Meng, Qingjie, Huang, Jun-Jie, Vlontzos, Athanasios, Rueckert, Daniel, &
conference on multimedia (pp. 908–917). Kainz, Bernhard (2022). Video summarization through reinforcement learning with
Carion, Nicolas, Massa, Francisco, Synnaeve, Gabriel, Usunier, Nicolas, Kirillov, Alexan- a 3D spatio-temporal U-Net. IEEE Transactions on Image Processing, 31, 1573–1586.
der, & Zagoruyko, Sergey (2020). End-to-end object detection with transformers. Mahasseni, Behrooz, Lam, Michael, & Todorovic, Sinisa (2017). Unsupervised video
In Proceedings of the European conference on computer vision (pp. 213–229). summarization with adversarial lstm networks. In Proceedings of the IEEE conference
Cheng, Bowen, Misra, Ishan, Schwing, Alexander G, Kirillov, Alexander, & Girdhar, Ro- on computer vision and pattern recognition (pp. 202–211).
hit (2022). Masked-attention mask transformer for universal image segmentation. Mao, Aihua, Yang, Zhi, Lin, Ken, Xuan, Jun, & Liu, Yong-Jin (2022). Positional
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. attention guided transformer-like architecture for visual question answering. IEEE
1290–1299). Transactions on Multimedia, 1–13.

10
Y. Zhang et al. Expert Systems With Applications 249 (2024) 123568

Mei, Shaohui, Guan, Genliang, Wang, Zhiyong, He, Mingyi, Hua, Xian-Sheng, & Yuan, Yitian, Mei, Tao, Cui, Peng, & Zhu, Wenwu (2019). Video summarization by
Dagan Feng, David (2014). L2,0 constrained sparse dictionary selection for video learning deep side semantic embedding. IEEE Transactions on Circuits and Systems
summarization. In Proceedings of the IEEE international conference on multimedia and for Video Technology, 29(1), 226–237.
expo (pp. 1–6). Yuan, Li, Tay, Francis Eng Hock, Li, Ping, & Feng, Jiashi (2020). Unsupervised video
Meng, Jingjing, Wang, Suchen, Wang, Hongxing, Yuan, Junsong, & Tan, Yap-Peng summarization with cycle-consistent adversarial LSTM networks. IEEE Transactions
(2017). Video summarization via multi-view representative selection. In Proceedings on Multimedia, 22(10), 2711–2722.
of the IEEE international conference on computer vision workshops (pp. 1189–1198). Zhang, Ke, Chao, Wei-Lun, Sha, Fei, & Grauman, Kristen (2016a). Summary transfer:
Merler, Michele, Mac, Khoi-Nguyen C., Joshi, Dhiraj, Nguyen, Quoc-Bao, Ham- Exemplar-based subset selection for video summarization. In Proceedings of the IEEE
mer, Stephen, Kent, John, et al. (2019). Automatic curation of sports highlights conference on computer vision and pattern recognition (pp. 1059–1067).
using multimodal excitement features. IEEE Transactions on Multimedia, 21(5), Zhang, Ke, Chao, Wei-Lun, Sha, Fei, & Grauman, Kristen (2016b). Video summarization
1147–1160. with long short-term memory. In Proceedings of the European conference on computer
Niu, Zhaoyang, Zhong, Guoqiang, & Yu, Hui (2021). A review on the attention vision (pp. 766–782).
mechanism of deep learning. Neurocomputing, 452, 48–62. Zhang, Yunzuo, Guo, Wei, Wu, Cunyu, Li, Wei, & Tao, Ran (2023). FANet: An arbitrary
Otani, Mayu, Nakashima, Yuta, Rahtu, Esa, & Heikkila, Janne (2019). Rethinking the direction remote sensing object detection network based on feature fusion and angle
evaluation of video summaries. In Proceedings of the IEEE conference on computer classification. IEEE Transactions on Geoscience and Remote Sensing.
vision and pattern recognition (pp. 7596–7604). Zhang, Yunzuo, Kang, Weili, Liu, Yameng, & Zhu, Pengfei (2023). Joint multi-level
Park, Jungin, Lee, Jiyoung, Kim, Ig-Jae, & Sohn, Kwanghoon (2020). Sumgraph: feature network for lightweight person re-identification. In ICASSP 2023-2023 IEEE
Video summarization via recursive graph modeling. In Proceedings of the European international conference on acoustics, speech and signal processing (pp. 1–5).
conference on computer vision (pp. 647–663). Zhang, Yunzuo, Song, Zhouchen, & Li, Wenbo (2023). Enhancement multi-module
Potapov, Danila, Douze, Matthijs, Harchaoui, Zaid, & Schmid, Cordelia (2014). network for few-shot leaky cable fixture detection in railway tunnel. Signal
Category-specific video summarization. In Proceedings of the European conference Processing: Image Communication, 113, Article 116943.
on computer vision (pp. 540–555). Zhang, Yunzuo, Tao, Ran, & Wang, Yue (2017). Motion-state-adaptive video summa-
Rochan, Mrigank, Ye, Linwei, & Wang, Yang (2018). Video summarization using fully rization via spatiotemporal analysis. IEEE Transactions on Circuits and Systems for
convolutional sequence networks. In Proceedings of the European conference on Video Technology, 27(6), 1340–1352.
computer vision (pp. 347–363). Zhang, Yunzuo, Zhang, Tian, Wu, Cunyu, & Tao, Ran (2023). Multi-scale spatiotem-
Song, Yale, Vallmitjana, Jordi, Stent, Amanda, & Jaimes, Alejandro (2015). Tvsum: poral feature fusion network for video saliency prediction. IEEE Transactions on
Summarizing web videos using titles. In Proceedings of the IEEE conference on Multimedia.
computer vision and pattern recognition (pp. 5179–5187). Zhang, Shu, Zhu, Yingying, & Roy-Chowdhury, Amit K. (2016). Context-aware
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, surveillance video summarization. IEEE Transactions on Image Processing, 25(11),
Anguelov, Dragomir, et al. (2015). Going deeper with convolutions. In Proceedings 5469–5478.
of the IEEE conference on computer vision and pattern recognition. Zhao, Bin, Gong, Maoguo, & Li, Xuelong (2022). Hierarchical multimodal transformer
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, to summarize videos. Neurocomputing, 468, 360–369.
Gomez, Aidan N, et al. (2017). Attention is all you need. Vol. 30, In Proceedings of Zhao, Bin, Li, Xuelong, & Lu, Xiaoqiang (2017). Hierarchical recurrent neural network
the advances in neural information processing systems. for video summarization. In Proceedings of the ACM international conference on
Wang, Shuai, Cong, Yang, Cao, Jun, Yang, Yunsheng, Tang, Yandong, Zhao, Huaici, multimedia (pp. 863–871).
et al. (2016). Scalable gastroscopic video summarization via similar-inhibition Zhao, Bin, Li, Xuelong, & Lu, Xiaoqiang (2018). Hsa-rnn: Hierarchical structure-adaptive
dictionary selection. Artificial Intelligence in Medicine, 66, 1–13. rnn for video summarization. In Proceedings of the IEEE conference on computer vision
Wang, Zhikang, He, Lihuo, Tu, Xiaoguang, Zhao, Jian, Gao, Xinbo, Shen, Shengmei, and pattern recognition (pp. 7405–7414).
et al. (2021). Robust video-based person re-identification by hierarchical mining. Zhao, Bin, Li, Haopeng, Lu, Xiaoqiang, & Li, Xuelong (2022). Reconstructive sequence-
IEEE Transactions on Circuits and Systems for Video Technology, 1. graph network for video summarization. IEEE Transactions on Pattern Analysis and
Wei, Huawei, Ni, Bingbing, Yan, Yichao, Yu, Huanyu, Yang, Xiaokang, & Yao, Chen Machine Intelligence, 44(5), 2793–2801.
(2018). Video summarization via semantic attended networks. Vol. 32, In Zhao, Jiaqi, Wang, Hanzheng, Zhou, Yong, Yao, Rui, Chen, Silin, & El Saddik, Abdul-
Proceedings of the AAAI conference on artificial intelligence. motaleb (2022). Spatial-channel enhanced transformer for visible-infrared person
Xiao, Shuwen, Zhao, Zhou, Zhang, Zijian, Guan, Ziyu, & Cai, Deng (2020). Query-biased re-identification. IEEE Transactions on Multimedia, 1.
self-attentive network for query-focused video summarization. IEEE Transactions on Zhong, Rui, Wang, Rui, Zou, Yang, Hong, Zhiqiang, & Hu, Min (2021). Graph attention
Image Processing, 29, 5889–5899. networks adjusted Bi-LSTM for video summarization. IEEE Signal Processing Letters,
Xie, Jiehang, Chen, Xuanbai, Zhang, Tianyi, Zhang, Yixuan, Lu, Shao-Ping, Ce- 28, 663–667.
sar, Pablo, et al. (2022). Multimodal-based and aesthetic-guided narrative video Zhou, Kaiyang, Qiao, Yu, & Xiang, Tao (2018). Deep reinforcement learning for
summarization. IEEE Transactions on Multimedia, 1–15. unsupervised video summarization with diversity-representativeness reward. Vol.
Xu, Minghao, Wang, Hang, Ni, Bingbing, Zhu, Riheng, Sun, Zhenbang, & 32, In Proceedings of the AAAI conference on artificial intelligence.
Wang, Changhu (2021). Cross-category video highlight detection via set-based Zhu, Wencheng, Lu, Jiwen, Han, Yucheng, & Zhou, Jie (2022). Learning multiscale
learning. In Proceedings of the IEEE conference on computer vision and pattern hierarchical attention for video summarization. Pattern Recognition, 122, Article
recognition (pp. 7970–7979). 108312.
Yang, Antoine, Miech, Antoine, Sivic, Josef, Laptev, Ivan, & Schmid, Cordelia (2022). Zhu, Wencheng, Lu, Jiwen, Li, Jiahao, & Zhou, Jie (2021). DSNet: A flexible detect-to-
Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the summarize network for video summarization. IEEE Transactions on Image Processing,
IEEE conference on computer vision and pattern recognition (pp. 16442–16453). 30, 948–962.
Yeh, Ching-Feng, Wang, Yongqiang, Shi, Yangyang, Wu, Chunyang, Zhang, Frank, Zhuang, Yueting, Rui, Yong, Huang, T. S., & Mehrotra, S. (1998). Adaptive key frame
Chan, Julian, et al. (2021). Streaming attention-based models with augmented extraction using unsupervised clustering. Vol. 1, In Proceedings of the international
memory for end-to-end speech recognition. In Proceedings of the IEEE spoken conference on image processing (pp. 866–870).
language technology workshop (pp. 8–14).

11

You might also like