Deep Multi-Scale Pyramidal Features Network For Supervised Video Summarization
Deep Multi-Scale Pyramidal Features Network For Supervised Video Summarization
A R T I C L E I N F O A B S T R A C T
Keywords: Video data are witnessing exponential growth, and extracting summarized information is challenging. It is al
Video summarization ways necessary to reduce the load of video traffic for the efficient video storage, transmission, and retrieval
Supervised learning requirements. The aim of video summarization (VS) is to extract the most important contents from video re
Keyframes
positories effectively. Recent attempts have used fewer representative features, which have been fed to recurrent
Feature fusion
networks to achieve VS. However, generating the desired summaries can become challenging due to the limited
Keyshots
Feature refinement representativeness of extracted features and a lack of consideration for feature refinement. In this article, we
introduce a vision transformer (ViT)-assisted deep pyramidal refinement network that can extract and refine
multi-scale features and can predict an importance score for each frame. The proposed network comprises four
main modules; initially, a dense prediction transformer with a ViT backbone is applied for the first time in this
domain to acquire the optimal representations from the input frames. Then, feature maps are obtained from
various layers separately and processed individually to support multi-scale progressive feature fusion and
refinement before the data are passed to the ultimate prediction module. Next, a customized pyramidal
refinement block is employed to refine the multi-level feature set before predicting the importance scores.
Finally, video summaries are produced by selecting keyframes based on the predictions. To explore the per
formance of the proposed network, extensive experiments are conducted on the TVSum and SumMe datasets, and
our network is found to achieve F1-scores of 62.4% and 51.9%, respectively, outperforming state-of-the-art al
ternatives by 0.9% and 0.5%.
* Corresponding author.
E-mail address: [email protected] (S.W. Baik).
https://doi.org/10.1016/j.eswa.2023.121288
Received 21 November 2022; Received in revised form 11 August 2023; Accepted 21 August 2023
Available online 29 August 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
information and to generate a summary that provides concise, infor disruption problem within video sequences, especially in VS. Semantic
mative, and essential information on the content of the original video. discontinuity in video sequences are problematic for LSTM-based
The idea of VS arose in the early 2000s (A. D. Doulamis et al., 2000; N. D. models and also prevents the efficient parallelization of processing in
Doulamis et al., 2000), and there have been many scientific achieve the LSTM network, which could exploit the GPU more effectively. Due to
ments in this field. Early research works in the VS domain (Ma et al., these challenges, RNNs are incapable of resolving VS issues. Further
2002) focused on developing the simplest methods, such as video more, the superior results from attention-based strategies for machine
skimming and weak visual saliency-based methods. Other notable translation tasks have highlighted their importance in various domains
achievements include the extraction of representative frames from a of computer vision (CV), such as detecting salient content in an image
video (Panagiotakis et al., 2009) and clustering-based methods in which (Hussain et al., 2022). Hence, some attention-based research works
similar frames are combined to generate a summary. Gesture-based VS (Apostolidis et al., 2021; Liang et al., 2022; Liu et al., 2019; Zhu et al.,
(Kosmopoulos et al., 2005) is another approach that identifies hand and 2022) have used entirely different methods to overcome the issues of
head gestures through skin colour segmentation and utilizes Zernike RNN-based models.
moments for representation. They leverage gesture energy to pinpoint These existing networks focus on using fewer representative features,
keyframes, with proven efficacy on sign language videos. Similarly, which are inefficient in achieving better performance. Furthermore,
category-specific VS (Potapov et al., 2014) involves generating tailored existing attention-based SVS techniques neglect to gather global features
summaries. VS techniques developed after 2010 can be mainly classified and avoid deep focus on the visual critical contents of the video frames,
into supervised and unsupervised methods, depending on or without the which is essential when generating a summary. They relied on single
annotations. The outcomes of VS learned by unsupervised methods are feature maps, which are unable to grab both low-level visual features
often less accurate, and it is not always clear whether the generated and high-level semantic concepts. Moreover, the lack of attention to
effects are beneficial, as there are no labels to measure and cross-check complex contextual information at multiple scales restricted the per
the usefulness of the VS. Supervised video summarization (SVS) tech formance of these mechanisms. This results in a more significant vari
niques learn directly from human-labels assigned to each frame in videos ation in attention levels, which has a detrimental effect on the attention
to preserve the keyframe inclusion criteria. Significant development has ratings and the relative significance of frames. To bridge these research
arisen after the development of benchmark datasets SumMe and TVSum, gaps , we developed a novel network called deep multi-scale pyramidal
making it easy to evaluate summaries with ground truth (GT). The features network (MPFN) for SVS. Our proposed network performs well
categorization for the SumMe and TVSum datasets is presented in and makes the following significant contributions in the domain:
Table 1. These have enabled researchers to compare and improve VS
methods by human supervision. In a nutshell, VS has made significant • Our attempt is the first to incorporate a ViT backbone for multi-scales
progress over the past two decades and continues to be an active area of feature extraction in the domain of VS. Optimal, global and deep
research with many practical applications. domain-specific representations are extracted in a unique way to
In recent years, supervised approaches have made significant prog offer deeper information about the visual contents of videos. We
ress in the field of VS (Apostolidis et al., 2021; Zhu et al., 2022). These avoid following the trend towards the use of conventional CNN-
techniques aim to achieve a substantial similarity between the sum based backbone models and propose a novel, distinctive way of
maries produced by models and GT guidance (Yao et al., 2016). Based on extracting the highest representative features. A ViT uses the trans
the number of views to be summarized, the VS techniques can be further former’s self-attention mechanism to gather global features, thereby
classified into two types: single-view VS (Muhammad, Hussain, Rodri achieving outstanding performance. The acquisition of features via a
gues, et al., 2020) and multi-view VS (Hussain et al., 2021). The purpose dense prediction transformer (DPT) with a ViT backbone allows us to
of single-view VS is to summarize videos taken by a single-view camera, leverage the strengths of both architectures to extract rich features
whereas, in multi-view VS, the aim is to generate a summary of videos from input frames and to produce accurate, detailed predictions.
from several cameras, each with a different point of view. Our research • Rather than relying on a typical single feature map, we use multi-
focuses on SVS, which has recently gained attention from researchers in scale feature maps to capture low-level visual features and high-
the VS domain. level semantic concepts. This method allows the model to focus on
Several works on VS have been presented in the past few years. A crucial aspects of the video that remain undiscovered by a conven
recent survey (Apostolidis et al., 2021) reported that early research used tional process for acquiring features at a single scale. This approach
recurrent neural networks (RNNs) to stimulate temporal information to has a substantial positive impact on network performance, as
analyze the context dependencies between video frames (e.g., Lebron empirically validated in a later results section.
Casas and Koblents, 2018; Zhang et al., 2016b; B. Zhao et al., 2017; Zhao • We use a pyramidal refinement block (PRB) in our network, con
et al., 2018). Long short-term memory (LSTM) mechanisms offer bene sisting of several multi-headed self-attention (MHSA) and dense
fits, as they can capture the long-term hierarchical dependencies be atrous spatial pyramid pooling (DASPP) modules, which are applied
tween frames. However, LSTM-based models for VS are constrained by to each level of the feature maps, excluding the initial layer features.
certain intrinsic limitations. Computations are generally performed from The MHSA module processes the feature representations by
left to right, meaning that the processing of subsequent frames in a computing multiple self-attention mechanisms in parallel, resulting
sequence is contingent upon the processing of the previous frame. in multiple attention vectors. The DASPP module considers features
Despite the existence of bi-directional LSTM (BiLSTM), computations in at multiple scales to capture local and global context information.
both directions are still affected by the same issue as in LSTM (Zhong The resulting PRB significantly improves the quality of the features
et al., 2021). The sequential nature of these approaches gives rise to a by capturing complex contextual information at multiple scales and
levels, which are then forwarded to the convolution head for further
processing.
Table 1 • As features are matured through various layers, and then integrated
Primary statistics and characterisation of the datasets. with the initial level features at different stages. The progressive
Dataset No. of Mean Type of Annotation feature fusion, combines multi-level refined features to boost the
Videos Length representations for additional effectiveness before the predictions.
TVSum (Song et al., 50 4 min 18 s Frame-level importance • Before finalizing the architecture of the proposed model, compre
2015) scores hensive ablation studies are conducted from a technical perspective.
SumMe (Gygli et al., 25 2 min 40 s Frame-level importance Extensive experiments over two benchmarks (TVSum and SumMe)
2014) scores
2
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
verify the strength of the employed network and show that it ach The third subgroup of approaches uses deep adversarial or rein
ieves good performance compared to SOTA approaches. forcement learning to summarize videos. For instance, Mahasseni et al.
(2017) used an unsupervised generative adversarial framework using
The remaining sections of this article are organized as follows. LSTM networks. Zhou et al. (2018) developed a deep summarization
Related research is briefly summarized in Section 2, while Section 3 network to retrieve essential frames; their network assigned a proba
provides the technical details of the proposed network. A comprehensive bility to each frame to determine which frames should be included in a
discussion of the experiments is presented in Section 4. Finally, some video summary. A paper by Rochan et al. (2018) described the use of a
concluding remarks and suggestions for research for the future are given fully convolutional framework to generate summaries and a discrimi
in Section 5. nator to distinguish these generated summaries from original human-
labelled summaries. Lei et al. (2018) developed a reinforcement
2. Related literature network to address the issue of the time needed for GT labelling. Yuan
et al. (2019) considered a consistent cycle LSTM adversarial architecture
Research on VS has become much more widespread and advanced that could be leveraged to significantly minimize the information lost in
over the last two decades. The quantity of video data in the digital world a video summary. Zhao et al. (2019) suggested a dual-learning strategy
is increasing, thus highlighting the value with need of VS and attracting in which the summary generator was rewarded by combining a sum
researchers interested in this domain. Consequently, several techniques mary with video reconstruction. Zhong et al. (2021) introduced a graph-
for automating VS have been presented, and traditional approaches have attention-network-adapted Bi-LSTM to improve the frame-wise proba
now been replaced by techniques based on deep network architectures. bility of a keyframe being chosen. Since the input videos were quite
In this section, we review methods that are suitable for VS, which can be long, the unrolling in the decoder was likewise very long, making it
divided into two categories: (i) unsupervised and (ii) supervised ap difficult to train the network and faced reducing performance for
proaches, as described below. lengthy videos. The fourth subgroup uses memorability techniques to
select keyframes with maximum memorability and entropy scores for
each shot (Fei et al., 2017; Muhammad, Hussain, Tanveer, et al., 2019;
2.1. Unsupervised video summarization Zhong et al., 2021). In another research work, Muhammad, Hussain, Del
Ser, et al. (2019) introduced an unsupervised technique with coarse and
Traditional unsupervised VS techniques typically produce sum fine refining strategies for deep features for use in resource-constrained
maries based on set criteria like representativeness and the discovery of industrial surveillance systems. Unlike the approaches described above,
key moments. Most unsupervised VS techniques rely on clustering our technique is based on human-labelled data, meaning that the sum
(Basavarajaiah & Sharma, 2021; Chu et al., 2015; De Avila et al., 2011), maries are more in line with human-generated GT results.
dictionary learning (Elhamifar et al., 2015), adversarial or reinforce
ment learning, and memorability (Muhammad, Hussain, & Baik, 2020),
with coarse and fine feature refining strategies (Muhammad, Hussain, 2.2. Supervised video summarization
Del Ser, et al., 2019). Clustering techniques apply clustering algorithms
to partition data into groups to enable keyframes to be identified. For Traditional machine learning methods and deep learning (DL)
instance, De Avila et al. (2011) exploited domain knowledge to choose techniques are employed in SVS strategies. Non-DL practices mainly
important shots or frames. Chu et al. (2015) explored visually conveyed focused on handcrafted features or training models for video summaries.
information to find video frames containing a similar subject. Using a Examples of such methods include first-person VS based on various
clustering approach, Basavarajaiah and Sharma (2021) employed a graph representations (Sahu & Chowdhury, 2021), category-specific VS
CNN-based pre-trained network to retrieve visual representations from (Li et al., 2019), and the generation of summaries from user videos
the video frames. Following the trend, k-means was deployed on these (Gygli et al., 2015). Without temporal information, these approaches
features, and a sequential keyframe-generating approach was used for cannot derive long-range relationships. RNNs have been implemented
better results. for VS for the second subgroup due to the better sequence modelling
Clustering techniques cannot adapt well to changes in the complexity capability, and RNN-based models have obtained promising results. For
of a shot. The second subgroup of these, therefore, uses dictionary instance, B. Zhao et al. (2017) used a Bi-LSTM to record temporal re
learning. In these methods, the frames are interpreted as components of lationships in both the forward and reverse directions for VS. Zhao et al.
dictionaries, and researchers have explored the minimum subset of (2020) employed structure-adaptive sliding bidirectional windows to
dictionaries needed to produce video summaries (Basavarajaiah & integrate shot detection into their framework. Li et al. (2019) used a
Sharma, 2021). For example, in the work by Basavarajaiah and Sharma meta-learning method to achieve task-driven VS. However, RNN-based
(2021), videos were structured by arranging them as a linear sequence of models have been shown to have issues with gradient vanishing and
keyframes. Li et al. (2017) summarized both raw and edited versions of explosion, and several alternative deep models have avoided using
videos using a set of summary criteria that included storyness, signifi RNNs. For instance, to capture the dependencies between video frames,
cance, diversity, and representativeness. Mei et al. (2020) tackled the Rochan et al. (2018) presented layered temporal convolutional, pooling,
challenges of VS by using several patches taken from individual frames. and deconvolutional-based frameworks, and their results were
Zhang et al. (2020) explored a unique motion method that imitated comparatively reasonable. Zhao et al. (2021) explored a sequence-graph
online dictionary learning in remembering the previous movements of network to learn both intra- and inter-shot relationships. The attention
an object by constantly updating a customized auto-encoder. Unfortu mechanism has recently been incorporated into VS (Fajtl et al., 2018; Ji
nately, it can’t consistently achieve a consistent division because the et al., 2019; B. Zhao et al., 2017; Puthige et al., 2023) and has shown
fundamental principle of block sparsity is that the frames within a given comparatively impressive results. In this regard, Liu et al. (2019) sug
block share similarities. As a result, the performance is usually low and gested the use of multi-head attention to highlight the keyframes. Most
could be further improved. recently, to generate video summaries, Zhu et al. (2022) proposed an
3
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
attention-based multi-scale hierarchical (LMHA) method that could autonomous vehicles, and enable marketing through effective adver
learn short and long-range temporal representations by paying attention tisements. Thus, SVS can be used to generate concise summaries across
both inside and between blocks. However, most approaches have domains in which the inclusion of keyframes is critical.
focused on conventional CNN based features as highlighted in Table 3, On the other hand, SVS approaches are not always suitable, espe
such as GoogleNet, which results in comparatively non-representative cially in domains with unstructured data like large-scale video collec
features. What sets our method apart from existing approaches is that tions lacking human-labelled importance scores. Dynamic environments
we are able to capture highly representative features after analysing an with continuous data streams make supervised training difficult due to
extensive range of backbone networks and present a novel network for the need for constant annotations Amin et al. (2023). Quantifying sub
refining these features. We integrate the PRB with more attention in our jective moments, such as emotionally significant ones, is complex with
network, which simultaneously collects features and learns frame de human-labelled scores, making unsupervised methods more fitting. SVS
pendencies by modelling based on human annotations. In this way, the methods are domain-specific and might not generalize well to others,
proposed network addresses the drawbacks of existing attention-based while unsupervised approaches are more adaptable. Supervised
techniques and gives improved summarization performance. methods struggle to keep pace in fast-evolving areas like trending events
or emerging technologies, whereas unsupervised ones flexibly adapt to
3. The proposed network changing trends. For tasks like user-generated content summarization or
wildlife video analysis, obtaining labeled data can be challenging due to
This section presents the theoretical and technical details of the content diversity, privacy concerns, or the need for domain expertise. In
proposed MPFN. It consists of eight subsections: the feasibility and such cases, unsupervised methods become more applicable. The choice
concept of supervised learning in VS is discussed in Section 3.1, and the between these methods depends on the task’s specifics. This emphasizes
problem formulation is given in Section 3.2. An overview of the pro the trade-offs between supervised summarization benefits and the re
posed network is briefly presented in Section 3.3, followed by a sources needed for manual annotation, underscoring the importance of
description of the backbone networks used for feature extraction in selecting the right approach based on an application’s constraints and
Section 3.4. Next, the proposed PRB is discussed in Section 3.5, and the needs.
progressive encoder feature fusion is in Section 3.6. The prediction of
frame-level importance scores by the keyframes ConvNet block is
explained in detail in Section 3.7. Finally, the conversion of the 3.2. Problem formulation
importance scores of each frame to keyshots is described in Section 3.8.
In SVS, the aim is to take a sequence of video frames VF = [F1 , F2 , F3 ,
3.1. Concept of supervised video summarization ⋯Fn ] and the corresponding human-labelled importance scores Vs =
[S1 , S2 , S3 , ⋯Sn ] Sn and to select the most informative video frames,
The supervised approach is particularly useful for VS, allowing a where n is the total number of frames in the input video. Prior research
model to learn from human-labelled scores corresponding to video has mainly considered two types of formats for the output of SVS
frames. Supervised techniques make the model more accurate, as the models: (i) frame importance scores and (ii) binary labels (0, 1). Key
network can learn patterns from the human-labelled frames to ensure frames (Gong et al., 2014; Rochan et al., 2018; Zhang et al., 2016b; Zhu
that the keyframe inclusion criteria for the summary video are met et al., 2022) or keyshots (Ji et al., 2019; Liang et al., 2022; Song et al.,
effectively. In contrast, the outcomes of VS learned by unsupervised 2015) are widely used to describe binary label outputs. We assume that
methods are less accurate (Jung et al., 2019; Potapov et al., 2014). These each frame is passed from the deep feature extraction and refining
approaches lack supervision in terms of human-labelled importance module and is finally shown as the feature vector for each frame.
score when interpreting patterns, and the frame scores are instead The frames are then passed to the keyframes ConvNet block (KCB) to
determined based on the visual features of the frames, such as their predict for each frame in the video. The video frames are referred to as
entropy and complexity. Furthermore, unsupervised methods cannot F1 , F2 , F3 , ⋯.Fn , and the objective is to assign each of the n frames a
use the GT (importance scores) to evaluate the quality of the generated
summary. This can make it difficult to determine whether the summary
generated by an unsupervised approach is accurate or not. The SVS
datasets are labelled by 15 to 20 human annotators by giving importance
scores based on several criteria, including representativeness, diversity,
and context (Song et al., 2015). The probability that a given frame is
selected in the video summary depends on the score predicted by the
model.
Supervised summarization offers distinct advantages over unsuper
vised approaches for specific applications. SVS is suitable for domains
where precise, human-interpretable summaries are essential for
decision-making and understanding complex video data. It allows for
more precise control over the final output. Learning from labeled ex
amples can be tailored to capture specific elements and shots within a
video, such as key events, characters, or points of interest. This makes it
ideal for applications where consistency and precision are essential,
such as news broadcast summarization, sports event highlights, and
summarization of educational or instructional videos. This approach can
offer significant value for healthcare, as it can aid in efficiently trans
ferring human knowledge by identifying critical moments in surgical
procedures and medical imaging sequences. In the field of education,
concise summaries can be generated from educational videos based on
annotations by domain experts. This approach can provide informative Fig. 1. Visual illustration of the problem formulation for SVS: each frame is
broadcast highlights in news and media, thereby saving viewers time. It labelled with a binary number (a score of zero or one) to determine its inclusion
can also aid in sports analysis, allow for critical event logging by or exclusion from the final summary.
4
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
binary labelled value (zero or one). The video summary is generated by (2022) to calculate the importance score of shots by taking the average
collecting the keyframes labelled with one, as shown in Fig. 1. Each frame score inside of each shot and identifying the shot boundaries.
frame is linked by GT, which is simplified to a binary label (keyframe/ Finally, our proposed network generates a summary while ensuring that
non-keyframe) and defines whether it should form part of the video the overall score is maximized where the length of the summary does not
summary. exceed 15% of the total duration of the video.
A visual overview of the proposed network is given in Fig. 2, and it Exploring multi-scale representations is more important than
can be seen that it is composed of five key modules: a deep multi-scale considering straightforward feature maps, since promising features can
feature extraction block followed by feature refining via PRB, progres be obtained using various layers.
sive encoder feature fusion; a KCB; and conversion from frame scores to The fundamental concept of deep multi-level feature extraction in
a summary of keyshots. We note that in the deep multi-scale features volves obtaining abstract information from video frames at many scales.
block, the input frames of the video are fed to a dense transformer Existing methods in the SVS domain only pool five features, such as the
network to extract multi-scale features. Four multi-scale feature maps, GoogleNet architecture (Ji et al., 2019; Liang et al., 2022; Mahasseni
Ƒm1 , Ƒm2 , Ƒm3 , and Ƒm4 , are acquired from the backbone ViT, and the et al., 2017; Rochan et al., 2018; Wei et al., 2018; Yuan et al., 2017;
mature three deeper feature maps, Ƒm2 , Ƒm3 and Ƒm4 are passed from the Zhang et al., 2016b; Zhao et al., 2021; Zhong et al., 2021; Zhou et al.,
PRB for further refinement. The base DPT-assisted ViT generates raw 2018; Zhu et al., 2022). However, recent advancements in CV, and
feature maps that are refined by PRB containing MHSAs and DASPPs. specifically with regard to the ViT, have shown competitive perfor
Furthermore, feature maps are fused from lower to deeper mature layers mance on vision tasks that involve learning a mapping from input im
at different stages in the architecture. The use of a progressive encoder ages to complex output structures. The ViT is used as the fundamental
for feature fusion boosts the features before the prediction. The KCB computing unit of the encoder in an architecture known as the DPT,
takes the resultant feature vector as input with 1,024 dimensions ob which is based on an encoder-decoder structure. Research on dense
tained from the efficient features Fα and Fβ and predicts the frame prediction frequently focuses on the decoder and its aggregation strat
importance scores. The kernel temporal segmentation (KTS) (Potapov egy (Chen et al., 2017; Yuan et al., 2020; H. Zhao et al., 2017); however,
et al., 2014) technique was used by Liang et al. (2022) and Zhu et al. it is commonly agreed that the selection of the backbone model signif
Fig. 2. Visual overview of the proposed network, consisting of five main components: a deep multi-level features extraction module followed by PRB, a progressive
feature fusion module, a convolutional head for parameter learning, and a KCB for final prediction of the importance scores.
5
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
Fig. 3. Architecture of the keyframes ConvNet block, which takes refined features as input and gives predicted frame-level importance scores as output.
icantly impacts the ultimate predictions. Unlike convolution-based and refine the acquired features, and to obtain better results, the PRB is
models Munsif et al. (2022), the ViT explicitly avoids applying down applied. The positive impact of this step is shown empirically in the
sampling processes when the initial image embedding has been obtained experimental section.
and keeps the dimensions of the representation constant in all processing
steps. Our proposed architecture for feature extraction contains a DPT 3.5. Feature refinement with the pyramidal refinement block
model with a ViT backbone, which is responsible for encoding the input
frames into a sequence of features using a self-attention mechanism. This Feature refinement can enhance the intermediate representations by
sequence of features is then fed to the DPT architecture, which processes making the frame features more representative and informative. Pyra
them and outputs dense predictions. The DPT architecture consists of a midal refinement is the most suitable option for feature refinement, as it
series of transformer layers, in a similar way to the ViT architecture, but processes frames at progressively finer scales to extract scale and
with modifications to enable it to handle dense prediction tasks. translation-invariant features. Finally, the features collected at various
Notably, their proposed strategy is commonly utilized for dense pre levels of detail are merged to form the output. We include multiple
diction tasks. However, we apply it in the VS domain due to its better MHSA and DASPPs in our proposed PRB. MHSA uses attention followed
image representation abilities and its focus on the details of an impor by DASPPs to make it possible for these methods to successfully inte
tant object. grate information from many scales, which enhances the performance of
Following the findings of Ranftl et al. (2021), we extract deep pixel the pyramidal refinement process. The proposed PRB contains three
features from the input RGB frame, represented as Fh×w , and use them as main stages, denoted as Pm2 , Pm3 and Pm4 . Although four stages were
tokens K = {K0 , ⋯, KNáµ½ } where Náµ½ = áµ½hxw
2 , and h is the height, initially deployed, we reduced this to three after an empirical analysis.
where w is the width, and P is the patch size in the network. Each patch The initial features described in Eq. (1) are specific features (Ranftl et al.,
is a D-dimensional vector with a trainable linear projection. The encoder 2021), which we later decided to exclude from the final proposed block
network produces Ƒm1 , Ƒm2 , Ƒm3 , Ƒm4 = E(K), where E refers to the pro and use in the later fusion process, as shown in Fig. 2. These initially
posed encoder. These feature maps are acquired from the stages. TB1 , acquired features are used to assist and boost the global features at the
TB2 , TB3 , and TB4 of the backbone ViT. To enable the maximum flow of specific points discussed in Section 3.5. The proposed PRB is employed
ViT among the ViT blocks, dense connections are embedded, as to upgrade the mature features Ƒm2 , Ƒm3 and Ƒm4 , by focusing attention
described below. We consider two kinds of features, the first of which is and the results are validated. The PRB applies multi-scale DASPP layers
initial edge information level features (Ƒm1 ) presented in Eq. (1) and MHSA at the end of each stage. The features Ƒm2 acquired in the
including the shapes, colours, edges, and structures of the objects in the second stage are used in the pyramid refinement process, which includes
frames: one MHSA module and three DASPPs. Two DASPPs and one MHSA are
applied to Ƒm3 , as these features are comparatively mature. The most
Ƒm1 = E(TB1 (Ƒith RGB) ) (1) mature features, Ƒm4 , are already relatively enhanced compared to Ƒm3 ,
The second type of features from lower to upper (Ƒm2 , Ƒm3 and Ƒm4 ) and a single DASPP with one MHSA is therefore used for further
are considered more mature as shown in Eqs. (2)– (4): refinement. The overall structure of the block is shown in Eq. (5), where
the pyramid representation is shown for Pm2 , Pm2 and Pm2 .
Ƒm2 = E(TB2 (Ƒm1 ) ) (2) ⎡ ⎤ ⎡
Pm2 ⎤
DASPP DASPP DASPP MHSA
Ƒm3 = E(TB3 (Ƒm1 ⊕ Ƒm2 ) ) (3) ⎢ ⎥ ⎢
⎢ Pm3 ⎥ = ⎣ DASPP DASPP ⎥
(5)
⎣ ⎦ MHSA ⎦
Ƒm4 = E(TB4 (Ƒm1 ⊕ Ƒm2 ⊕ Ƒm3 ) ) (4) Pm4 DASPP MHSA
It is worth noting that although Ƒm2 , Ƒm3 and Ƒm4 are mature global Our ultimate objective with the proposed encoder network is to
features, Ƒm4 are more mature global features that possess all the rep avoid losing any informative features, as they cannot be recovered
resentation information of the three blocks, as shown in Eq. (4). All of during decoding. The structured attention module in the proposed PRB
the features are combined in Ƒm4 , which can therefore be referred to as therefore boosts the features of the VS problem by applying DASPP
global features. Four feature maps are finally acquired: the initial ViT (Yang et al., 2018) to the receptive fields captured from the overall
feature with specific features, and the remaining three ViT blocks with frame. Compared to the ASPP module (from DeepLabv3), DASPP is more
the mature layers in our deep multi-level features. To further enhance advanced since it uses 3 × 3 convolutions followed by 3 × 3 dilated
convolutions to enhance the features and fuses the input with the output
6
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
of the DASPP module through short residual connections. A multi-head the resulting frame feature. The dimensions output will be 1 × VF × C,
attention mechanism is employed to obtain more critical information since we require scores for each frame to indicate whether it is a key
from paired feature representation as discussed. frame or non-keyframe.
Learning: A summary of the original video contains a small collec
3.6. Progressive feature fusion tion of frames, meaning that the keyframes and non-keyframes are un
balanced when a keyframe-based supervised setup is used.
Feature fusion plays an influential role in the excellent performance Consequently, the keyframes are small in number compared to the other
of SOTA networks. Identifying relevant features and enhancing them non-key frames. The use of a weighted loss for learning is a standard
through refinement processes offers the advantage of generating strategy for addressing this issue. The weight for each class is deter
prominent features that can increase recognition performance. The aim mined as Cw = Med f
fc , where Med f. refers to the median of the calculated
of fusing the initial and mature refined features is to give more assis frequencies. Here, fc represents the number of label C frames divided by
tance to each level features before the final prediction. The results show the total frames which contain the given label C in the video. This
that the fusion of different level features can improve the features in balancing technique has also been used for pixel labelling (Eigen &
terms of the final recognition of keyframes. From Eq. (5), it can be seen Fergus, 2015). We assume here that we have a number of video frames T
that Pm4 are global features, which contain all stages, and comparatively for training. Each video frame is also annotated with a GT label (as a
refined features. However, the problem with global features is that there keyframe or /non-keyframe). The loss function Losssum in Eq. (12) is
is less attention to specific features, which are also important. To used for learning:
overcome this, the fourth stage pyramid-refined features Pm4 are fused ( )
with Pm3 , as visually illustrated in Eq. (2) and shown in Eq. (6) 1 ∑T exp(λt, Gt)
mathematically:
Losssum = − Cwt log ∑C (12)
T t=1
exp(λt, c)
c=1
Pm43 = σ(Pm4 ⊕ Pm3 ) (6)
where Gt is the annotated frame label, λt refers to the prediction frame
The resultant feature maps Pm43 are fused with Pm2 as shown in Eq. scores of the individual frame, c represents the binary label class (zero or
(7).: one), and the weight of the class is Cw .
Pm432 = σ (Ω(Pm43 ⊕ Pm2 ) ) (7)
3.8. Generation of keyshots summary
To give more importance to the features resulting from Eq. (7), the
initial edge features Ƒm1 are fused with the pyramid-refined features to
One of our objectives was to use the scores for the frames to produce
create the efficient features Pm432 Ƒm1 as shown in Eq. (8):
a summary containing keyshots. The KTS technique (Potapov et al.,
Pm432 Ƒm1 = σ (Ω(Pm432 ⊕ Ƒm1 ) ) (8) 2014) is employed here to extract shots from videos, as suggested by
Zhang et al. (2016b). The score for each shot is computed by averaging
These features are fed into the proposed convolution head CH , which the scores for the frames, where each shot corresponds to a particular
includes two 1D convolutions and two max-pooling layers. The repre segment. The issue of optimization can then be resolved to get the
sentations are then flattened to generate the features Ƒβ as shown in Eq. keyshots-based summary. By applying the optimization process shown
(9): in Eq. (13) below, a keyshots-based summaries are obtained:
Pm432 Fm1 →CH + FL1 = Fβ (9) ∑
m ∑
m
max Oν ℧i b s.t. Oν δ⩽O, Oν ∈ {0, 1} (13)
To strengthen the features further, the mature and global features i=1 i=1
Ƒm4 are flattened:
where ℧i is the significance score for the i − th shot, δ denotes the shot
Fm4 →FL2 = Fα (10) length, O refers to the summarized video length, and Oν is the optimi
sation indicator, which indicates whether or not a shot is included. In a
The features Ƒβ are fused with Ƒα through the operation in Eq. (11):
similar way to previous approaches, the threshold O is fixed at 15% of
( ( ))
Ƒαβ = Ωh x w Bc Ƒα ⊕ Ƒβ (11) the total length of the video. Eq. (13) shows the summary score, which is
obtained while limiting the length under the specified threshold.
where σ is the residual channel attention, Ώ refers to the upsampling
operation, Bc is the balancing convolution, and FL denotes the flattened 4. Experimental results
layers. The features Ƒαβ are fed to a fully connected block which contains
a ReLU activation function, followed by a fully connected layer. The This section provides a detailed description of the datasets, the sys
final output consists of the resulting features for the individual frame. tem configuration, and the experimental setup. We present a compara
The final features have a size of 1024 dimensions and are then fed to the tive analysis that was performed to evaluate the proposed network
KCB. against SOTA for VS. An extensive ablation study is also conducted to
explore the strength of the proposed strategies. Finally, qualitative re
3.7. Keyframes ConvNet block sults are given to enable a visual comparison to the results with the GT.
The KCB is employed to predict the importance score. The proposed 4.1. Datasets
module applies one dimensional convolution operation to simulta
neously process all frames, and learn temporal patterns. It also includes Two standard SVS datasets, TVSum (Song et al., 2015) and SumMe
deconvolution, feature normalization, and convolution layers. Fig. 2 (Gygli et al., 2014), were used to train and evaluate the proposed
shows the structure of the block. Convolutional layers are arranged to network. Both datasets cover a wide range of activities as depicted in the
include convolution, batch normalization, ReLU activation, and sample examples provided in Fig. 4.
dropout. Lastly, two deconvolutions are applied to carry out recon TVSum: This dataset includes 50 videos downloaded from YouTube.
struction and to obtain the prediction score assisted by softmax. The They show 10 types of activities, such as animal healthcare, dog shows, a
convolution transpose helps obtain a prediction score that aligns with person doing bike tricks, changing of vehicle tyres, making sandwiches,
the input data’s dimensions. The input has dimensions of 1 × VF × D, etc. Each category includes five videos with a diversity of data. A total of
where VF are the frames of the video, and D represents the dimensions of 20 users manually created GT summaries for the TVSum dataset,
7
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
Fig. 4. Sample frames from the (a) SumMe and (b) TVSum datasets. The frames show the wide range of activities in both datasets.
annotating them with scores ranging from one to five, where one means have annotations that were created by 15 to 20 human, thus ensuring
that the shot is unimportant, and a score closer to five indicates that the their reliability and comprehensiveness for VS research (Gygli et al.,
shot is more significant. The durations of the videos in this dataset are 2014; Song et al., 2015).
typically 1–5 min, and the mean length for all 50 videos is 4 min and 18 s.
SumMe: This database contains 25 videos of various activities, such 4.3. Implementation and setup
as cooking, festival events, sports, and holidays, which were also ob
tained from YouTube. According to the baseline study (Gygli et al., The videos were uniformly downsampled to two frames per second
2014), they analysed different types of videos: egocentric, moving, and following the process described by Rochan et al. (2018). The output in
static. The mean length of the 25 videos is 2 min 40 s, and each video the form of the final features was then passed to the KCB, with a size of
contains 15–18 user summaries. The duration of the videos varies be 1024 dimensions for each frame. We note that any type of feature rep
tween 1.5 and 6.5 min. The data are diverse and include various kinds of resentation can be used with the proposed network. Our experiments
annotations. The way in which the different GT annotations are handled were performed using a Window server equipped with an NVIDIA
is covered in the subsequent Section. GeForce GTX 3090 graphics processing unit with a 12 GB RAM capacity.
The model was implemented in PyTorch V1.12.0, and the parameters
4.2. Dataset manual annotations were optimised with the ADAM optimiser. For training, an optimal batch
size of two was determined, the momentum was set to 0.9, and the
Manual annotation is critical to SVS, providing the GT or target learning rate was fixed to 10− 3.
summaries that guide the model’s learning process. These annotations
serve as GT labels for the models to learn from, resulting in high-quality 4.3.1. Ground truth
labelled datasets tailored for VS. It allows the model to understand The GT strategy applied by Rochan et al. (2018) was adopted to
which aspects of a video are important and should be included in a produce a subset of frames (a small number of separated frames) for each
summary. The quality and accuracy of SVS largely depend on the quality video. Summaries based on keyframes and non-keyframes were used for
of manual annotations. However, manual annotation can be labour- training. Keyshots summaries are required for a fair comparison with
intensive, requiring skilled annotators, and multiple individuals to previous approaches. The SumMe dataset includes keyshots of GT labels,
ensure consistency. Despite these challenges, manual annotation is and the network is directly evaluated based on these GT summaries to
essential for advancements in this field and providing crucial bench exploit the benefits of labelled data for a supervised scheme. However,
marks for developing precise algorithms. The main challenge with the TVSum dataset lacks keyshots annotations, and keyshots labels were
manual annotation is ensuring the consistency of the annotations. therefore generated using human annotation. The strategy described in
Different annotators may have varying perceptions of what constitutes one of the baseline works (Zhang et al., 2016b) was followed, in which
critical events or points of interest in a video, leading to potential dis keyshots summaries were generated from importance scores.
crepancies in the annotated data. This inconsistency can affect the
model’s performance and its ability to generalize. These factors under Table 2
score the importance of careful planning and resource allocation in any Ground truth labels used during training and testing for both datasets.
SVS project. Balancing the quality of manual annotations and the cost
Dataset No. of Annotators Training GT Testing GT
and time implications involved in the process is crucial. The TVSum and
SumMe datasets, which contain videos from a wide range of domains TVSum 20 Frame score Frame score
SumMe 15–18 Frame score Keyshots
such as news, documentaries, sports, lectures, talks, and presentations,
8
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
Table 3
Comparison of results from traditional to SOTA techniques (highest performance shown in bold, and second highest in italics). The techniques marked with † are those
where an attention mechanism is incorporated.
Technique Shot segmentation Features F1-score Average Rank
SumMe TVSum
Motion-AE (Zhang et al., 2020) Motion trajectory segmentation FRCNN 37.7 51.5 44.6 27th
VsLSTM (Zhang et al., 2016b) KTS GoogleNet, HOG, SIFT 37.6 54.2 45.9 26th
APD-VS (Lei et al., 2018) – AlexNet 41.2 51.3 46.25 25th
MST_C (Sahu & Chowdhury, 2021) – PHOG 38.3 54.6 46.45 24th
DppLSTM (Zhang et al., 2016b) KTS GoogleNet 38.6 54.7 46.65 23rd
FCSN Unsup (Rochan et al., 2018) KTS GoogleNet 41.5 52.7 47.1 22nd
AR (Elfeki & Borji, 2019) KTS GoogleNet 40.1 56.3 48.2 21st
Cycle-Sum (Yuan et al., 2019) Uniform segmentation GoogleNet 41.9 57.6 49.75 20th
DR-DSN Sup (Zhou et al., 2018) KTS GoogleNet 42.8 57.6 50.2 19th
DR-DSN UnSup (Zhou et al., 2018) KTS GoogleNet 43.9 58.1 51.0 18th
MetaL-TDVS (Li et al., 2019) KTS GoogleNet 44.1 58.2 51.15 17th
ERSUM (Li et al., 2017) Uniform segmentation VGG16 43.1 59.4 51.25 16th
DySeqDPP (Li et al., 2018) KTS GoogleNet 44.3 58.4 51.35 15th
Sum-GANDpp (Yuan et al., 2019) KTS GoogleNet 43.4 59.5 51.45 14th
Sum-GAN (Yuan et al., 2019) KTS GoogleNet 43.6 59.5 51.55 13th
HSA-RNN (Zhao et al., 2018) KTS VGG16 44.1 59.8 51.95 12th
ySF-CVS (Huang & Wang, 2019) KTS CapsulesNet 46.0 58.0 52.0 11th
TTH-RNN (Zhao et al., 2020) Dictionary learning GoogleNet 44.3 60.2 52.25 10th
RSGN Sup (Zhao et al., 2021) KTS GoogleNet 45.0 60.1 52.55 9th
RSGN Unsup (Zhao et al., 2021) KTS GoogleNet 45.5 61.1 53.3 8th
yM− AVS (Ji et al., 2019) KTS GoogleNet 46.1 61.8 53.95 7th
FCSN Sup (Rochan et al., 2018) KTS GoogleNet 51.1 59.2 55.15 6th
†VASNet (Fajtl et al., 2018) KTS GoogleNet 49.71 61.37 55.54 5th
†GAN-VS (Zhong et al., 2021) KTS GoogleNet, VGG 51.7 59.6 55.65 4th
ySABTNet (Fu & Wang, 2021) KTS GoogleNet 50.7 61 55.85 3rd
†SUM-GDA (Li et al., 2021) KTS GoogleNet 52.8 58.9 55.85 3rd
yLMHA (Zhu et al., 2022) KTS GoogleNet + Op 51.4 61.5 56.45 2nd
yVSDA (Liang et al., 2022) KTS GoogleNet 51.7 61.2 56.45 2nd
yMPFN (Our proposed network) KTS DPT ViT 51.9 62.4 57.15 1st
9
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
(2P × R) method to extract the features by considering four feature maps for
F1 − score = (16)
(P + R × 100%) feature assistance at different scales. The combination of attention
mechanism, scalability, and our method of extracting features from each
The usual method outlined in SOTA work (Zhu et al., 2022) was
specific level makes our deployed network a strong candidate as a
applied to compute this metric for videos with numerous GT summaries.
backbone model for feature extraction. The features that give the best
We randomly selected 20% of each dataset as test samples, and the
performance are selected, and a refinement module is developed to
remaining 80% was used for training and validation. We randomly
further improve these features.
distributed the data, repeated the experiments several times, and
calculated the performance based on the average F1-score.
4.5.2. Pyramid feature refinement and fusion analysis
We carried out an empirical study involving feature refinement and
4.4. Performance comparison with conventional and SOTA techniques fusion at different scales. Refining the features at multiple scales can
enhance the overall feature representation. Ablation experiments were
The proposed network was assessed in comparison with previous conducted to evaluate the impact of the PRB. Table 6 shows the results
SOTA unsupervised and SVS techniques. The results of experiments in from different stages of the feature refinement block. Our scheme uses
which various methods were applied to the TVSum and SumMe datasets the mature feature maps Ƒm2 , Ƒm3 and Ƒm4 acquired from the backbone.
are shown in Tables 3 and 4. We also present a comparative analysis To examine the features, the PRB was investigated in depth to take the
with three different traditional approaches: the method of Song et al. most representative stages. It can be seen from the results reported
(2015), exemplar-based subset selection (Zhang et al., 2016a), and below that when the PRB is not included for refinement, our network
edited video and raw VS (Li et al., 2017). Our results are also compared gives the worst results and that the performance is marginally increased
to those of various DL-based approaches, including attention-based when more pyramidal attention stages are employed. The table shows
methods, in the domain of VS. that good results are achieved on both datasets when the three mature
Our network significantly outperformed the SOTA alternatives on global features Ƒm2 , Ƒm3 and Ƒm4 are passed to the PRB. This superior
the SumMe and TVSum datasets. The performance on the TVSum was performance is because MHSA uses attention to selectively attend to
higher than the SumMe since the themes of the videos in the TVSum are different levels of abstraction in the feature maps. This allows the
similar, whereas those in the SumMe are highly diverse, as shown in network to capture both the local and global contexts, which is essential
Fig. 3. In addition, the number of videos in the SumMe dataset was about for distinguishing keyframes from non-keyframes. DASPP selectively
half of the total in the TVSum dataset. The most recent works, VSDA attends to both the spatial and channel dimensions of the feature maps;
(Liang et al., 2022) and LMHA (Zhu et al., 2022), both achieved an this helps the model to capture fine-grained spatial details and encode
average F1-score of 56.45 on both datasets, giving them the second-best richer channel-wise representations, which enable the network to focus
performance after our mechanism. VSDA and LMHA leverage attention- on the most relevant features at each scale via PRB. The visual and
based modules to strengthen the importance of each feature. However, mathematical representations are updated as Pm4 , Pm3 and Pm4 after
both extract visual features using conventional backbone networks passing from the PRB. As discussed in Section 3.5, the initial features Ƒm1
without exploring high-representative features. As a result, fewer rep containing specific representations are stored for later fusion with the
resentation features are extracted, as they follow the current trend of VS mature features to give more effective outcomes. The Ƒm1 features were
mechanisms by relying on pool5 of GoogleNet. fed to the pyramid1 block, which contains a single MHSA; the perfor
We also conducted extensive study of various backbones and mance is reported in Table 6.
developed a unique way of extracting features using a SOTA network.
Our feature refinement strategies, followed by the fusion of the domi
nant features at different scales, were shown to be effective through 4.6. Impact of the MHSA attention heads
empirical validation. Our proposed MPFPN network marginally out
performed the SOTA alternatives, as can be seen from Tables 3 and 4. The number of attention heads in MHSA have a significant impact on
the performance of the network. Increasing the number of heads allows
4.5. Ablation study the network to capture more complex relationships between the input
data (Voita et al., 2019), leading to better performance. Our proposed
We carried out an ablation analysis to examine our network more PRB contains an MHSA in each of the three pyramid stages. The number
deeply by comparing different backbones and deploying effective PRB of attention heads has a marginal effect on the overall frame recognition
stages, followed by exploring the impact of the attention heads on the
results. Extensive experiments were conducted, and the performance of Table 5
our network is reported accordingly. Results from using various backbones to extract features for input to the pro
posed KCB (best results shown in bold).
4.5.1. Backbone analysis Dataset GoogleNet MobileNet ResNet-152 ResNet-101 ViT
Feature extraction is very important, and the selection of the best TVSum 56.4 55.2 53.8 52.3 58.5
feature descriptor can significantly improve the performance of a model. SumMe 47.2 46.9 46.3 43.8 48.7
An extensive features analysis can enable the best-performing feature
extractor to be identified. We analysed the impact of several widely used
feature extraction models for VS, and the features were then passed to
Table 6
the proposed KCB to evaluate the performance. The features were F1-score (%) results from ablation experiments on the suggested PRB.
examined at this early stage because we intended to feed the most
Dataset Pyramid1 Pyramid2 Pyramid3 Pyramid4 F1-score
efficient features to the attention module. Experiments were conducted
on both datasets using various backbone descriptors, and the results are TVSum ✔ × × × 58.7
59.1
shown in Table 5. It can be observed that the best results were achieved × ✔ ✔ ×
× × ✔ ✔ 61.5
using our strategy. The main reason for its superior performance is that × ✔ ✔ ✔ 61.9
the DPT-assisted ViT backbone uses an attention mechanism to attend to SumMe ✔ × × × 48.6
different parts of the input frames. This enables the proposed model to × ✔ ✔ × 49.4
capture the spatial relationships between different frame regions, × × ✔ ✔ 50.8
51.2
making it more suitable for feature extraction. We also use a unique
× ✔ ✔ ✔
10
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
Fig. 5. (a) F1-score (percentage) results from ablation studies on the number of attention heads; (b) results from the final ablation study on both datasets.
Fig. 6. (a) Promising results from the 25th video of the SumMe dataset with an F1-score of 77.4; (b) poor results for the 17th video of the SumMe dataset with an F1-
score of 32.5. Overlapping frames are denoted by a black arrow, while incorrect predictions are shown with red arrows. The ground truth (top bar) and predicted
labels (bottom bar) are shown for each video.
performance. Numerous experiments were conducted with varying into several subspaces. The attention mechanism can be applied at
numbers of MHSA attention heads in PRB to demonstrate the impact of various levels of parallelism, depending on the number of attention
this operation, the results of which are shown in Fig. 5 (a). It is worth heads, which in turn defines the number of subspaces. Increasing the
noting that the performance of MHSA depends on the number of heads number of attention heads has been empirically shown to improve the
involved. The key reason for this is that the network with more attention performance of the model up to a specific level. The performance de
heads can focus on specific features throughout the feature processing creases on both datasets when the head sizes are increased from 16.
steps, but the performance is degraded when too many attention heads Hence, the fluctuation in the F1-score shown in the graph supports our
are used, and tends towards overfitting, as illustrated in Fig. 5(a). It can conclusion in terms of the significance of the number of uniform
be seen that the best performance increased from 61.9% and 51.2% to attention heads. These results demonstrate the importance of including
62.4% and 51.9% on TVSum and SumMe, respectively, when the uni attention heads. Fig. 5 (b) depicts the training results after all ablations
form 16 heads were deployed in each MHSA in the pyramid. Optimal on both datasets. The network was trained on 50 epochs, and the best
performance was achieved when the input feature maps were divided performance was found for the 45th epoch on both datasets. The final
11
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
Fig. 7. Examples of results for two videos (good and bad results) from the TVSum dataset: (a) promising results on the 41st video with an F1-score of 78.0; (b) poor
results on the 17th video with an F1-score of 47.9. The predicted labels (bottom bar) and GT (top bar) are shown for each video.
testing results on TVSum and SumMe are shown in Fig. 5(b). scene recognition. With this motivation, we studied various pyramidal-
based feature refinement and developed our MPFN network for efficient
video summarization. After an extensive analysis of backbone models,
4.7. Qualitative results we employed a dense prediction transformer assisted by a ViT backbone
to extract the optimal representations. Our model extracted deep spe
Qualitative findings are provided to visually assess the effectiveness cific and global features from four multi-scaled feature maps rather than
of the proposed network. Figs. 6 and 7 show examples from four videos relying on a single feature map. The acquired multi-level feature rep
on both datasets with the GT. Two videos were selected from each resentations were learned via pyramid attention, and feature refinement
dataset to illustrate the best F1-scores and the poor results. The sum stages were used to boost the frame-level representations. A progressive
maries produced by our network show a high overlap and a low per feature fusion process was applied to the features before feeding to the
centage of non-overlap. Fig. 6 shows several examples of keyframes from proposed KCB to predict the importance score close to the GT. We also
the SumMe dataset: the first row shows the GT, while the keyframes conducted a more profound investigation with extensive experiments to
predicted by our network are shown in the second row. It can be seen explore the effectiveness of each component in the proposed network.
from Fig. 6(a) that for the 25th video of the SumMe dataset, good results The predicted scores for the videos were used to generate visual results
are obtained that are almost the same as the GT. However, the perfor that were compared with the GT frames. Our experimental results
mance of our network on the 17th video is poor due to high diversity, as indicated that the proposed approach outperformed SOTA techniques in
shown in Fig. 6(b). Our network identifies relevant keyframes that are terms of the F1-score. In the future, we aim to integrate feature selection
well-aligned with the GT summary. It can be observed that the network and an optimization algorithm to improve the performance and infer
selects representative video frames and produces a summary that rep encing of the model. We also intend to explore the use of VS in uncertain
resents the central theme of the original video. environments.
In Fig. 7, we show the best and worst video results for the TVSum
dataset. The frames in the summary video are indicated by the green CRediT authorship contribution statement
bars, and the black bars show non-keyframes. Our network identifies
almost the same frames as the GT summary for the 41st video, which Habib Khan: Conceptualization, Methodology, Software, Valida
mainly shows bike riding. However, it produces poor results on the 25th tion, Writing – original draft. Tanveer Hussain: Conceptualization,
video. The best result on the TVSum dataset was an F1-score of 78.0, Validation, Writing – review & editing. Samee Ullah Khan: Data
while the poor result for this dataset was found for the 25th video, with curation, Formal analysis, Methodology, Software, Supervision, Writing
an F1-score of 47.9. Overall, the summary obtained with our method is – review & editing. Zulfiqar Ahmad Khan: Formal analysis, Software,
similar to the summary generated by the annotators for the video. The Writing – review & editing. Sung Wook Baik: Funding acquisition,
best results for the SumMe and TVSum datasets were obtained for videos Project administration.
showing ball games and a person attempting bike tricks. Our network
selects the most relevant frames to represent the activities which are Declaration of Competing Interest
more matching with the GT summary.
The authors declare that they have no known competing financial
5. Conclusion interests or personal relationships that could have appeared to influence
the work reported in this paper.
Pyramidal attention has been shown to give SOTA performance
when applied to video analysis problems such as event detection and
12
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
Data availability Ji, Z., Xiong, K., Pang, Y., & Li, X. (2019). Video summarization with attention-based
encoder–decoder networks. IEEE Transactions on Circuits and Systems for Video
Technology, 30(6), 1709–1717. https://doi.org/10.1109/ICCST50977.2020.00046
The authors do not have permission to share data. Jung, Y., Cho, D., Kim, D., Woo, S., & Kweon, I. S. (2019, January). Discriminative
feature learning for unsupervised video summarization. Proceedings of the AAAI
Acknowledgment Conference on Artificial Intelligence, Honolulu, Hawaii, USA.
Kosmopoulos, D. I., Doulamis, A., & Doulamis, N. (2005, September). Gesture-based
video summarization. IEEE International Conference on Image Processing, Genova,
This work was supported by the National Research Foundation of Italy.
Korea (NRF) grant funded by the Korea Government, MSIT, under Grant Lebron Casas, L., & Koblents, E. (2018, December). Video summarization with LSTM and
deep attention models. International Conference on MultiMedia Modeling, Thessaloniki,
2023R1A2C1005788. Greece.
Lei, J., Luan, Q., Song, X., Liu, X., Tao, D., & Song, M. (2018). Action parsing-driven
References video summarization based on reinforcement learning. IEEE Transactions on Circuits
and Systems for Video Technology, 29(7), 2126–2137. https://doi.org/10.1109/
TCSVT.2018.2860797
Amin, S. U., Hussain, A., Kim, B., & Seo, S. (2023). Deep learning based active learning
Li, P., Ye, Q., Zhang, L., Yuan, L., Xu, X., & Shao, L. (2021). Exploring global diverse
technique for data annotation and improve the overall performance of classification
attention via pairwise temporal relation for video summarization. Pattern
models. Expert Systems with Applications, 228, 120391.
Recognition, 111, Article 107677. https://doi.org/10.1016/j.patcog.2020.107677
Apostolidis, E., Adamantidou, E., Metsai, A. I., Mezaris, V., & Patras, I. (2021). Video
Li, X., Li, H., & Dong, Y. (2019). Meta learning for task-driven video summarization. IEEE
summarization using deep neural networks: A survey. Proceedings of the IEEE, 109
Transactions on Industrial Electronics, 67(7), 5778–5786. https://doi.org/10.1109/
(11), 1838–1863. https://doi.org/10.1109/JPROC.2021.3117472
TIE.2019.2931283
Apostolidis, E., Balaouras, G., Mezaris, V., & Patras, I. (2021, November). Combining
Li, X., Zhao, B., & Lu, X. (2017). A general framework for edited video and raw video
global and local attention with positional encoding for video summarization. IEEE
summarization. IEEE Transactions on Image Processing, 26(8), 3652–3664. https://
International Symposium on Multimedia (ISM), Naple, Italy.
doi.org/10.1109/TIP.2017.2695887
Basavarajaiah, M., & Sharma, P. (2021). GVSUM: Generic video summarization using
Li, Y., Wang, L., Yang, T., & Gong, B. (2018, October). How local is the local diversity?
deep visual features. Multimedia Tools and Applications, 80(9), 14459–14476. https://
Reinforcing sequential determinantal point processes with dynamic ground sets for
doi.org/10.1007/s11042-020-10460-0
supervised video summarization. Proceedings of the European Conference on Computer
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab:
Vision (ECCV), Munich, Germany.
Semantic image segmentation with deep convolutional nets, atrous convolution, and
Liang, G., Lv, Y., Li, S., Wang, X., & Zhang, Y. (2022). Video summarization with a dual-
fully connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence,
path attentive network. Neurocomputing, 467, 1–9. https://doi.org/10.1016/j.
40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184
neucom.2021.09.015
W.-S. Chu Y. Song A. Jaimes June). Video co-summarization: Video summarization by
Liu, Y.-T., Li, Y.-J., Yang, F.-E., Chen, S.-F., & Wang, Y.-C. F. (2019). Learning
visual co-occurrence 2015 Boston, MA, USA.
hierarchical self-attention for video summarization. 2019 IEEE International
De Avila, S. E. F., Lopes, A. P. B., da Luz Jr, A., & de Albuquerque Araújo, A. (2011).
Conference on Image Processing (ICIP). https://doi.org/10.1109/ICIP.2019.8803639.
VSUMM: A mechanism designed to produce static video summaries and a novel
Ma, Y.-F., Lu, L., Zhang, H.-J., & Li, M. (2002, December). A user attention model for
evaluation method. Pattern Recognition Letters, 32(1), 56–68. https://doi.org/
video summarization. Proceedings of the Tenth ACM International Conference on
10.1016/j.patrec.2010.08.004
Multimedia, Juan-les-Pins, France.
Doulamis, A. D., Doulamis, N. D., & Kollias, S. D. (2000). A fuzzy video content
Mahasseni, B., Lam, M., & Todorovic, S. (2017, July). Unsupervised video summarization
representation for video summarization and content-based retrieval. Signal
with adversarial lstm networks. Proceedings of the IEEE Conference on Computer Vision
Processing, 80(6), 1049–1067. https://doi.org/10.1016/S0165-1684(00)00019-0
and Pattern Recognition, Honolulu, Hawaii, USA.
Doulamis, N. D., Doulamis, A. D., Avrithis, Y. S., Ntalianis, K. S., & Kollias, S. D. (2000).
Mei, S., Ma, M., Wan, S., Hou, J., Wang, Z., & Feng, D. D. (2020). Patch based video
Efficient summarization of stereoscopic video sequences. IEEE Transactions on
summarization with block sparse representation. IEEE Transactions on Multimedia,
Circuits and Systems for Video Technology, 10(4), 501–517. https://doi.org/10.1109/
23, 732–747. https://doi.org/10.1109/TMM.2020.2987683
76.844996
Muhammad, K., Hussain, T., & Baik, S. W. (2020). Efficient CNN based summarization of
Eigen, D., & Fergus, R. (2015, December). Predicting depth, surface normals and
surveillance videos for resource-constrained devices. Pattern Recognition Letters, 130,
semantic labels with a common multi-scale convolutional architecture. Proceedings of
370–375. https://doi.org/10.1016/j.patrec.2018.08.003
the IEEE International Conference on Computer Vision, Santiago, Chile.
Muhammad, K., Hussain, T., Del Ser, J., Palade, V., & De Albuquerque, V. H. C. (2019).
Elfeki, M., Borji, A., & (2019, January). Video summarization via actionness ranking..
DeepReS: A deep learning-based video summarization strategy for resource-
(2019). IEEE Winter Conference on Applications of Computer Vision (WACV). HI, USA:
constrained industrial surveillance scenarios. IEEE Transactions on Industrial
Waikoloa.
Informatics, 16(9), 5938–5947. https://doi.org/10.1109/TII.2019.2960536
Elhamifar, E., Sapiro, G., & Sastry, S. S. (2015). Dissimilarity-based sparse subset
Muhammad, K., Hussain, T., Rodrigues, J. J., Bellavista, P., de Macêdo, A. R. L., & de
selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(11),
Albuquerque, V. H. C. (2020). Efficient and privacy preserving video transmission in
2182–2197. https://doi.org/10.1109/TPAMI.2015.2511748
5G-enabled IoT surveillance networks: Current challenges and future directions.
Fajtl, J., Sokeh, H. S., Argyriou, V., Monekosso, D., & Remagnino, P. (2018, December).
IEEE Network, 35(2), 26–33. https://doi.org/10.1109/MNET.011.1900514
Summarizing videos with attention. Asian Conference on Computer Vision, Perth,
Muhammad, K., Hussain, T., Tanveer, M., Sannino, G., & de Albuquerque, V. H. C.
Australia.
(2019). Cost-effective video summarization using deep CNN with hierarchical
Fei, M., Jiang, W., & Mao, W. (2017). Memorable and rich video summarization. Journal
weighted fusion for IoT surveillance networks. IEEE Internet of Things Journal, 7(5),
of Visual Communication and Image Representation, 42, 207–217. https://doi.org/
4455–4463. https://doi.org/10.1109/JIOT.2019.2950469
10.1016/j.jvcir.2016.12.001
Munsif, M., Afridi, H., Ullah, M., Khan, S. D., Cheikh, F. A., & Sajjad, M. (2022,
Fu, H., & Wang, H. (2021). Self-attention binary neural tree for video summarization.
September). A lightweight convolution neural network for automatic disasters
Pattern Recognition Letters, 143, 19–26. https://doi.org/10.1016/j.
recognition. In 2022 10th European Workshop on Visual Information Processing
patrec.2020.12.016
(EUVIP) (pp. 1–6). IEEE.
Georgiev, D. (2022). 33+ Amazing TikTok Statistics You Should Know in 2022. Retrieved
Panagiotakis, C., Doulamis, A., & Tziritas, G. (2009). Equivalent key frames selection
from https://techjury.net/blog/tiktok-statistics/. Accessed July 26, 2022.
based on iso-content principles. IEEE Transactions on Circuits and Systems for Video
Gong, B., Chao, W.-L., Grauman, K., & Sha, F. (2014, December). Diverse sequential
Technology, 19(3), 447–451. https://doi.org/10.1109/TCSVT.2009.2013517
subset selection for supervised video summarization. International Conference on
Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014). Category-specific video
Neural Information Processing Systems, Montreal Canada.
summarization. In Computer Vision–ECCV 2014: 13th European Conference, Zurich,
Gygli, M., Grabner, H., Riemenschneider, H., & Van Gool, L. (2014). (2014, September).
Switzerland, September 6–12, 2014, Proceedings, Part VI 13 (pp. 540–555). Springer
Creating summaries from user videos. Computer Vision–ECCV, 13th European Conference.
International Publishing.
Switzerland: Zurich.
Puthige, I., Hussain, T., Gupta, S., & Agarwal, M. (2023). Attention over attention: An
Gygli, M., Grabner, H., & Van Gool, L. (2015, June). Video summarization by learning
enhanced supervised video summarization approach. Procedia Computer Science, 218,
submodular mixtures of objectives. Proceedings of the IEEE Conference on Computer
2359–2368. https://doi.org/10.1016/j.procs.2023.01.211
Vision and Pattern Recognition, Boston, MA, USA.
Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021, October). Vision transformers for dense
Hale, J. (2019). More Than 500 Hours Of Content Are Now Being Uploaded To YouTube
prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision,
Every Minute. Retrieved from https://www.tubefilter.com/2019/05/07/number-
Montreal, BC, Canada.
hours-video-uploaded-to-youtube-per-minute/. Accessed July 26, 2022.
Rochan, M., Ye, L., & Wang, Y. (2018, September). Video summarization using fully
Huang, C., & Wang, H. (2019). A novel key-frames selection framework for
convolutional sequence networks. Proceedings of the European Conference on Computer
comprehensive video summarization. IEEE Transactions on Circuits and Systems for
Vision (ECCV), Munich, Germany.
Video Technology, 30(2), 577–589. https://doi.org/10.1109/TCSVT.2019.2890899
Sahu, A., & Chowdhury, A. S. (2021). First person video summarization using different
Hussain, T., Anwar, A., Anwar, S., Petersson, L., Baik, S. W., & (2022, June). Pyramidal
graph representations. Pattern Recognition Letters, 146, 185–192. https://doi.org/
attention for saliency detection.. (2022). IEEE/CVF Conference on Computer Vision
10.1016/j.patrec.2021.03.013
and Pattern Recognition Workshops (CVPRW). USA: Louisiana.
Song, Y., Vallmitjana, J., Stent, A., & Jaimes, A. (2015, June). TVSum: Summarizing web
Hussain, T., Muhammad, K., Ding, W., Lloret, J., Baik, S. W., & de Albuquerque, V. H. C.
videos using titles. Proceedings of the IEEE Conference on Computer Vision and Pattern
(2021). A comprehensive survey of multi-view video summarization. Pattern
Recognition, Boston, MA, USA.
Recognition, 109, Article 107567. https://doi.org/10.1016/j.patcog.2020.107567
13
H. Khan et al. Expert Systems With Applications 237 (2024) 121288
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing multi-head Zhao, B., Li, H., Lu, X., & Li, X. (2021). Reconstructive sequence-graph network for video
self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5),
preprint arXiv:1905.09418. https://doi.org/10.48550/arXiv.1905.09418. 2793–2801. https://doi.org/10.1109/TPAMI.2021.3072117
Wei, H., Ni, B., Yan, Y., Yu, H., Yang, X., & Yao, C. (2018, Febraury). Video Zhao, B., Li, X., & Lu, X. (2017, October). Hierarchical recurrent neural network for
summarization via semantic attended networks. Proceedings of the AAAI Conference video summarization. Proceedings of the 25th ACM International Conference on
on Artificial Intelligence, New Orleans, Lousiana, USA. Multimedia, Nice, France.
Yang, M., Yu, K., Zhang, C., Li, Z., & Yang, K. (2018, June). Denseaspp for semantic Zhao, B., Li, X., & Lu, X. (2018, June). HSA-RNN: Hierarchical structure-adaptive RNN
segmentation in street scenes. Proceedings of the IEEE Conference on Computer Vision for video summarization. Proceedings of the IEEE Conference on Computer Vision and
and Pattern Recognition, Salt Lake City, UT, USA. Pattern Recognition, Salt Lake City, UT, USA.
Yao, T., Mei, T., & Rui, Y. (2016). Highlight detection with pairwise deep ranking for Zhao, B., Li, X., & Lu, X. (2019). Property-constrained dual learning for video
first-person video summarization. In Proceedings of the IEEE Conference on Computer summarization. IEEE Transactions on Neural Networks and Learning Systems, 31(10),
Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2016.112 3989–4000. https://doi.org/10.1109/TNNLS.2019.2951680
Yuan, L., Tay, F. E. H., Li, P., & Feng, J. (2019). Unsupervised video summarization with Zhao, B., Li, X., & Lu, X. (2020). TTH-RNN: Tensor-train hierarchical recurrent neural
cycle-consistent adversarial LSTM networks. IEEE Transactions on Multimedia, 22 network for video summarization. IEEE Transactions on Industrial Electronics, 68(4),
(10), 2711–2722. https://doi.org/10.1109/TMM.2019.2959451 3629–3637. https://doi.org/10.1109/TIE.2020.2979573
Yuan, Y., Chen, X., & Wang, J. (2020, August). Object-contextual representations for Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017, July). Pyramid scene parsing network.
semantic segmentation. European Conference on Computer Vision, Glasgow, Scotland, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
United Kingdom. Honolulu, Hawaii, USA.
Yuan, Y., Mei, T., Cui, P., & Zhu, W. (2017). Video summarization by learning deep side Zhao, J., Mathieu, M., & LeCun, Y. (2016). Energy-based generative adversarial network.
semantic embedding. IEEE Transactions on Circuits and Systems for Video Technology, arXiv preprint arXiv:1609.03126. https://doi.org/10.48550/arXiv.1609.03126.
29(1), 226–237. https://doi.org/10.1109/TCSVT.2017.2771247 Zhong, R., Wang, R., Zou, Y., Hong, Z., & Hu, M. (2021). Graph attention networks
Zhang, K., Chao, W.-L., Sha, F., & Grauman, K. (2016a, June). Summary transfer: adjusted bi-LSTM for video summarization. IEEE Signal Processing Letters, 28,
Exemplar-based subset selection for video summarization. Proceedings of the IEEE 663–667. https://doi.org/10.1109/LSP.2021.3066349
Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA. Zhou, K., Qiao, Y., & Xiang, T. (2018, February). Deep reinforcement learning for
Zhang, K., Chao, W.-L., Sha, F., & Grauman, K. (2016b, October). Video summarization unsupervised video summarization with diversity-representativeness reward.
with long short-term memory. European Conference on Computer Vision, Amsterdam, Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, Louisiana,
Netherlands. USA.
Zhang, Y., Liang, X., Zhang, D., Tan, M., & Xing, E. P. (2020). Unsupervised object-level Zhu, W., Lu, J., Han, Y., & Zhou, J. (2022). Learning multiscale hierarchical attention for
video summarization with online motion auto-encoder. Pattern Recognition Letters, video summarization. Pattern Recognition, 122, Article 108312. https://doi.org/
130, 376–385. https://doi.org/10.1016/j.patrec.2018.07.030 10.1016/j.patcog.2021.108312
14