23AAAI Refined Semantic Enhancement Towards Frequency Diffusion For Video Captioning
23AAAI Refined Semantic Enhancement Towards Frequency Diffusion For Video Captioning
Abstract
Training
Video captioning aims to generate natural language sentences
that describe the given video accurately. Existing methods
obtain favorable generation by exploring richer visual repre-
sentations in encode phase or improving the decoding ability. a man is playing the keyboard. insufficient
However, the long-tailed problem hinders these attempts at a person is singing into a microphone. occurrence
low-frequency tokens, which rarely occur but carry critical se- a man is singing a cover of a song in a studio. cause
mantics, playing a vital role in the detailed generation. In this
paper, we introduce a novel Refined Semantic enhancement
method towards Frequency Diffusion (RSFD), a captioning Testing
model that constantly perceives the linguistic representation of
the infrequent tokens. Concretely, a Frequency-Aware Diffu-
sion (FAD) module is proposed to comprehend the semantics no generation of the word
of low-frequency tokens to break through generation limita- a person is playing a video game. 'keyboard'
tions. In this way, the caption is refined by promoting the
absorption of tokens with insufficient occurrence. Based on
FAD, we design a Divergent Semantic Supervisor (DSS) mod- Figure 1: Illustration of existing models limitations: the in-
ule to compensate for the information loss of high-frequency sufficient occurrence of low-frequency tokens affects the
tokens brought by the diffusion process, where the seman- information extraction during training, thus failing to gener-
tics of low-frequency tokens is further emphasized to allevi-
ate refined low-frequency words in the testing phase.
ate the long-tailed problem. Extensive experiments indicate
that RSFD outperforms the state-of-the-art methods on two
benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate
that the enhancement of low-frequency tokens semantics can et al. 2022b; Xu et al. 2022) are developed to explore richer
obtain a competitive generation effect. Code is available at visual features. Prior works (Zhang and Peng 2019; Pan et al.
https://github.com/lzp870/RSFD. 2020; Zhang et al. 2020; Li et al. 2022c) enhance the spatio-
temporal representations between objects, while others (Jin
et al. 2020; Gao et al. 2022) tend to improve the architec-
Introduction ture of the decoder to obtain better linguistic representations.
Video captioning is the task of understanding video con- However, they either focus on the enhancement of visual rep-
tent and describing it accurately. It has considerable applica- resentations or the generation ability of the model, ignoring
tion value in social networks, human-computer interaction, the tokens with insufficient occurrence in the training phase.
and other fields. Despite recent progress in this field, it re- Many low-frequency tokens, as a critical factor to prove cap-
mains a challenging task as existing models prefer to generate tion performance, have not received adequate attention, thus
commonly-used words disregarding many infrequent tokens challenging for the model to express similar information in
that carry critical semantics of video content, limiting the other videos in the testing phase like the instance in Fig. 1.
refined semantic generation in the testing phase. Concretely, a man is playing the keyboard and the model
The existing methods generally adopt the encoder-decoder has partially generated “a person is playing a video game”.
framework (Venugopalan et al. 2015), where the encoder The result commonly occurs when the low-frequency token
generates visual representation by receiving a set of consecu- “keyboard” is of insufficient occurrence in the given video.
tive frames as input, and the decoder generates captions via To deal with the challenges of token imbalance, the ap-
recurrent neural networks (RNNs) or Transformer (Hori et al. proach (Wu et al. 2016) in neural machine translation splits
2017). Some efforts (Hu et al. 2022a,b; Jia et al. 2022; Liao words into more fine-grained units. However, the token imbal-
* Corresponding authors. ance phenomenon is not fundamentally eliminated. BMI (Xu
Copyright © 2023, Association for the Advancement of Artificial et al. 2021) has been dedicated to assigning adaptive weights
Intelligence (www.aaai.org). All rights reserved. to target tokens in light of the token frequency. Although
3724
Encoder HEL Highway Embedding Layer Divergent Semantic Supervisor
Concatenation
I Train L Original Loss
CNN
2D
HEL
...
Test DL Divergent Loss
. ..
M
CNN
3D
HEL Output Captions
...
Linear
Linear
Frequency-Aware Diffusion Train Test Decoder
character is
Softmax
Ground-truth Semantic Space Diffusion
Caption Caption
shared
character is
driving
character
driving
character
character
Softmax
Frequency Diffusion
Linear
e.g.
Word Embedding
driving
Cross-Attention
Feed Forward
Self-Attention
is window
is
airplane
a
truck size = 3
driving
vehicle
is vehicle
a
driving
a
shared
trailer
vehicle
L Latter
a
Predict
a High-Frequency Token Unmarked Token
vehicle
Softmax
Ground
Linear
Linear
Low-Frequency Token Similar Semantic Trurh
Former DL
vehicle Predict
Figure 2: Overview of the proposed RSFD architecture. It mainly consists of the encoder in the top-left box and the decoder with
FAD and DSS modules in another box. In the training phase, FAD promotes the model comprehending the refined information
by mapping the ground-truth caption to the semantic space and fusing it in frequency diffusion. DSS supervises the central word
to obtain its distinctive semantics. In the testing phase, only the transformer-based parts are implemented for sentence generation.
they increase the exposure of low-frequency tokens during loss of high-frequency tokens (Pure high-frequency words
training, they potentially damage the information of high- can be entirely trained, whereas they carry low-frequency
frequency ones. In video captioning, few researchers fo- semantics that disturbs the model’s learning of them) by
cus on the learning effect of low-frequency tokens. ORG- the diffusion process and further excavate the low-frequency
TRL (Zhang et al. 2020) introduces an external language information, we further design a Divergent Semantic Supervi-
model to extend the low-frequency tokens, which depends sor (DSS) module to exploit adjacency words to supervise the
heavily on external knowledge. Taking the above attempts central word. With the adjacent supervisory information, the
as a reference, the motivation of our design can be depicted central high-frequency token can obtain the semantics from
as follows: 1) With the help of high-frequency tokens more multiple teachers adapted to itself for compensation, and the
times trained by the model, low-frequency tokens entrust central low-frequency one can further promote its genera-
their semantics to high-frequency ones obtaining adequate tion. We concentrate on the representation of low-frequency
exposure to alleviate the effect of token imbalance. 2) Instruc- tokens, which helps the model capture information about
tion from multiple teachers to high-frequency tokens boosts relatively rare but vital words in the testing phase. The im-
specific semantics suitable for them as compensation due to provement of low-frequency tokens enables the generation
information loss by the previous stage, and low-frequency caption to have a refined description.
tokens further enhance their generation. Our main contributions are summarized threefold:
In this work, we propose a novel method named Refined • We put the emphasis on token imbalance that has not
Semantic enhancement towards Frequency Diffusion (RSFD) been focused seriously in video captioning and dedicate
for video captioning to address the mentioned issue. In RSFD, to concentrating on infrequent but vital tokens to enhance
unlike previous effort (Zhang et al. 2020) that with the help the refined semantics of generated captions.
of external modules, we propose a module named Frequency-
• We devise a unique module, termed Frequency-Aware
Aware Diffusion (FAD) to explore the semantics of low-
Diffusion (FAD), in the form of diffusion to learn tokens
frequency tokens inside our model. Firstly, in light of the
with insufficient occurrence that are most likely to carry
occurrence numbers of each token in the whole corpus and
critical semantics. To the best of our knowledge, this is
in each video they appear, we split them into three cate-
the first time that the frequency of tokens is attached to
gories (high-frequency, low-frequency, and unmarked). Sec-
great importance in video captioning.
ondly, we exploit the diffusion process to comprehend low-
frequency tokens by integrating them with high-frequency • We further design a Divergent Semantic Supervisor (DSS)
ones. With the assistance of the diffusion of low-frequency module as a constraint for balancing different frequent
tokens, the model develops the ability to absorb the semantics tokens by leveraging distinctive semantics adapted to the
of low-frequency tokens. Afterward, to alleviate the semantic token itself, and quantitatively validating its effectiveness.
3725
Related Works Fig. 2, our overall framework follows the encoder-decoder
RNN- and Transformer-Based Methods structure. During the training process, Frequency-Aware Dif-
fusion (FAD) encourages the model to add low-frequency
With the development of deep learning, researchers extract token noise to learn its semantics. Then the diffusion features
video features through deep convolution neural networks of tokens are fused with the corresponding visual features
and combine them with RNNs or Transformer (Hori et al. according to the cross-attention mechanism. At the head of
2017) to make the generated words more accurate. Early the decoder, Divergent Semantic Supervisor (DSS) obtains
in (Venugopalan et al. 2015), the encoder-decoder framework distinctive semantic features by updating the gradient that
is applied first to the captioning model, where visual fea- adapts to the token itself. In the testing phase, only the origi-
tures are leveraged by performing mean-pooling over each nal Transformer architecture is retained to generate captions.
frame, and long short-term memory (LSTM) (Hochreiter
and Schmidhuber 1997) is adopted to generate sentences. Encoder-Decoder Framework
Benefiting from parallel computing ability and scalability
Encoder. Image and motion features are processed by two
of Transformer (Vaswani et al. 2017), it is applied to many separate encoders, which give a set of consecutive video
multi-modal tasks. Inspired by it, Transformer has been in-
clips of length K. Here we use V ∈ RK×dv to represent
troduced for video captioning (Chen et al. 2018a). SBAT (Jin the features of these two types, where dv denotes dimen-
et al. 2020) is intended to reduce the redundancy of consecu- sion of each feature. We feed them into highway embed-
tive frames. D2 (Gao et al. 2022) utilizes the syntactic prior ding layer (HEL) to obtain representations R ∈ RK×dh ,
to measuring the contribution of visual features and syntactic i.e., R = fHEL (V ), where dh is output dimension of HEL,
prior information for each word. and fHEL can be trained directly through simple gradient de-
scent (Srivastava, Greff, and Schmidhuber 2015), thus it can
Diffusion Model be formulated as:
Some efforts (Li et al. 2022a; Ye et al. 2022b) have fHEL (V ) = BN (T (V ) ◦ H (V ) + (1 − T (V )) ◦ P (V )) , (1)
been conducted on data noise. In recent years, Diffusion
model (Sohl-Dickstein et al. 2015) has made remarkable where H(V ) = V Weh , P (V ) = tanh(H(V )Wep ), and
progress. DDPM (Ho, Jain, and Abbeel 2020) develops the T (V ) = ϕ(H(V )Wet ). BN denotes batch normalization
denoising ability by artificially adding noise in the diffusion operation (Ioffe and Szegedy 2015), ◦ is the element-wise
process and matching each small step of the inverse pro- product, and ϕ is the sigmoid function. Weh ∈ Rdv ×dh ,
cess corresponding to the forward process. TimeGrad (Rasul and {Wep , Wet } ∈ Rdh ×dh . The multi-modalities, e.g., im-
et al. 2021) integrates the diffusion model with the autore- age and motion features are applied concatenation to obtain
gressive model by sampling from the data distribution at R ∈ R2K×dh to feed into the decoder.
each time step by estimating its gradient. The diffusion ap- Decoder. The decoder block consists of self-attention,
proaches (Liao et al. 2022a; Zhu et al. 2022) have obtained cross-attention, and feed-forward layer. The self-attention
significant progress in computer vision. Inspired by the po- layer is formulated as:
tential of learning noise by noise addition that is just aiming
at learning the noise purely, RSFD regards the information on SelfAtt (E<t ) = MultiHead (E<t , E<t , E<t )
(2)
low-frequency tokens as noise and adds it to high-frequency = Concat (head1:h ) Wh ,
tokens, developing the model’s ability to comprehend low-
frequency tokens. where E<t ∈ RT ×dh denotes the feature vector of embedded
captions, T represents the length of caption. Concat denotes
Translation Methods with Frequency Modeling concatenation operation, and h denotes the number of multi-
ple attention heads. Wh ∈ Rdh ×dh is a trainable parameter.
In neural machine translation, the token imbalance data drive Residual connection and layer normalization are adopted
the model preferentially generates high-frequency words after the self-attention layer:
while ignoring the rarely-used words. GNMT (Wu et al. 2016) ′
proposes to divide words into a limited set of common sub- E<t = LayerNorm (E<t + SelfAtt (E<t )) , (3)
word units to handle the rare words naturally. BMI (Xu et al.
2021) assigns an adaptive weight to promote token-level where LayerNorm smooths the size relationship between
adaptive training. In video captioning, the large-scale train- different samples.
ing corpus demonstrates the phenomenon of the imbalance So far, the self-attention process has been completed, and
′
occurrence of tokens. External language model (Zhang et al. the result E<t is employed for cross-attention as a query:
2020) is introduced to extend the ground-truth tokens to deal ′ ′
with the token imbalance problem, which heavily depends on D<t = LayerNorm E<t + MultiHead E<t , R, R , (4)
external knowledge. In contrast, RSFD highlight relatively
where D<t denotes the sentence feature of cross-attention
rare words that are most likely refined semantics in the sen- output. We adopt a feed-forward network (FFN) that employs
tence, alleviating token imbalance inside our model. non-linear transformation, and apply LayerNorm to calculate
the probability distribution of words by Softmax:
Proposed Method
Dt = LayerNorm (D<t + FFN (D<t )) , (5)
We devise our Refined Semantic enhancement towards Fre-
quency Diffusion (RSFD) for video captioning. As shown in Pt (yt |y<t , R) = Softmax (Dt Wp ) , (6)
3726
where R denotes the encoded video features, Wp ∈ Rdh ×de comprehended by the video to which it belongs. Thus it
is a trainable variable. The objective is to minimize the cross- is divided into UMT for it can be self-sufficient without
entropy loss: disturbing other high-frequency words; Instead, when both of
T
X them are below their respective hyper-parameters, the token
Lt = − log Pt (yt∗ |y<t
∗
, R) , (7) is obviously classified as LFT, for it requires more trains.
t=1
Noising Frequency Diffusion. Our diffusion is an intuitive
where yt∗ denotes the ground-truth token at time step t. diffusion exploiting one noise addition step rather than the
vanilla Markov chain with multiple adding noise steps. In-
Frequency-Aware Diffusion spired by denoising after adding noise (Ho, Jain, and Abbeel
ORG-TRL (Zhang et al. 2020) introduces an external lan- 2020), we regard low-frequency tokens as noise and add them
to high-frequency tokens so that our diffusion is a process
guage model to extend the ground-truth tokens to cope with from low-frequency to high-frequency. We respectively de-
the token imbalance problem. However, it needs to be trained
fine TokenL = {tokenL L L
1 , token2 , · · · , tokenm }, Token
H
=
in large-scale corpus in advance, divorced from the caption- H H H
ing model, and the external model consumes additional com- {token1 , token2 , · · · , tokenn } as low-frequency and high-
frequency token features list. A similarity matrix S is con-
puting resources and memory. Instead, we propose an FAD structed to measure the similarity semantics between i-th
module inside captioning model to sufficiently grasp low- LFT and j-th HFT:
frequency semantics. We first introduce our split method of
high-frequency and low-frequency tokens and present a rea- exp sij
Sij = Pn , (10)
sonable explanation for it. Then we elaborate on the diffusion j=1 exp sij
process to learn the semantics of low-frequency tokens.
where Sij is the (i, j)-th element of S ∈ Rm×n . We adopt
Split of Distinct Frequency Words. Due to the severe softmax function as the normalization function, sij denotes
imbalance of tokens in the corpus, a small number of high- the cosine similarity score between two tokens. We choose
frequency tokens occupy many occurrences while a large j-th HFT corresponding to max value in each line as the
number of low-frequency tokens appear rarely. Here we indi- diffusion object of i-th LFT:
cate HFT for high-frequency tokens, LFT for low-frequency
ones, and UMT for unmarked tokens. Then we define a token LoH = arg max (S, dim = 1) , (11)
belongs to different categories of frequency as: where LoH denotes the list of high-frequency tokens cor-
|vid (tok)| responding to low-frequency tokens one by one, and the
HFT, |vid (tok)| ≥ γ value of LoH represents the index of TokenH . In this way,
all |tok|
, ≥ δ, (8) we have established the corresponding relationship between
|vid (tok)| |cap|
low-frequency tokens and their semantically most similar
LFT,
<γ
|vidall (tok)| high-frequency token. Afterward, we obtain the parameter of
|vid (tok)| the noise according to the similarity:
UMT, |vid (tok)| ≥ γ
all |tok| exp σ tokenL H
, < δ, (9) i , tokenloh
|vid (tok)| |cap| αi = P , (12)
exp σ tokenL H
i , tokenloh
LFT,
<γ i∈N
|vidall (tok)|
where δ and γ represent hyper-parameter to decide which where loh denotes i-th value of LoH, αi represents the signif-
category of the frequency the token belongs to. In video cap- icance of i-th noise, N denotes the indexes of low-frequency
words corresponding to one high-frequency word, σ denotes
tioning, a video has multiple ground-truth captions. |vid(·)| the cosine similarity of semantic between these two tokens,
and |vidall (·)| respectively represent the occurrence numbers
of the token in the corresponding video and the sum number then we obtain the diffusion tokenHloh of high-frequency:
of tokens in all captions of its video. |tok| denotes the number tokenH H
X
αi ◦ tokenL
loh = tokenloh + i . (13)
of occurrences of this token in all ground-truth captions of the i∈N
whole corpus, and |cap| indicates the number of all captions
in the dataset. Note that unlike DDPM (Ho, Jain, and Abbeel 2020) and
We use γ to assess the frequency of the token in intra- TimeGrad (Rasul et al. 2021), which learns the noise first and
video, for it only indicates the frequency inside its located then denoising, our model only learns low-frequency tokens
video (termed intra-frequency), and δ evaluates the frequency regardless of how to remove them. By adding appropriate
of the token in inter-video for it expresses the frequency in low-frequency tokens as noise to high-frequency ones, RSFD
the whole dataset (termed inter-frequency). We split tokens can generate relatively rarely used but more refined words.
into three kinds of frequency tokens in four cases: When
inter-frequency exceeds the δ, if the token has higher intra- Divergent Semantic Supervisor
frequency, it is divided into HFT owing to its universality; Due to the noise semantics of low-frequency in FAD, the
Otherwise, proving that it cannot integrate sufficiently with semantics of high-frequency is constrained to some extent.
the corresponding video and be divided into LFT, even if To complement the semantics of the central high-frequency
it is a common token; When inter-frequency is below the δ tokens and to further encourage the formation of the cen-
while the intra-frequency is greater than the γ, it shows that tral low-frequency tokens, DSS offers the contextual cue of
even though it is not a frequent token, it can be adequately adjacent tokens to the central tokens.
3727
Skip-gram (Mikolov et al. 2013) designs a window cen- Implementation Details
tered on the central word and supervises the generation of
In our experiments, we follow (Pei et al. 2019) to extract im-
the central word using the contextual words. Motivated by
age and motion features to encode video information. Specifi-
it, we design a DSS obtaining the gradient that adapts to the
cally, we use ImageNet (Deng et al. 2009) pre-trained ResNet-
token itself to understand the word adequately.
As shown in Fig. 2, we project the hidden features to the 101 (He et al. 2016) to extract 2D scene features for each
new feature space before the last linear layer and then project frame. We also utilize Kinetics (Kay et al. 2017) pre-trained
new features to the space of the dimension of corpus size: ResNeXt-101 with 3D convolutions (Hara, Kataoka, and
′
Satoh 2018).
Dt = Dt Wa Wp , (14) In our implementation, size of video feature dv and the
′
′
hidden size dh are set to 2,048 and 512. Empirically, we
Pt (yt |y<t , R) = Softmax Dt , (15) set the sampled frames K = 8 for each video clip. We set
maximum sequence length T to 30 on MSR-VTT, whereas
where Dt is the result of Eq. (5), Wa ∈ Rdh ×dh is a trainable
T = 20 on MSVD. Transformer decoder has a decoder layer,
variable, Wp ∈ Rdh ×de is also a trainable variable shared
′ ′ 8 attention heads, 0.5 dropout ratio, and 0.0005 ℓ2 weight
with Eq. (6). We use Dtf and Dtl to respectively represent decay. We implement word embeddings by trainable 512
the former and the latter linear layer of the same layer as the dimensions of embedding layers. In the training phase, we
original last union linear projection: adopt Adam (Kingma and Ba 2015) with an initial learning
′
rate of 0.005 to optimize our model. The batch size is set to
′
Ptf (yt |y<t−1 , R) = Softmax Dtf , (16)
64, and the training epoch is set to 50. During testing, we use
′
′
Ptl (yt |y<t+1 , R) = Softmax Dtl . (17) the beam-search method with size 5 to generate the predicted
sentences. γ and δ for deciding the category of the token
The divergent loss is calculated as: are respectively set to 0.015 and 0.0015. We set λ = 0.07
T T
on MSR-VTT and λ = 0.4 on MSVD to demonstrate the
Ldiv = −
X ′
log Ptf (yt∗ |y<t−1
∗
, R) −
X ′
log Ptl (yt∗ |y<t+1
∗
, R), significance of the divergent loss. All our experiments are
t=1 t=1
conducted on two NVIDIA Tesla PH402 SKU 200.
(18)
where two items denote the former and the latter divergent Comparison with the State-of-the-Arts
losses, respectively. The final loss is calculated as:
The comparison results with the state-of-the-arts in Table 1
L = Lt + λLdiv , (19) illustrate that RSFD achieves a significant performance im-
where λ is a variable parameter to decide the significance provement. On MSR-VTT, RSFD achieves the best results
of Ldiv . In the above example, we only use one word before under 3 out of 4 metrics. Specifically, RSFD outperforms
and after to supervise the central word. In the experimental existing methods except for ORG-TRL (Zhang et al. 2020)
results section, we display the number of words selected as under all the metrics on MSR-VTT, proving that our focus on
the supervisor to achieve the best generation effect. solving the insufficient occurrence of low-frequency words
is practical. In particular, RSFD achieves 53.1% and 96.7%
Experimental Results under CIDEr on MSR-VTT and MSVD, respectively, making
an improvement of 2.0% and 5.3% over AR-B (Yang et al.
Datasets and Evaluation Metrics 2021). As CIDEr captures human judgment of consensus
MSR-VTT (Xu et al. 2016) consists of 10,000 video clips, better than other metrics, superior performance under CIDEr
each annotated with 20 English captions and a category. indicates that RSFD can generate semantically more similar
MSVD (Chen and Dolan 2011) is a widely-used benchmark to human understanding. The boost in performance demon-
video captioning dataset collected from YouTube, composed strates the advantages of RSFD, which exploits FAD and
of 1,970 video clips and roughly 80,000 English sentences. DSS, and models the low-frequency token for detailed and
Following the standard split, we use the same setup as pre- comprehensive textual representations.
vious works (Pan et al. 2020; Ye et al. 2022a), which takes RSFD is slightly lower than ORG-TRL under BLEU-4 on
6,513 video clips for training, 497 video clips for validation, MSR-VTT, which uses an external language model to inte-
and 2,990 video clips for testing on MSR-VTT, as well as grate linguistic knowledge. But it is undeniable that RSFD is
1,200, 100, and 670 videos for training, validation, and test- second only to it, demonstrating the superiority of RSFD. OA-
ing on MSVD. The vocabulary size of MSR-VTT is 10,547, BTG (Zhang and Peng 2019), STG-KD (Pan et al. 2020), and
whereas that of MSVD is 9,468. ORG-TRL perform better under METEOR and ROUGE L
We present the results on four commonly used metrics on MSVD. They either capture the temporal interaction of ob-
to evaluate caption generation quality: BLEU-4 (B-4) (Pap- jects well in a small dataset or bring complementary represen-
ineni et al. 2002), METEOR (M) (Banerjee and Lavie 2005), tation by considering video semantics. Despite not exploring
ROUGE L (R) (Lin 2004), and CIDEr (C) (Vedantam, Zit- the above information, RSFD focuses on generating refined
nick, and Parikh 2015). BLEU-4, METEOR, and ROUGE L descriptions and outperforms other methods under CIDEr,
are generally used in machine translation. CIDEr is proposed benefiting from its emphasis on refined low-frequency to-
for captioning tasks specifically and is considered more con- kens. Indicated that CIDEr best matches human consensus
sistent with human judgment. judgment.
3728
MSR-VTT MSVD
Method Venue
B-4 M R C B-4 M R C
M3-VC (Wang et al. 2018b) CVPR ’18 38.1 26.6 - - 51.8 32.5 - -
RecNet (Wang et al. 2018a) CVPR ’18 39.1 26.6 59.3 42.7 52.3 34.1 69.8 80.3
PickNet (Chen et al. 2018b) ECCV ’18 41.3 27.7 59.8 44.1 52.3 33.3 69.6 76.5
OA-BTG (Zhang and Peng 2019) CVPR ’19 41.4 28.2 - 46.9 56.9 36.2 - 90.6
MARN (Pei et al. 2019) CVPR ’19 40.4 28.1 60.7 47.1 48.6 35.1 71.9 92.2
MGSA (Chen and Jiang 2019) AAAI ’19 42.4 27.6 - 47.5 53.4 35.0 - 86.7
GRU-EVE (Aafaq et al. 2019) CVPR ’19 38.3 28.4 60.7 48.1 47.9 35.0 71.5 78.1
POS-CG (Wang et al. 2019) ICCV ’19 42.0 28.2 61.6 48.7 52.5 34.1 71.3 88.7
STG-KD (Pan et al. 2020) CVPR ’20 40.5 28.3 60.9 47.1 52.2 36.9 73.9 93.0
SAAT (Zheng, Wang, and Tao 2020) CVPR ’20 40.5 28.2 60.9 49.1 46.5 33.5 69.4 81.0
ORG-TRL (Zhang et al. 2020) CVPR ’20 43.6 28.8 62.1 50.9 54.3 36.4 73.9 95.2
SBAT (Jin et al. 2020) IJCAI ’20 42.9 28.9 61.5 51.6 53.1 35.3 72.3 89.5
TTA (Tu et al. 2021) PR ’21 41.4 27.7 61.1 46.7 51.8 35.5 72.4 87.7
SibNet (Liu, Ren, and Yuan 2021) TPAMI ’21 41.2 27.8 60.8 48.6 55.7 35.5 72.6 88.8
SGN (Ryu et al. 2021) AAAI ’21 40.8 28.3 60.8 49.5 52.8 35.5 72.9 94.3
AR-B (Yang et al. 2021) (Baseline) † AAAI ’21 42.1 29.1 61.2 51.1 49.2 35.3 72.1 91.4
FrameSel (Li et al. 2022b) TCSVT ’22 38.4 27.2 59.7 44.1 50.4 34.2 70.4 73.7
TVRD (Wu et al. 2022) TCSVT ’22 43.0 28.7 62.2 51.8 50.5 34.5 71.7 84.3
RSFD (Ours) 43.4 29.3 62.3 53.1 51.2 35.7 72.9 96.7
Table 1: Performance (%) comparison with the state-of-the-arts on MSR-VTT and MSVD. † indicates the reproduced method.
The best results are shown in bold.
Table 2: Performance (%) comparison of different configu- Table 3: Performance (%) comparison of different compo-
rations of window size in DSS module on MSR-VTT and nents in RSFD on MSR-VTT and MSVD. The best results
MSVD. The best results are shown in bold. are shown in bold.
Study on Trade-off Parameter adding divergent loss boosts the performance of the cap-
Trade-off Parameter of Supervision Window. To explore tioning model, which benefited from gradient correction of
the impact of the number of words, we design a supervision adjacent words to central words. Considering all the metrics
window of different sizes to extract adjacent supervision infor- comprehensively, we empirically set λ = 0.07 on MSR-VTT.
mation shown in Table 2. It shows that when the window size
increases, the overall performance shows an upward trend, Trade-off Hyper-Parameter γ and δ. We conduct experi-
but when the size exceeds 5, the performance begins to de- ments on MSR-VTT to explore the effect of hyper-parameter
cline. When the supervision window is too small, insufficient γ for the intra-frequency and hyper-parameter δ for the inter-
adjacent information cannot play a supervisory role. And in frequency. All metrics scores of different settings are illus-
an oversized supervision window, not only low-frequency trated in Figs. 3(b) and 3(c). If δ is too high, the reduction of
tokens, all tokens receive unique gradient updates, causing the number of high-frequency tokens makes the information
overfitting of other tokens that have been trained enough. of low-frequency tokens cannot be well absorbed and fully in-
tegrated with visual features. Similarly, if γ is too high, some
Trade-off Hyper-Parameter λ. To evaluate the effective- words that should not be split into low-frequency tokens are
ness of λ and find an appropriate value for λ, we adjust the divided into them, causing the overfitting of these words. In
value of Eq. (19) under all metrics on MSR-VTT are given contrast, if γ is set too low, low-frequency tokens that need
in Fig. 3(a). Note that only using original loss (λ = 0) shows to be sufficiently trained are ignored. Considering the above
the worst performances. Thus it can be concluded again that discussion, we set γ and δ as 0.015 and 0.0015, respectively.
3729
Relative…Metrics
Relative…Metrics
Relative…Metrics
1.5 BLEU@4 1.5 BLEU@4 1.5 BLEU@4
METEOR METEOR METEOR
1.0 ROUGE 1.0 ROUGE 1.0 ROUGE
CIDEr CIDEr CIDEr
0.5 0.5 0.5
Figure 3: Analysis of (a) different parameter λ for the significance of divergent semantic supervisor, (b) different parameter γ
deciding intra-frequency, and (c) different parameter δ determining inter-frequency on MSR-VTT. We show the relative results
on all metrics.
Figure 4: Instances of AR-B (baseline), RSFD with FAD, Figure 5: Qualitative results generated on the testing sets of
and RSFD model in the testing phase. The yellow and blue MSR-VTT with AR-B (baseline) and RSFD. The red words
words indicate the high-frequency and low-frequency words indicate the refined information in the ground truth as well
to be predicted. The yellow box on the left corresponds to as the generated sentences of RSFD.
the three methods predicting the yellow word, while the blue
box on the right generates the blue word.
and effectively alleviates the long-tailed problem in video
captioning, demonstrating a competitive generation effect.
Ablation Study
Effectiveness of Different Components. The results are Qualitative Results
presented in Table 3. The first line method AR-B (Yang et al. We present some results generated on MSR-VTT in Fig. 5.
2021) (baseline) only applies a Transformer-based model to It can be observed that the content of captions generated by
the encoder-decoder framework. The methods in the second RSFD is more refined than AR-B (Yang et al. 2021) (base-
and third rows refer to methods with FAD and DSS, respec- line). Taking the first result as an example, AR-B only un-
tively. It can be observed that FAD, through sufficiently ex- derstands the rough meaning of the video, e.g., “folding”,
ploiting the semantics of low-frequency tokens, outperforms missing the abundant description, e.g., “airplane”. By con-
AR-B under popular metrics, particularly bringing significant trast, our model captures more details and generates them
CIDEr improvements. Although FAD slightly affects the se- accurately, demonstrating that RSFD produces relatively less
mantics of high-frequency tokens, we integrate it with DSS, common but critical words. The rest of the results have simi-
which makes up the information of high-frequency tokens lar characteristics.
to some extent. Our complete method RSFD results demon-
strate that combining the above two modules can achieve the Conclusion
best performance.
In this paper, we present a Refined Semantic enhancement
Evaluation of FAD and DSS. We analyze the effect of towards Frequency Diffusion (RSFD) for video captioning.
each component in the testing phase shown in Fig. 4. It can RSFD addresses the problem of the Transformer-based or
be seen that FAD generates “ocean” shown in the second line, RNN-based architecture on the insufficient occurrence of
which as an infrequent token, carries critical semantics but low-frequency tokens that limits the capability of compre-
is challenging to predict by AR-B (Yang et al. 2021). FAD hensive semantic generation. Our method incorporates the
helps the model comprehend low-frequency tokens relative Frequency-Aware Diffusion (FAD) and Divergent Semantic
to video content. Although FAD slightly narrows the high- Supervisor (DSS), which capture more detailed semantics of
frequency token “swims” probability from 0.627 to 0.579, relatively less common but critical tokens to provide refined
DSS compensates for it to 0.656 and further enhances the semantic enhancement with the help of low-frequency tokens.
semantics of the low-frequency token “ocean” from 0.682 RSFD achieves competitive performances by outperforming
to 0.697. In general, our method promotes the generation of baselines by large margins of 2.0% and 5.3% in terms of
refined captions owing to our focus on low-frequency tokens CIDEr on MSR-VTT and MSVD, respectively.
3730
Acknowledgments for Video Frame Interpolation. In Proc. ACM Int. Conf.
This work was supported in part by the National Natural Multimedia, 2145–2153.
Science Foundation of China under Grants 62271361 and Hu, M.; Jiang, K.; Liao, L.; Xiao, J.; Jiang, J.; and Wang,
62066021, the Department of Science and Technology, Hubei Z. 2022b. Spatial-Temporal Space Hand-in-Hand: Spatial-
Provincial People’s Government under Grant 2021CFB513, Temporal Video Super-Resolution via Cycle-Projected Mu-
and the Hubei Province Educational Science Planning Special tual Learning. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Funding Project under Grant 2022ZA41. Recognit., 3564–3573.
Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Accel-
References erating Deep Network Training by Reducing Internal Covari-
Aafaq, N.; Akhtar, N.; Liu, W.; Gilani, S. Z.; and Mian, A. ate Shift. In Proc. JMLR Int. Conf. Mach. Learn., 448–456.
2019. Spatio-Temporal Dynamics and Semantic Attribute Jia, X.; Zhong, X.; Ye, M.; Liu, W.; and Huang, W. 2022.
Enriched Visual Encoding for Video Captioning. In Proc. Complementary Data Augmentation for Cloth-Changing Per-
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 12487– son Re-Identification. IEEE Trans. Image Process., 31: 4227–
12496. 4239.
Banerjee, S.; and Lavie, A. 2005. METEOR: An Automatic Jin, T.; Huang, S.; Chen, M.; Li, Y.; and Zhang, Z. 2020.
Metric for MT Evaluation with Improved Correlation with SBAT: Video Captioning with Sparse Boundary-Aware Trans-
Human Judgments. In Proc. ACL Annu. Meeting Assoc. former. In Proc. IJCAI Int. Joint Conf. Artif. Intell., 630–636.
Comput. Linguistics Workshop, 65–72. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.;
Chen, D. L.; and Dolan, W. B. 2011. Collecting Highly Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev,
Parallel Data for Paraphrase Evaluation. In Proc. ACL Annu. P.; et al. 2017. The Kinetics Human Action Video Dataset.
Meeting Assoc. Comput. Linguistics, 190–200. arXiv:1705.06950.
Chen, M.; Li, Y.; Zhang, Z.; and Huang, S. 2018a. TVT: Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochas-
Two-View Transformer Network for Video Captioning. In tic Optimization. In Proc. Int. Conf. Learn. Represent. Work-
Proc. PMLR Asian Conf. Mach. Learn., 847–862. shop.
Chen, S.; and Jiang, Y. 2019. Motion Guided Spatial Atten- Li, J.; Zha, S.; Chen, C.; Ding, M.; Zhang, T.; and Yu, H.
tion for Video Captioning. In Proc. AAAI Conf. Artif. Intell., 2022a. Attention Guided Global Enhancement and Local Re-
8191–8198. finement Network for Semantic Segmentation. IEEE Trans.
Chen, Y.; Wang, S.; Zhang, W.; and Huang, Q. 2018b. Less Image Process., 31: 3211–3223.
Is More: Picking Informative Frames for Video Captioning. Li, L.; Zhang, Y.; Tang, S.; Xie, L.; Li, X.; and Tian, Q.
In Proc. Springer Eur. Conf. Comput. Vis., 367–384. 2022b. Adaptive Spatial Location with Balanced Loss for
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; and Li, F. 2009. Video Captioning. IEEE Trans. Circuits Syst. Video Technol.,
ImageNet: A Large-Scale Hierarchical Image Database. In 17–30.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 248– Li, Z.; Zhong, X.; Chen, S.; Liu, W.; Huang, W.; and Li,
255. L. 2022c. Background Disturbance Mitigation for Video
Gao, Y.; Hou, X.; Suo, W.; Sun, M.; Ge, T.; Jiang, Y.; and Captioning via Entity-Action Relocation. In Proc. IEEE Int.
Wang, P. 2022. Dual-Level Decoupled Transformer for Video Conf. Acoustics Speech Signal Process.
Captioning. In Proc. ACM Int. Conf. Multimedia Retr., 219– Liao, L.; Chen, W.; Xiao, J.; Wang, Z.; Lin, C.; and Satoh,
228. S. 2022a. Unsupervised Foggy Scene Understanding via
Hara, K.; Kataoka, H.; and Satoh, Y. 2018. Can Spatiotempo- Self Spatial-Temporal Label Diffusion. IEEE Trans. Image
ral 3D CNNs Retrace the History of 2D CNNs and ImageNet? Process., 31: 3525–3540.
In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Liao, L.; Xu, K.; Wu, H.; Chen, C.; Sun, W.; Yan, Q.; and Lin,
6546–6555. W. 2022b. Exploring the Effectiveness of Video Perceptual
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Representation in Blind Video Quality Assessment. In Proc.
Learning for Image Recognition. In Proc. IEEE/CVF Conf. ACM Int. Conf. Multimedia, 837–846.
Comput. Vis. Pattern Recognit., 770–778. Lin, C. 2004. ROUGE: A Package for Automatic Evaluation
Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising Diffusion of Summaries. In Proc. ACL Annu. Meeting Assoc. Comput.
Probabilistic Models. In Adv. Neural Inf. Process. Syst. Linguistics Workshop, 74–81.
Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-Term Liu, S.; Ren, Z.; and Yuan, J. 2021. SibNet: Sibling Convolu-
Memory. Neural Comput., 9(8): 1735–1780. tional Encoder for Video Captioning. IEEE Trans. Pattern
Hori, C.; Hori, T.; Lee, T.; Zhang, Z.; Harsham, B.; Hershey, Anal. Mach. Intell., 3259–3272.
J. R.; Marks, T. K.; and Sumi, K. 2017. Attention-Based Mul- Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef-
timodal Fusion for Video Description. In Proc. IEEE/CVF ficient Estimation of Word Representations in Vector Space.
Int. Conf. Comput. Vis., 4203–4212. In Proc. Int. Conf. Learn. Represent. Workshop.
Hu, M.; Jiang, K.; Liao, L.; Nie, Z.; Xiao, J.; and Wang, Z. Pan, B.; Cai, H.; Huang, D.; Lee, K.; Gaidon, A.; Adeli, E.;
2022a. Progressive Spatial-temporal Collaborative Network and Niebles, J. C. 2020. Spatio-Temporal Graph for Video
3731
Captioning with Knowledge Distillation. In Proc. IEEE/CVF Klingner, J.; Shah, A.; Johnson, M.; Liu, X.; Kaiser, L.;
Conf. Comput. Vis. Pattern Recognit., 10867–10876. Gouws, S.; Kato, Y.; Kudo, T.; Kazawa, H.; Stevens, K.;
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. BLEU: Kurian, G.; Patil, N.; Wang, W.; Young, C.; Smith, J.; Riesa,
A Method for Automatic Evaluation of Machine Translation. J.; Rudnick, A.; Vinyals, O.; Corrado, G.; Hughes, M.; and
In Proc. ACL Annu. Meeting Assoc. Comput. Linguistics, Dean, J. 2016. Google’s Neural Machine Translation System:
311–318. Bridging the Gap between Human and Machine Translation.
Pei, W.; Zhang, J.; Wang, X.; Ke, L.; Shen, X.; and Tai, arXiv:1609.08144.
Y. 2019. Memory-Attended Recurrent Network for Video Xu, J.; Mei, T.; Yao, T.; and Rui, Y. 2016. MSR-VTT: A
Captioning. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Large Video Description Dataset for Bridging Video and
Recognit., 8347–8356. Language. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Rasul, K.; Seward, C.; Schuster, I.; and Vollgraf, R. 2021. Recognit., 5288–5296.
Autoregressive Denoising Diffusion Models for Multivariate Xu, X.; Liu, W.; Wang, Z.; Hu, R.; and Tian, Q. 2022.
Probabilistic Time Series Forecasting. In Proc. JMLR Int. Towards Generalizable Person Re-Identification with a Bi-
Conf. Mach. Learn., 8857–8868. Stream Generative Model. Pattern Recognit., 132: 108954.
Ryu, H.; Kang, S.; Kang, H.; and Yoo, C. D. 2021. Semantic Xu, Y.; Liu, Y.; Meng, F.; Zhang, J.; Xu, J.; and Zhou, J. 2021.
Grouping Network for Video Captioning. In Proc. AAAI Bilingual Mutual Information Based Adaptive Training for
Conf. Artif. Intell., 2514–2522. Neural Machine Translation. In Proc. ACL Annu. Meeting
Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; and Assoc. Comput. Linguistics, 511–516.
Ganguli, S. 2015. Deep Unsupervised Learning Using Yang, B.; Zou, Y.; Liu, F.; and Zhang, C. 2021. Non-
Nonequilibrium Thermodynamics. In Proc. JMLR Int. Conf. Autoregressive Coarse-to-Fine Video Captioning. In Proc.
Mach. Learn., 2256–2265. AAAI Conf. Artif. Intell., 3119–3127.
Srivastava, R. K.; Greff, K.; and Schmidhuber, J. 2015. Train- Ye, H.; Li, G.; Qi, Y.; Wang, S.; Huang, Q.; and Yang, M.-H.
ing Very Deep Networks. In Adv. Neural Inf. Process. Syst., 2022a. Hierarchical Modular Network for Video Captioning.
2377–2385. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Tu, Y.; Zhou, C.; Guo, J.; Gao, S.; and Yu, Z. 2021. Enhanc- 17939–17948.
ing the Alignment between Target Words and Corresponding Ye, M.; Li, H.; Du, B.; Shen, J.; Shao, L.; and Hoi, S. C. H.
Frames for Video Captioning. Pattern Recognit., 107702. 2022b. Collaborative Refining for Person Re-Identification
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; with Label Noise. IEEE Trans. Image Process., 31: 379–391.
Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention Zhang, J.; and Peng, Y. 2019. Object-Aware Aggregation
Is All You Need. In Adv. Neural Inf. Process. Syst., 5998– with Bidirectional Temporal Graph for Video Captioning.
6008. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Vedantam, R.; Zitnick, C. L.; and Parikh, D. 2015. CIDEr: 8327–8336.
Consensus-Based Image Description Evaluation. In Proc. Zhang, Z.; Shi, Y.; Yuan, C.; Li, B.; Wang, P.; Hu, W.;
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 4566–4575. and Zha, Z. 2020. Object Relational Graph with Teacher-
Venugopalan, S.; Xu, H.; Donahue, J.; Rohrbach, M.; Recommended Learning for Video Captioning. In Proc.
Mooney, R. J.; and Saenko, K. 2015. Translating Videos IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 13275–
to Natural Language Using Deep Recurrent Neural Networks. 13285.
In Proc. ACL Conf. North Am. Chapter Assoc. Comput. Lin- Zheng, Q.; Wang, C.; and Tao, D. 2020. Syntax-Aware Action
guistics, 1494–1504. Targeting for Video Captioning. In Proc. IEEE/CVF Conf.
Wang, B.; Ma, L.; Zhang, W.; Jiang, W.; Wang, J.; and Comput. Vis. Pattern Recognit., 13093–13102.
Liu, W. 2019. Controllable Video Captioning with POS Zhu, H.; Yuan, J.; Yang, Z.; Zhong, X.; and Wang, Z. 2022.
Sequence Guidance Based on Gated Fusion Network. In Fine-Grained Fragment Diffusion for Cross Domain Crowd
Proc. IEEE/CVF Int. Conf. Comput. Vis., 2641–2650. Counting. In Proc. ACM Int. Conf. Multimedia, 5659–5668.
Wang, B.; Ma, L.; Zhang, W.; and Liu, W. 2018a. Recon-
struction Network for Video Captioning. In Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., 7622–7631.
Wang, J.; Wang, W.; Huang, Y.; Wang, L.; and Tan, T. 2018b.
M3: Multimodal Memory Modelling for Video Captioning.
In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
7512–7520.
Wu, B.; Niu, G.; Yu, J.; Xiao, X.; Zhang, J.; and Wu, H. 2022.
Towards Knowledge-Aware Video Captioning via Transitive
Visual Relationship Detection. IEEE Trans. Circuits Syst.
Video Technol., 32: 6753–6765.
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.;
Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.;
3732