0% found this document useful (0 votes)
7 views

LoCATe-GAT_Modeling_Multi-Scale_Local_Context_and_Action_Relationships_for_Zero-Shot_Action_Recognition

The article presents LoCATe-GAT, a novel framework for zero-shot action recognition (ZSAR) that combines a Local Context-Aggregating Temporal transformer and a Graph Attention Network to improve the recognition of unseen actions by leveraging local spatial context and action relationships. The framework utilizes pretrained contrastive image-based visual-language models to enhance video embeddings, achieving significant performance improvements on multiple benchmarks. Extensive experiments demonstrate that LoCATe-GAT outperforms existing state-of-the-art methods, particularly in generalized ZSAR settings.

Uploaded by

ae.roushankr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

LoCATe-GAT_Modeling_Multi-Scale_Local_Context_and_Action_Relationships_for_Zero-Shot_Action_Recognition

The article presents LoCATe-GAT, a novel framework for zero-shot action recognition (ZSAR) that combines a Local Context-Aggregating Temporal transformer and a Graph Attention Network to improve the recognition of unseen actions by leveraging local spatial context and action relationships. The framework utilizes pretrained contrastive image-based visual-language models to enhance video embeddings, achieving significant performance improvements on multiple benchmarks. Extensive experiments demonstrate that LoCATe-GAT outperforms existing state-of-the-art methods, particularly in generalized ZSAR settings.

Uploaded by

ae.roushankr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 1

LoCATe-GAT: Modeling Multi-Scale Local Context


and Action Relationships for Zero-Shot
Action Recognition
Sandipan Sarma , Divyam Singal, and Arijit Sur , Member, IEEE

Abstract—The increasing number of actions in the real world the deep learning revolution, with CNNs surpassing human
makes it difficult for traditional deep-learning models to recog- performance in object recognition [1], [2], [3]. Naturally, video
nize unseen actions. Recently, pretrained contrastive image-based understanding has been gaining immense attention recently,
visual-language (I-VL) models have been adapted for efficient
“zero-shot” scene understanding. Pairing such models with trans- where human action recognition (HAR) in videos is among
formers to implement temporal modeling has been rewarding for the front-runners [4], [5]. Realizing that machines mimicking
zero-shot action recognition (ZSAR). However, the significance human intelligence for HAR can have potentially explosive
of modeling the local spatial context of objects and action envi- real-world applications, several studies have been conducted
ronments remains unexplored. In this work, we propose a ZSAR in areas like video surveillance, autonomous driving, sports
framework called LoCATe-GAT, comprising a novel Local Context-
Aggregating Temporal transformer (LoCATe) and a Graph At- analysis, and others. However, these methods are restricted by
tention Network (GAT). Specifically, image and text encodings the curse of deep learning – lack of large-scale annotated training
extracted from a pretrained I-VL model are used as inputs for data.
LoCATe-GAT. Motivated by the observation that object-centric Zero-shot learning (ZSL) [6], [7] approaches have recently
and environmental contexts drive both distinguishability and func-
emerged to alleviate this problem. Contrary to supervised learn-
tional similarity between actions, LoCATe captures multi-scale
local context using dilated convolutional layers during temporal ing, where all categories to be recognized are predefined, a
modeling. Furthermore, the proposed GAT models semantic rela- zero-shot paradigm learns from the visual data of only a few
tionships between classes and achieves a strong synergy with the seen classes during training. At test time, it achieves knowl-
video embeddings produced by LoCATe. Extensive experiments on edge transferability from the seen to unseen domains/classes.
four widely-used benchmarks – UCF101, HMDB51, ActivityNet,
Beyond methods based on generative models [8], [9], knowledge
and Kinetics – show we achieve state-of-the-art results. Specifically,
we obtain relative gains of 3.8% and 4.8% on these datasets in graphs [10], [11], and transformers [12], a new line of work
conventional and 16.6% on UCF101in generalized ZSAR settings. based on large-scale pretraining of contrastive image-based
For large-scale datasets like ActivityNet and Kinetics, our method visual-language (I-VL) models [13], [14] seems very promising
achieves a relative gain of 31.8% and 27.9%, respectively, over in addressing zero-shot HAR. The impressive performance of
the previous methods. Additionally, we gain 25.3% and 18.4%
these models across a wide set of vision tasks stems from a
on UCF101 and HMDB51 as per the recent “TruZe” evaluation
protocol. strong representation learning using millions of image-text pairs
publicly available on the web, allowing even zero-shot capabili-
Index Terms—Zero-shot learning, action recognition, ties [13]. Consequently, a few recent works have tried efficiently
transformer, graph attention network.
adapting these I-VL models to downstream video tasks. In the
context of HAR, learning video-specific/task-specific prompt
I. INTRODUCTION vectors to obtain better discriminative text representations for
action classes and temporal modeling [15], [16] using transform-
HE ability of humans to analyze motion patterns and per-
T ceive an action subsequently follows their understanding
of object recognition. In the last decade or so, computer vision
ers [17], [18] have been explored recently. However, transform-
ers usually have a limited ability to consider local spatial context,
which could be a major hindrance in contextual information flow
as a research area has improved in leaps and bounds following
during temporal modeling. When distinguishing two actions, the
object of interaction can play an important role even if the spa-
tiotemporal motion trajectories of the body parts are relatively
Received 27 June 2024; revised 27 August 2024; accepted 6 October 2024.
(Corresponding author: Sandipan Sarma.) similar (e.g., throwing a hammer vs. discus in Fig. 1(a)). Even for
Sandipan Sarma and Arijit Sur are with the Department of Computer Science differentiating actions independent of any object of interest, an
and Engineering, Indian Institute of Technology Guwahati, Guwahati 781039, additional semantic extracted from video frames can be useful
India (e-mail: [email protected]; [email protected]).
Divyam Singal is with the Department of Computer Science and Engineering, (e.g., understanding “the presence of water” can differentiate
Indian Institute of Technology Guwahati, Guwahati 781039, India (e-mail: skijet and horse riding in Fig. 1(b)). On the other hand, humans
[email protected]). also tend to interact similarly with functionally-similar objects
Recommended for acceptance by Y. Wu.
Digital Object Identifier 10.1109/TETCI.2024.3499995 (punching bags and speed bags in Fig. 1(c)) and environments

2471-285X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE

To model action class relationships, we build a knowledge


graph (KG) with the action classes as its nodes. Existing graph-
based methods have tried using word embeddings [10], [19],
[22] like Word2Vec [23] and GloVe [24] for initializing the node
representations while working in the zero-shot setting. However,
action names can be defined using multiple words as well, such
as apply eye makeup, in which case an average word embedding
for all words in the class name is used as node representation.
Ghosh et al. [20] show that this approach does not always capture
the correct relationships between classes and demonstrate im-
provements using Sent2Vec embeddings [25] instead. On the
contrary, we feed a handcrafted prompt to the text encoder
of CLIP for every action class, and the output representations
Fig. 1. Illustration of how object-centric and environmental context plays a initialize the class nodes in our KG. Then, a Graph Attention
dual role in zero-shot HAR – (a) similar spatiotemporal action differentiation
based on object dependency (in red labels); (b) differentiation by action envi- Network (GAT [26]) is trained to model the relationships of a
ronment; (c) human behavior to functionally-similar objects (in blue labels); node with its neighbors via an attention mechanism. Contrary to
(d) human behavior in similar environments. the widely-used GCNs [27] for knowledge transfer [11], GATs
consider different statistical strengths for different neighbors of
a node, producing better embeddings and achieving substantial
improvements over GCN.
(water bodies in Fig. 1(d)). Therefore, our intuition is that local To summarize, the following are our main contributions:
contextual cues from objects and environments play a dual role – r We propose a novel temporal transformer called LoCATe
fuelling both distinguishability and functional similarity. This that looks for frame-wise local spatial context and aggre-
can be detrimental to zero-shot HAR for efficient knowledge gates them to account for temporal dependencies to per-
transfer from seen to unseen actions. form zero-shot HAR in videos. We feed visual embeddings
In a multimodal approach like zero-shot, how well we capture obtained from CLIP to LoCATe, which has a hierarchical
class relationships plays a vital role in how strong the semantic structure to capture multi-scale context.
bridge between seen and unseen classes will be. Knowledge r We use a Graph Attention Network (GAT) for model-
graphs (KGs) have proven helpful as a structured medium for ing action relationships semantically and demonstrate two
modeling prior class knowledge, and graph neural networks aspects in zero-shot HAR: 1) the importance of node
(GNNs) have been exploited in the recent past that leverage these initialization using powerful textual representations from
KGs for transferring seen knowledge to unseen [10], [19], [20]. CLIP, and 2) the strong synergy achieved between CLIP-
By design, the input graph and the initial node representations based temporal modeling and attention modeling for action
for such GNNs heavily impact their learning [11]. Moreover, classes using GAT which provides an improved alignment
the dynamic relationships of a node with its neighbors must be between visual-semantic spaces.
scrutinized. r Extensive experiments show that our framework has the
To leverage frame-wise local context and establish useful tem- ability to tackle the polysemy of action verbs. We evaluate
poral dependencies with object-centric and environmental cues, our framework on four challenging zero-shot action recog-
we propose a novel temporal transformer called Local Context- nition benchmarks – UCF101, HMDB51, ActivityNet, and
Aggregating Temporal transformer (LoCATe) that builds on top Kinetics – and achieve significant gains in accuracy over
of a CLIP [13] image encoder. The powerful frame-wise visual the existing state-of-the-art methods.
representations from CLIP provide a solid foundation for us to
capitalize on and look for contextual cues and their inter-frame
II. RELATED WORK
correlations. Specifically, with visual features from CLIP as
input, LoCATe uses a transformer encoder consisting of Layer Zero-shot action recognition (ZSAR): The initial works in
Normalization and Multi-head Self-Attention layers to capture ZSAR extended the idea of zero-shot image classification [6]
global spatiotemporal dependencies. However, instead of pass- where a strong alignment between visual and semantic embed-
ing the attention outputs through a simple feedforward network dings is a straightforward way of attaining generalizability to
like previous works that do not pay attention to local context [17], unseen classes while handling domain shift [28], [29]. Recently,
[18], we propose to use a novel Local Context Aggregator (LCA) skeleton data has been used as a reference for estimating the pose
block. An LCA block encases a hierarchical convolutional setup of a human for distinguishing actions, and several works [22],
wherein three different convolutional branches try to capture [30] have demonstrated the benefit of skeleton data as auxiliary
frame-wise local spatial context at multiple scales. The outputs information for ZSAR. However, human pose might misguide
of each branch are further refined using Convolutional Block a zero-shot learner due to the polysemy of action verbs. For
Attention Modules (CBAMs [21]), summarized using Global example, it is challenging to transfer pose cues learned from a
Average Pooling layers, and finally concatenated to provide us seen action billiards shot to an unseen action like cricket shot
video embeddings enriched with multi-scale temporal context. (Fig. 2). Hence, we broadly identify two groups of works that

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 3

and finetune paradigm in HAR. Ju et al. [17] additionally attach a


temporal transformer and formulate several video understanding
tasks under the same umbrella. X-CLIP [18] builds a cross-frame
transformer for frame-level message passing and then integrates
them to get the video features. However, they ignore the local
spatial context in the frames while establishing temporal depen-
dencies. We address this using our novel transformer LoCATe
(refer to Section III-C).
Graph Neural Networks: Since their inception, knowledge
graphs (KGs) have been extensively used to structure prior
knowledge and build relationships between concepts [47], [48].
Several works incorporate auxiliary skeleton information [19],
[49], [50] to boost performance, but this hinders their scalability.
GCNs [27] have been the go-to networks for spatiotemporal
modeling of actions [22], where a convolution operation ag-
gregates information from node neighbors, calculating static
weights based on node degrees. [10], [11], [20], [22], [51], [52]
use GCNs for modeling semantic relationships between actions.
In this work, we highlight our use of a Graph Attention Net-
Fig. 2. Polysemy of action verbs (shown in red). The human pose may work [26] (GAT) that aims to improve upon GCN-based action
differ significantly depending upon a primary object of interest (top row), the representations by incorporating multi-head attention over the
environment/field of activity (middle row), or a mixture of both (bottom row). node features. As per [20], initial node representation drives
the learning in the later stages. Hence, we initialize the nodes
with text embeddings from CLIP, achieving better synergy with
take a more scalable and zero-shot-compatible approach toward visual features produced by LoCATe.
disambiguating actions.
The first group seeks better attribute representations of ac- III. APPROACH
tion classes. Manually-defined attributes [31] and word embed-
dings [28], [32] have been the usual choice for representing A. Problem Formulation
action semantics for a long time. Some works try leveraging In ZSAR, we are given a training set Dtrain = {(vi , li )| vi ∈
additional context by modeling linguistic attributes of verbs [33] Vseen , li ∈ S}, where vi is a video clip of a seen class li . A
and using synonyms of actions to supplement their semantic separate set of novel (unseen) data Dnovel = {(vj , lj )| vj ∈
representations [34]. Others use textual descriptions to enrich Vunseen , lj ∈ U } is given such that the sets of seen and unseen
word embeddings [35] or create their own elaborative descrip- classes are disjoint, i.e., S ∩ U = φ. We work in the more
tions [36], [37], [38] and object-action/sentence-action semantic realistic inductive ZSAR setting – where video samples of
descriptors [39], but are not scalable to large datasets as they unseen classes are unavailable during training – instead of the
require human annotators. On the contrary, we use the CLIP transductive setting [19], [53]. Then, the task in conventional
text encoder to get semantics and require only the action class ZSAR (CZSAR) is to learn a classifier:
name as input, achieving scalability.
The second group emphasizes obtaining better visual repre- fzsl : Vunseen → U (1)
sentations and learning from seen data to augment unseen data.
The existing methods predominantly take a generative approach, For generalized ZSAR (GZSAR), a small subset of Vseen
sub
addressing several challenges in video feature generation, such (Vseen ) is used as the set of seen samples at test time. The
as heterogeneity gap [8], [40], hubness problem [41], semantic objective changes to learning a classifier:
consistency [42] and reducing the effect of inherent bias towards fgzsl : Vseen
sub
∪ Vunseen → S ∪ U (2)
seen classes [9], [43]. Recently, space-time attention [44] has
been leveraged in [45] for joint visual-text representation learn-
B. System Overview
ing in transformers, and an end-to-end trainable transformer
network called ResT [12] aims to offer better generalizability. Inspired by the recent success of I-VL models adapted to the
However, their temporal modeling remains sub-optimal, confus- ZSAR task [17], [18], [46], we propose to use CLIP [13] to get
ing similar actions with subtle muscle movements like chewing video and text encodings. Specifically, during training, given
and laughing. a video clip vi ∈ RT ×H×W ×C of T sampled frames, each of
Visual-language pretraining: Although I-VL models [13], spatial resolution H × W with C channels, we feed it into the
[14] have shown impressive zero-shot transferability of prior image encoder of CLIP Φvis (.), obtaining frame-wise visual
knowledge, acquiring a large amount of labeled video-text pairs encodings:
for training them is a significant obstacle in HAR. Recently,
ActionCLIP [46] became the first to introduce a pretrain, prompt Ω(vi ) = [ω1 , ω2 , . . .ωT ] = Φvis (vi ), (3)

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE

Fig. 3. Proposed framework. Given video frames as input, a CLIP image encoder produces the frame encodings. Our novel temporal transformer takes these,
along with temporal positional encodings, as input and passes them down the encoder structure. The intermediate output goes through an LCA block, where local
context is aggregated from multiple scales. Temporal averaging produces video embeddings. Additionally, semantic relationships between actions are learned using
a graph attention network (GAT). Visual-semantic alignment of video and semantic embeddings using a cosine similarity enables ZSAR.

where vi ∈ Vseen and ωi ∈ Rd . Furthermore, to generate text multi-head self-attention (MHSA), layer normalization (LN),
encodings for a class, the text encoder of CLIP Φtext (.) and MLPs. But the inputs to them are meticulously obtained
needs a prompt template [46] as input (further explained in frame-level visual encodings from a previous stage of the frame-
Section III-D1). These constitute the set of semantic embed- work, and hence these transformers just act as an aggregator of
dings: frame-level encodings. Moreover, their capabilities are limited
when it comes to capturing local spatial context, which could be
Ψ = {ψ(lk ) = Φtext (prompt(lk ))| lk ∈ S ∪ U } (4) detrimental to ZSAR. Object-centric and environmental contexts
can influence both distinguishability and functional similarity
The proposed framework has two main components, as shown
between actions (Fig. 1), eventually assisting in knowledge
in Fig. 3. The first is a visual branch for context-aware tempo-
transfer from seen to unseen actions.
ral modeling of the frame-wise video encodings Ω(vi ) using
In our visual branch, after extracting frame-level visual en-
LoCATe, which captures multi-scale local spatial context and
codings for a video vi , we pass them through our Local Context-
aggregates them from the temporal dimension. The second
Aggregating Temporal Transformer (LoCATe) that performs
is a semantic branch for modeling action relationships using
self-attention operations. Moreover, instead of using an MLP
GAT [26] that models a node’s relationships with its neighbors
like traditional transformers, we propose a Local Context Ag-
via a multi-head attention mechanism and learns seen-unseen
gregator (LCA) block to leverage multi-scale local context from
associations. In the following sections, we provide an in-depth
the frame-level encodings. For LoCATe, we first prepare our
discussion of the two components and how they contribute to
input embeddings as:
visual-semantic alignment.
xi = Ω(vi ) + ρtemp (5)
C. Visual Branch: Context-Aware Temporal Modeling
where ρtemp denotes temporal positional encodings, which are
Attention mechanisms have proven to be useful in image learnable vectors. These embeddings pass through layer normal-
classification for extracting discriminative information and form ization and MHSA:
the core of transformer-based models [1]. Recently, some
transformer-based models [13], [54] pretrained in a multimodal x̃i = xi + M HSA(LN (xi ))
fashion with visual-language data (I-VL models) have shown x̂i = LN (x̃i ) (6)
promising zero-shot generalizability in areas like image classi-
fication. However, videos pose a bigger challenge. Firstly, unlike Having accounted for global spatiotemporal attention via
images, large-scale labeled video-text pairs for pretraining I-VL MHSA within the tokens x̂i , we reshape them to 2D feature
models are hard to collect and can incur enormous computa- maps and pass them through an LCA block that has three sepa-
tion and memory costs. And secondly, the semantic meaning rate convolutional branches (Fig. 4). Each branch consists of a
of a video clip is engraved within multiple individual frames 1 × 1 convolutional (CONV) layer, followed by a 3 × 3 dilated
and the spatiotemporal correlations between them. Recently, CONV layer. Different dilation factors in different branches
a few works [17], [18] have successfully adapted these I-VL allow us to tune the receptive field sizes, capturing local context
models to action recognition by focusing on temporal modeling from multiple scales. The output feature maps are refined via a
via transformers. Such transformer encoders usually consist of Convolutional Block Attention Module (CBAM [21]), which

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 5

like Sent2Vec [11], [20] proves to be complementary to the


video embeddings from LoCATe, as we show in the ablations.
Edge formations are done following KG1 [20], where node i is
connected to j if j is among the top NG closest neighbors of i,
as per the cosine similarity of their node features:
ψ(li ) · ψ(lj )
Θ(ψ(li ), ψ(lj )) = (7)
||ψ(li )||2 · ||ψ(lj )||2
Fig. 4. The novel LCA block. Each branch captures context at a different scale We also make the resulting adjacency matrix A symmetric
by controlling the effective receptive field using dilation factor d. CBAM [21] and add self-loops for every node.
performs essential feature refinement. GAP = Global Average Pooling.
2) Learning Semantic Associations: Instead of summing the
neighborhood features like GCN [27] or simply averaging over
them, GAT [26] employs multi-head attention to aggregate the
consists of spatial and channel attention blocks. From a frame’s dynamic relationships of a node with its neighborhood. To
(l+1)
perspective, channel attention discovers the contribution of a compute the node embedding hi of layer (l + 1) from the
feature map during learning, while spatial attention looks for the embeddings of layer l, GAT computes unnormalized pair-wise
essential signals to learn from that feature map, hence refining (l) (l)
attention coefficients cij using the transformed embeddings zi
the output features as a whole. We then use another 1 × 1 CONV of adjacent nodes i and j:
to obtain a desired number of feature maps and pass them
(l) (l)
through a global average pooling (GAP) layer for summarizing zi = W (l) · hi
the resulting spatial information. We ensure that the output from (l) (l) (l)
each branch, when concatenated, yields d−dimensional embed- cij = LeakyRELU (δ (l) · (zi zj )) (8)
dings such that they can be passed on to the next encoder layer.
Here, W (l) is a learnable weight matrix,  is a concatenation
Averaging the final embeddings along the temporal dimension
operation, and δ (l) is a learnable weight vector. For a neighbor-
produces the video embeddings for vi , i.e., viout ∈ Rd .
hood Ni of node i, softmax is applied to normalize the attention
coefficients:
D. Semantic Branch: Modeling Action Relationships (l)
(l) exp(cij )
Text encodings that represent action class semantics inher- αij =  (l)
(9)
ently establish a relationship between classes. However, we k∈Ni exp(cik )
construct a knowledge graph (KG) and use a Graph Attention Finally, the outputs are aggregated from M different attention
Network (GAT) to emphasize the associative strength between heads:
different actions.  
1) Constructing the KG: Instead of using off-the-shelf se- (l+1)
M  (l,m) (l,m)
hi =  σ αij · zj , (10)
mantic networks like ConceptNet to build KGs [10] or relying m=1 j∈Ni
on skeleton-based KGs [22], [49], we take a more scalable
(l,m) (l,m)
approach. Inspired by Ghosh et al. [20], we build a KG with one where αij and zj are the attention coefficients and trans-
node for each action class lk of a dataset, where lk ∈ S ∪ U . Ad- formed embeddings from the mth head and σ is a non-linear
ditionally, while experimenting on a dataset other than Kinetics, activation function. The output node features hout are mapped
i
we add nodes corresponding to classes in Kinetics-400 [55] in to d−dimensions.
the KG of every dataset since previous works [20], [28] have
shown advantages of augmenting classes from other large-scale E. Model Training and Inference
datasets in the KG. To initialize the node features, previous
graph-based approaches have mostly used word embeddings. In our two-stage training process (Fig. 3), we first train the
For example, [10] uses Word2Vec [23], [22] uses GloVe [24], transformer LoCATe with seen data. Once trained, we map the
and [20] uses Sent2Vec [25] embeddings. On the contrary, we viout ∈ Rd to |S| outputs and use it as a classifier with weight
use one of the standard handcrafted prompt templates [13] for an matrix WLoCAT e ∈ Rd×|S| . Next, we train the GAT and take
action class lk – “A video of a person [lk ]” – obtaining semantic the output node features corresponding to seen classes hout ∈
embeddings ψ(lk ) from CLIP and normalizing them. Initializing Rd×|S| . Visual-semantic alignment is learned by optimizing a
the action nodes with these embeddings provides a two-fold mean squared error loss:
advantage. Firstly, there is no need to average word vectors for LM SE = ||WLoCAT e − hout ||2 (11)
every word to get embeddings for multi-word action class names
like “Apply eye makeup”. Ghosh et al. [20] have previously illus- At test time, a batch of B video samples are passed through
trated that the averaging approach [10], [19], [56] fails to capture LoCATe to obtain their video embeddings v out ∈ RB×d . Node
the correct relationships between actions. And secondly, since features corresponding to the unseen classes hout
j ∈ Rd×|U | (j ∈
graph-based approaches rely heavily on the input graph and its U ) are extracted from the trained GAT, and a cosine similarity is
node representations, using ψ(lk ) instead of other embeddings utilized to find the similarity between a video embedding viout

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE

and the semantic embeddings: provides the best parameter-accuracy balance, we adopt the
same backbone for the image and text encoders of our pretrained
viout · hout
j
j )=
Θ(viout , hout (12) CLIP. Throughout the framework, both of these encoders are
||viout ||2 · ||hout
j ||2 kept frozen, and the output image and text encodings have a
The most similar class is assigned as the test label for a given dimension d = 512. The transformer LoCATe is initialized with
video vi . In the case of generalized ZSAR, node features cor- a Gaussian distribution N (0, 0.02), where the positional encod-
responding to a node j ∈ S ∪ U are considered for computing ing ρtemp ∈ RT ×d . The MHSA block has 8 attention heads, and
similarity scores. no dropout layer has been used inside the block. An LCA block
within the transformer encoder contains three branches b1 , b2 ,
and b3 for capturing multi-scale context. Each branch bi has a
IV. EXPERIMENTS
3 × 3 dilated convolution layer with dilation factor di (values
In this section, we conduct experiments on four bench- 1, 2, 4 for i = 1, 2, 3 respectively). LoCATe is trained using an
mark datasets in two different settings – conventional ZSAR Adam optimizer with a learning rate of 2 × 10−7 and a batch size
(CZSAR) and generalized ZSAR (GZSAR). Furthermore, sev- of 22 videos. Following previous work [12], [32], we use one clip
eral ablation studies illustrate our framework’s efficacy. during training and 25 clips during testing per video. Our best
results are achieved using one transformer encoder block. Our
A. Datasets and Evaluation Protocols transformer is trained for 28 epochs for the HMDB51 dataset
and 16 epochs for the other datasets.
There are two datasets predominantly used by previous works
2) Semantic Branch: With one node per action class, the
for experiments in ZSAR – UCF101 [68] and HMDB51 [69].
number of nodes in our KG while experimenting with the
They consist of 13,320 and 6,766 videos from 101 and 51
Kinetics dataset is |S| + |U |. However, while experimenting
classes, respectively. Moreover, a few works have recently
with any dataset other than Kinetics, the number of nodes
demonstrated model performance on large-scale datasets like
equals |S| + |U | + |K400 |, where K400 is the set of classes from
ActivityNet-v1.3 [70] and Kinetics [55], [71]. ActivityNet-v1.3
Kinetics-400 [55] during KG construction. Following previous
contains 19,994 videos from 200 activity categories. Meanwhile,
works [11], [20], [28], we also reap the benefits of using these
the Kinetics-400 [55] dataset contains around 240,000 videos
additional nodes for capturing richer class relationships but do
in its training set. Unlike zero-shot image classification [6],
not utilize them while learning visual-semantic alignment. After
disjoint seen-unseen splits for these datasets in ZSAR have not
being initialized by d−dimensional text encodings from CLIP,
been standardized yet. The most widely used ones are several
edges are formed between these nodes with NG = 5. We apply
50%/50% random splits of the dataset classes into seen and
a three-layer GAT model [26], with the first two layers having
unseen since obtaining large enough sets for both learning and
4 and the last layer having 6 attention heads. We train it with an
evaluation is feasible. We follow [36] and use the 50%/50% splits
Adam SGD optimizer having a learning rate of 10−3 for 100000
for a fair comparison with a wide array of existing works. Hence,
epochs. To prevent overfitting, we use dropout layers with a rate
we use half the classes as “seen” (for training) and the rest as
of 0.6.
“unseen” classes for testing in the CZSAR setting. However, we
follow [36] for the “seen-unseen splits” of the Kinetics bench-
mark. As per [36], the training set of Kinetics-400 [55] dataset is C. Zero-Shot Results
used as “seen data” (i.e., 400 “seen” classes). For testing, three
1) Conventional ZSAR: We compare our method with an
independent sets of 160 classes from Kinetics-600 [71] are used
extensive array of previous works. Here, our focus is only on the
as “unseen classes”.
inductive methods since they strictly follow the zero-shot criteria
While evaluating our method in the CZSAR setting, we report
of not using any unseen class visual data during training. Table I
the top-1 accuracy for the unseen classes. In the GZSAR setting,
shows that our proposed method, LoCATe-GAT, consistently
we compute average top-1 accuracy for seen classes (Sacc ) and
improves over the state-of-the-art (SOTA). We achieve relative
unseen classes (Uacc ) and report the harmonic mean [6] of these
accuracy gains of 3.8% and 4.8% on UCF101 and HMDB51 over
two as the primary evaluation metric.
the SOTA, X-FLO [18]. On large-scale datasets, our LoCATe-
Recently, a new evaluation protocol named TruZe [72] has
GAT outperforms previous methods significantly by obtaining
emerged that redefines the sets of seen and unseen classes for
relative gains of 31.8% and 27.9% on ActivityNet and Kinetics,
UCF101 and HMDB51. It provides a new set of 70/29 seen and
respectively (Table II).
31/22 unseen classes for the UCF101/HMDB51 datasets.
2) Generalized ZSAR: Being the realistic but more difficult
setting in zero-shot literature, only a few methods present their
B. Implementation Details model’s efficacy in the GZSAR setting. A few methods [19],
All experiments are performed on a single NVIDIA A100 [29], [47], [53] attempt to enhance model generalizability using
80 GB GPU based on a PyTorch framework. a transductive approach (training with unlabeled unseen class
1) Visual Branch: Our frame sampling strategy follows [12], visual samples in addition to seen class samples). In Table III,
[32], wherein we prepare T = 16 frames for a video clip. These we show that our (inductive) method beats these transductive
frames are processed to have a spatial dimension of 224 × 224. methods too. Existing works provide GZSAR results only on
Following the observation of ActionCLIP [46] that ViT-B/16 UCF101 and HMDB51 datasets. On UCF101, we beat the

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 7

TABLE I TABLE III


CZSAR PERFORMANCE ON UCF101 AND HMDB51 BENCHMARKS GZSAR PERFORMANCE ON UCF101 AND HMDB51

TABLE IV
CZSAR PERFORMANCE ON UCF101 AND HMDB51 FOLLOWING
THE TRUZE PROTOCOL

TABLE II TABLE V
CZSAR PERFORMANCE ON ACTIVITYNET AND KINETICS COMPONENT-WISE ANALYSIS IN THE CZSAR SETTING FOR UCF-101

on UCF101. On HMDB51, LoCATe-GAT achieves a relative


gain of 18.4% over the SOTA E2E [32].

closest method [53] considerably by achieving a relative gain D. Ablation Studies and Discussions
of 16.6%. On HMDB51, our results are lesser but comparable For a uniform evaluation, we perform our studies on the
to the SOTA. One reason behind this lower performance could UCF101 dataset due to its diversity in terms of action classes.
be that many actions in HMDB51, like sit, walk, and laugh, Moreover, there are large variations in camera motion, object
are insensitive to object-centric or environmental context, the appearance, pose, object scales, and viewpoint that are necessary
fundamental motivation behind our novelty. to establish the robustness of our model components.
3) Evaluation on TruZe Protocol: The recent work 1) Effect of Visual and Semantic Branches: Temporal model-
TruZe [72] claims that their newly proposed seen-unseen ing is especially essential for video understanding, as shown by
splits are fairer but more restrictive. They justified this by previous works on ZSAR [17], [18]. However, by additionally
comparing the top-1 accuracy for a wide range of ZSAR models capturing object-centric and environmental contexts, we beat the
using random 50%/50% splits and TruZe splits and found that SOTA method. Table V summarizes the contribution from the
all models perform relatively poorly with TruZe splits. Table IV visual and semantic branches of our framework. We first evaluate
presents the CZSAR performance of several models as reported our framework with the semantic branch only (i.e., without
by [37], [72]. Despite harder splits, LoCATe-GAT outperforms LoCATe), where visual embeddings directly from CLIP and
the SOTA [37] by attaining a whooping relative gain of 25.3% semantic embeddings learned via GAT are used for computing

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE

Fig. 5. Attention visualization with respect to a node of interest (in orange). The KG has nodes from UCF101+Kinetics-400, but neighbors (in yellow) only from
UCF101 are shown to reduce clutter. Self-loops are omitted, and attention weights of our trained GAT for different neighbors are illustrated via the colored edges.

TABLE VI
IMPACT OF DIFFERENT SUB-COMPONENTS IN LOCATE ON TOP-1 ACCURACY
(CZSAR SETTING ON UCF-101)

similarity scores at test time. In this setting, our result lies only
behind the SOTA method X-FLO [18] (with 73.2%). Next, we
evaluate using only the visual branch (i.e., without GAT). In this
case, similarity scores are computed at test time between the
video embeddings given by our temporal transformer LoCATe
and the text encodings given by CLIP. Table V shows that Lo-
CATe itself beats the SOTA [18] by 0.4% absolute gain in top-1 Fig. 6. Benefit of capturing local context. The small objects of interest are
accuracy. Finally, by combining LoCATe and GAT, we gain an enclosed in yellow boxes. Action classes are taken from UCF-101.
additional boost of 2.4%. This indicates that appreciating action
relationships via GAT encourages a strong synergy between the
video and semantic embeddings, improving their alignment that
the text encodings of CLIP alone could not provide. with d = 1 and d = 2 to observe the effect of neglecting con-
2) Model Complexity and Scalability: Table V shows that textual cues from smaller scales. The decreasing accuracy with
we achieve a good inference speed of 0.54 videos per second. decreasing branches in Table VI conveys that capturing context
Moreover, our model is quite lightweight with 5.85 million pa- at smaller scales is beneficial.
rameters – significantly lower than others like ActionCLIP [46] For a class-level comparison, we randomly pick four ac-
(141.7 million). As compared to ResT [12], which encoun- tions involving small objects of interest and plot the top-1
ters 30.8 × 25 GFLOPs for a 25-clip inference, LoCATe-GAT accuracy with different architectures in Fig. 6. Even without
achieves a relative gain of up to 38.9% over ResT with around GAT, a 3-branch LoCATe easily outperforms recent works like
8× fewer GFLOPs (3.7 × 25). Thanks to the scalability of our AURL [67] on most occasions. With a 3-branch LoCATe-GAT,
method, we witness consistent improvements over the SOTA state-of-the-art results are achieved in all cases. However, a
across all four evaluated datasets (including large datasets like LoCATe-GAT framework with fewer branches fares below a
Kinetics) and obtain similar inference speeds for all. No fine- 3-branch LoCATe, indicating that only enhancing the semantic
tuning is needed for target video datasets, which saves time and representations of actions is inadequate for ZSAR; multi-scale
makes our method very scalable to future datasets as well. local context for empowering the visual embeddings is essential
3) Improvements by Capturing Multi-Scale Context: To ana- as well. In more detail, confusion matrices for different LCA
lyze the benefits of capturing local context from multiple scales, branch setups of our method in Fig. 7 provide a visualization of
we consider reducing the number of branches in the LCA block ZSAR accuracy obtained on all unseen classes. It is observed
(Fig. 4) to 2 and 1 (from the default 3). We first remove one that with more branches capturing context at smaller scales, the
branch with dilation factor d = 1 and then remove two branches recognition ability mostly improves progressively for actions

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 9

Fig. 7. (Left to right): Confusion matrices for (a) AURL, (b) LoCATe (default), (c) LoCATe-GAT (default) without CBAM, (d) LoCATe-GAT (with 1 LCA
branch), (e) LoCATe-GAT (with 2 LCA branches), and (f) complete LoCATe-GAT (default). Here, ‘default’ indicates a 3-branch LCA setting. Action classes are
from UCF-101.

involving small objects, such as knitting and nunchucks. A Therefore, AURL fails miserably in consistently recognizing
noteworthy case of an action involving small object is the class actions involving similar objects like playing cello (38.64%
yo-yo, which goes completely unrecognized by the 1-branch accuracy) and playing violin (8% accuracy), while LoCATe-
(Fig. 7(d)) and 2-branch (Fig. 7(e)) LCA-based setups but with GAT obtains consistent results (93.29% and 99% respectively).
a 3-branch setup (Fig. 7(f)), it gets a positive boost in accuracy Meanwhile, contextual cues from the environment can also help
(specifically 10.93%). in better understanding human behavior. One such indicator
4) Dual Role of Object-Centric and Environmental Context: is the performance consistency achieved by LoCATe-GAT for
The distinguishability and functional similarity cues mined by water sport actions like kayaking (58.16%) and skijet (99%).
object-centric and environmental contexts are difficult to eval- However, AURL struggles in this case as well, with 0% accuracy
uate and measure quantitatively since they are observational in recognizing videos from skijet. Overall, it is evident from
intuitions of our work. However, some useful insights into these the diagonal entries of the confusion matrices in Fig. 7 that
concepts can be ascertained from the confusion matrices in our method is superior to AURL [67]. More importantly, our
Fig. 7. Similar to the action pairs shown in the introductory motivation of exploiting the distinguishability and functional
Fig. 1, several such pairs can be spotted among the unseen similarity of objects and action environments is highlighted by
actions in Fig. 7, and their recognition accuracy in previous the improved consistency in predicting actions.
works like AURL [67] (Fig. 7(a)) can be compared to ours 5) Role of CBAM in LCA: Table VI shows that the exclusion
(Fig. 7(f)). For instance, actions band marching and military of CBAM modules from the LCA block yields a lesser accuracy,
parade comprise similar spatiotemporal body motions, but the suggesting that feature refinement using channel and spatial at-
object of interaction can be a major factor in distinguishing them. tention helps produce better video embeddings. Moreover, while
AURL achieves the same accuracy as LoCATe-GAT on band comparing in a 3-branch LCA setting (Fig. 6), excluding CBAM
marching (96.77%), but misclassifies military parade as band consistently shows an inferior performance to both LoCATe
marching most of the time, classifying only 4.27% of its samples and LoCATe-GAT frameworks with CBAM, proving its worth
correctly. On the other hand, LoCATe-GAT achieves 88.8% within the framework.
accuracy for the same class. Reflecting a dual role, the functional 6) Different Text Representations: We compare the seman-
similarity of objects like cello and violin (both string-based tic representations learned by GAT when initialized with
instruments) is also appreciated by our method, unlike AURL. word embeddings (Word2Vec [23]), sentence embeddings

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE

TABLE VII
INFLUENCE OF DIFFERENT TEXTUAL REPRESENTATIONS OF ACTION CLASSES
ON TOP-1 ACCURACY (CZSAR SETTING ON UCF-101)

Fig. 9. Grad-CAM visualizations [76] for action classes from UCF-101.


(a) and (b) denote successful cases while (c) denotes a failure case of our method.

are unable to mitigate polysemy, yielding poor results for such


classes. However, LoCATe produces much more distinguishable
visual representations despite polysemy, owing to the powerful
CLIP visual embeddings as inputs and accounting for action
context. Using a GAT additionally induces even greater distin-
Fig. 8. Comparison of LCA with the traditional MLP used in transformers guishability in the semantic space, handling polysemy better
with respect to (a) accuracy and (b) trainable parameters in the model. and yielding massive improvements over recent methods like
AURL. For instance, the red-marked region in Fig. 7(a) shows
a clear confusion in action recognition by AURL because of
(Sent2Vec [25]), and CLIP text encodings [13] in Table VII. It is polysemy in more than five action names containing the verb
evident that CLIP text encodings are more powerful than word playing. Comparatively, our model (Fig. 7(b) and (f)) mitigates
embeddings. Moreover, they are better as node initializers for these confusions to a large extent.
our knowledge graph since we beat LGKT [11] comprehensively 9) Is GAT Necessary for ZSAR?: The node aggregation strat-
(Table I) on CZSAR, who use Sent2Vec [25]. In fact, Sent2Vec egy of a GAT aids zero-shot knowledge transfer between classes.
worsens the visual-semantic alignment with video embeddings This is visualized in Fig. 5, where neighbors are connected
obtained from LoCATe, yielding an accuracy of 30.1%. This is to a node with different statistical strengths (attention) after
similar to a previous Sent2Vec-based framework [20] (32.8%) training the GAT, as shown by different edge colors. Addi-
but is heavily outperformed using our framework based on CLIP tionally, semantically-similar neighbors highlight a meaningful
text encodings by an absolute gain of 45.9%. Hence, learning construction of our knowledge graph. Fig. 5(a) and (b) show
from CLIP text encodings clearly has superior compatibility similar sports/musical instruments as neighbors of hammer
with our video embeddings. throw/playing dhol. Considering hammer throw, GAT learns to
7) Significance of LCA Over MLP: Traditionally, trans- give more attention to throw discus due to similar spatiotemporal
former encoder blocks employ MLPs as feed-forward networks motion of the body in these sports, followed by javelin throw
(FFNs) that constitute about two-thirds of the total trainable (due to object shape similarities). Dhol and tabla are both two-
parameters. These FFNs are position-wise functions, and [75] surface percussions and hence receive more attention, unlike
recently showed that they emulate neural memory and act as drum kits. Fig. 5(c) presents the interesting case of high jump,
pattern detectors over an input sequence of textual data. How- where polysemy of the verb “jump” can be seen among the
ever, with video frames (images) where spatial relationships connected neighbors. However, long jump and balance beam are
are crucial, convolutional layers can play a similar role due to given lesser attention than sky diving and trampoline jumping,
their advantageous compatibility with image data by design. both of which involve greater heights like high jump. Finally,
Experimentally, we justify the merit of our LCA block (that has transcending objects, environments, and even body motions,
convolutional layers) in the transformer LoCATe as opposed to writing on board and typing (Fig. 5(d)) are strongly connected.
the traditional MLP from two aspects – classification accuracy 10) Visualizing Attention Maps: Since Grad-CAM [76] is
and number of trainable parameters (Fig. 8). Using only LoCATe a popular method to obtain visual explanations for decisions
as our ZSAR framework, LCA achieves an absolute gain of 4% given by a convolutional network-based method, we deem it
over MLP, with just about one-third of the number of trainable suitable for analyzing our novel transformer LoCATe containing
parameters than the MLP setting. With LoCATe-GAT as the the LCA block. Fig. 9 shows Grad-CAM visualizations of the
ZSAR framework, LCA again overtakes MLP by 4.5% by using visual embeddings learned by our LCA block for predicting three
just about 1.5 times less trainable parameters. Hence, using LCA different actions. A general observation is that for actions involv-
makes the framework lightweight while improving accuracy. ing direct points of contact between humans and objects (e.g.,
8) Mitigating Polysemy: Several actions may contain the between human and bow in archery and humans and bicycles in
same verbs in the class name, making polysemy a possibility biking), the model focuses well on the objects (bow in Fig. 9(a))
– e.g., playing, jump, and others. Recent works like AURL [67] and points of contacts (seat, pedals and handles of bicycles in
rely simply on word embeddings for semantic descriptions and Fig. 9(b)). However, in a few cases like Fig. 9(c) (contactless

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 11

actions like blowing candles), the attention is scattered around novel framework called LoCATe-GAT for improving the visual-
the human body. Although it is mainly around the face (relevant semantic alignment of action classes during zero-shot action
to the ground truth action), the action is misclassified and hence recognition (ZSAR). It consists of a lightweight transformer
denotes a failure case. that focuses on the multi-scale local context during temporal
11) Comparison With Prior Transformer-Based Methods: modeling, as well as a graph attention network for capturing
The idea of local context aggregation in LoCATe-GAT is action relationships, empowered by the powerful image and
lightweight but effective, boosting ZSAR performance across text encodings from a pretrained CLIP. Extensive experiments
various datasets with simpler architecture and fewer compo- on four challenging benchmarks demonstrate state-of-the-art
nents than the recent transformer-based methods. For instance, results in conventional and generalized zero-shot settings. Using
ActionCLIP [46] and X-CLIP [18] leveraged prompt engineer- several ablations, we show the contribution of both the visual and
ing as a way of improving semantic representations of actions semantic branches toward improving ZSAR. The significance of
during image-text alignment. Both methods adjudged additional capturing local spatial context from multiple scales is reflected in
transformers as the best prompt learners, which increases model the confusion matrices for our method and compared to existing
complexity. ResT [12] incorporated masked language modeling works. These confusion matrices also highlight that recent works
as an auxiliary task, and ViSET [45] combined self-attention still suffer from the polysemy problem, and our method mitigates
with space-time attention [44] for better textual modeling. On the it to a large extent. Finally, from a computational perspective,
other hand, LoCATe-GAT is able to achieve higher performance LoCATe-GAT achieves better accuracy while being smaller in
using simple handcrafted prompts for CLIP’s text encoder, heav- terms of total network parameters – thanks to our novel LCA
ily reducing model size and complexity. Several works [18], block that we use in place of the traditional MLPs used in
[46], [66] additionally cited fine-tuning on target datasets as transformers.
a reason for dramatic improvements. However, LoCATe-GAT
does not need any fine-tuning, saving resources and time. Finally,
despite acknowledging the inductive bias of CNNs [45], prior REFERENCES
works shifted their attention completely to transformers, where
[1] Y. Li et al., “MViTv2: Improved multiscale vision transformers for clas-
we capitalized on this quality and reformulated our transformer sification and detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
for efficient ZSAR. As a result, we attain SOTA performance Recognit., 2022, pp. 4804–4814.
using a single GPU and fewer sampled frames (16) compared to [2] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr,
“Res2net: A new multi-scale backbone architecture,” IEEE Trans. Pattern
others like X-CLIP (32 GPUs) and ViSET (96 frames). Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, Feb. 2021.
12) Comparison With Prior GNN-Based Methods: Some [3] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for
previous works have adopted fully graph-based frameworks unsupervised visual representation learning,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2020, pp. 9729–9738.
that generate classifiers for unseen actions by learning action- [4] Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,”
attribute relationships [10] or from additional modalities like Int. J. Comput. Vis., vol. 130, no. 5, pp. 1366–1401, 2022.
human pose [22]. However, they are less scalable because [5] S. P. Sahoo, S. Ari, K. Mahapatra, and S. P. Mohanty, “HAR-Depth: A
novel framework for human action recognition using sequential learning
such additional modalities are difficult to leverage, and their and depth estimated history images,” IEEE Trans. Emerg. Topics Comput.
performance remains comparatively very low. To improve per- Intell., vol. 5, no. 5, pp. 813–825, Oct. 2021.
formance, a transductive approach [10], [19] works well, but [6] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning—a
comprehensive evaluation of the good, the bad and the ugly,” IEEE Trans.
it requires processing unlabeled unseen class video data – a Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2251–2265, Sep. 2019.
setting less practical. [11], [20] initialize GCNs with Sent2Vec [7] A. Panda, B. Santra, and D. P. Mukherjee, “Isolating features of object and
embeddings [77] to learn action relationships. However, they use its state for compositional zero-shot learning,” IEEE Trans. Emerg. Topics
Comput. Intell., vol. 7, no. 5, pp. 1571–1583, Oct. 2023.
a 3D CNN like I3D to extract the video features, which is com- [8] C. Zhang and Y. Peng, “Visual data synthesis via GAN for zero-shot
putationally heavy and yields sub-optimal ZSAR performance. video classification,” in Proc. 27th Int. Joint Conf. Artif. Intell., 2018,
In contrast, our LoCATe-GAT mitigates all these issues to a large pp. 1128–1134.
[9] M. Hong, G. Li, X. Zhang, and Q. Huang, “Generalized zero-shot video
extent. First, it relies only on visual and text modalities and hence classification via generative adversarial networks,” in Proc. 28th ACM Int.
becomes very scalable. Second, ours is an inductive approach, Conf. Multimedia, 2020, pp. 2419–2426.
where we do not use video data from any unseen action class [10] J. Gao, T. Zhang, and C. Xu, “I know the relationships: Zero-shot action
recognition via two-stream graph convolutional networks and knowledge
throughout the training phase. And third, we use GAT instead of graphs,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 8303–8311.
GCNs and initialize them with CLIP text embeddings to better [11] P. Ghosh, N. Saini, L. S. Davis, and A. Shrivastava, “Learning graphs for
capture correlations between actions. We avoid 3D CNNs and knowledge transfer with limited labels,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., 2021, pp. 11151–11161.
instead account for temporal information using our LoCATe [12] C.-C. Lin, K. Lin, L. Wang, Z. Liu, and L. Li, “Cross-modal representa-
transformer. Consequently, pairing the GAT with LoCATe helps tion learning for zero-shot action recognition,” in Proc. IEEE/CVF Conf.
us beat the prior GNN-based ZSAR works by huge margins Comput. Vis. Pattern Recognit., 2022, pp. 19978–19988.
[13] A. Radford et al., “Learning transferable visual models from natural lan-
across various datasets. guage supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
[14] C. Jia et al., “Scaling up visual and vision-language representation learning
with noisy text supervision,” in Proc. Int. Conf. Mach. Learn., 2021,
V. CONCLUSION pp. 4904–4916.
[15] L. Wang, Z. Tong, B. Ji, and G. Wu, “TDN: Temporal difference networks
Motivated by the dual role of object-centric and environ- for efficient action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis.
mental context for zero-shot action recognition, we present a Pattern Recognit., 2021, pp. 1895–1904.

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE

[16] Z. Tu, Y. Liu, Y. Zhang, Q. Mu, and J. Yuan, “DTCM: Joint optimization of [41] A. Mishra, V. K. Verma, M. S. K. Reddy, S. Arulkumar, P. Rai, and A. Mit-
dark enhancement and action recognition in videos,” IEEE Trans. Image tal, “A generative approach to zero-shot and few-shot action recognition,”
Process., vol. 32, pp. 3507–3520, 2023. in Proc. 2018 IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 372–380.
[17] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual-language [42] M. Hong, X. Zhang, G. Li, and Q. Huang, “Fine-grained feature gener-
models for efficient video understanding,” in Proc. Eur. Conf. Comput. Vis., ation for generalized zero-shot video classification,” IEEE Trans. Image
2022, pp. 105–124. Process., vol. 32, pp. 1599–1612, 2023.
[18] B. Ni et al., “Expanding language-image pretrained models for general [43] D. Mandal et al., “Out-of-distribution detection for generalized zero-shot
video recognition,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, pp. 1–18. action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
[19] Y. Xu, C. Han, J. Qin, X. Xu, G. Han, and S. He, “Transductive zero-shot nit., 2019, pp. 9985–9993.
action recognition via visually connected graph convolutional networks,” [44] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you
IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 8, pp. 3761–3769, need for video understanding?,” in Proc. Int. Conf. Mach. Learn., 2021,
Aug. 2021. Art. no. 4.
[20] P. Ghosh, N. Saini, L. S. Davis, and A. Shrivastava, “All about knowledge [45] K. Doshi and Y. Yilmaz, “Zero-shot action recognition with transformer-
graphs for actions,” 2020, arXiv:2008.12432. based video semantic embedding,” in Proc. IEEE/CVF Conf. Comput. Vis.
[21] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block Pattern Recognit., 2023, pp. 4858–4867.
attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19. [46] M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video
[22] N. Wu and K. Kawamoto, “Zero-shot action recognition with three- action recognition,” 2021, arXiv:2109.08472.
stream graph convolutional networks,” Sensors, vol. 21, no. 11, 2021, [47] J. Gao and C. Xu, “CI-GNN: Building a category-instance graph for
Art. no. 3793. zero-shot video classification,” IEEE Trans. Multimedia, vol. 22, no. 12,
[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed pp. 3088–3100, Dec. 2020.
representations of words and phrases and their compositionality,” in Proc. [48] J. Xing et al., “Boosting few-shot action recognition with graph-guided
Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119. hybrid matching,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023,
[24] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for pp. 1740–1750.
word representation,” in Proc. Conf. Empirical Methods Natural Lang. [49] Z. Tu, J. Zhang, H. Li, Y. Chen, and J. Yuan, “Joint-bone fusion graph
Process., 2014, pp. 1532–1543. convolutional network for semi-supervised skeleton action recognition,”
[25] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sentence IEEE Trans. Multimedia, vol. 25, pp. 1819–1831, 2023.
embeddings using compositional N-gram features,” in Proc. Conf. North [50] J. Zhang, Z. Tu, J. Weng, J. Yuan, and B. Du, “A modular neural motion re-
Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, targeting system decoupling skeleton and shape perception,” IEEE Trans.
pp. 528–540. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6889–6904, Oct. 2024.
[26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, [51] J. Gao, T. Zhang, and C. Xu, “Learning to model relationships for zero-shot
“Graph attention networks,” in Proc. Int. Conf. Learn. Representations, video classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43,
2018. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ no. 10, pp. 3476–3491, Oct. 2021.
[27] T. N. Kipf and M. Welling, “Semi-supervised classification with graph [52] Q. Qi, H. Wang, T. Su, and X. Liu, “Learning temporal information and
convolutional networks,” in Proc. 5th Int. Conf. Learn. Representations, object relation for zero-shot action recognition,” Displays, vol. 73, 2022,
2017. Art. no. 102177.
[28] X. Xu, T. M. Hospedales, and S. Gong, “Multi-task zero-shot action [53] T. Su, H. Wang, Q. Qi, L. Wang, and B. He, “Transductive learning
recognition with prioritised data augmentation,” in Proc. 14th Eur. Conf. with prior knowledge for generalized zero-shot action recognition,” IEEE
Comput. Vis., 2016, pp. 343–359. Trans. Circuits Syst. Video Technol., vol. 31, no. 1, pp. 260–273, Jan. 2024.
[29] Y. Tian, Y. Kong, Q. Ruan, G. An, and Y. Fu, “Aligned dynamic-preserving [54] L. Yuan et al., “Florence: A new foundation model for computer vision,”
embedding for zero-shot action recognition,” IEEE Trans. Circuits Syst. 2021, arXiv:2111.11432.
Video Technol., vol. 30, no. 6, pp. 1597–1612, Jun. 2020. [55] W. Kay et al., “The kinetics human action video dataset,” 2017,
[30] Y. Su, M. Xing, S. An, W. Peng, and Z. Feng, “Vdarn: Video disentangling arXiv:1705.06950.
attentive relation network for few-shot and zero-shot action recognition,” [56] M. Jain, J. C. Van Gemert, T. Mensink, and C. G. Snoek, “Objects2action:
Ad Hoc Netw., vol. 113, 2021, Art. no. 102380. Classifying and localizing actions without any video example,” in Proc.
[31] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by IEEE Int. Conf. Comput. Vis., 2015, pp. 4588–4596.
attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, [57] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of
pp. 3337–3344. output embeddings for fine-grained image classification,” in Proc. IEEE
[32] B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka, “Rethinking Conf. Comput. Vis. Pattern Recognit., 2015, pp. 2927–2936.
zero-shot video classification: End-to-end training for realistic applica- [58] Q. Wang and K. Chen, “Zero-shot visual recognition via bidirectional
tions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, latent embedding,” Int. J. Comput. Vis., vol. 124, pp. 356–383, 2017.
pp. 4613–4623. [59] M. Bishay, G. Zoumpourlis, and I. Patras, “TARN: Temporal attentive
[33] R. Zellers and Y. Choi, “Zero-shot activity recognition with verb attribute relation network for few-shot and zero-shot action recognition,” in Proc.
induction,” in Proc. Conf. Empirical Methods Natural Lang. Process., 30th Brit. Mach. Vis. Conf., 2019, Art. no. 154. [Online]. Available: https:
2017. [Online]. Available: https://arxiv.org/abs/1707.09468 //bmvc2019.org/wp-content/uploads/papers/0650-paper.pdf
[34] I. Alexiou, T. Xiang, and S. Gong, “Exploring synonyms as context in zero- [60] C. Qi, Z. Feng, M. Xing, Y. Su, J. Zheng, and Y. Zhang, “Energy-based
shot action recognition,” in Proc. 2016 IEEE Int. Conf. Image Process., temporal summarized attentive network for zero-shot action recognition,”
2016, pp. 4190–4194. IEEE Trans. Multimedia, vol. 25, pp. 1940–1953, 2023.
[35] Q. Wang and K. Chen, “Alternative semantic representations for zero- [61] C. Bretti and P. Mettes, “Zero-shot action recognition from diverse object-
shot human action recognition,” in Proc. Eur. Conf. Mach. Learn. Knowl. scene compositions,” in Proc. 32nd Brit. Mach. Vis. Conf., 2021.
Discov. Databases, 2017, pp. 87–102. [62] M. Hong, X. Zhang, G. Li, and Q. Huang, “Multi-modal multi-grained
[36] S. Chen and D. Huang, “Elaborative rehearsal for zero-shot action recogni- embedding learning for generalized zero-shot video classification,” IEEE
tion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 13638–13647. Trans. Circuits Syst. Video Technol., vol. 33, no. 10, pp. 5959–5972,
[37] J. Gao, Y. Hou, Z. Guo, and H. Zheng, “Learning spatio-temporal seman- Oct. 2023.
tics and cluster relation for zero-shot action recognition,” IEEE Trans. Cir- [63] P. Mettes, W. Thong, and C. G. Snoek, “Object priors for classifying
cuits Syst. Video Technol., vol. vol. 33, no. 11, pp. 6519–6530, Nov. 2023. and localizing unseen actions,” Int. J. Comput. Vis., vol. 129, no. 6,
[38] L. Zhou and J. Mao, “Improving class representation for zero-shot action pp. 1954–1971, 2021.
recognition,” in Proc. 5th ACM Int. Conf. Multimedia Asia, 2023, pp. 1–7. [64] V. Estevam, R. Laroca, H. Pedrini, and D. Menotti, “Tell me what you
[39] V. Estevam, R. Laroca, H. Pedrini, and D. Menotti, “Global semantic see: A zero-shot action recognition method based on natural language
descriptors for zero-shot action recognition,” IEEE Signal Process. Lett., descriptions,” Multimedia Tools Appl., vol. 83, no. 9, pp. 28147–28173,
vol. 29, pp. 1843–1847, 2022. 2024.
[40] Y. Xu, X. Xu, G. Han, and S. He, “Holistically associated transductive zero- [65] J. Wang, Y. Jiang, Y. Long, X. Sun, M. Pagnucco, and Y. Song, “Decon-
shot learning,” IEEE Trans. Cogn. Devel. Syst., vol. 14, no. 2, pp. 437–447, founding causal inference for zero-shot action recognition,” IEEE Trans.
Jun. 2022. Multimedia, vol. 26, pp. 3976–3986, 2024.

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 13

[66] Y. Qian, L. Yu, W. Liu, and A. G. Hauptmann, “Rethinking zero-shot action Divyam Singal received the B.Tech. degree in com-
recognition: Learning from latent atomic actions,” in Proc. 17th Eur. Conf. puter science and engineering with a minor in robotics
Comput. Vis., 2022, pp. 104–120. and artificial intelligence from the Indian Institute of
[67] S. Pu, K. Zhao, and M. Zheng, “Alignment-uniformity aware representa- Technology Guwahati, Guwahati, India, in 2023. He
tion learning for zero-shot video classification,” in Proc. IEEE/CVF Conf. is currently a Software Engineer with Google under
Comput. Vis. Pattern Recognit., 2022, pp. 19968–19977. Google Ads. His research interests include computer
[68] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human vision, video understanding, natural language pro-
actions classes from videos in the wild,” 2012, arXiv:1212.0402. cessing, and machine learning.
[69] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A
large video database for human motion recognition,” in Proc. Int. Conf.
Comput. Vis., 2011, pp. 2556–2563.
[70] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activ-
itynet: A large-scale video benchmark for human activity understanding,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 961–970.
[71] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman,
“A short note about kinetics-600,” 2018, arXiv:1808.01340.
[72] S. N. Gowda, L. Sevilla-Lara, K. Kim, F. Keller, and M. Rohrbach, “A
new split for evaluating true zero-shot action recognition,” in Proc. DAGM
German Conf. Pattern Recognit., 2021, pp. 191–205. Arijit Sur (Member, IEEE) received the Ph.D. degree
[73] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele, “Latent in computer science and engineering from the Depart-
embeddings for zero-shot classification,” in Proc. IEEE Conf. Comput. Vis. ment of Computer Science and Engineering, Indian
Pattern Recognit., 2016, pp. 69–77. Institute of Technology Kharagpur, Kharagpur, In-
[74] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha, “Synthesized classi- dia. He is currently an Associate Professor with the
fiers for zero-shot learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Department of CSE, Indian Institute of Technology
Recognit., 2016, pp. 5327–5336. Guwahati, Guwahati, India. He has coauthored 28
[75] M. Geva, R. Schuster, J. Berant, and O. Levy, “Transformer feed-forward top-tier journal papers and 50 conference papers in
layers are key-value memories,” in Proc. 2021 Conf. Empirical Methods various premier forums. His research focuses on deep
Natural Lang. Process., 2021, pp. 5484–5495. learning approaches for solving different computer
[76] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and vision problems. Specifically, it includes image, video
D. Batra, “Grad-CAM: Visual explanations from deep networks via restoration, underwater vision, super-resolution, satellite image segmentation,
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, image translation, image steganalysis, zero-shot learning, active learning, med-
pp. 618–626. ical imaging, adversarial perturbation, and different ML-based approaches for
[77] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sen- adaptive video streaming in a 5G environment.
tence embeddings using compositional N-gram features,” in Proc. Conf.
North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol.,
Jun. 2018, pp. 528–540. [Online]. Available: https://aclanthology.org/
N18-1049

Sandipan Sarma received the B.Tech. degree in


computer science and technology from the Indian
Institute of Engineering Science and Technology,
Shibpur, India, in 2019. He is currently working
toward the Ph.D. degree with the Department of
Computer Science and Engineering, Indian Institute
of Technology Guwahati, Guwahati, India. His re-
search interests include leveraging machine learning
and deep learning methods, particularly zero-shot, for
computer vision applications like image classifica-
tion, object detection, and action recognition.

Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.

You might also like