LoCATe-GAT_Modeling_Multi-Scale_Local_Context_and_Action_Relationships_for_Zero-Shot_Action_Recognition
LoCATe-GAT_Modeling_Multi-Scale_Local_Context_and_Action_Relationships_for_Zero-Shot_Action_Recognition
Abstract—The increasing number of actions in the real world the deep learning revolution, with CNNs surpassing human
makes it difficult for traditional deep-learning models to recog- performance in object recognition [1], [2], [3]. Naturally, video
nize unseen actions. Recently, pretrained contrastive image-based understanding has been gaining immense attention recently,
visual-language (I-VL) models have been adapted for efficient
“zero-shot” scene understanding. Pairing such models with trans- where human action recognition (HAR) in videos is among
formers to implement temporal modeling has been rewarding for the front-runners [4], [5]. Realizing that machines mimicking
zero-shot action recognition (ZSAR). However, the significance human intelligence for HAR can have potentially explosive
of modeling the local spatial context of objects and action envi- real-world applications, several studies have been conducted
ronments remains unexplored. In this work, we propose a ZSAR in areas like video surveillance, autonomous driving, sports
framework called LoCATe-GAT, comprising a novel Local Context-
Aggregating Temporal transformer (LoCATe) and a Graph At- analysis, and others. However, these methods are restricted by
tention Network (GAT). Specifically, image and text encodings the curse of deep learning – lack of large-scale annotated training
extracted from a pretrained I-VL model are used as inputs for data.
LoCATe-GAT. Motivated by the observation that object-centric Zero-shot learning (ZSL) [6], [7] approaches have recently
and environmental contexts drive both distinguishability and func-
emerged to alleviate this problem. Contrary to supervised learn-
tional similarity between actions, LoCATe captures multi-scale
local context using dilated convolutional layers during temporal ing, where all categories to be recognized are predefined, a
modeling. Furthermore, the proposed GAT models semantic rela- zero-shot paradigm learns from the visual data of only a few
tionships between classes and achieves a strong synergy with the seen classes during training. At test time, it achieves knowl-
video embeddings produced by LoCATe. Extensive experiments on edge transferability from the seen to unseen domains/classes.
four widely-used benchmarks – UCF101, HMDB51, ActivityNet,
Beyond methods based on generative models [8], [9], knowledge
and Kinetics – show we achieve state-of-the-art results. Specifically,
we obtain relative gains of 3.8% and 4.8% on these datasets in graphs [10], [11], and transformers [12], a new line of work
conventional and 16.6% on UCF101in generalized ZSAR settings. based on large-scale pretraining of contrastive image-based
For large-scale datasets like ActivityNet and Kinetics, our method visual-language (I-VL) models [13], [14] seems very promising
achieves a relative gain of 31.8% and 27.9%, respectively, over in addressing zero-shot HAR. The impressive performance of
the previous methods. Additionally, we gain 25.3% and 18.4%
these models across a wide set of vision tasks stems from a
on UCF101 and HMDB51 as per the recent “TruZe” evaluation
protocol. strong representation learning using millions of image-text pairs
publicly available on the web, allowing even zero-shot capabili-
Index Terms—Zero-shot learning, action recognition, ties [13]. Consequently, a few recent works have tried efficiently
transformer, graph attention network.
adapting these I-VL models to downstream video tasks. In the
context of HAR, learning video-specific/task-specific prompt
I. INTRODUCTION vectors to obtain better discriminative text representations for
action classes and temporal modeling [15], [16] using transform-
HE ability of humans to analyze motion patterns and per-
T ceive an action subsequently follows their understanding
of object recognition. In the last decade or so, computer vision
ers [17], [18] have been explored recently. However, transform-
ers usually have a limited ability to consider local spatial context,
which could be a major hindrance in contextual information flow
as a research area has improved in leaps and bounds following
during temporal modeling. When distinguishing two actions, the
object of interaction can play an important role even if the spa-
tiotemporal motion trajectories of the body parts are relatively
Received 27 June 2024; revised 27 August 2024; accepted 6 October 2024.
(Corresponding author: Sandipan Sarma.) similar (e.g., throwing a hammer vs. discus in Fig. 1(a)). Even for
Sandipan Sarma and Arijit Sur are with the Department of Computer Science differentiating actions independent of any object of interest, an
and Engineering, Indian Institute of Technology Guwahati, Guwahati 781039, additional semantic extracted from video frames can be useful
India (e-mail: [email protected]; [email protected]).
Divyam Singal is with the Department of Computer Science and Engineering, (e.g., understanding “the presence of water” can differentiate
Indian Institute of Technology Guwahati, Guwahati 781039, India (e-mail: skijet and horse riding in Fig. 1(b)). On the other hand, humans
[email protected]). also tend to interact similarly with functionally-similar objects
Recommended for acceptance by Y. Wu.
Digital Object Identifier 10.1109/TETCI.2024.3499995 (punching bags and speed bags in Fig. 1(c)) and environments
2471-285X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 3
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Proposed framework. Given video frames as input, a CLIP image encoder produces the frame encodings. Our novel temporal transformer takes these,
along with temporal positional encodings, as input and passes them down the encoder structure. The intermediate output goes through an LCA block, where local
context is aggregated from multiple scales. Temporal averaging produces video embeddings. Additionally, semantic relationships between actions are learned using
a graph attention network (GAT). Visual-semantic alignment of video and semantic embeddings using a cosine similarity enables ZSAR.
where vi ∈ Vseen and ωi ∈ Rd . Furthermore, to generate text multi-head self-attention (MHSA), layer normalization (LN),
encodings for a class, the text encoder of CLIP Φtext (.) and MLPs. But the inputs to them are meticulously obtained
needs a prompt template [46] as input (further explained in frame-level visual encodings from a previous stage of the frame-
Section III-D1). These constitute the set of semantic embed- work, and hence these transformers just act as an aggregator of
dings: frame-level encodings. Moreover, their capabilities are limited
when it comes to capturing local spatial context, which could be
Ψ = {ψ(lk ) = Φtext (prompt(lk ))| lk ∈ S ∪ U } (4) detrimental to ZSAR. Object-centric and environmental contexts
can influence both distinguishability and functional similarity
The proposed framework has two main components, as shown
between actions (Fig. 1), eventually assisting in knowledge
in Fig. 3. The first is a visual branch for context-aware tempo-
transfer from seen to unseen actions.
ral modeling of the frame-wise video encodings Ω(vi ) using
In our visual branch, after extracting frame-level visual en-
LoCATe, which captures multi-scale local spatial context and
codings for a video vi , we pass them through our Local Context-
aggregates them from the temporal dimension. The second
Aggregating Temporal Transformer (LoCATe) that performs
is a semantic branch for modeling action relationships using
self-attention operations. Moreover, instead of using an MLP
GAT [26] that models a node’s relationships with its neighbors
like traditional transformers, we propose a Local Context Ag-
via a multi-head attention mechanism and learns seen-unseen
gregator (LCA) block to leverage multi-scale local context from
associations. In the following sections, we provide an in-depth
the frame-level encodings. For LoCATe, we first prepare our
discussion of the two components and how they contribute to
input embeddings as:
visual-semantic alignment.
xi = Ω(vi ) + ρtemp (5)
C. Visual Branch: Context-Aware Temporal Modeling
where ρtemp denotes temporal positional encodings, which are
Attention mechanisms have proven to be useful in image learnable vectors. These embeddings pass through layer normal-
classification for extracting discriminative information and form ization and MHSA:
the core of transformer-based models [1]. Recently, some
transformer-based models [13], [54] pretrained in a multimodal x̃i = xi + M HSA(LN (xi ))
fashion with visual-language data (I-VL models) have shown x̂i = LN (x̃i ) (6)
promising zero-shot generalizability in areas like image classi-
fication. However, videos pose a bigger challenge. Firstly, unlike Having accounted for global spatiotemporal attention via
images, large-scale labeled video-text pairs for pretraining I-VL MHSA within the tokens x̂i , we reshape them to 2D feature
models are hard to collect and can incur enormous computa- maps and pass them through an LCA block that has three sepa-
tion and memory costs. And secondly, the semantic meaning rate convolutional branches (Fig. 4). Each branch consists of a
of a video clip is engraved within multiple individual frames 1 × 1 convolutional (CONV) layer, followed by a 3 × 3 dilated
and the spatiotemporal correlations between them. Recently, CONV layer. Different dilation factors in different branches
a few works [17], [18] have successfully adapted these I-VL allow us to tune the receptive field sizes, capturing local context
models to action recognition by focusing on temporal modeling from multiple scales. The output feature maps are refined via a
via transformers. Such transformer encoders usually consist of Convolutional Block Attention Module (CBAM [21]), which
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 5
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
and the semantic embeddings: provides the best parameter-accuracy balance, we adopt the
same backbone for the image and text encoders of our pretrained
viout · hout
j
j )=
Θ(viout , hout (12) CLIP. Throughout the framework, both of these encoders are
||viout ||2 · ||hout
j ||2 kept frozen, and the output image and text encodings have a
The most similar class is assigned as the test label for a given dimension d = 512. The transformer LoCATe is initialized with
video vi . In the case of generalized ZSAR, node features cor- a Gaussian distribution N (0, 0.02), where the positional encod-
responding to a node j ∈ S ∪ U are considered for computing ing ρtemp ∈ RT ×d . The MHSA block has 8 attention heads, and
similarity scores. no dropout layer has been used inside the block. An LCA block
within the transformer encoder contains three branches b1 , b2 ,
and b3 for capturing multi-scale context. Each branch bi has a
IV. EXPERIMENTS
3 × 3 dilated convolution layer with dilation factor di (values
In this section, we conduct experiments on four bench- 1, 2, 4 for i = 1, 2, 3 respectively). LoCATe is trained using an
mark datasets in two different settings – conventional ZSAR Adam optimizer with a learning rate of 2 × 10−7 and a batch size
(CZSAR) and generalized ZSAR (GZSAR). Furthermore, sev- of 22 videos. Following previous work [12], [32], we use one clip
eral ablation studies illustrate our framework’s efficacy. during training and 25 clips during testing per video. Our best
results are achieved using one transformer encoder block. Our
A. Datasets and Evaluation Protocols transformer is trained for 28 epochs for the HMDB51 dataset
and 16 epochs for the other datasets.
There are two datasets predominantly used by previous works
2) Semantic Branch: With one node per action class, the
for experiments in ZSAR – UCF101 [68] and HMDB51 [69].
number of nodes in our KG while experimenting with the
They consist of 13,320 and 6,766 videos from 101 and 51
Kinetics dataset is |S| + |U |. However, while experimenting
classes, respectively. Moreover, a few works have recently
with any dataset other than Kinetics, the number of nodes
demonstrated model performance on large-scale datasets like
equals |S| + |U | + |K400 |, where K400 is the set of classes from
ActivityNet-v1.3 [70] and Kinetics [55], [71]. ActivityNet-v1.3
Kinetics-400 [55] during KG construction. Following previous
contains 19,994 videos from 200 activity categories. Meanwhile,
works [11], [20], [28], we also reap the benefits of using these
the Kinetics-400 [55] dataset contains around 240,000 videos
additional nodes for capturing richer class relationships but do
in its training set. Unlike zero-shot image classification [6],
not utilize them while learning visual-semantic alignment. After
disjoint seen-unseen splits for these datasets in ZSAR have not
being initialized by d−dimensional text encodings from CLIP,
been standardized yet. The most widely used ones are several
edges are formed between these nodes with NG = 5. We apply
50%/50% random splits of the dataset classes into seen and
a three-layer GAT model [26], with the first two layers having
unseen since obtaining large enough sets for both learning and
4 and the last layer having 6 attention heads. We train it with an
evaluation is feasible. We follow [36] and use the 50%/50% splits
Adam SGD optimizer having a learning rate of 10−3 for 100000
for a fair comparison with a wide array of existing works. Hence,
epochs. To prevent overfitting, we use dropout layers with a rate
we use half the classes as “seen” (for training) and the rest as
of 0.6.
“unseen” classes for testing in the CZSAR setting. However, we
follow [36] for the “seen-unseen splits” of the Kinetics bench-
mark. As per [36], the training set of Kinetics-400 [55] dataset is C. Zero-Shot Results
used as “seen data” (i.e., 400 “seen” classes). For testing, three
1) Conventional ZSAR: We compare our method with an
independent sets of 160 classes from Kinetics-600 [71] are used
extensive array of previous works. Here, our focus is only on the
as “unseen classes”.
inductive methods since they strictly follow the zero-shot criteria
While evaluating our method in the CZSAR setting, we report
of not using any unseen class visual data during training. Table I
the top-1 accuracy for the unseen classes. In the GZSAR setting,
shows that our proposed method, LoCATe-GAT, consistently
we compute average top-1 accuracy for seen classes (Sacc ) and
improves over the state-of-the-art (SOTA). We achieve relative
unseen classes (Uacc ) and report the harmonic mean [6] of these
accuracy gains of 3.8% and 4.8% on UCF101 and HMDB51 over
two as the primary evaluation metric.
the SOTA, X-FLO [18]. On large-scale datasets, our LoCATe-
Recently, a new evaluation protocol named TruZe [72] has
GAT outperforms previous methods significantly by obtaining
emerged that redefines the sets of seen and unseen classes for
relative gains of 31.8% and 27.9% on ActivityNet and Kinetics,
UCF101 and HMDB51. It provides a new set of 70/29 seen and
respectively (Table II).
31/22 unseen classes for the UCF101/HMDB51 datasets.
2) Generalized ZSAR: Being the realistic but more difficult
setting in zero-shot literature, only a few methods present their
B. Implementation Details model’s efficacy in the GZSAR setting. A few methods [19],
All experiments are performed on a single NVIDIA A100 [29], [47], [53] attempt to enhance model generalizability using
80 GB GPU based on a PyTorch framework. a transductive approach (training with unlabeled unseen class
1) Visual Branch: Our frame sampling strategy follows [12], visual samples in addition to seen class samples). In Table III,
[32], wherein we prepare T = 16 frames for a video clip. These we show that our (inductive) method beats these transductive
frames are processed to have a spatial dimension of 224 × 224. methods too. Existing works provide GZSAR results only on
Following the observation of ActionCLIP [46] that ViT-B/16 UCF101 and HMDB51 datasets. On UCF101, we beat the
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 7
TABLE IV
CZSAR PERFORMANCE ON UCF101 AND HMDB51 FOLLOWING
THE TRUZE PROTOCOL
TABLE II TABLE V
CZSAR PERFORMANCE ON ACTIVITYNET AND KINETICS COMPONENT-WISE ANALYSIS IN THE CZSAR SETTING FOR UCF-101
closest method [53] considerably by achieving a relative gain D. Ablation Studies and Discussions
of 16.6%. On HMDB51, our results are lesser but comparable For a uniform evaluation, we perform our studies on the
to the SOTA. One reason behind this lower performance could UCF101 dataset due to its diversity in terms of action classes.
be that many actions in HMDB51, like sit, walk, and laugh, Moreover, there are large variations in camera motion, object
are insensitive to object-centric or environmental context, the appearance, pose, object scales, and viewpoint that are necessary
fundamental motivation behind our novelty. to establish the robustness of our model components.
3) Evaluation on TruZe Protocol: The recent work 1) Effect of Visual and Semantic Branches: Temporal model-
TruZe [72] claims that their newly proposed seen-unseen ing is especially essential for video understanding, as shown by
splits are fairer but more restrictive. They justified this by previous works on ZSAR [17], [18]. However, by additionally
comparing the top-1 accuracy for a wide range of ZSAR models capturing object-centric and environmental contexts, we beat the
using random 50%/50% splits and TruZe splits and found that SOTA method. Table V summarizes the contribution from the
all models perform relatively poorly with TruZe splits. Table IV visual and semantic branches of our framework. We first evaluate
presents the CZSAR performance of several models as reported our framework with the semantic branch only (i.e., without
by [37], [72]. Despite harder splits, LoCATe-GAT outperforms LoCATe), where visual embeddings directly from CLIP and
the SOTA [37] by attaining a whooping relative gain of 25.3% semantic embeddings learned via GAT are used for computing
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 5. Attention visualization with respect to a node of interest (in orange). The KG has nodes from UCF101+Kinetics-400, but neighbors (in yellow) only from
UCF101 are shown to reduce clutter. Self-loops are omitted, and attention weights of our trained GAT for different neighbors are illustrated via the colored edges.
TABLE VI
IMPACT OF DIFFERENT SUB-COMPONENTS IN LOCATE ON TOP-1 ACCURACY
(CZSAR SETTING ON UCF-101)
similarity scores at test time. In this setting, our result lies only
behind the SOTA method X-FLO [18] (with 73.2%). Next, we
evaluate using only the visual branch (i.e., without GAT). In this
case, similarity scores are computed at test time between the
video embeddings given by our temporal transformer LoCATe
and the text encodings given by CLIP. Table V shows that Lo-
CATe itself beats the SOTA [18] by 0.4% absolute gain in top-1 Fig. 6. Benefit of capturing local context. The small objects of interest are
accuracy. Finally, by combining LoCATe and GAT, we gain an enclosed in yellow boxes. Action classes are taken from UCF-101.
additional boost of 2.4%. This indicates that appreciating action
relationships via GAT encourages a strong synergy between the
video and semantic embeddings, improving their alignment that
the text encodings of CLIP alone could not provide. with d = 1 and d = 2 to observe the effect of neglecting con-
2) Model Complexity and Scalability: Table V shows that textual cues from smaller scales. The decreasing accuracy with
we achieve a good inference speed of 0.54 videos per second. decreasing branches in Table VI conveys that capturing context
Moreover, our model is quite lightweight with 5.85 million pa- at smaller scales is beneficial.
rameters – significantly lower than others like ActionCLIP [46] For a class-level comparison, we randomly pick four ac-
(141.7 million). As compared to ResT [12], which encoun- tions involving small objects of interest and plot the top-1
ters 30.8 × 25 GFLOPs for a 25-clip inference, LoCATe-GAT accuracy with different architectures in Fig. 6. Even without
achieves a relative gain of up to 38.9% over ResT with around GAT, a 3-branch LoCATe easily outperforms recent works like
8× fewer GFLOPs (3.7 × 25). Thanks to the scalability of our AURL [67] on most occasions. With a 3-branch LoCATe-GAT,
method, we witness consistent improvements over the SOTA state-of-the-art results are achieved in all cases. However, a
across all four evaluated datasets (including large datasets like LoCATe-GAT framework with fewer branches fares below a
Kinetics) and obtain similar inference speeds for all. No fine- 3-branch LoCATe, indicating that only enhancing the semantic
tuning is needed for target video datasets, which saves time and representations of actions is inadequate for ZSAR; multi-scale
makes our method very scalable to future datasets as well. local context for empowering the visual embeddings is essential
3) Improvements by Capturing Multi-Scale Context: To ana- as well. In more detail, confusion matrices for different LCA
lyze the benefits of capturing local context from multiple scales, branch setups of our method in Fig. 7 provide a visualization of
we consider reducing the number of branches in the LCA block ZSAR accuracy obtained on all unseen classes. It is observed
(Fig. 4) to 2 and 1 (from the default 3). We first remove one that with more branches capturing context at smaller scales, the
branch with dilation factor d = 1 and then remove two branches recognition ability mostly improves progressively for actions
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 9
Fig. 7. (Left to right): Confusion matrices for (a) AURL, (b) LoCATe (default), (c) LoCATe-GAT (default) without CBAM, (d) LoCATe-GAT (with 1 LCA
branch), (e) LoCATe-GAT (with 2 LCA branches), and (f) complete LoCATe-GAT (default). Here, ‘default’ indicates a 3-branch LCA setting. Action classes are
from UCF-101.
involving small objects, such as knitting and nunchucks. A Therefore, AURL fails miserably in consistently recognizing
noteworthy case of an action involving small object is the class actions involving similar objects like playing cello (38.64%
yo-yo, which goes completely unrecognized by the 1-branch accuracy) and playing violin (8% accuracy), while LoCATe-
(Fig. 7(d)) and 2-branch (Fig. 7(e)) LCA-based setups but with GAT obtains consistent results (93.29% and 99% respectively).
a 3-branch setup (Fig. 7(f)), it gets a positive boost in accuracy Meanwhile, contextual cues from the environment can also help
(specifically 10.93%). in better understanding human behavior. One such indicator
4) Dual Role of Object-Centric and Environmental Context: is the performance consistency achieved by LoCATe-GAT for
The distinguishability and functional similarity cues mined by water sport actions like kayaking (58.16%) and skijet (99%).
object-centric and environmental contexts are difficult to eval- However, AURL struggles in this case as well, with 0% accuracy
uate and measure quantitatively since they are observational in recognizing videos from skijet. Overall, it is evident from
intuitions of our work. However, some useful insights into these the diagonal entries of the confusion matrices in Fig. 7 that
concepts can be ascertained from the confusion matrices in our method is superior to AURL [67]. More importantly, our
Fig. 7. Similar to the action pairs shown in the introductory motivation of exploiting the distinguishability and functional
Fig. 1, several such pairs can be spotted among the unseen similarity of objects and action environments is highlighted by
actions in Fig. 7, and their recognition accuracy in previous the improved consistency in predicting actions.
works like AURL [67] (Fig. 7(a)) can be compared to ours 5) Role of CBAM in LCA: Table VI shows that the exclusion
(Fig. 7(f)). For instance, actions band marching and military of CBAM modules from the LCA block yields a lesser accuracy,
parade comprise similar spatiotemporal body motions, but the suggesting that feature refinement using channel and spatial at-
object of interaction can be a major factor in distinguishing them. tention helps produce better video embeddings. Moreover, while
AURL achieves the same accuracy as LoCATe-GAT on band comparing in a 3-branch LCA setting (Fig. 6), excluding CBAM
marching (96.77%), but misclassifies military parade as band consistently shows an inferior performance to both LoCATe
marching most of the time, classifying only 4.27% of its samples and LoCATe-GAT frameworks with CBAM, proving its worth
correctly. On the other hand, LoCATe-GAT achieves 88.8% within the framework.
accuracy for the same class. Reflecting a dual role, the functional 6) Different Text Representations: We compare the seman-
similarity of objects like cello and violin (both string-based tic representations learned by GAT when initialized with
instruments) is also appreciated by our method, unlike AURL. word embeddings (Word2Vec [23]), sentence embeddings
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VII
INFLUENCE OF DIFFERENT TEXTUAL REPRESENTATIONS OF ACTION CLASSES
ON TOP-1 ACCURACY (CZSAR SETTING ON UCF-101)
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 11
actions like blowing candles), the attention is scattered around novel framework called LoCATe-GAT for improving the visual-
the human body. Although it is mainly around the face (relevant semantic alignment of action classes during zero-shot action
to the ground truth action), the action is misclassified and hence recognition (ZSAR). It consists of a lightweight transformer
denotes a failure case. that focuses on the multi-scale local context during temporal
11) Comparison With Prior Transformer-Based Methods: modeling, as well as a graph attention network for capturing
The idea of local context aggregation in LoCATe-GAT is action relationships, empowered by the powerful image and
lightweight but effective, boosting ZSAR performance across text encodings from a pretrained CLIP. Extensive experiments
various datasets with simpler architecture and fewer compo- on four challenging benchmarks demonstrate state-of-the-art
nents than the recent transformer-based methods. For instance, results in conventional and generalized zero-shot settings. Using
ActionCLIP [46] and X-CLIP [18] leveraged prompt engineer- several ablations, we show the contribution of both the visual and
ing as a way of improving semantic representations of actions semantic branches toward improving ZSAR. The significance of
during image-text alignment. Both methods adjudged additional capturing local spatial context from multiple scales is reflected in
transformers as the best prompt learners, which increases model the confusion matrices for our method and compared to existing
complexity. ResT [12] incorporated masked language modeling works. These confusion matrices also highlight that recent works
as an auxiliary task, and ViSET [45] combined self-attention still suffer from the polysemy problem, and our method mitigates
with space-time attention [44] for better textual modeling. On the it to a large extent. Finally, from a computational perspective,
other hand, LoCATe-GAT is able to achieve higher performance LoCATe-GAT achieves better accuracy while being smaller in
using simple handcrafted prompts for CLIP’s text encoder, heav- terms of total network parameters – thanks to our novel LCA
ily reducing model size and complexity. Several works [18], block that we use in place of the traditional MLPs used in
[46], [66] additionally cited fine-tuning on target datasets as transformers.
a reason for dramatic improvements. However, LoCATe-GAT
does not need any fine-tuning, saving resources and time. Finally,
despite acknowledging the inductive bias of CNNs [45], prior REFERENCES
works shifted their attention completely to transformers, where
[1] Y. Li et al., “MViTv2: Improved multiscale vision transformers for clas-
we capitalized on this quality and reformulated our transformer sification and detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
for efficient ZSAR. As a result, we attain SOTA performance Recognit., 2022, pp. 4804–4814.
using a single GPU and fewer sampled frames (16) compared to [2] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr,
“Res2net: A new multi-scale backbone architecture,” IEEE Trans. Pattern
others like X-CLIP (32 GPUs) and ViSET (96 frames). Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, Feb. 2021.
12) Comparison With Prior GNN-Based Methods: Some [3] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for
previous works have adopted fully graph-based frameworks unsupervised visual representation learning,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2020, pp. 9729–9738.
that generate classifiers for unseen actions by learning action- [4] Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,”
attribute relationships [10] or from additional modalities like Int. J. Comput. Vis., vol. 130, no. 5, pp. 1366–1401, 2022.
human pose [22]. However, they are less scalable because [5] S. P. Sahoo, S. Ari, K. Mahapatra, and S. P. Mohanty, “HAR-Depth: A
novel framework for human action recognition using sequential learning
such additional modalities are difficult to leverage, and their and depth estimated history images,” IEEE Trans. Emerg. Topics Comput.
performance remains comparatively very low. To improve per- Intell., vol. 5, no. 5, pp. 813–825, Oct. 2021.
formance, a transductive approach [10], [19] works well, but [6] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning—a
comprehensive evaluation of the good, the bad and the ugly,” IEEE Trans.
it requires processing unlabeled unseen class video data – a Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2251–2265, Sep. 2019.
setting less practical. [11], [20] initialize GCNs with Sent2Vec [7] A. Panda, B. Santra, and D. P. Mukherjee, “Isolating features of object and
embeddings [77] to learn action relationships. However, they use its state for compositional zero-shot learning,” IEEE Trans. Emerg. Topics
Comput. Intell., vol. 7, no. 5, pp. 1571–1583, Oct. 2023.
a 3D CNN like I3D to extract the video features, which is com- [8] C. Zhang and Y. Peng, “Visual data synthesis via GAN for zero-shot
putationally heavy and yields sub-optimal ZSAR performance. video classification,” in Proc. 27th Int. Joint Conf. Artif. Intell., 2018,
In contrast, our LoCATe-GAT mitigates all these issues to a large pp. 1128–1134.
[9] M. Hong, G. Li, X. Zhang, and Q. Huang, “Generalized zero-shot video
extent. First, it relies only on visual and text modalities and hence classification via generative adversarial networks,” in Proc. 28th ACM Int.
becomes very scalable. Second, ours is an inductive approach, Conf. Multimedia, 2020, pp. 2419–2426.
where we do not use video data from any unseen action class [10] J. Gao, T. Zhang, and C. Xu, “I know the relationships: Zero-shot action
recognition via two-stream graph convolutional networks and knowledge
throughout the training phase. And third, we use GAT instead of graphs,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 8303–8311.
GCNs and initialize them with CLIP text embeddings to better [11] P. Ghosh, N. Saini, L. S. Davis, and A. Shrivastava, “Learning graphs for
capture correlations between actions. We avoid 3D CNNs and knowledge transfer with limited labels,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., 2021, pp. 11151–11161.
instead account for temporal information using our LoCATe [12] C.-C. Lin, K. Lin, L. Wang, Z. Liu, and L. Li, “Cross-modal representa-
transformer. Consequently, pairing the GAT with LoCATe helps tion learning for zero-shot action recognition,” in Proc. IEEE/CVF Conf.
us beat the prior GNN-based ZSAR works by huge margins Comput. Vis. Pattern Recognit., 2022, pp. 19978–19988.
[13] A. Radford et al., “Learning transferable visual models from natural lan-
across various datasets. guage supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
[14] C. Jia et al., “Scaling up visual and vision-language representation learning
with noisy text supervision,” in Proc. Int. Conf. Mach. Learn., 2021,
V. CONCLUSION pp. 4904–4916.
[15] L. Wang, Z. Tong, B. Ji, and G. Wu, “TDN: Temporal difference networks
Motivated by the dual role of object-centric and environ- for efficient action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis.
mental context for zero-shot action recognition, we present a Pattern Recognit., 2021, pp. 1895–1904.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[16] Z. Tu, Y. Liu, Y. Zhang, Q. Mu, and J. Yuan, “DTCM: Joint optimization of [41] A. Mishra, V. K. Verma, M. S. K. Reddy, S. Arulkumar, P. Rai, and A. Mit-
dark enhancement and action recognition in videos,” IEEE Trans. Image tal, “A generative approach to zero-shot and few-shot action recognition,”
Process., vol. 32, pp. 3507–3520, 2023. in Proc. 2018 IEEE Winter Conf. Appl. Comput. Vis., 2018, pp. 372–380.
[17] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual-language [42] M. Hong, X. Zhang, G. Li, and Q. Huang, “Fine-grained feature gener-
models for efficient video understanding,” in Proc. Eur. Conf. Comput. Vis., ation for generalized zero-shot video classification,” IEEE Trans. Image
2022, pp. 105–124. Process., vol. 32, pp. 1599–1612, 2023.
[18] B. Ni et al., “Expanding language-image pretrained models for general [43] D. Mandal et al., “Out-of-distribution detection for generalized zero-shot
video recognition,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, pp. 1–18. action recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
[19] Y. Xu, C. Han, J. Qin, X. Xu, G. Han, and S. He, “Transductive zero-shot nit., 2019, pp. 9985–9993.
action recognition via visually connected graph convolutional networks,” [44] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you
IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 8, pp. 3761–3769, need for video understanding?,” in Proc. Int. Conf. Mach. Learn., 2021,
Aug. 2021. Art. no. 4.
[20] P. Ghosh, N. Saini, L. S. Davis, and A. Shrivastava, “All about knowledge [45] K. Doshi and Y. Yilmaz, “Zero-shot action recognition with transformer-
graphs for actions,” 2020, arXiv:2008.12432. based video semantic embedding,” in Proc. IEEE/CVF Conf. Comput. Vis.
[21] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional block Pattern Recognit., 2023, pp. 4858–4867.
attention module,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 3–19. [46] M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video
[22] N. Wu and K. Kawamoto, “Zero-shot action recognition with three- action recognition,” 2021, arXiv:2109.08472.
stream graph convolutional networks,” Sensors, vol. 21, no. 11, 2021, [47] J. Gao and C. Xu, “CI-GNN: Building a category-instance graph for
Art. no. 3793. zero-shot video classification,” IEEE Trans. Multimedia, vol. 22, no. 12,
[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed pp. 3088–3100, Dec. 2020.
representations of words and phrases and their compositionality,” in Proc. [48] J. Xing et al., “Boosting few-shot action recognition with graph-guided
Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119. hybrid matching,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2023,
[24] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for pp. 1740–1750.
word representation,” in Proc. Conf. Empirical Methods Natural Lang. [49] Z. Tu, J. Zhang, H. Li, Y. Chen, and J. Yuan, “Joint-bone fusion graph
Process., 2014, pp. 1532–1543. convolutional network for semi-supervised skeleton action recognition,”
[25] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sentence IEEE Trans. Multimedia, vol. 25, pp. 1819–1831, 2023.
embeddings using compositional N-gram features,” in Proc. Conf. North [50] J. Zhang, Z. Tu, J. Weng, J. Yuan, and B. Du, “A modular neural motion re-
Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, targeting system decoupling skeleton and shape perception,” IEEE Trans.
pp. 528–540. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6889–6904, Oct. 2024.
[26] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, [51] J. Gao, T. Zhang, and C. Xu, “Learning to model relationships for zero-shot
“Graph attention networks,” in Proc. Int. Conf. Learn. Representations, video classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43,
2018. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ no. 10, pp. 3476–3491, Oct. 2021.
[27] T. N. Kipf and M. Welling, “Semi-supervised classification with graph [52] Q. Qi, H. Wang, T. Su, and X. Liu, “Learning temporal information and
convolutional networks,” in Proc. 5th Int. Conf. Learn. Representations, object relation for zero-shot action recognition,” Displays, vol. 73, 2022,
2017. Art. no. 102177.
[28] X. Xu, T. M. Hospedales, and S. Gong, “Multi-task zero-shot action [53] T. Su, H. Wang, Q. Qi, L. Wang, and B. He, “Transductive learning
recognition with prioritised data augmentation,” in Proc. 14th Eur. Conf. with prior knowledge for generalized zero-shot action recognition,” IEEE
Comput. Vis., 2016, pp. 343–359. Trans. Circuits Syst. Video Technol., vol. 31, no. 1, pp. 260–273, Jan. 2024.
[29] Y. Tian, Y. Kong, Q. Ruan, G. An, and Y. Fu, “Aligned dynamic-preserving [54] L. Yuan et al., “Florence: A new foundation model for computer vision,”
embedding for zero-shot action recognition,” IEEE Trans. Circuits Syst. 2021, arXiv:2111.11432.
Video Technol., vol. 30, no. 6, pp. 1597–1612, Jun. 2020. [55] W. Kay et al., “The kinetics human action video dataset,” 2017,
[30] Y. Su, M. Xing, S. An, W. Peng, and Z. Feng, “Vdarn: Video disentangling arXiv:1705.06950.
attentive relation network for few-shot and zero-shot action recognition,” [56] M. Jain, J. C. Van Gemert, T. Mensink, and C. G. Snoek, “Objects2action:
Ad Hoc Netw., vol. 113, 2021, Art. no. 102380. Classifying and localizing actions without any video example,” in Proc.
[31] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by IEEE Int. Conf. Comput. Vis., 2015, pp. 4588–4596.
attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, [57] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of
pp. 3337–3344. output embeddings for fine-grained image classification,” in Proc. IEEE
[32] B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka, “Rethinking Conf. Comput. Vis. Pattern Recognit., 2015, pp. 2927–2936.
zero-shot video classification: End-to-end training for realistic applica- [58] Q. Wang and K. Chen, “Zero-shot visual recognition via bidirectional
tions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, latent embedding,” Int. J. Comput. Vis., vol. 124, pp. 356–383, 2017.
pp. 4613–4623. [59] M. Bishay, G. Zoumpourlis, and I. Patras, “TARN: Temporal attentive
[33] R. Zellers and Y. Choi, “Zero-shot activity recognition with verb attribute relation network for few-shot and zero-shot action recognition,” in Proc.
induction,” in Proc. Conf. Empirical Methods Natural Lang. Process., 30th Brit. Mach. Vis. Conf., 2019, Art. no. 154. [Online]. Available: https:
2017. [Online]. Available: https://arxiv.org/abs/1707.09468 //bmvc2019.org/wp-content/uploads/papers/0650-paper.pdf
[34] I. Alexiou, T. Xiang, and S. Gong, “Exploring synonyms as context in zero- [60] C. Qi, Z. Feng, M. Xing, Y. Su, J. Zheng, and Y. Zhang, “Energy-based
shot action recognition,” in Proc. 2016 IEEE Int. Conf. Image Process., temporal summarized attentive network for zero-shot action recognition,”
2016, pp. 4190–4194. IEEE Trans. Multimedia, vol. 25, pp. 1940–1953, 2023.
[35] Q. Wang and K. Chen, “Alternative semantic representations for zero- [61] C. Bretti and P. Mettes, “Zero-shot action recognition from diverse object-
shot human action recognition,” in Proc. Eur. Conf. Mach. Learn. Knowl. scene compositions,” in Proc. 32nd Brit. Mach. Vis. Conf., 2021.
Discov. Databases, 2017, pp. 87–102. [62] M. Hong, X. Zhang, G. Li, and Q. Huang, “Multi-modal multi-grained
[36] S. Chen and D. Huang, “Elaborative rehearsal for zero-shot action recogni- embedding learning for generalized zero-shot video classification,” IEEE
tion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 13638–13647. Trans. Circuits Syst. Video Technol., vol. 33, no. 10, pp. 5959–5972,
[37] J. Gao, Y. Hou, Z. Guo, and H. Zheng, “Learning spatio-temporal seman- Oct. 2023.
tics and cluster relation for zero-shot action recognition,” IEEE Trans. Cir- [63] P. Mettes, W. Thong, and C. G. Snoek, “Object priors for classifying
cuits Syst. Video Technol., vol. vol. 33, no. 11, pp. 6519–6530, Nov. 2023. and localizing unseen actions,” Int. J. Comput. Vis., vol. 129, no. 6,
[38] L. Zhou and J. Mao, “Improving class representation for zero-shot action pp. 1954–1971, 2021.
recognition,” in Proc. 5th ACM Int. Conf. Multimedia Asia, 2023, pp. 1–7. [64] V. Estevam, R. Laroca, H. Pedrini, and D. Menotti, “Tell me what you
[39] V. Estevam, R. Laroca, H. Pedrini, and D. Menotti, “Global semantic see: A zero-shot action recognition method based on natural language
descriptors for zero-shot action recognition,” IEEE Signal Process. Lett., descriptions,” Multimedia Tools Appl., vol. 83, no. 9, pp. 28147–28173,
vol. 29, pp. 1843–1847, 2022. 2024.
[40] Y. Xu, X. Xu, G. Han, and S. He, “Holistically associated transductive zero- [65] J. Wang, Y. Jiang, Y. Long, X. Sun, M. Pagnucco, and Y. Song, “Decon-
shot learning,” IEEE Trans. Cogn. Devel. Syst., vol. 14, no. 2, pp. 437–447, founding causal inference for zero-shot action recognition,” IEEE Trans.
Jun. 2022. Multimedia, vol. 26, pp. 3976–3986, 2024.
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SARMA et al.: LOCATE-GAT: MODELING MULTI-SCALE LOCAL CONTEXT AND ACTION RELATIONSHIPS FOR ZERO-SHOT ACTION RECOGNITION 13
[66] Y. Qian, L. Yu, W. Liu, and A. G. Hauptmann, “Rethinking zero-shot action Divyam Singal received the B.Tech. degree in com-
recognition: Learning from latent atomic actions,” in Proc. 17th Eur. Conf. puter science and engineering with a minor in robotics
Comput. Vis., 2022, pp. 104–120. and artificial intelligence from the Indian Institute of
[67] S. Pu, K. Zhao, and M. Zheng, “Alignment-uniformity aware representa- Technology Guwahati, Guwahati, India, in 2023. He
tion learning for zero-shot video classification,” in Proc. IEEE/CVF Conf. is currently a Software Engineer with Google under
Comput. Vis. Pattern Recognit., 2022, pp. 19968–19977. Google Ads. His research interests include computer
[68] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human vision, video understanding, natural language pro-
actions classes from videos in the wild,” 2012, arXiv:1212.0402. cessing, and machine learning.
[69] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: A
large video database for human motion recognition,” in Proc. Int. Conf.
Comput. Vis., 2011, pp. 2556–2563.
[70] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activ-
itynet: A large-scale video benchmark for human activity understanding,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 961–970.
[71] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman,
“A short note about kinetics-600,” 2018, arXiv:1808.01340.
[72] S. N. Gowda, L. Sevilla-Lara, K. Kim, F. Keller, and M. Rohrbach, “A
new split for evaluating true zero-shot action recognition,” in Proc. DAGM
German Conf. Pattern Recognit., 2021, pp. 191–205. Arijit Sur (Member, IEEE) received the Ph.D. degree
[73] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele, “Latent in computer science and engineering from the Depart-
embeddings for zero-shot classification,” in Proc. IEEE Conf. Comput. Vis. ment of Computer Science and Engineering, Indian
Pattern Recognit., 2016, pp. 69–77. Institute of Technology Kharagpur, Kharagpur, In-
[74] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha, “Synthesized classi- dia. He is currently an Associate Professor with the
fiers for zero-shot learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Department of CSE, Indian Institute of Technology
Recognit., 2016, pp. 5327–5336. Guwahati, Guwahati, India. He has coauthored 28
[75] M. Geva, R. Schuster, J. Berant, and O. Levy, “Transformer feed-forward top-tier journal papers and 50 conference papers in
layers are key-value memories,” in Proc. 2021 Conf. Empirical Methods various premier forums. His research focuses on deep
Natural Lang. Process., 2021, pp. 5484–5495. learning approaches for solving different computer
[76] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and vision problems. Specifically, it includes image, video
D. Batra, “Grad-CAM: Visual explanations from deep networks via restoration, underwater vision, super-resolution, satellite image segmentation,
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, image translation, image steganalysis, zero-shot learning, active learning, med-
pp. 618–626. ical imaging, adversarial perturbation, and different ML-based approaches for
[77] M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sen- adaptive video streaming in a 5G environment.
tence embeddings using compositional N-gram features,” in Proc. Conf.
North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol.,
Jun. 2018, pp. 528–540. [Online]. Available: https://aclanthology.org/
N18-1049
Authorized licensed use limited to: Indian Institute Of Technology Jammu. Downloaded on March 05,2025 at 09:00:59 UTC from IEEE Xplore. Restrictions apply.