One-Shot Open Affordance Learning with Foundation Models

Uploaded by

muhammad.ikhalas.khan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

One-Shot Open Affordance Learning with Foundation Models

Uploaded by

muhammad.ikhalas.khan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

One-Shot Open Affordance Learning with Foundation Models

Gen Li1 Deqing Sun2 Laura Sevilla-Lara1 Varun Jampani3

1 2 3
University of Edinburgh Google Research Stability AI
arXiv:2311.17776v1 [cs.CV] 29 Nov 2023

Abstract

We introduce One-shot Open Affordance Learning

(OOAL), where a model is trained with just one example per
base object category, but is expected to identify novel ob-
jects and affordances. While vision-language models excel
at recognizing novel objects and scenes, they often strug-
gle to understand finer levels of granularity such as affor-
dances. To handle this issue, we conduct a comprehensive
analysis of existing foundation models, to explore their in-
herent understanding of affordances and assess the poten- Figure 1. The pipeline of one-shot open affordance learning. It
tial for data-limited affordance learning. We then propose uses one image per base object for training, and can perform zero-
a vision-language framework with simple and effective de- shot inference on novel objects and affordances.
signs that boost the alignment between visual features and
affordance text embeddings. Experiments on two affordance
segmentation benchmarks show that the proposed method ping can be established through a labeled dataset with pre-
outperforms state-of-the-art models with less than 1% of defined objects and affordances. However, large-scale af-
the full training data, and exhibits reasonable generaliza- fordance datasets are scarce, and most of them have a small
tion capability on unseen objects and affordances. number of object categories, making it difficult to apply the
learned mapping to novel objects and scenes. To reduce
the reliance on costly annotation, some recent studies per-
1. Introduction form affordance learning from sparse key points [16,53,54],
Affordances are the potential “action possibilities” re- videos of humans in action [19, 31, 43], or human-object in-
gions of an object [21, 23], which play a pivotal role in teraction images [29, 36]. While alleviating the need for
various applications, including robotic learning [20, 27, 44], dense pixel labeling, these methods still require a large
scene understanding [13, 33, 52], and human-object interac- amount of training data. In addition, they often struggle
tion [24, 43]. In particular, affordance is crucial for embod- to generalize to unseen objects and cannot identify novel
ied intelligence, since it facilitates agents’ understanding of affordances.
the associations between objects, actions, and effects in dy- To tackle the above limitations, we are interested in
namic environments, thus bridging the gap between passive learning an affordance model that does not rely on exten-
perception and active interaction [15, 39]. sive datasets, and can comprehend novel object and affor-
Learning to recognize object affordances across a variety dance classes. For example, after a model is trained with the
of scenarios is challenging, since different objects can vary knowledge that scissor blades afford cutting, it should gen-
significantly in appearance, shape, and size, yet have the eralize to related objects such as knives and axes, inferring
same functionality. For instance, a chef’s knife and a pair that their blades can cut objects too. Moreover, the model
of office scissors share common affordances of cutting and should be able to reason about semantically similar vocab-
holding, but their blades and handles look different. ularies, e.g., “hold” and “grasp”, “cut” and “slice”, instead
A large portion of the work [13, 17, 18, 40, 45, 46] has of knowing only predefined affordance categories.
focused on learning a mapping between visual features and In this paper, we target the extreme case of using merely
affordance labels, utilizing diverse resources as inputs, such one example from each base object category and term this
as 2D images, RGB-D data, and 3D point clouds. This map- research problem as One-shot Open Affordance Learning

1
(OOAL), where the model is trained with very little data, 2. Related Work
and is expected to recognize novel objects and affordances
during inference. The illustration of OOAL pipeline is Affordance Learning. The term “affordance” is popular-
shown in Fig. 1. Compared with the typical affordance ized by the psychologist James Gibson, who describes it
learning that requires numerous training samples and can as the properties of an object or the environment that sug-
only reason within a closed affordance vocabulary, OOAL gest possible actions or interactions. Building on this, re-
alleviates the need of large-scale datasets and broadens the searchers have developed many approaches to acquire af-
scope of inference. fordance information in various ways. In computer vision,
initial research [13, 18, 28, 41] has focused on affordance
To this end, we note that foundation Vision-Language detection using convolutional neural networks. As manual
Models (VLMs) can be a potential solution, which have re- affordance annotations are often costly to acquire, much
cently emerged as powerful tools for a wide array of com- subsequent research has shifted its focus to weak super-
puter vision tasks. The open vocabulary nature of these vision such as keypoints [16, 53, 54] or image-level labels
VLMs like CLIP [50] that are trained on a large corpus of [36, 43]. Recent work has explored a novel perspective on
image-text data enables reasoning of previously unseen ob- how to ground affordances from human-object interaction
jects, scenes, and concepts. However, we observe that these images [29,36,64] or human action videos [9,19,31,43]. In
models often fail to understand nuanced vocabularies such robotics, affordance learning enables robots to interact ef-
as affordances or object parts. One hypothesis is that object fectively and intelligently with complex and dynamic envi-
parts and affordances appear much less frequently in image ronments [2,63]. Specifically, some work [3,27,58] utilizes
captions compared with objects. Therefore, the following affordance to build relationships between objects, tasks, and
question naturally arises: Can we teach foundation models manipulations for robotic grasping. Other studies focus on
to comprehend more subtle, fine-grained aspects of objects, learning affordance from other available resources that can
such as affordances, with very few examples? In this way, be deployed on real robots, such as human teleoperated play
the generalization capability of foundation models can be data [6], image pairs [5], and egocentric video datasets [4].
inherited with minimum annotation effort.
In contrast to the works above that often require a large
To achieve this, we first conduct a thorough analysis of amount of training data, we propose the problem of OOAL
several representative foundation models. The objective is that aims to perform affordance learning with one sample
to delve into their inherent understanding of affordances, per base object category, and allows zero-shot inference to
and figure out what visual representation is suitable for data- handle novel objects and affordances.
limited affordance learning. Based on the analysis, we then Foundation Models for Affordance Learning. With the
build a learning architecture and propose several methods, rapid development of foundation models such as Large Lan-
including text prompt learning, multi-layer feature fusion, guage Models (LLMs) and vision-language models, many
and a CLS-token-guided transformer decoder, that can facil- research efforts have explored their utilization in affordance
itate the alignment between visual representation and affor- learning or reasoning. Mees et al. [37] leverage GPT-
dance text embeddings. Lastly, we select a dense prediction 3 [7] to break down language instructions into subgoals,
task, affordance segmentation, for evaluation and compari- and learn a visual affordance model to complete real world
son with a variety of state-of-the-art models, where we find long-horizon tasks. Li et al. [29] adopt DINO-ViT features
that our methods can achieve higher performance with less to perform affordance grounding by transferring affordance
than 1% of the complete training data. knowledge from human-object interaction images to ego-
Overall, our contributions can be summarized as follows: centric views. Huang et al. [25] propose a novel pipeline
(1) We introduce the problem of OOAL, aiming to develop that uses LLMs [48] for affordance reasoning, which inter-
a robust affordance model that can generalize to novel ob- acts with VLMs to produce 3D affordance maps for robotic
ject and affordance categories without the need of massive manipulation. Recent studies [38,51,56] delve into the inte-
training data. (2) We conduct a comprehensive analysis gration of affordance and language models for task-oriented
on existing foundation models to explore their potential for grasping, which allows robots to grasp objects in a more ap-
OOAL. Following the analysis, we build a learning archi- propriate and safe manner.
tecture with vision-language foundation models, and design The closest methods to ours are AffCorrs [22] and Ope-
several methods to improve the alignment between visual nAD [47]. AffCorrs utilizes the visual foundation model
features and affordance text labels. (3) We implement ex- DINO to find corresponding affordances in a one-shot man-
tensive experiments with two datasets on affordance seg- ner, but relevant objects are explicitly selected as support
mentation to demonstrate the effectiveness of our learning images to significantly reduce the difficulty. OpenAD takes
pipeline, and observe significant gains over baselines with advantage of CLIP for open-vocabulary affordance detec-
strong generalization capability. tion in point clouds. It requires a large number of manual

2
annotations, while our work performs affordance learning
with merely one example per base object category.

3. Problem Setting
One-shot Open Affordance Learning (OOAL) aims to
learn a model to predict affordance with one example per
base object class and can generalize to novel object classes.
In this work, we focus on the dense prediction task of affor-
dance segmentation. Specifically, objects are first divided
into Nb base classes and No novel classes without intersec-
tion. The model receives only Nb samples during training,
one for each base object category, which is a pair of im-
age I ∈ RH×W ×3 and pixel-wise affordance annotation
M ∈ RH×W ×N (N is the number of affordance categories
in the dataset). After training, evaluation is performed on Figure 2. Analysis of vision-language foundation models on
the combination of base and novel object categories to mea- text-based affordance grounding. The 1st and 3rd rows use af-
sure the generalization ability of the model. Also, affor- fordance texts as input queries, and the 2nd and 4th rows use cor-
dance labels can be replaced with novel vocabularies that responding object parts as input text queries. Visualizations show
share similar semantics, such as “chop”, “slice”, and “trim” that these models have limited ability to recognize fine-grained af-
fordances and object parts.
to represent affordance akin to “cut”.
It is worth noting that OOAL is different from one-
shot semantic segmentation (OSSS) [55] and one-shot af-
fordance detection (OS-AD) [34]. Both OSSS and OS-AD ity method CLIP Surgery [30], a state-of-the-art open-
receive one-shot sample during training. However, the sam- vocabulary segmentation method CAT-Seg [12], and an
ple keeps changing in each iteration, so the model can be open-vocabulary detection method GroundingDINO [32].
exposed to many different images. Additionally, a sup- For vanilla CLIP, we employ the method proposed in
port image is required at inference to provide prior infor- MaskCLIP [66] that directly extracts dense predictions
mation. In comparison, OOAL performs one-shot training without fine-tuning. We use the text prompt template of
and zero-shot inference, which poses additional challenges. “somewhere to [affordance]” to query visual features to
The model needs to generalize to previously unseen objects, find corresponding areas. As illustrated in Fig. 2, we note
necessitating the ability to understand and recognize seman- that most models cannot understand affordance well, ex-
tic relationships between seen and unseen classes with very cept the detection model GroundingDINO, but its predic-
limited data. tions mainly focus on the whole object rather than parts.
As for dense prediction models, CAT-Seg often recognizes
4. Method affordance regions as background, and CLIP gives high ac-
tivation on both foreground and background. In compari-
4.1. Analysis of Foundation Models son, CLIP Surgery fails to localize the “holding” area for a
The field of computer vision has recently witnessed a knife, but manages to associate the phrase “sit on” with a
surge in the prevalence of large foundational models, such chair. Furthermore, even when the affordance text is re-
as CLIP [50], Segment Anything [26], DINO [8, 49] etc. placed with corresponding object parts, predictions from
These models exhibit strong zero-shot generalization ca- CLIP and GroundingDINO remain biased toward objects,
pabilities for several computer vision tasks, making them while CLIP Surgery and CAT-Seg tend to activate the wrong
seem like a great option to tackle the problem of OOAL. To parts. This is consistent with recent findings [57, 60] that
this end, we perform analysis on several existing founda- CLIP has limited part recognition ability.
tion models which we split into three parts: ❶ Do current To answer questions ❷ and ❸, we consider two essen-
vision-language foundation models and their variants have tial characteristics of a good affordance model in the low-
the ability to detect affordances via affordance/part-based shot setting: (1) Part-aware representation. The visual rep-
prompting? ❷ Can the features of visual foundation mod- resentation should exhibit awareness of object parts, given
els discriminate affordance regions in images? and ❸ Can that affordance often denotes small and fine-grained re-
these models generalize affordance recognition to novel ob- gions, e.g., a bicycle saddle to sit on or a knife handle to
jects and perform well in the low-shot setting? hold. (2) Part-level semantic correspondence. This prop-
Driven by question ❶, we select four representative erty is critical for generalization, since the model requires
models, i.e., the vanilla CLIP, a CLIP-based explainabil- the understanding of semantic relations to make reasonable

3
Figure 3. Proposed learning framework for OOAL. Our designs are highlighted in three color blocks, which are text prompt learning,
multi-layer feature fusion, and CLS-guided transformer decoder. [CLS] denotes the CLS token of the vision encoder.

the feature similarity maps computed as the cosine simi-

larity between one patch representation on the knife blade
and an image of scissors. It is obvious that DINOv2 shows
finer correspondence between blades of knife and scissors.
By contrast, CLIP produces messy correspondences in both
foreground and background, and feature correspondences
of DeiT III are only discriminative at the object level, but
not specific to the affordance part region (cut). From the
above analysis, we conclude that DINOv2 is well suited for
affordance learning due to its fine-grained part-aware repre-
Figure 4. Analysis of visual foundation models on affordance sentation and superior part-level semantic correspondence.
learning. Top row: visualizations of PCA components. Bottom Quantitative comparisons are shown in Sec. 5.5.
row: feature similarity maps between the yellow mark on the knife
blade and the image of scissors. Qualitative results show that DI- 4.2. Motivation and method
NOv2 has clearer part-aware representations and better part-level
semantic correspondence. Through a systematic analysis, we identify DINOv2 as a
powerful tool for addressing the OOAL problem. However,
there are still fundamental issues that hinder performance
in this challenging setting. The first is that DINOv2 is a
predictions on novel objects. In addition, good correspon- vision-only model, and lacks the ability to identify novel
dence proves advantages in scenarios with limited data, as affordances. One potential solution involves integrating a
the model can be more robust to intra-class recognition, and text encoder like CLIP, but it is recognized that the input
less susceptible to changes in appearance. We then analyze text is sensitive to prompts. This is particularly problem-
the features from three representative and powerful visual atic in the case of affordances, which combine both an ob-
foundation models, i.e., vision-language contrastive learn- ject and a verb, making manual prompt design a complex
ing CLIP, fully-supervised learning DeiT III [59], and self- task. The second issue is that while features of DINOv2
supervised learning DINOv2. First, we perform the princi- are part-oriented, the level of granularity varies across lay-
pal component analysis (PCA) on the extracted patch fea- ers. Determining the appropriate granularity level is cru-
tures of each model to investigate the part awareness. Visu- cial when handling affordances associated with diverse ob-
alization of PCA components in the top row of Fig. 4 shows jects. The third issue arises due to the absence of align-
that all three models have part-aware features to some ex- ment between the DINOv2 vision encoder and CLIP text
tent, yet CLIP cannot well distinguish the background, and encoder, as they are trained separately and independently of
features of DeiT III are not discriminative enough for dif- each other. Building upon these observations, we establish
ferent parts. Next, we choose a different object that has a vision-language framework based on DINOv2 and CLIP,
equivalent affordances, i.e., knife and scissors, to assess the and propose three modules to resolve each of the three fun-
semantic correspondence. The bottom row of Fig. 4 shows damental bottlenecks mentioned above.

4
In this section, we first describe the overview of our function, it often carries rich prior information of the whole
proposed learning framework that builds on the powerful image, such as salient objects or regions. Consequently,
foundation models. Then, we elaborate on the three pro- we utilize the [CLS] token to produce a guidance mask that
posed designs that help in the challenging OOAL problem. constrains the cross-attention within a foreground region.
Finally, we discuss the framework’s capability to identify The decoder receives three inputs, i.e., text embeddings
novel objects and affordances at inference. Ft , visual features Fv , and the [CLS] token Lcls . Firstly,
Overview. The proposed learning framework is presented linear transformations are performed to yield query, key,
in Fig. 3, which consists of a vision encoder, a text encoder, and value:
and a transformer decoder. First, the pretrained vision en-
coder DINOv2 is used to extract dense patch embeddings Q = ϕq (Ft ), K = ϕk (Fv ), V = ϕv (Fv ). (2)
F̂v ∈ RL×Cv , where L is the number of tokens or patches.
Then, affordance labels are processed by the CLIP text en- Here we use text embeddings as query, and visual features
coder to obtain text embeddings Ft ∈ RN ×C . To cope as key and value, allowing the model to focus on the update
with inconsistent dimensions between visual and text em- of text embeddings by retrieving relevant visual informa-
beddings, an embedder ev : RCv → RC with a single MLP tion that corresponds to the affordance text. Next, the CLS-
layer is employed. In the end, the lightweight transformer guided mask is calculated between the [CLS] token and key
decoder takes both visual and text embeddings as input, and via matrix multiplication:
outputs the affordance prediction.
Text Prompt Learning. Manually designing prompts for ϕc (Lcls )K T
Mcls = sigmoid( √ ), (3)
affordances can be a complicated work, especially consid- dk
ering that CLIP has difficulty in recognizing affordance
where dk is a scaling factor that equals the dimension of the
(see Fig. 2). Thus, we adopt the Context Optimization
keys. The masked cross-attention is then computed as:
(CoOp) [67] method to introduce automatic text prompt
learning. Instead of finetuning the CLIP text encoder, the p
F̂t = softmax(QK T / dk ) · Mcls V + Ft . (4)
inclusion of learnable prompts is an effective strategy that
can alleviate the problem of overfitting and retain the in- ′
After that, the updated text embeddings Ft are obtained
herent text recognition ability of CLIP. Specifically, p ran- by sending F̂t through a feed-forward network (FFN) with
domly initialized learnable context vectors {v1 , v2 , ..., vp } a residual connection. The decoder comprises t layers of
are inserted in front of the text CLS token, and they are transformers, and the ultimate prediction is generated by
shared for all affordance classes. performing matrix product between the output of the last
Multi-Layer Feature Fusion. Different layers of DINOv2 transform layer and original visual features Fv , thereby en-
features often exhibit different levels of granularity [1]. suring the maximum retention of part-aware representations
Since affordance may correspond to multiple parts of an ob- from DINOv2. Lastly, binary cross entropy is employed as
ject, a diverse set of granularities can be beneficial. For this loss function to optimize parameters of linear layers, em-
purpose, we aggregate the features of the last j layers. Each bedder, and decoder.
layer of features is first processed by a linear projection, Inference on novel objects and affordances. During the
and then all features are linearly combined with a weighted training process, the decoder learns to establish an align-
summation: ment between visual features and affordance text embed-
j
X dings. When encountering a novel object at inference, the
F̂v = αi · ϕ(Fn−i+1 ), α1 + α2 + ... + αj = 1, (1) aligned affordance text embeddings can locate correspond-
i=1 ing object regions, leveraging the part-level semantic cor-
where Fn denotes the last layer, α is a learnable parame- respondence property inherent in DINOv2. Similarly, as
ter that controls the fusion ratio of each layer, and ϕ indi- the model processes novel affordance text inputs, the gen-
cates the linear transformation. This straightforward fusion erated text embeddings can also retrieve the aligned visual
scheme enables adaptive selection among different granu- features, which are based on the semantic similarities to the
larity levels, allowing the model to handle affordance recog- base affordances seen in the training.
nition across diverse scenarios.
CLS-Guided Transformer Decoder. To deal with the lack 5. Experiments
of alignment between visual and text features, we propose
5.1. Datasets
a lightweight transformer decoder that applies a masked
cross-attention mechanism to promote the mutual commu- We choose two typical datasets, AGD20K [36] and
nication between two branches. Since the [CLS] token of UMD part affordance [46], both of which include a large
a foundation model is used in the computation of objective number of object categories that help in the evaluation of

5
Training Data Seen Unseen
Task Method
seen / unseen split KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
Hotspots [43] 1.773 0.278 0.615 1.994 0.237 0.577
23,083 / 15,543 images Cross-view-AG [36] 1.538 0.334 0.927 1.787 0.285 0.829
WSAG
image-level labels Cross-view-AG+ [35] 1.489 0.342 0.981 1.765 0.279 0.882
LOCATE [29] 1.226 0.401 1.177 1.405 0.372 1.157
MaskCLIP [66] 5.752 0.169 0.041 6.052 0.152 0.047
50 / 33 images SAN [62] 1.435 0.357 0.941 1.580 0.351 1.022
OOAL
keypoint labels ZegCLIP [68] 1.413 0.387 1.001 1.552 0.361 1.042
Ours 0.740 0.577 1.745 1.070 0.461 1.503

Table 1. Comparison with state of the art on AGD20K dataset. OOAL setting uses 0.22% / 0.21% of the full training data. WSAG denotes
weakly-supervised affordance grounding. The best and second-best results are highlighted in bold and underlined, respectively.

Setting Method Seen Unseen hIoU novel split to evaluate performance on novel object classes.
Due to its small number of object categories, we take one
DeepLabV3+ [10] 70.5 57.5 63.3
Fully example from each base object instance to form the training
SegFormer [61] 74.6 57.7 65.0
Supervised set. Specific affordance categories and object class splits
PSPNet [65] 72.0 60.8 66.0
can be found in the supplementary material.
PSPNet [65] 56.7 46.6 51.1
DeepLabV3+ [10] 56.8 48.4 52.3 5.2. Implementation details
SegFormer [11] 64.6 51.4 57.3
OOAL Experiments are implemented on two GeForce RTX
MaskCLIP [66] 4.25 4.24 4.25
3090 GPUs. All visual foundation models use the same
SAN [62] 45.1 32.2 37.5
ZegCLIP [68] 47.4 36.0 40.9
base-sized vision transformer (ViT-base). We train the
Ours 74.6 59.7 66.4 model using SGD optimizer with learning rate 0.01 for 20k
iterations. For experiments on AGD20K, images are first re-
Table 2. Comparison on UMD dataset. Fully-supervised methods sized to 256 × 256 and randomly cropped to 224 × 224 with
are trained with 14,823 and 20,874 images with pixel-level labels horizontal flipping. Experiments for UMD dataset are con-
for seen and unseen split, respectively. In contrast, OOAL setting ducted on the opensource toolbox MMSegmentation [14]
uses 54 and 76 images, 0.36% of the full training data. with the default training setting. The hyperparameters p, j,
and t are set to 8, 3, and 2, respectively.
Following previous work, we adopt the commonly used
novel objects. AGD20K is a large-scale affordance ground- Kullback-Leibler Divergence (KLD), Similarity (SIM), and
ing dataset with 36 affordances and 50 objects, containing Normalized Scanpath Saliency (NSS) metrics to evaluate
23,816 images from exocentric and egocentric views. It the results on AGD20K. For UMD dataset, we use the met-
aims to learn affordance from human-object interaction im- ric of mean intersection-over-union (mIoU), and also incor-
ages, and perform affordance localization on egocentric im- porate the harmonic mIoU as a balanced measure that ac-
ages. As it is a dataset for weakly-supervised learning, im- counts for both seen and unseen settings.
ages in the training set only have image-level labels. There-
fore we manually annotate 50 randomly selected egocentric 5.3. Comparison to state-of-the-art methods
images from each object category for training. AGD20K AGD20K dataset is benchmarked with weakly super-
also has two train-test splits for seen and unseen settings, vised affordance grounding (WSAG) approaches, which
and we follow their splits to evaluate the performance. Note use image-level object and affordance labels to do affor-
that AGD20K uses sparse annotation, where ground truth dance segmentation. Note that results from WSAG meth-
consists of keypoints within affordance areas, and then a ods are not directly comparable to our setting, as training
gaussian kernel is applied over each point to produce dense labels are different. Despite using only image-level labels,
annotation. the training data required are more than 460 times of ours.
UMD dataset consists of 28,843 RGB-D images with 7 The results in Tab. 1 demonstrate that our results exceed
affordances and 17 object categories, and images of each all WSAG counterparts in an easy and realistic setting. We
object are captured on a revolving turntable. It has two also benchmark open-vocabulary segmentation methods of
train-test splits termed category split and novel split. We MaskCLIP, SAN, and ZegCLIP for further comparison. We
use the category split to evaluate base object categories and find that these CLIP-based methods have a large perfor-

6
Figure 5. Qualitative comparison with LOCATE and ZegCLIP on AGD20K dataset. When multiple affordance predictions overlap, the one
with higher value is displayed. Our predictions distinguish different object parts, while other methods often make overlapping predictions.

eral representative semantic segmentation methods (PSP-

Net, DeepLabV3+, SegFormer) and open-vocabulary se-
mantic segmentation methods. For fair comparison, the
classical segmentation methods are trained with the full
training set, while foundation-model-based methods like
ZegCLIP and SAN are evaluated in the OOAL setting.
It is clear that our proposed model is quite effective,
which can be comparable to fully-supervised methods with
only 0.36% of their training data. To explore how fully-
supervised methods are affected by the limited data, we
further train these models in the OOAL setting. Results
in Tab. 2 show that the performance of these models de-
grades by around 10% in both seen and unseen settings
when given only one-shot example. Additionally, under the
same OOAL setting, we observe a more apparent gain over
other CLIP-based open-vocabulary segmentation methods,
showing that CLIP is not suitable for data-limited affor-
dance learning. The poor performance of MaskCLIP from
both tables also verifies that CLIP has very limited under-
standing on affordance.
Figure 6. Qualitative comparison with SegFormer and ZegCLIP
on UMD affordance dataset in OOAL setting. Images have been 5.4. Qualitative results
enlarged and cropped for better visualization.
Qualitative comparisons on AGD20K dataset are shown
in Fig. 5. We note that WSAG methods like LOCATE of-
mance gap with ours, and are also inferior to the state-often make overlapping predictions for examples with multi-
the-art WSAG method LOCATE. ple affordances, while our results show a clear separation
The comprehensive comparison on UMD dataset is dis- between different affordance regions. ZegCLIP can make
played in Tab. 2, where we benchmark the results with sev- reasonable predictions to some extent, but it mostly focuses

7
Seen Unseen
Model
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
CLIP 1.294 0.384 1.107 1.556 0.327 0.966
DeiT III 1.301 0.378 1.140 1.535 0.321 1.049
DINOv2 1.156 0.425 1.297 1.462 0.360 1.105

Table 3. Ablation results of different visual foundation models.

Seen Unseen
Method
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
Baseline 1.156 0.425 1.297 1.462 0.360 1.105
+ TPL 1.060 0.455 1.422 1.338 0.390 1.302
+ MLFF 0.846 0.537 1.622 1.115 0.447 1.440
+ TD 0.749 0.578 1.738 1.131 0.443 1.408
+ CTM 0.740 0.577 1.745 1.070 0.461 1.503
Figure 7. Qualitative examples of novel affordance prediction on
UMD dataset. The 1st and 2nd rows display results on base ob- Table 4. Ablation results of proposed modules. TPL: text prompt
jects, and the 3rd and 4th rows show results for novel objects. learning. MLFF: multi-layer feature fusion. TD: transformer de-
coder. CTM: CLS-guided mask.

on the whole object and the accuracy is far from satisfac- ply process the visual features with the embedder, and per-
tory, whereas our results are more part-focused, especially form matrix multiplication with pre-computed affordance
for the unseen objects. For example, the prediction for the text embeddings to output segmentation maps. As shown in
unseen object of bicycle show that our model can handle Tab. 3, CLIP and DeiT III exhibit comparable performance,
the complex affordance (ride) with multiple separated af- whereas DINOv2 achieves much better results in both seen
fordance areas (saddle, handlebar, and pedal). In Fig. 6, we and unseen settings, which are consistent with the analysis
display the results for UMD dataset. We observe that Seg- that DINOv2 is more suitable for affordance learning.
Former and ZegCLIP often fail to recognize affordances of Proposed Methods. We use the DINOv2 with a simple em-
objects whose parts are similar in appearance. Also, they bedder as baseline, and gradually integrate our methods to
tend to misclassify metallic object parts as cuttable affor- analyze the effect of each proposed design. The results in
dance, suggesting that inferring affordances with only ap- Tab. 4 reveal that each module can consistently deliver no-
pearance features can be misleading. In comparison, our table improvements. In particular, we notice that the inclu-
predictions are more accurate due to the utilization of DI- sion of a transformer decoder can enhance the performance
NOv2’s part-level semantic correspondences. in the seen setting, but yield inferior results for the unseen
One particular feature of our model is that it can rec- setting. With the integration of the CLS-guided mask, re-
ognize novel affordances not shown during training. To sults of both settings can be improved, suggesting that re-
demonstrate this, we replace the original affordance labels stricting the cross-attention space is an effective strategy for
with semantically similar words and check if the model unseen object affordance recognition.
can still reason about corresponding affordance areas. As
shown in Fig. 7, the model manages to make correct pre- 6. Conclusion
dictions for novel affordances, such as “hold and grab” for
base affordance “grasp”, “saw” for “cut”, and “accommo- In this paper, we propose the problem of one-shot open
date” for “contain”. affordance learning that uses one example per base object
category as training data, and has the ability to recognize
5.5. Ablation study novel objects and affordances. We first present a detailed
analysis into different foundation models for the purpose of
The ablation study is performed on the more challeng- data-limited affordance learning. Motivated by the analysis,
ing AGD20K dataset due to its natural images with diverse we build a vision-language learning framework with sev-
backgrounds. Ablations on hyperparameters are left in the eral proposed designs that better utilize the visual features
supplementary material. and promote the alignment with text embeddings. Experi-
Different Vision Encoders. To complement the qualitative ment results demonstrate that we achieve comparable per-
analysis in Sec. 4.1, we conduct quantitative experiments formance over several fully-supervised baselines with less
on CLIP, DeiT III, and DINOv2. Specifically, we sim- than 1% of the full training data.

8
References [14] MMSegmentation Contributors. MMSegmentation:
Openmmlab semantic segmentation toolbox and
[1] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. benchmark. https : / / github . com / open -
Deep vit features as dense visual descriptors. ECCVW What mmlab/mmsegmentation, 2020. 6
is Motion For, 2022. 5
[15] Francisco Cruz, Sven Magg, Cornelius Weber, and Stefan
[2] Paola Ardón, Èric Pairet, Katrin S Lohan, Subramanian Ra- Wermter. Training agents with interactive reinforcement
mamoorthy, and Ronald Petrick. Affordances in robotic learning and contextual affordances. IEEE Transactions on
tasks–a survey. arXiv preprint arXiv:2004.07400, 2020. 2 Cognitive and Developmental Systems, 8(4):271–284, 2016.
[3] Paola Ardón, Eric Pairet, Ronald PA Petrick, Subramanian 1
Ramamoorthy, and Katrin S Lohan. Learning grasp affor- [16] Leiyao Cui, Xiaoxue Chen, Hao Zhao, Guyue Zhou, and
dance reasoning through semantic relations. IEEE Robotics Yixin Zhu. Strap: Structured object affordance segmentation
and Automation Letters, 4(4):4571–4578, 2019. 2 with point supervision. arXiv preprint arXiv:2304.08492,
[4] Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, 2023. 1, 2
and Deepak Pathak. Affordances from human videos as [17] Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and
a versatile representation for robotics. In Proceedings of Kui Jia. 3d affordancenet: A benchmark for visual object
the IEEE/CVF Conference on Computer Vision and Pattern affordance understanding. In CVPR, 2021. 1
Recognition, pages 13778–13790, 2023. 2
[18] Thanh Toan Do, Anh Nguyen, and Ian Reid. Affordancenet:
[5] Homanga Bharadhwaj, Abhinav Gupta, and Shubham Tul-
An end-to-end deep learning approach for object affordance
siani. Visual affordance prediction for guiding robot explo-
detection. ICRA, 2018. 1, 2
ration. arXiv preprint arXiv:2305.17783, 2023. 2
[19] Kuan Fang, Te Lin Wu, Daniel Yang, Silvio Savarese, and
[6] Jessica Borja-Diaz, Oier Mees, Gabriel Kalweit, Lukas Her-
Joseph J. Lim. Demo2Vec: Reasoning Object Affordances
mann, Joschka Boedecker, and Wolfram Burgard. Affor-
from Online Videos. CVPR, 2018. 1, 2
dance learning from play for sample-efficient policy learn-
ing. In 2022 International Conference on Robotics and Au- [20] Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen,
tomation (ICRA), pages 6372–6378. IEEE, 2022. 2 Yaodong Yang, and Hao Dong. End-to-end affor-
dance learning for robotic manipulation. arXiv preprint
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
arXiv:2209.12941, 2022. 1
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- [21] James J. Gibson. The Ecological Approach to Visual Percep-
guage models are few-shot learners. Advances in neural in- tion: Classic Edition. Houghton Mifflin, 1979. 1
formation processing systems, 33:1877–1901, 2020. 2 [22] Denis Hadjivelichkov, Sicelukwanda Zwane, Marc Deisen-
[8] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, roth, Lourdes Agapito, and Dimitrios Kanoulas. One-Shot
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- Transfer of Affordance Regions? AffCorrs! CoRL, 2022. 2
ing properties in self-supervised vision transformers. In [23] Mohammed Hassanin, Salman Khan, and Murat Tahtali. Vi-
ICCV, 2021. 3 sual affordance and function understanding: A survey. ACM
[9] Joya Chen, Difei Gao, Kevin Qinghong Lin, and Mike Zheng Computing Surveys (CSUR), 54(3):1–35, 2021. 1
Shou. Affordance grounding from demonstration video to [24] Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, and
target image. In Proceedings of the IEEE/CVF Conference Dacheng Tao. Affordance transfer learning for human-object
on Computer Vision and Pattern Recognition, pages 6799– interaction detection. In Proceedings of the IEEE/CVF Con-
6808, 2023. 2 ference on Computer Vision and Pattern Recognition, pages
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian 495–504, 2021. 1
Schroff, and Hartwig Adam. Encoder-decoder with atrous [25] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li,
separable convolution for semantic image segmentation. In Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value
Proceedings of the European conference on computer vision maps for robotic manipulation with language models. arXiv
(ECCV), pages 801–818, 2018. 6 preprint arXiv:2307.05973, 2023. 2
[11] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- [26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
der Kirillov, and Rohit Girdhar. Masked-attention mask Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
transformer for universal image segmentation. In Proceed- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
ings of the IEEE/CVF conference on computer vision and thing. arXiv preprint arXiv:2304.02643, 2023. 3
pattern recognition, pages 1290–1299, 2022. 6 [27] Mia Kokic, Johannes A Stork, Joshua A Haustein, and Dan-
[12] Seokju Cho, Heeseong Shin, Sunghwan Hong, Seungjun ica Kragic. Affordance detection for task-specific grasping
An, Seungjun Lee, Anurag Arnab, Paul Hongsuck Seo, using deep learning. In 2017 IEEE-RAS 17th International
and Seungryong Kim. Cat-seg: Cost aggregation for Conference on Humanoid Robotics (Humanoids), pages 91–
open-vocabulary semantic segmentation. arXiv preprint 98. IEEE, 2017. 1, 2
arXiv:2303.11797, 2023. 3 [28] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Sax-
[13] Ching-Yao Chuang, Jiaman Li, Antonio Torralba, and Sanja ena. Learning human activities and object affordances from
Fidler. Learning to act properly: Predicting and explaining rgb-d videos. The International journal of robotics research,
affordances from images. In CVPR, 2018. 1, 2 32(8):951–970, 2013. 2

9
[29] Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla- [44] Tushar Nagarajan and Kristen Grauman. Learning affor-
Lara. Locate: Localize and transfer object parts for weakly dance landscapes for interaction exploration in 3d environ-
supervised affordance grounding. In Proceedings of the ments. Advances in Neural Information Processing Systems,
IEEE/CVF Conference on Computer Vision and Pattern 33:2005–2015, 2020. 1
Recognition, pages 10922–10931, 2023. 1, 2, 6 [45] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and
[30] Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip Nikos G Tsagarakis. Detecting object affordances with con-
surgery for better explainability with enhancement in open- volutional neural networks. In 2016 IEEE/RSJ International
vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. Conference on Intelligent Robots and Systems (IROS), pages
3 2765–2770. IEEE, 2016. 1
[31] Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- [46] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and
aolong Wang. Joint hand motion and interaction hotspots Nikos G Tsagarakis. Object-based affordances detection
prediction from egocentric videos. In CVPR, 2022. 1, 2 with convolutional neural networks and dense conditional
[32] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao random fields. In IROS, 2017. 1, 5
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
[47] Toan Ngyen, Minh Nhat Vu, An Vuong, Dzung Nguyen,
Zhu, et al. Grounding dino: Marrying dino with grounded
Thieu Vo, Ngan Le, and Anh Nguyen. Open-vocabulary
pre-training for open-set object detection. arXiv preprint
affordance detection in 3d point clouds. arXiv preprint
arXiv:2303.05499, 2023. 3
arXiv:2303.02401, 2023. 2
[33] Timo Luddecke and Florentin Worgotter. Learning to seg-
ment affordances. In Proceedings of the IEEE International [48] R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View
Conference on Computer Vision Workshops, pages 769–776, in Article, 2023. 2
2017. 1 [49] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
[34] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Dacheng Tao. One-shot affordance detection. In IJCAI, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
2021. 3 Dinov2: Learning robust visual features without supervision.
[35] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and arXiv preprint arXiv:2304.07193, 2023. 3
Dacheng Tao. Grounded affordance from exocentric view. [50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
arXiv preprint arXiv:2208.13196, 2022. 6 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[36] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
Dacheng Tao. Learning affordance grounding from exocen- transferable visual models from natural language supervi-
tric images. CVPR, 2022. 1, 2, 5, 6, 12 sion. In International conference on machine learning, pages
[37] Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. 8748–8763. PMLR, 2021. 2, 3
Grounding language with visual affordances over unstruc- [51] Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr,
tured data. In 2023 IEEE International Conference on Lawrence Chen, Angjoo Kanazawa, and Ken Goldberg. Lan-
Robotics and Automation (ICRA), pages 11576–11582. guage embedded radiance fields for zero-shot task-oriented
IEEE, 2023. 2 grasping. arXiv preprint arXiv:2309.07970, 2023. 2
[38] Reihaneh Mirjalili, Michael Krawez, Simone Silenzi, Yan- [52] Anirban Roy and Sinisa Todorovic. A multi-scale cnn for
nik Blei, and Wolfram Burgard. Lan-grasp: Using large lan- affordance segmentation in rgb images. In Computer Vision–
guage models for semantic object grasping. arXiv preprint ECCV 2016: 14th European Conference, Amsterdam, The
arXiv:2310.05239, 2023. 2 Netherlands, October 11–14, 2016, Proceedings, Part IV 14,
[39] Luis Montesano, Manuel Lopes, Alexandre Bernardino, and pages 186–201. Springer, 2016. 1
Jose Santos-Victor. Affordances, development and imitation. [53] Johann Sawatzky and Jurgen Gall. Adaptive binarization
In 2007 IEEE 6th International Conference on Development for weakly supervised affordance segmentation. In ICCVW,
and Learning, pages 270–275. IEEE, 2007. 1 2017. 1, 2
[40] Lorenzo Mur-Labadia, Ruben Martinez-Cantin, and Jose J
[54] Johann Sawatzky, Abhilash Srikantha, and Juergen Gall.
Guerrero. Bayesian deep learning for affordance segmenta-
Weakly supervised affordance detection. CVPR, 2017. 1,
tion in images. arXiv preprint arXiv:2303.00871, 2023. 1
2
[41] Austin Myers, Angjoo Kanazawa, Cornelia Fermuller, and
Yiannis Aloimonos. Affordance of Object Parts from Geo- [55] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and
metric Features. Int. Conf. Robot. Autom., pages 5–6, 2015. Byron Boots. One-shot learning for semantic segmentation.
2 arXiv preprint arXiv:1709.03410, 2017. 3
[42] Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis [56] Yaoxian Song, Penglei Sun, Yi Ren, Yu Zheng, and
Aloimonos. Affordance detection of tool parts from geomet- Yue Zhang. Learning 6-dof fine-grained grasp detec-
ric features. ICRA, 2015. 12 tion based on part affordance grounding. arXiv preprint
[43] Tushar Nagarajan, Christoph Feichtenhofer, and Kristen arXiv:2301.11564, 2023. 2
Grauman. Grounded human-object interaction hotspots from [57] Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping
video. In Proceedings of the IEEE/CVF International Con- Luo, Saining Xie, and Zhicheng Yan. Going denser
ference on Computer Vision, pages 8688–8697, 2019. 1, 2, with open-vocabulary part segmentation. arXiv preprint
6 arXiv:2305.11173, 2023. 3

10
[58] Chao Tang, Jingwen Yu, Weinan Chen, and Hong Zhang.
Relationship oriented affordance learning through manipula-
tion graph construction. arXiv preprint arXiv:2110.14137,
2021. 2
[59] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii:
Revenge of the vit. In European Conference on Computer
Vision, pages 516–533. Springer, 2022. 4
[60] Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong,
Xihui Liu, and Jiangmiao Pang. Ov-parts: To-
wards open-vocabulary part segmentation. arXiv preprint
arXiv:2310.05107, 2023. 3
[61] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transform-
ers. Advances in Neural Information Processing Systems,
34:12077–12090, 2021. 6
[62] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi-
ang Bai. Side adapter network for open-vocabulary semantic
segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2945–
2954, 2023. 6
[63] Xintong Yang, Ze Ji, Jing Wu, and Yu-Kun Lai. Recent ad-
vances of deep robotic affordance learning: a reinforcement
learning perspective. IEEE Transactions on Cognitive and
Developmental Systems, 2023. 2
[64] Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Jiebo
Luo, and Zheng-Jun Zha. Grounding 3d object affor-
dance from 2d interactions in images. arXiv preprint
arXiv:2303.10437, 2023. 2
[65] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2881–2890, 2017. 6
[66] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
dense labels from clip. In European Conference on Com-
puter Vision, pages 696–712. Springer, 2022. 3, 6
[67] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Liu. Learning to prompt for vision-language models. In-
ternational Journal of Computer Vision, 130(9):2337–2348,
2022. 5
[68] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and
Yifan Liu. Zegclip: Towards adapting clip for zero-shot se-
mantic segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
11175–11185, 2023. 6

11
A. Dataset Details
To evaluate the model’s generalization ability in the chal-
lenging One-shot Open Affordance Learning (OOAL) set-
ting, datasets with a large number of object categories are
required. In addition, at least two object categories are
needed for each affordance so that the model can be trained Figure 8. Different affordance annotation schemes. Dense affor-
on one object and tested on the other. After an investigation dance annotation is labeled as binary masks. Sparse affordance
of existing affordance datasets, we find only two datasets, annotation is first labeled as keypoints, and then a gaussian kernel
is performed over each point to produce pixel-wise ground truth.
AGD20K [36] and UMD [42], that fulfill the prerequisites
and can be used to evaluate the affordance segmentation
task. Specific affordance and object categories of these two
datasets are shown in Tab. 5. For the unseen split, we dis-
play the object category division in Tab. 6. The model is
trained on base object classes, and evaluated on novel ob-
jects categories.
Moreover, it is worth noting that annotations in AGD20K
and UMD are of different types. UMD uses pixel-level
dense binary maps, while the ground truth of AGD20K con-
sist of sparse keypoints within the affordance areas, and a
gaussian distribution is then applied on each point to gen-
erate dense annotation. The difference of dense and sparse
affordance annotation is highlighted in Fig. 8.

B. Ablation Study on Hyperparameters

Figure 9. Visualization of CLS-guided mask.
The proposed framework involves three primary hyper-
parameters, i.e., the number of learnable text tokens p, vi-
sion encoder fusion layers j, decoder transformer layers t.
C.2. Visualization of Unseen Affordances
We conduct ablation studies individually to explore the im- In Fig. 11, we further display examples on AGD20K
pact of these hyperparameters, as detailed in Tab. 7, Tab. 8, dataset to showcase that our model has the ability to rec-
and Tab. 9. Notably, increasing the number of learnable ognize unseen affordances. It is evident that the model can
text tokens up to 8 showcases a gradual improvement in consistently activate relevant affordance areas when receiv-
performance within the seen setting, but leads to fluctuat- ing text that are previously unseen during training.
ing results in the unseen setting, indicating its susceptibility
to generalization when confronted with unseen objects. In C.3. Additional Qualitative Results
terms of the fusion layers, the fusion of the last two layers
In Fig. 10, we present more qualitative results on
demonstrates an obvious performance gain compared to the
AGD20K dataset. The comparison demonstrates that pre-
single-layer counterpart, and integrating the last three layers
dictions from our methods exhibit clear separation among
yields the best results. Lastly, we note that the transformer
object parts, while predictions from other approaches often
decoder can effectively improve performance in both seen
bias towards one part or the whole object. In particular, our
and unseen setting, and a two-layer transformer decoder
methods can locate very fine-grained affordance areas even
produces the most optimal results.
for unseen objects, such as the saddle of a bicycle for “sit
on”, and the handle of a golf club for “hold”.
C. Additional Visualizations
D. Discussion and Limitations
C.1. Visualization of CLS-guided mask
This study introduces a novel problem of OOAL, and
In Fig. 9, we display the visualization of the CLS-guided presents a framework built upon foundation models that can
mask from the proposed CLS-guided transformer decoder. perform effective affordance learning with limited samples
It can be seen that the mask primarily concentrates on fore- and annotations. We note that this framework can be poten-
ground objects, thus facilitating the cross-attention within tially used in various applications, such as robotic manipu-
salient regions. lation and virtual reality. For instance, in robotic manipula-
Dataset Affordance Object
(17) bowl, cup, hammer, knife, ladle, mallet,
(7) grasp, cut, scoop, contain,
UMD mug, pot, saw, scissors, scoop, shears, shovel,
pound, support, wrap-grasp
spoon, tenderizer, trowel, turner
(50) apple, axe, badminton racket, banana, baseball, baseball bat,
basketball, bed, bench, bicycle, binoculars, book, bottle, bowl,
(37) beat, boxing, brush with, carry, catch, cut, cut with,
broccoli, camera, carrot, cell phone, chair, couch, cup,
drag, drink with, eat, hit, hold, jump, kick, lie on, lift,
discus, drum, fork, frisbee, golf clubs, hammer, hot dog,
AGD20K look out, open, pack, peel, pick up, pour, push, ride,
javelin, keyboard, knife, laptop, microwave, motorcycle,
sip, sit on, stick, stir, swing, take photo, talk on,
orange, oven, pen, punching bag, refrigerator, rugby ball,
text on, throw, type on, wash, write
scissors, skateboard, skis, snowboard, soccer ball, suitcase,
surfboard, tennis racket, toothbrush, wine glass

Table 5. Affordance and object classes in the UMD and AGD20K dataset. The number of classes is shown in parentheses.

Dataset Base Objects (Train) Novel Objects (Test)

(9) cup, ladle, pot, saw, scoop, shears,
UMD (8) bowl, hammer, knife, mallet, mug, scissors, spoon, turner
shovel, tenderizer, trowel
(33) apple, badminton racket, baseball, baseball bat, bench,
book, bottle, bowl, carrot, cell phone, chair, couch,
discus, fork, frisbee, hammer, hot dog, javelin, (14) axe, banana, basketball, bed, bicycle,
AGD20K keyboard, microwave, motorcycle, orange, oven, broccoli, camera, cup, golf clubs, knife,
punching bag, rugby ball, scissors, skateboard, laptop, refrigerator, skis, soccer ball
snowboard, suitcase, surfboard, tennis racket,
toothbrush, wine glass

Table 6. Object category division in the unseen split of UMD and AGD20K dataset. The number of categories is shown in parentheses.

Seen Unseen Seen Unseen

p t
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
2 0.774 0.568 1.710 1.119 0.457 1.434 0 0.846 0.537 1.622 1.115 0.447 1.440
4 0.765 0.573 1.714 1.102 0.469 1.449 1 0.753 0.574 1.737 1.094 0.449 1.492
6 0.760 0.572 1.726 1.162 0.440 1.383 2 0.740 0.577 1.745 1.067 0.465 1.492
8 0.740 0.577 1.745 1.070 0.461 1.503 3 0.746 0.575 1.738 1.110 0.458 1.456
10 0.768 0.581 1.726 1.111 0.460 1.463
Table 9. Ablation study on the number of transformer decoder
Table 7. Ablation study on the number of learnable token p in text layers t.
prompt learning.

Seen Unseen tation effort. This stands in contrast to traditional methods

j that necessitate extensive training data or numerous simu-
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
lated interaction trials to gain affordance knowledge.
1 1.060 0.455 1.422 1.338 0.390 1.302
Despite achieving good performance with few train-
2 0.748 0.576 1.756 1.105 0.456 1.452
ing samples, our framework reveals two limitations: First,
3 0.740 0.577 1.745 1.070 0.461 1.503
4 0.762 0.579 1.713 1.129 0.453 1.401 while text prompt learning enhances the performance within
unseen objects, it diminishes the framework’s generaliza-
Table 8. Ablation study on the number of fusion layers j in multi- tion capacity to unseen affordances. This occurs due to an
layer feature fusion. excess of learnable tokens potentially weakening the intrin-
sic word similarities within the CLIP text encoder. A viable
solution to this limitation involves combining the learnable
tion, the model can make reasonable affordance predictions prompts with manually designed prompts. Second, the per-
for diverse base and novel objects, requiring minimal anno- formance is notably influenced by the selection of the one-
Figure 10. Additional qualitative comparison on AGD20K dataset.

Figure 11. Qualitative examples of unseen affordance prediction

on AGD20K dataset. The 2nd column shows the results on seen
affordances, and the 3rd and 4th columns show results with unseen
affordances.

shot example. Instances with heavy occlusion or inferior

lighting conditions can impact the learning performance.
Given the inherent challenges in learning from merely one-
shot example, this limitation appears reasonable and logical.

Prompt Distribution Learning
No ratings yet
Prompt Distribution Learning
13 pages
IEEE PAMI: Towards Open Vocabulary Learning A Survey
No ratings yet
IEEE PAMI: Towards Open Vocabulary Learning A Survey
20 pages
A Survey of Graph Prompting Methods
No ratings yet
A Survey of Graph Prompting Methods
11 pages
Fnins 18 1349204
No ratings yet
Fnins 18 1349204
10 pages
Lu_Prompt_Distribution_Learning_CVPR_2022_paper
No ratings yet
Lu_Prompt_Distribution_Learning_CVPR_2022_paper
10 pages
CoCoOp
No ratings yet
CoCoOp
11 pages
2205.15445v1
No ratings yet
2205.15445v1
17 pages
230623-Paper-learning and Representing Object Shape Through an Array of Orientation Columns
No ratings yet
230623-Paper-learning and Representing Object Shape Through an Array of Orientation Columns
14 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
11 pages
Telling Stories For Common Sense Zero-Shot Action Recognition
No ratings yet
Telling Stories For Common Sense Zero-Shot Action Recognition
16 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Deep Meta-Learning Learning To Learn in The Concept Space
No ratings yet
Deep Meta-Learning Learning To Learn in The Concept Space
10 pages
Cross Training
No ratings yet
Cross Training
11 pages
BADGr—A toolbox for box-based approximation, decomposition and GRasping
No ratings yet
BADGr—A toolbox for box-based approximation, decomposition and GRasping
10 pages
Applsci 11 09374 v3
No ratings yet
Applsci 11 09374 v3
30 pages
Learning To Compare: Relation Network For Few-Shot Learning
No ratings yet
Learning To Compare: Relation Network For Few-Shot Learning
10 pages
Instagen: Enhancing Object Detection by Training On Synthetic Dataset
No ratings yet
Instagen: Enhancing Object Detection by Training On Synthetic Dataset
13 pages
Multi attentionNetworkforOneShotLearning
No ratings yet
Multi attentionNetworkforOneShotLearning
9 pages
Transfer Learning Using VGG-16 With Deep Convoluti
No ratings yet
Transfer Learning Using VGG-16 With Deep Convoluti
9 pages
(Article 7) EDUCON.2018.8363383
No ratings yet
(Article 7) EDUCON.2018.8363383
9 pages
Siamese Neural Networks For One-Shot Image Recognition
No ratings yet
Siamese Neural Networks For One-Shot Image Recognition
8 pages
Prompt-Time_Symbolic_Knowledge_Capture_with_LLM
No ratings yet
Prompt-Time_Symbolic_Knowledge_Capture_with_LLM
8 pages
LoCATe-GAT_Modeling_Multi-Scale_Local_Context_and_Action_Relationships_for_Zero-Shot_Action_Recognition
No ratings yet
LoCATe-GAT_Modeling_Multi-Scale_Local_Context_and_Action_Relationships_for_Zero-Shot_Action_Recognition
13 pages
Fu Relaxing From Vocabulary ICCV 2015 Paper
No ratings yet
Fu Relaxing From Vocabulary ICCV 2015 Paper
9 pages
Zero Shot Learning
No ratings yet
Zero Shot Learning
13 pages
EBSD Calssification
No ratings yet
EBSD Calssification
12 pages
3. Octopus Aggressive Search of Multi-Modality Data Using Multifaceted Knowledge Base
No ratings yet
3. Octopus Aggressive Search of Multi-Modality Data Using Multifaceted Knowledge Base
11 pages
Model-Driven Deep Learning
No ratings yet
Model-Driven Deep Learning
3 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Dynamic Head: Unifying Object Detection Heads With Attentions
No ratings yet
Dynamic Head: Unifying Object Detection Heads With Attentions
11 pages
Transfer Learning Using VGG-16 With Deep Convoluti
No ratings yet
Transfer Learning Using VGG-16 With Deep Convoluti
9 pages
Attribute-Based Classification For Zero-Shot Visual Object Categorization
No ratings yet
Attribute-Based Classification For Zero-Shot Visual Object Categorization
13 pages
Oriented_Tiny_Object_Detection_A_Dataset_Benchmark
No ratings yet
Oriented_Tiny_Object_Detection_A_Dataset_Benchmark
18 pages
Pose Induction For Novel Object Categories
No ratings yet
Pose Induction For Novel Object Categories
9 pages
2307.09668
No ratings yet
2307.09668
14 pages
DOCK - Detecting Objects by Transferring Common-Sense Knowledge
No ratings yet
DOCK - Detecting Objects by Transferring Common-Sense Knowledge
17 pages
Why does deep and cheap work so well
No ratings yet
Why does deep and cheap work so well
17 pages
Learning With Feature Evolvable Streams (Hou, Zhang, and Zhou 2017)
No ratings yet
Learning With Feature Evolvable Streams (Hou, Zhang, and Zhou 2017)
14 pages
CTY-I2A-20230403
No ratings yet
CTY-I2A-20230403
6 pages
1801.00868v3
No ratings yet
1801.00868v3
10 pages
Stop Overkilling Simple Tasks With Black-Box
No ratings yet
Stop Overkilling Simple Tasks With Black-Box
7 pages
Education and Design: Using Human-Computer Interaction Case Studies To Learn
No ratings yet
Education and Design: Using Human-Computer Interaction Case Studies To Learn
6 pages
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
No ratings yet
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
24 pages
Unsupervised Feature Learning and Deep Learning - A Review and New Perspectives Author Yoshua Bengio, Aaron Courville, and Pascal Vincent
No ratings yet
Unsupervised Feature Learning and Deep Learning - A Review and New Perspectives Author Yoshua Bengio, Aaron Courville, and Pascal Vincent
30 pages
Few-Shot Object Detection A Comprehensive Survey
No ratings yet
Few-Shot Object Detection A Comprehensive Survey
21 pages
MaskCon: Masked Contrastive Learning For Coarse-Labelled Dataset
No ratings yet
MaskCon: Masked Contrastive Learning For Coarse-Labelled Dataset
10 pages
Advances_and_Challenges_in_Meta-Learning_A_Technical_Review
No ratings yet
Advances_and_Challenges_in_Meta-Learning_A_Technical_Review
17 pages
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
No ratings yet
Unsupervised Embedding Learning Via Invariant and Spreading Instance Feature
10 pages
Contrastive Predictive Coding
No ratings yet
Contrastive Predictive Coding
13 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
1-s2.0-S0925231223005891-main
No ratings yet
1-s2.0-S0925231223005891-main
12 pages
Deepsetfusion
No ratings yet
Deepsetfusion
10 pages
Li Et Al. - 2023 - Building Manufacturing Deep Learning Models With M
No ratings yet
Li Et Al. - 2023 - Building Manufacturing Deep Learning Models With M
8 pages
2406.19370v4
No ratings yet
2406.19370v4
32 pages
Active Learning For Entity Alignment
No ratings yet
Active Learning For Entity Alignment
15 pages
Contrastive Learning for Object Detection
No ratings yet
Contrastive Learning for Object Detection
5 pages
Deep Learning For X Ray Image To Text Generation
No ratings yet
Deep Learning For X Ray Image To Text Generation
4 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Workshop Master Revealed
From Everand
Workshop Master Revealed
Anil Soni
No ratings yet
ChatGPT and Language Translation A Small Case Study Evaluating English MandarinTranslation
No ratings yet
ChatGPT and Language Translation A Small Case Study Evaluating English MandarinTranslation
11 pages
1 s2.0 S2949719122000036 Main
No ratings yet
1 s2.0 S2949719122000036 Main
11 pages
RITA: Group Attention Is All You Need For Timeseries Analytics
No ratings yet
RITA: Group Attention Is All You Need For Timeseries Analytics
14 pages
Self Supervised Multi Modal Sequential Recommendation
No ratings yet
Self Supervised Multi Modal Sequential Recommendation
11 pages
Chatbots & Recommendation Systems Final Review
No ratings yet
Chatbots & Recommendation Systems Final Review
49 pages
Rs Prompter
No ratings yet
Rs Prompter
18 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Scene Transformer A Unified Architecture For Predicting Multiple Agent Trajectories
No ratings yet
Scene Transformer A Unified Architecture For Predicting Multiple Agent Trajectories
25 pages
ChatGPT in The Age of Generative AI and Large Lang
No ratings yet
ChatGPT in The Age of Generative AI and Large Lang
60 pages
Exploring The Prevalence of Homophily Among Classes of Hate Speech
No ratings yet
Exploring The Prevalence of Homophily Among Classes of Hate Speech
16 pages
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
No ratings yet
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
3 pages
Ai engineer roadmap-kdtech
No ratings yet
Ai engineer roadmap-kdtech
18 pages
Pi 0
No ratings yet
Pi 0
17 pages
LLM 1
No ratings yet
LLM 1
2 pages
AI in Entertainment
No ratings yet
AI in Entertainment
10 pages
Lec 1 Intro
No ratings yet
Lec 1 Intro
54 pages
Paper 1
No ratings yet
Paper 1
55 pages
CC S 339 NLP Basics &TSA
No ratings yet
CC S 339 NLP Basics &TSA
68 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
12 pages
Final Project Vaaghu
No ratings yet
Final Project Vaaghu
84 pages
DiffuseLoco - Real-Time Legged Locomotion
No ratings yet
DiffuseLoco - Real-Time Legged Locomotion
19 pages
001 OK AVSR based transformer
No ratings yet
001 OK AVSR based transformer
15 pages
SHOWRUNNER Ai
No ratings yet
SHOWRUNNER Ai
13 pages
Natural Language Processingand Sentiment Analysis
No ratings yet
Natural Language Processingand Sentiment Analysis
15 pages
Bachelor Thesis by Jintao Ling-20-05-2020
No ratings yet
Bachelor Thesis by Jintao Ling-20-05-2020
37 pages
1. LLMs for Me - Introduction LLMs & Generative Text
No ratings yet
1. LLMs for Me - Introduction LLMs & Generative Text
38 pages
Attention Is All You Need-Summary by Meghana B
No ratings yet
Attention Is All You Need-Summary by Meghana B
2 pages
403 Self Adaptive Relational T
No ratings yet
403 Self Adaptive Relational T
9 pages
Dsda 01
No ratings yet
Dsda 01
15 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages