0% found this document useful (0 votes)
11 views

One-Shot Open Affordance Learning with Foundation Models

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

One-Shot Open Affordance Learning with Foundation Models

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

One-Shot Open Affordance Learning with Foundation Models

Gen Li1 Deqing Sun2 Laura Sevilla-Lara1 Varun Jampani3


1 2 3
University of Edinburgh Google Research Stability AI
arXiv:2311.17776v1 [cs.CV] 29 Nov 2023

Abstract

We introduce One-shot Open Affordance Learning


(OOAL), where a model is trained with just one example per
base object category, but is expected to identify novel ob-
jects and affordances. While vision-language models excel
at recognizing novel objects and scenes, they often strug-
gle to understand finer levels of granularity such as affor-
dances. To handle this issue, we conduct a comprehensive
analysis of existing foundation models, to explore their in-
herent understanding of affordances and assess the poten- Figure 1. The pipeline of one-shot open affordance learning. It
tial for data-limited affordance learning. We then propose uses one image per base object for training, and can perform zero-
a vision-language framework with simple and effective de- shot inference on novel objects and affordances.
signs that boost the alignment between visual features and
affordance text embeddings. Experiments on two affordance
segmentation benchmarks show that the proposed method ping can be established through a labeled dataset with pre-
outperforms state-of-the-art models with less than 1% of defined objects and affordances. However, large-scale af-
the full training data, and exhibits reasonable generaliza- fordance datasets are scarce, and most of them have a small
tion capability on unseen objects and affordances. number of object categories, making it difficult to apply the
learned mapping to novel objects and scenes. To reduce
the reliance on costly annotation, some recent studies per-
1. Introduction form affordance learning from sparse key points [16,53,54],
Affordances are the potential “action possibilities” re- videos of humans in action [19, 31, 43], or human-object in-
gions of an object [21, 23], which play a pivotal role in teraction images [29, 36]. While alleviating the need for
various applications, including robotic learning [20, 27, 44], dense pixel labeling, these methods still require a large
scene understanding [13, 33, 52], and human-object interac- amount of training data. In addition, they often struggle
tion [24, 43]. In particular, affordance is crucial for embod- to generalize to unseen objects and cannot identify novel
ied intelligence, since it facilitates agents’ understanding of affordances.
the associations between objects, actions, and effects in dy- To tackle the above limitations, we are interested in
namic environments, thus bridging the gap between passive learning an affordance model that does not rely on exten-
perception and active interaction [15, 39]. sive datasets, and can comprehend novel object and affor-
Learning to recognize object affordances across a variety dance classes. For example, after a model is trained with the
of scenarios is challenging, since different objects can vary knowledge that scissor blades afford cutting, it should gen-
significantly in appearance, shape, and size, yet have the eralize to related objects such as knives and axes, inferring
same functionality. For instance, a chef’s knife and a pair that their blades can cut objects too. Moreover, the model
of office scissors share common affordances of cutting and should be able to reason about semantically similar vocab-
holding, but their blades and handles look different. ularies, e.g., “hold” and “grasp”, “cut” and “slice”, instead
A large portion of the work [13, 17, 18, 40, 45, 46] has of knowing only predefined affordance categories.
focused on learning a mapping between visual features and In this paper, we target the extreme case of using merely
affordance labels, utilizing diverse resources as inputs, such one example from each base object category and term this
as 2D images, RGB-D data, and 3D point clouds. This map- research problem as One-shot Open Affordance Learning

1
(OOAL), where the model is trained with very little data, 2. Related Work
and is expected to recognize novel objects and affordances
during inference. The illustration of OOAL pipeline is Affordance Learning. The term “affordance” is popular-
shown in Fig. 1. Compared with the typical affordance ized by the psychologist James Gibson, who describes it
learning that requires numerous training samples and can as the properties of an object or the environment that sug-
only reason within a closed affordance vocabulary, OOAL gest possible actions or interactions. Building on this, re-
alleviates the need of large-scale datasets and broadens the searchers have developed many approaches to acquire af-
scope of inference. fordance information in various ways. In computer vision,
initial research [13, 18, 28, 41] has focused on affordance
To this end, we note that foundation Vision-Language detection using convolutional neural networks. As manual
Models (VLMs) can be a potential solution, which have re- affordance annotations are often costly to acquire, much
cently emerged as powerful tools for a wide array of com- subsequent research has shifted its focus to weak super-
puter vision tasks. The open vocabulary nature of these vision such as keypoints [16, 53, 54] or image-level labels
VLMs like CLIP [50] that are trained on a large corpus of [36, 43]. Recent work has explored a novel perspective on
image-text data enables reasoning of previously unseen ob- how to ground affordances from human-object interaction
jects, scenes, and concepts. However, we observe that these images [29,36,64] or human action videos [9,19,31,43]. In
models often fail to understand nuanced vocabularies such robotics, affordance learning enables robots to interact ef-
as affordances or object parts. One hypothesis is that object fectively and intelligently with complex and dynamic envi-
parts and affordances appear much less frequently in image ronments [2,63]. Specifically, some work [3,27,58] utilizes
captions compared with objects. Therefore, the following affordance to build relationships between objects, tasks, and
question naturally arises: Can we teach foundation models manipulations for robotic grasping. Other studies focus on
to comprehend more subtle, fine-grained aspects of objects, learning affordance from other available resources that can
such as affordances, with very few examples? In this way, be deployed on real robots, such as human teleoperated play
the generalization capability of foundation models can be data [6], image pairs [5], and egocentric video datasets [4].
inherited with minimum annotation effort.
In contrast to the works above that often require a large
To achieve this, we first conduct a thorough analysis of amount of training data, we propose the problem of OOAL
several representative foundation models. The objective is that aims to perform affordance learning with one sample
to delve into their inherent understanding of affordances, per base object category, and allows zero-shot inference to
and figure out what visual representation is suitable for data- handle novel objects and affordances.
limited affordance learning. Based on the analysis, we then Foundation Models for Affordance Learning. With the
build a learning architecture and propose several methods, rapid development of foundation models such as Large Lan-
including text prompt learning, multi-layer feature fusion, guage Models (LLMs) and vision-language models, many
and a CLS-token-guided transformer decoder, that can facil- research efforts have explored their utilization in affordance
itate the alignment between visual representation and affor- learning or reasoning. Mees et al. [37] leverage GPT-
dance text embeddings. Lastly, we select a dense prediction 3 [7] to break down language instructions into subgoals,
task, affordance segmentation, for evaluation and compari- and learn a visual affordance model to complete real world
son with a variety of state-of-the-art models, where we find long-horizon tasks. Li et al. [29] adopt DINO-ViT features
that our methods can achieve higher performance with less to perform affordance grounding by transferring affordance
than 1% of the complete training data. knowledge from human-object interaction images to ego-
Overall, our contributions can be summarized as follows: centric views. Huang et al. [25] propose a novel pipeline
(1) We introduce the problem of OOAL, aiming to develop that uses LLMs [48] for affordance reasoning, which inter-
a robust affordance model that can generalize to novel ob- acts with VLMs to produce 3D affordance maps for robotic
ject and affordance categories without the need of massive manipulation. Recent studies [38,51,56] delve into the inte-
training data. (2) We conduct a comprehensive analysis gration of affordance and language models for task-oriented
on existing foundation models to explore their potential for grasping, which allows robots to grasp objects in a more ap-
OOAL. Following the analysis, we build a learning archi- propriate and safe manner.
tecture with vision-language foundation models, and design The closest methods to ours are AffCorrs [22] and Ope-
several methods to improve the alignment between visual nAD [47]. AffCorrs utilizes the visual foundation model
features and affordance text labels. (3) We implement ex- DINO to find corresponding affordances in a one-shot man-
tensive experiments with two datasets on affordance seg- ner, but relevant objects are explicitly selected as support
mentation to demonstrate the effectiveness of our learning images to significantly reduce the difficulty. OpenAD takes
pipeline, and observe significant gains over baselines with advantage of CLIP for open-vocabulary affordance detec-
strong generalization capability. tion in point clouds. It requires a large number of manual

2
annotations, while our work performs affordance learning
with merely one example per base object category.

3. Problem Setting
One-shot Open Affordance Learning (OOAL) aims to
learn a model to predict affordance with one example per
base object class and can generalize to novel object classes.
In this work, we focus on the dense prediction task of affor-
dance segmentation. Specifically, objects are first divided
into Nb base classes and No novel classes without intersec-
tion. The model receives only Nb samples during training,
one for each base object category, which is a pair of im-
age I ∈ RH×W ×3 and pixel-wise affordance annotation
M ∈ RH×W ×N (N is the number of affordance categories
in the dataset). After training, evaluation is performed on Figure 2. Analysis of vision-language foundation models on
the combination of base and novel object categories to mea- text-based affordance grounding. The 1st and 3rd rows use af-
sure the generalization ability of the model. Also, affor- fordance texts as input queries, and the 2nd and 4th rows use cor-
dance labels can be replaced with novel vocabularies that responding object parts as input text queries. Visualizations show
share similar semantics, such as “chop”, “slice”, and “trim” that these models have limited ability to recognize fine-grained af-
fordances and object parts.
to represent affordance akin to “cut”.
It is worth noting that OOAL is different from one-
shot semantic segmentation (OSSS) [55] and one-shot af-
fordance detection (OS-AD) [34]. Both OSSS and OS-AD ity method CLIP Surgery [30], a state-of-the-art open-
receive one-shot sample during training. However, the sam- vocabulary segmentation method CAT-Seg [12], and an
ple keeps changing in each iteration, so the model can be open-vocabulary detection method GroundingDINO [32].
exposed to many different images. Additionally, a sup- For vanilla CLIP, we employ the method proposed in
port image is required at inference to provide prior infor- MaskCLIP [66] that directly extracts dense predictions
mation. In comparison, OOAL performs one-shot training without fine-tuning. We use the text prompt template of
and zero-shot inference, which poses additional challenges. “somewhere to [affordance]” to query visual features to
The model needs to generalize to previously unseen objects, find corresponding areas. As illustrated in Fig. 2, we note
necessitating the ability to understand and recognize seman- that most models cannot understand affordance well, ex-
tic relationships between seen and unseen classes with very cept the detection model GroundingDINO, but its predic-
limited data. tions mainly focus on the whole object rather than parts.
As for dense prediction models, CAT-Seg often recognizes
4. Method affordance regions as background, and CLIP gives high ac-
tivation on both foreground and background. In compari-
4.1. Analysis of Foundation Models son, CLIP Surgery fails to localize the “holding” area for a
The field of computer vision has recently witnessed a knife, but manages to associate the phrase “sit on” with a
surge in the prevalence of large foundational models, such chair. Furthermore, even when the affordance text is re-
as CLIP [50], Segment Anything [26], DINO [8, 49] etc. placed with corresponding object parts, predictions from
These models exhibit strong zero-shot generalization ca- CLIP and GroundingDINO remain biased toward objects,
pabilities for several computer vision tasks, making them while CLIP Surgery and CAT-Seg tend to activate the wrong
seem like a great option to tackle the problem of OOAL. To parts. This is consistent with recent findings [57, 60] that
this end, we perform analysis on several existing founda- CLIP has limited part recognition ability.
tion models which we split into three parts: ❶ Do current To answer questions ❷ and ❸, we consider two essen-
vision-language foundation models and their variants have tial characteristics of a good affordance model in the low-
the ability to detect affordances via affordance/part-based shot setting: (1) Part-aware representation. The visual rep-
prompting? ❷ Can the features of visual foundation mod- resentation should exhibit awareness of object parts, given
els discriminate affordance regions in images? and ❸ Can that affordance often denotes small and fine-grained re-
these models generalize affordance recognition to novel ob- gions, e.g., a bicycle saddle to sit on or a knife handle to
jects and perform well in the low-shot setting? hold. (2) Part-level semantic correspondence. This prop-
Driven by question ❶, we select four representative erty is critical for generalization, since the model requires
models, i.e., the vanilla CLIP, a CLIP-based explainabil- the understanding of semantic relations to make reasonable

3
Figure 3. Proposed learning framework for OOAL. Our designs are highlighted in three color blocks, which are text prompt learning,
multi-layer feature fusion, and CLS-guided transformer decoder. [CLS] denotes the CLS token of the vision encoder.

the feature similarity maps computed as the cosine simi-


larity between one patch representation on the knife blade
and an image of scissors. It is obvious that DINOv2 shows
finer correspondence between blades of knife and scissors.
By contrast, CLIP produces messy correspondences in both
foreground and background, and feature correspondences
of DeiT III are only discriminative at the object level, but
not specific to the affordance part region (cut). From the
above analysis, we conclude that DINOv2 is well suited for
affordance learning due to its fine-grained part-aware repre-
Figure 4. Analysis of visual foundation models on affordance sentation and superior part-level semantic correspondence.
learning. Top row: visualizations of PCA components. Bottom Quantitative comparisons are shown in Sec. 5.5.
row: feature similarity maps between the yellow mark on the knife
blade and the image of scissors. Qualitative results show that DI- 4.2. Motivation and method
NOv2 has clearer part-aware representations and better part-level
semantic correspondence. Through a systematic analysis, we identify DINOv2 as a
powerful tool for addressing the OOAL problem. However,
there are still fundamental issues that hinder performance
in this challenging setting. The first is that DINOv2 is a
predictions on novel objects. In addition, good correspon- vision-only model, and lacks the ability to identify novel
dence proves advantages in scenarios with limited data, as affordances. One potential solution involves integrating a
the model can be more robust to intra-class recognition, and text encoder like CLIP, but it is recognized that the input
less susceptible to changes in appearance. We then analyze text is sensitive to prompts. This is particularly problem-
the features from three representative and powerful visual atic in the case of affordances, which combine both an ob-
foundation models, i.e., vision-language contrastive learn- ject and a verb, making manual prompt design a complex
ing CLIP, fully-supervised learning DeiT III [59], and self- task. The second issue is that while features of DINOv2
supervised learning DINOv2. First, we perform the princi- are part-oriented, the level of granularity varies across lay-
pal component analysis (PCA) on the extracted patch fea- ers. Determining the appropriate granularity level is cru-
tures of each model to investigate the part awareness. Visu- cial when handling affordances associated with diverse ob-
alization of PCA components in the top row of Fig. 4 shows jects. The third issue arises due to the absence of align-
that all three models have part-aware features to some ex- ment between the DINOv2 vision encoder and CLIP text
tent, yet CLIP cannot well distinguish the background, and encoder, as they are trained separately and independently of
features of DeiT III are not discriminative enough for dif- each other. Building upon these observations, we establish
ferent parts. Next, we choose a different object that has a vision-language framework based on DINOv2 and CLIP,
equivalent affordances, i.e., knife and scissors, to assess the and propose three modules to resolve each of the three fun-
semantic correspondence. The bottom row of Fig. 4 shows damental bottlenecks mentioned above.

4
In this section, we first describe the overview of our function, it often carries rich prior information of the whole
proposed learning framework that builds on the powerful image, such as salient objects or regions. Consequently,
foundation models. Then, we elaborate on the three pro- we utilize the [CLS] token to produce a guidance mask that
posed designs that help in the challenging OOAL problem. constrains the cross-attention within a foreground region.
Finally, we discuss the framework’s capability to identify The decoder receives three inputs, i.e., text embeddings
novel objects and affordances at inference. Ft , visual features Fv , and the [CLS] token Lcls . Firstly,
Overview. The proposed learning framework is presented linear transformations are performed to yield query, key,
in Fig. 3, which consists of a vision encoder, a text encoder, and value:
and a transformer decoder. First, the pretrained vision en-
coder DINOv2 is used to extract dense patch embeddings Q = ϕq (Ft ), K = ϕk (Fv ), V = ϕv (Fv ). (2)
F̂v ∈ RL×Cv , where L is the number of tokens or patches.
Then, affordance labels are processed by the CLIP text en- Here we use text embeddings as query, and visual features
coder to obtain text embeddings Ft ∈ RN ×C . To cope as key and value, allowing the model to focus on the update
with inconsistent dimensions between visual and text em- of text embeddings by retrieving relevant visual informa-
beddings, an embedder ev : RCv → RC with a single MLP tion that corresponds to the affordance text. Next, the CLS-
layer is employed. In the end, the lightweight transformer guided mask is calculated between the [CLS] token and key
decoder takes both visual and text embeddings as input, and via matrix multiplication:
outputs the affordance prediction.
Text Prompt Learning. Manually designing prompts for ϕc (Lcls )K T
Mcls = sigmoid( √ ), (3)
affordances can be a complicated work, especially consid- dk
ering that CLIP has difficulty in recognizing affordance
where dk is a scaling factor that equals the dimension of the
(see Fig. 2). Thus, we adopt the Context Optimization
keys. The masked cross-attention is then computed as:
(CoOp) [67] method to introduce automatic text prompt
learning. Instead of finetuning the CLIP text encoder, the p
F̂t = softmax(QK T / dk ) · Mcls V + Ft . (4)
inclusion of learnable prompts is an effective strategy that
can alleviate the problem of overfitting and retain the in- ′
After that, the updated text embeddings Ft are obtained
herent text recognition ability of CLIP. Specifically, p ran- by sending F̂t through a feed-forward network (FFN) with
domly initialized learnable context vectors {v1 , v2 , ..., vp } a residual connection. The decoder comprises t layers of
are inserted in front of the text CLS token, and they are transformers, and the ultimate prediction is generated by
shared for all affordance classes. performing matrix product between the output of the last
Multi-Layer Feature Fusion. Different layers of DINOv2 transform layer and original visual features Fv , thereby en-
features often exhibit different levels of granularity [1]. suring the maximum retention of part-aware representations
Since affordance may correspond to multiple parts of an ob- from DINOv2. Lastly, binary cross entropy is employed as
ject, a diverse set of granularities can be beneficial. For this loss function to optimize parameters of linear layers, em-
purpose, we aggregate the features of the last j layers. Each bedder, and decoder.
layer of features is first processed by a linear projection, Inference on novel objects and affordances. During the
and then all features are linearly combined with a weighted training process, the decoder learns to establish an align-
summation: ment between visual features and affordance text embed-
j
X dings. When encountering a novel object at inference, the
F̂v = αi · ϕ(Fn−i+1 ), α1 + α2 + ... + αj = 1, (1) aligned affordance text embeddings can locate correspond-
i=1 ing object regions, leveraging the part-level semantic cor-
where Fn denotes the last layer, α is a learnable parame- respondence property inherent in DINOv2. Similarly, as
ter that controls the fusion ratio of each layer, and ϕ indi- the model processes novel affordance text inputs, the gen-
cates the linear transformation. This straightforward fusion erated text embeddings can also retrieve the aligned visual
scheme enables adaptive selection among different granu- features, which are based on the semantic similarities to the
larity levels, allowing the model to handle affordance recog- base affordances seen in the training.
nition across diverse scenarios.
CLS-Guided Transformer Decoder. To deal with the lack 5. Experiments
of alignment between visual and text features, we propose
5.1. Datasets
a lightweight transformer decoder that applies a masked
cross-attention mechanism to promote the mutual commu- We choose two typical datasets, AGD20K [36] and
nication between two branches. Since the [CLS] token of UMD part affordance [46], both of which include a large
a foundation model is used in the computation of objective number of object categories that help in the evaluation of

5
Training Data Seen Unseen
Task Method
seen / unseen split KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
Hotspots [43] 1.773 0.278 0.615 1.994 0.237 0.577
23,083 / 15,543 images Cross-view-AG [36] 1.538 0.334 0.927 1.787 0.285 0.829
WSAG
image-level labels Cross-view-AG+ [35] 1.489 0.342 0.981 1.765 0.279 0.882
LOCATE [29] 1.226 0.401 1.177 1.405 0.372 1.157
MaskCLIP [66] 5.752 0.169 0.041 6.052 0.152 0.047
50 / 33 images SAN [62] 1.435 0.357 0.941 1.580 0.351 1.022
OOAL
keypoint labels ZegCLIP [68] 1.413 0.387 1.001 1.552 0.361 1.042
Ours 0.740 0.577 1.745 1.070 0.461 1.503

Table 1. Comparison with state of the art on AGD20K dataset. OOAL setting uses 0.22% / 0.21% of the full training data. WSAG denotes
weakly-supervised affordance grounding. The best and second-best results are highlighted in bold and underlined, respectively.

Setting Method Seen Unseen hIoU novel split to evaluate performance on novel object classes.
Due to its small number of object categories, we take one
DeepLabV3+ [10] 70.5 57.5 63.3
Fully example from each base object instance to form the training
SegFormer [61] 74.6 57.7 65.0
Supervised set. Specific affordance categories and object class splits
PSPNet [65] 72.0 60.8 66.0
can be found in the supplementary material.
PSPNet [65] 56.7 46.6 51.1
DeepLabV3+ [10] 56.8 48.4 52.3 5.2. Implementation details
SegFormer [11] 64.6 51.4 57.3
OOAL Experiments are implemented on two GeForce RTX
MaskCLIP [66] 4.25 4.24 4.25
3090 GPUs. All visual foundation models use the same
SAN [62] 45.1 32.2 37.5
ZegCLIP [68] 47.4 36.0 40.9
base-sized vision transformer (ViT-base). We train the
Ours 74.6 59.7 66.4 model using SGD optimizer with learning rate 0.01 for 20k
iterations. For experiments on AGD20K, images are first re-
Table 2. Comparison on UMD dataset. Fully-supervised methods sized to 256 × 256 and randomly cropped to 224 × 224 with
are trained with 14,823 and 20,874 images with pixel-level labels horizontal flipping. Experiments for UMD dataset are con-
for seen and unseen split, respectively. In contrast, OOAL setting ducted on the opensource toolbox MMSegmentation [14]
uses 54 and 76 images, 0.36% of the full training data. with the default training setting. The hyperparameters p, j,
and t are set to 8, 3, and 2, respectively.
Following previous work, we adopt the commonly used
novel objects. AGD20K is a large-scale affordance ground- Kullback-Leibler Divergence (KLD), Similarity (SIM), and
ing dataset with 36 affordances and 50 objects, containing Normalized Scanpath Saliency (NSS) metrics to evaluate
23,816 images from exocentric and egocentric views. It the results on AGD20K. For UMD dataset, we use the met-
aims to learn affordance from human-object interaction im- ric of mean intersection-over-union (mIoU), and also incor-
ages, and perform affordance localization on egocentric im- porate the harmonic mIoU as a balanced measure that ac-
ages. As it is a dataset for weakly-supervised learning, im- counts for both seen and unseen settings.
ages in the training set only have image-level labels. There-
fore we manually annotate 50 randomly selected egocentric 5.3. Comparison to state-of-the-art methods
images from each object category for training. AGD20K AGD20K dataset is benchmarked with weakly super-
also has two train-test splits for seen and unseen settings, vised affordance grounding (WSAG) approaches, which
and we follow their splits to evaluate the performance. Note use image-level object and affordance labels to do affor-
that AGD20K uses sparse annotation, where ground truth dance segmentation. Note that results from WSAG meth-
consists of keypoints within affordance areas, and then a ods are not directly comparable to our setting, as training
gaussian kernel is applied over each point to produce dense labels are different. Despite using only image-level labels,
annotation. the training data required are more than 460 times of ours.
UMD dataset consists of 28,843 RGB-D images with 7 The results in Tab. 1 demonstrate that our results exceed
affordances and 17 object categories, and images of each all WSAG counterparts in an easy and realistic setting. We
object are captured on a revolving turntable. It has two also benchmark open-vocabulary segmentation methods of
train-test splits termed category split and novel split. We MaskCLIP, SAN, and ZegCLIP for further comparison. We
use the category split to evaluate base object categories and find that these CLIP-based methods have a large perfor-

6
Figure 5. Qualitative comparison with LOCATE and ZegCLIP on AGD20K dataset. When multiple affordance predictions overlap, the one
with higher value is displayed. Our predictions distinguish different object parts, while other methods often make overlapping predictions.

eral representative semantic segmentation methods (PSP-


Net, DeepLabV3+, SegFormer) and open-vocabulary se-
mantic segmentation methods. For fair comparison, the
classical segmentation methods are trained with the full
training set, while foundation-model-based methods like
ZegCLIP and SAN are evaluated in the OOAL setting.
It is clear that our proposed model is quite effective,
which can be comparable to fully-supervised methods with
only 0.36% of their training data. To explore how fully-
supervised methods are affected by the limited data, we
further train these models in the OOAL setting. Results
in Tab. 2 show that the performance of these models de-
grades by around 10% in both seen and unseen settings
when given only one-shot example. Additionally, under the
same OOAL setting, we observe a more apparent gain over
other CLIP-based open-vocabulary segmentation methods,
showing that CLIP is not suitable for data-limited affor-
dance learning. The poor performance of MaskCLIP from
both tables also verifies that CLIP has very limited under-
standing on affordance.
Figure 6. Qualitative comparison with SegFormer and ZegCLIP
on UMD affordance dataset in OOAL setting. Images have been 5.4. Qualitative results
enlarged and cropped for better visualization.
Qualitative comparisons on AGD20K dataset are shown
in Fig. 5. We note that WSAG methods like LOCATE of-
mance gap with ours, and are also inferior to the state-of- ten make overlapping predictions for examples with multi-
the-art WSAG method LOCATE. ple affordances, while our results show a clear separation
The comprehensive comparison on UMD dataset is dis- between different affordance regions. ZegCLIP can make
played in Tab. 2, where we benchmark the results with sev- reasonable predictions to some extent, but it mostly focuses

7
Seen Unseen
Model
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
CLIP 1.294 0.384 1.107 1.556 0.327 0.966
DeiT III 1.301 0.378 1.140 1.535 0.321 1.049
DINOv2 1.156 0.425 1.297 1.462 0.360 1.105

Table 3. Ablation results of different visual foundation models.

Seen Unseen
Method
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
Baseline 1.156 0.425 1.297 1.462 0.360 1.105
+ TPL 1.060 0.455 1.422 1.338 0.390 1.302
+ MLFF 0.846 0.537 1.622 1.115 0.447 1.440
+ TD 0.749 0.578 1.738 1.131 0.443 1.408
+ CTM 0.740 0.577 1.745 1.070 0.461 1.503
Figure 7. Qualitative examples of novel affordance prediction on
UMD dataset. The 1st and 2nd rows display results on base ob- Table 4. Ablation results of proposed modules. TPL: text prompt
jects, and the 3rd and 4th rows show results for novel objects. learning. MLFF: multi-layer feature fusion. TD: transformer de-
coder. CTM: CLS-guided mask.

on the whole object and the accuracy is far from satisfac- ply process the visual features with the embedder, and per-
tory, whereas our results are more part-focused, especially form matrix multiplication with pre-computed affordance
for the unseen objects. For example, the prediction for the text embeddings to output segmentation maps. As shown in
unseen object of bicycle show that our model can handle Tab. 3, CLIP and DeiT III exhibit comparable performance,
the complex affordance (ride) with multiple separated af- whereas DINOv2 achieves much better results in both seen
fordance areas (saddle, handlebar, and pedal). In Fig. 6, we and unseen settings, which are consistent with the analysis
display the results for UMD dataset. We observe that Seg- that DINOv2 is more suitable for affordance learning.
Former and ZegCLIP often fail to recognize affordances of Proposed Methods. We use the DINOv2 with a simple em-
objects whose parts are similar in appearance. Also, they bedder as baseline, and gradually integrate our methods to
tend to misclassify metallic object parts as cuttable affor- analyze the effect of each proposed design. The results in
dance, suggesting that inferring affordances with only ap- Tab. 4 reveal that each module can consistently deliver no-
pearance features can be misleading. In comparison, our table improvements. In particular, we notice that the inclu-
predictions are more accurate due to the utilization of DI- sion of a transformer decoder can enhance the performance
NOv2’s part-level semantic correspondences. in the seen setting, but yield inferior results for the unseen
One particular feature of our model is that it can rec- setting. With the integration of the CLS-guided mask, re-
ognize novel affordances not shown during training. To sults of both settings can be improved, suggesting that re-
demonstrate this, we replace the original affordance labels stricting the cross-attention space is an effective strategy for
with semantically similar words and check if the model unseen object affordance recognition.
can still reason about corresponding affordance areas. As
shown in Fig. 7, the model manages to make correct pre- 6. Conclusion
dictions for novel affordances, such as “hold and grab” for
base affordance “grasp”, “saw” for “cut”, and “accommo- In this paper, we propose the problem of one-shot open
date” for “contain”. affordance learning that uses one example per base object
category as training data, and has the ability to recognize
5.5. Ablation study novel objects and affordances. We first present a detailed
analysis into different foundation models for the purpose of
The ablation study is performed on the more challeng- data-limited affordance learning. Motivated by the analysis,
ing AGD20K dataset due to its natural images with diverse we build a vision-language learning framework with sev-
backgrounds. Ablations on hyperparameters are left in the eral proposed designs that better utilize the visual features
supplementary material. and promote the alignment with text embeddings. Experi-
Different Vision Encoders. To complement the qualitative ment results demonstrate that we achieve comparable per-
analysis in Sec. 4.1, we conduct quantitative experiments formance over several fully-supervised baselines with less
on CLIP, DeiT III, and DINOv2. Specifically, we sim- than 1% of the full training data.

8
References [14] MMSegmentation Contributors. MMSegmentation:
Openmmlab semantic segmentation toolbox and
[1] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. benchmark. https : / / github . com / open -
Deep vit features as dense visual descriptors. ECCVW What mmlab/mmsegmentation, 2020. 6
is Motion For, 2022. 5
[15] Francisco Cruz, Sven Magg, Cornelius Weber, and Stefan
[2] Paola Ardón, Èric Pairet, Katrin S Lohan, Subramanian Ra- Wermter. Training agents with interactive reinforcement
mamoorthy, and Ronald Petrick. Affordances in robotic learning and contextual affordances. IEEE Transactions on
tasks–a survey. arXiv preprint arXiv:2004.07400, 2020. 2 Cognitive and Developmental Systems, 8(4):271–284, 2016.
[3] Paola Ardón, Eric Pairet, Ronald PA Petrick, Subramanian 1
Ramamoorthy, and Katrin S Lohan. Learning grasp affor- [16] Leiyao Cui, Xiaoxue Chen, Hao Zhao, Guyue Zhou, and
dance reasoning through semantic relations. IEEE Robotics Yixin Zhu. Strap: Structured object affordance segmentation
and Automation Letters, 4(4):4571–4578, 2019. 2 with point supervision. arXiv preprint arXiv:2304.08492,
[4] Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, 2023. 1, 2
and Deepak Pathak. Affordances from human videos as [17] Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and
a versatile representation for robotics. In Proceedings of Kui Jia. 3d affordancenet: A benchmark for visual object
the IEEE/CVF Conference on Computer Vision and Pattern affordance understanding. In CVPR, 2021. 1
Recognition, pages 13778–13790, 2023. 2
[18] Thanh Toan Do, Anh Nguyen, and Ian Reid. Affordancenet:
[5] Homanga Bharadhwaj, Abhinav Gupta, and Shubham Tul-
An end-to-end deep learning approach for object affordance
siani. Visual affordance prediction for guiding robot explo-
detection. ICRA, 2018. 1, 2
ration. arXiv preprint arXiv:2305.17783, 2023. 2
[19] Kuan Fang, Te Lin Wu, Daniel Yang, Silvio Savarese, and
[6] Jessica Borja-Diaz, Oier Mees, Gabriel Kalweit, Lukas Her-
Joseph J. Lim. Demo2Vec: Reasoning Object Affordances
mann, Joschka Boedecker, and Wolfram Burgard. Affor-
from Online Videos. CVPR, 2018. 1, 2
dance learning from play for sample-efficient policy learn-
ing. In 2022 International Conference on Robotics and Au- [20] Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen,
tomation (ICRA), pages 6372–6378. IEEE, 2022. 2 Yaodong Yang, and Hao Dong. End-to-end affor-
dance learning for robotic manipulation. arXiv preprint
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
arXiv:2209.12941, 2022. 1
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- [21] James J. Gibson. The Ecological Approach to Visual Percep-
guage models are few-shot learners. Advances in neural in- tion: Classic Edition. Houghton Mifflin, 1979. 1
formation processing systems, 33:1877–1901, 2020. 2 [22] Denis Hadjivelichkov, Sicelukwanda Zwane, Marc Deisen-
[8] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, roth, Lourdes Agapito, and Dimitrios Kanoulas. One-Shot
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- Transfer of Affordance Regions? AffCorrs! CoRL, 2022. 2
ing properties in self-supervised vision transformers. In [23] Mohammed Hassanin, Salman Khan, and Murat Tahtali. Vi-
ICCV, 2021. 3 sual affordance and function understanding: A survey. ACM
[9] Joya Chen, Difei Gao, Kevin Qinghong Lin, and Mike Zheng Computing Surveys (CSUR), 54(3):1–35, 2021. 1
Shou. Affordance grounding from demonstration video to [24] Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, and
target image. In Proceedings of the IEEE/CVF Conference Dacheng Tao. Affordance transfer learning for human-object
on Computer Vision and Pattern Recognition, pages 6799– interaction detection. In Proceedings of the IEEE/CVF Con-
6808, 2023. 2 ference on Computer Vision and Pattern Recognition, pages
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian 495–504, 2021. 1
Schroff, and Hartwig Adam. Encoder-decoder with atrous [25] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li,
separable convolution for semantic image segmentation. In Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value
Proceedings of the European conference on computer vision maps for robotic manipulation with language models. arXiv
(ECCV), pages 801–818, 2018. 6 preprint arXiv:2307.05973, 2023. 2
[11] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- [26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
der Kirillov, and Rohit Girdhar. Masked-attention mask Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
transformer for universal image segmentation. In Proceed- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
ings of the IEEE/CVF conference on computer vision and thing. arXiv preprint arXiv:2304.02643, 2023. 3
pattern recognition, pages 1290–1299, 2022. 6 [27] Mia Kokic, Johannes A Stork, Joshua A Haustein, and Dan-
[12] Seokju Cho, Heeseong Shin, Sunghwan Hong, Seungjun ica Kragic. Affordance detection for task-specific grasping
An, Seungjun Lee, Anurag Arnab, Paul Hongsuck Seo, using deep learning. In 2017 IEEE-RAS 17th International
and Seungryong Kim. Cat-seg: Cost aggregation for Conference on Humanoid Robotics (Humanoids), pages 91–
open-vocabulary semantic segmentation. arXiv preprint 98. IEEE, 2017. 1, 2
arXiv:2303.11797, 2023. 3 [28] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Sax-
[13] Ching-Yao Chuang, Jiaman Li, Antonio Torralba, and Sanja ena. Learning human activities and object affordances from
Fidler. Learning to act properly: Predicting and explaining rgb-d videos. The International journal of robotics research,
affordances from images. In CVPR, 2018. 1, 2 32(8):951–970, 2013. 2

9
[29] Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla- [44] Tushar Nagarajan and Kristen Grauman. Learning affor-
Lara. Locate: Localize and transfer object parts for weakly dance landscapes for interaction exploration in 3d environ-
supervised affordance grounding. In Proceedings of the ments. Advances in Neural Information Processing Systems,
IEEE/CVF Conference on Computer Vision and Pattern 33:2005–2015, 2020. 1
Recognition, pages 10922–10931, 2023. 1, 2, 6 [45] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and
[30] Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip Nikos G Tsagarakis. Detecting object affordances with con-
surgery for better explainability with enhancement in open- volutional neural networks. In 2016 IEEE/RSJ International
vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. Conference on Intelligent Robots and Systems (IROS), pages
3 2765–2770. IEEE, 2016. 1
[31] Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- [46] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and
aolong Wang. Joint hand motion and interaction hotspots Nikos G Tsagarakis. Object-based affordances detection
prediction from egocentric videos. In CVPR, 2022. 1, 2 with convolutional neural networks and dense conditional
[32] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao random fields. In IROS, 2017. 1, 5
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
[47] Toan Ngyen, Minh Nhat Vu, An Vuong, Dzung Nguyen,
Zhu, et al. Grounding dino: Marrying dino with grounded
Thieu Vo, Ngan Le, and Anh Nguyen. Open-vocabulary
pre-training for open-set object detection. arXiv preprint
affordance detection in 3d point clouds. arXiv preprint
arXiv:2303.05499, 2023. 3
arXiv:2303.02401, 2023. 2
[33] Timo Luddecke and Florentin Worgotter. Learning to seg-
ment affordances. In Proceedings of the IEEE International [48] R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View
Conference on Computer Vision Workshops, pages 769–776, in Article, 2023. 2
2017. 1 [49] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
[34] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Dacheng Tao. One-shot affordance detection. In IJCAI, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
2021. 3 Dinov2: Learning robust visual features without supervision.
[35] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and arXiv preprint arXiv:2304.07193, 2023. 3
Dacheng Tao. Grounded affordance from exocentric view. [50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
arXiv preprint arXiv:2208.13196, 2022. 6 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[36] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
Dacheng Tao. Learning affordance grounding from exocen- transferable visual models from natural language supervi-
tric images. CVPR, 2022. 1, 2, 5, 6, 12 sion. In International conference on machine learning, pages
[37] Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. 8748–8763. PMLR, 2021. 2, 3
Grounding language with visual affordances over unstruc- [51] Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr,
tured data. In 2023 IEEE International Conference on Lawrence Chen, Angjoo Kanazawa, and Ken Goldberg. Lan-
Robotics and Automation (ICRA), pages 11576–11582. guage embedded radiance fields for zero-shot task-oriented
IEEE, 2023. 2 grasping. arXiv preprint arXiv:2309.07970, 2023. 2
[38] Reihaneh Mirjalili, Michael Krawez, Simone Silenzi, Yan- [52] Anirban Roy and Sinisa Todorovic. A multi-scale cnn for
nik Blei, and Wolfram Burgard. Lan-grasp: Using large lan- affordance segmentation in rgb images. In Computer Vision–
guage models for semantic object grasping. arXiv preprint ECCV 2016: 14th European Conference, Amsterdam, The
arXiv:2310.05239, 2023. 2 Netherlands, October 11–14, 2016, Proceedings, Part IV 14,
[39] Luis Montesano, Manuel Lopes, Alexandre Bernardino, and pages 186–201. Springer, 2016. 1
Jose Santos-Victor. Affordances, development and imitation. [53] Johann Sawatzky and Jurgen Gall. Adaptive binarization
In 2007 IEEE 6th International Conference on Development for weakly supervised affordance segmentation. In ICCVW,
and Learning, pages 270–275. IEEE, 2007. 1 2017. 1, 2
[40] Lorenzo Mur-Labadia, Ruben Martinez-Cantin, and Jose J
[54] Johann Sawatzky, Abhilash Srikantha, and Juergen Gall.
Guerrero. Bayesian deep learning for affordance segmenta-
Weakly supervised affordance detection. CVPR, 2017. 1,
tion in images. arXiv preprint arXiv:2303.00871, 2023. 1
2
[41] Austin Myers, Angjoo Kanazawa, Cornelia Fermuller, and
Yiannis Aloimonos. Affordance of Object Parts from Geo- [55] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and
metric Features. Int. Conf. Robot. Autom., pages 5–6, 2015. Byron Boots. One-shot learning for semantic segmentation.
2 arXiv preprint arXiv:1709.03410, 2017. 3
[42] Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis [56] Yaoxian Song, Penglei Sun, Yi Ren, Yu Zheng, and
Aloimonos. Affordance detection of tool parts from geomet- Yue Zhang. Learning 6-dof fine-grained grasp detec-
ric features. ICRA, 2015. 12 tion based on part affordance grounding. arXiv preprint
[43] Tushar Nagarajan, Christoph Feichtenhofer, and Kristen arXiv:2301.11564, 2023. 2
Grauman. Grounded human-object interaction hotspots from [57] Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping
video. In Proceedings of the IEEE/CVF International Con- Luo, Saining Xie, and Zhicheng Yan. Going denser
ference on Computer Vision, pages 8688–8697, 2019. 1, 2, with open-vocabulary part segmentation. arXiv preprint
6 arXiv:2305.11173, 2023. 3

10
[58] Chao Tang, Jingwen Yu, Weinan Chen, and Hong Zhang.
Relationship oriented affordance learning through manipula-
tion graph construction. arXiv preprint arXiv:2110.14137,
2021. 2
[59] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii:
Revenge of the vit. In European Conference on Computer
Vision, pages 516–533. Springer, 2022. 4
[60] Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong,
Xihui Liu, and Jiangmiao Pang. Ov-parts: To-
wards open-vocabulary part segmentation. arXiv preprint
arXiv:2310.05107, 2023. 3
[61] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transform-
ers. Advances in Neural Information Processing Systems,
34:12077–12090, 2021. 6
[62] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi-
ang Bai. Side adapter network for open-vocabulary semantic
segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2945–
2954, 2023. 6
[63] Xintong Yang, Ze Ji, Jing Wu, and Yu-Kun Lai. Recent ad-
vances of deep robotic affordance learning: a reinforcement
learning perspective. IEEE Transactions on Cognitive and
Developmental Systems, 2023. 2
[64] Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Jiebo
Luo, and Zheng-Jun Zha. Grounding 3d object affor-
dance from 2d interactions in images. arXiv preprint
arXiv:2303.10437, 2023. 2
[65] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2881–2890, 2017. 6
[66] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
dense labels from clip. In European Conference on Com-
puter Vision, pages 696–712. Springer, 2022. 3, 6
[67] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Liu. Learning to prompt for vision-language models. In-
ternational Journal of Computer Vision, 130(9):2337–2348,
2022. 5
[68] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and
Yifan Liu. Zegclip: Towards adapting clip for zero-shot se-
mantic segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
11175–11185, 2023. 6

11
A. Dataset Details
To evaluate the model’s generalization ability in the chal-
lenging One-shot Open Affordance Learning (OOAL) set-
ting, datasets with a large number of object categories are
required. In addition, at least two object categories are
needed for each affordance so that the model can be trained Figure 8. Different affordance annotation schemes. Dense affor-
on one object and tested on the other. After an investigation dance annotation is labeled as binary masks. Sparse affordance
of existing affordance datasets, we find only two datasets, annotation is first labeled as keypoints, and then a gaussian kernel
is performed over each point to produce pixel-wise ground truth.
AGD20K [36] and UMD [42], that fulfill the prerequisites
and can be used to evaluate the affordance segmentation
task. Specific affordance and object categories of these two
datasets are shown in Tab. 5. For the unseen split, we dis-
play the object category division in Tab. 6. The model is
trained on base object classes, and evaluated on novel ob-
jects categories.
Moreover, it is worth noting that annotations in AGD20K
and UMD are of different types. UMD uses pixel-level
dense binary maps, while the ground truth of AGD20K con-
sist of sparse keypoints within the affordance areas, and a
gaussian distribution is then applied on each point to gen-
erate dense annotation. The difference of dense and sparse
affordance annotation is highlighted in Fig. 8.

B. Ablation Study on Hyperparameters


Figure 9. Visualization of CLS-guided mask.
The proposed framework involves three primary hyper-
parameters, i.e., the number of learnable text tokens p, vi-
sion encoder fusion layers j, decoder transformer layers t.
C.2. Visualization of Unseen Affordances
We conduct ablation studies individually to explore the im- In Fig. 11, we further display examples on AGD20K
pact of these hyperparameters, as detailed in Tab. 7, Tab. 8, dataset to showcase that our model has the ability to rec-
and Tab. 9. Notably, increasing the number of learnable ognize unseen affordances. It is evident that the model can
text tokens up to 8 showcases a gradual improvement in consistently activate relevant affordance areas when receiv-
performance within the seen setting, but leads to fluctuat- ing text that are previously unseen during training.
ing results in the unseen setting, indicating its susceptibility
to generalization when confronted with unseen objects. In C.3. Additional Qualitative Results
terms of the fusion layers, the fusion of the last two layers
In Fig. 10, we present more qualitative results on
demonstrates an obvious performance gain compared to the
AGD20K dataset. The comparison demonstrates that pre-
single-layer counterpart, and integrating the last three layers
dictions from our methods exhibit clear separation among
yields the best results. Lastly, we note that the transformer
object parts, while predictions from other approaches often
decoder can effectively improve performance in both seen
bias towards one part or the whole object. In particular, our
and unseen setting, and a two-layer transformer decoder
methods can locate very fine-grained affordance areas even
produces the most optimal results.
for unseen objects, such as the saddle of a bicycle for “sit
on”, and the handle of a golf club for “hold”.
C. Additional Visualizations
D. Discussion and Limitations
C.1. Visualization of CLS-guided mask
This study introduces a novel problem of OOAL, and
In Fig. 9, we display the visualization of the CLS-guided presents a framework built upon foundation models that can
mask from the proposed CLS-guided transformer decoder. perform effective affordance learning with limited samples
It can be seen that the mask primarily concentrates on fore- and annotations. We note that this framework can be poten-
ground objects, thus facilitating the cross-attention within tially used in various applications, such as robotic manipu-
salient regions. lation and virtual reality. For instance, in robotic manipula-
Dataset Affordance Object
(17) bowl, cup, hammer, knife, ladle, mallet,
(7) grasp, cut, scoop, contain,
UMD mug, pot, saw, scissors, scoop, shears, shovel,
pound, support, wrap-grasp
spoon, tenderizer, trowel, turner
(50) apple, axe, badminton racket, banana, baseball, baseball bat,
basketball, bed, bench, bicycle, binoculars, book, bottle, bowl,
(37) beat, boxing, brush with, carry, catch, cut, cut with,
broccoli, camera, carrot, cell phone, chair, couch, cup,
drag, drink with, eat, hit, hold, jump, kick, lie on, lift,
discus, drum, fork, frisbee, golf clubs, hammer, hot dog,
AGD20K look out, open, pack, peel, pick up, pour, push, ride,
javelin, keyboard, knife, laptop, microwave, motorcycle,
sip, sit on, stick, stir, swing, take photo, talk on,
orange, oven, pen, punching bag, refrigerator, rugby ball,
text on, throw, type on, wash, write
scissors, skateboard, skis, snowboard, soccer ball, suitcase,
surfboard, tennis racket, toothbrush, wine glass

Table 5. Affordance and object classes in the UMD and AGD20K dataset. The number of classes is shown in parentheses.

Dataset Base Objects (Train) Novel Objects (Test)


(9) cup, ladle, pot, saw, scoop, shears,
UMD (8) bowl, hammer, knife, mallet, mug, scissors, spoon, turner
shovel, tenderizer, trowel
(33) apple, badminton racket, baseball, baseball bat, bench,
book, bottle, bowl, carrot, cell phone, chair, couch,
discus, fork, frisbee, hammer, hot dog, javelin, (14) axe, banana, basketball, bed, bicycle,
AGD20K keyboard, microwave, motorcycle, orange, oven, broccoli, camera, cup, golf clubs, knife,
punching bag, rugby ball, scissors, skateboard, laptop, refrigerator, skis, soccer ball
snowboard, suitcase, surfboard, tennis racket,
toothbrush, wine glass

Table 6. Object category division in the unseen split of UMD and AGD20K dataset. The number of categories is shown in parentheses.

Seen Unseen Seen Unseen


p t
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
2 0.774 0.568 1.710 1.119 0.457 1.434 0 0.846 0.537 1.622 1.115 0.447 1.440
4 0.765 0.573 1.714 1.102 0.469 1.449 1 0.753 0.574 1.737 1.094 0.449 1.492
6 0.760 0.572 1.726 1.162 0.440 1.383 2 0.740 0.577 1.745 1.067 0.465 1.492
8 0.740 0.577 1.745 1.070 0.461 1.503 3 0.746 0.575 1.738 1.110 0.458 1.456
10 0.768 0.581 1.726 1.111 0.460 1.463
Table 9. Ablation study on the number of transformer decoder
Table 7. Ablation study on the number of learnable token p in text layers t.
prompt learning.

Seen Unseen tation effort. This stands in contrast to traditional methods


j that necessitate extensive training data or numerous simu-
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
lated interaction trials to gain affordance knowledge.
1 1.060 0.455 1.422 1.338 0.390 1.302
Despite achieving good performance with few train-
2 0.748 0.576 1.756 1.105 0.456 1.452
ing samples, our framework reveals two limitations: First,
3 0.740 0.577 1.745 1.070 0.461 1.503
4 0.762 0.579 1.713 1.129 0.453 1.401 while text prompt learning enhances the performance within
unseen objects, it diminishes the framework’s generaliza-
Table 8. Ablation study on the number of fusion layers j in multi- tion capacity to unseen affordances. This occurs due to an
layer feature fusion. excess of learnable tokens potentially weakening the intrin-
sic word similarities within the CLIP text encoder. A viable
solution to this limitation involves combining the learnable
tion, the model can make reasonable affordance predictions prompts with manually designed prompts. Second, the per-
for diverse base and novel objects, requiring minimal anno- formance is notably influenced by the selection of the one-
Figure 10. Additional qualitative comparison on AGD20K dataset.

Figure 11. Qualitative examples of unseen affordance prediction


on AGD20K dataset. The 2nd column shows the results on seen
affordances, and the 3rd and 4th columns show results with unseen
affordances.

shot example. Instances with heavy occlusion or inferior


lighting conditions can impact the learning performance.
Given the inherent challenges in learning from merely one-
shot example, this limitation appears reasonable and logical.

You might also like