One-Shot Open Affordance Learning with Foundation Models
One-Shot Open Affordance Learning with Foundation Models
Abstract
1
(OOAL), where the model is trained with very little data, 2. Related Work
and is expected to recognize novel objects and affordances
during inference. The illustration of OOAL pipeline is Affordance Learning. The term “affordance” is popular-
shown in Fig. 1. Compared with the typical affordance ized by the psychologist James Gibson, who describes it
learning that requires numerous training samples and can as the properties of an object or the environment that sug-
only reason within a closed affordance vocabulary, OOAL gest possible actions or interactions. Building on this, re-
alleviates the need of large-scale datasets and broadens the searchers have developed many approaches to acquire af-
scope of inference. fordance information in various ways. In computer vision,
initial research [13, 18, 28, 41] has focused on affordance
To this end, we note that foundation Vision-Language detection using convolutional neural networks. As manual
Models (VLMs) can be a potential solution, which have re- affordance annotations are often costly to acquire, much
cently emerged as powerful tools for a wide array of com- subsequent research has shifted its focus to weak super-
puter vision tasks. The open vocabulary nature of these vision such as keypoints [16, 53, 54] or image-level labels
VLMs like CLIP [50] that are trained on a large corpus of [36, 43]. Recent work has explored a novel perspective on
image-text data enables reasoning of previously unseen ob- how to ground affordances from human-object interaction
jects, scenes, and concepts. However, we observe that these images [29,36,64] or human action videos [9,19,31,43]. In
models often fail to understand nuanced vocabularies such robotics, affordance learning enables robots to interact ef-
as affordances or object parts. One hypothesis is that object fectively and intelligently with complex and dynamic envi-
parts and affordances appear much less frequently in image ronments [2,63]. Specifically, some work [3,27,58] utilizes
captions compared with objects. Therefore, the following affordance to build relationships between objects, tasks, and
question naturally arises: Can we teach foundation models manipulations for robotic grasping. Other studies focus on
to comprehend more subtle, fine-grained aspects of objects, learning affordance from other available resources that can
such as affordances, with very few examples? In this way, be deployed on real robots, such as human teleoperated play
the generalization capability of foundation models can be data [6], image pairs [5], and egocentric video datasets [4].
inherited with minimum annotation effort.
In contrast to the works above that often require a large
To achieve this, we first conduct a thorough analysis of amount of training data, we propose the problem of OOAL
several representative foundation models. The objective is that aims to perform affordance learning with one sample
to delve into their inherent understanding of affordances, per base object category, and allows zero-shot inference to
and figure out what visual representation is suitable for data- handle novel objects and affordances.
limited affordance learning. Based on the analysis, we then Foundation Models for Affordance Learning. With the
build a learning architecture and propose several methods, rapid development of foundation models such as Large Lan-
including text prompt learning, multi-layer feature fusion, guage Models (LLMs) and vision-language models, many
and a CLS-token-guided transformer decoder, that can facil- research efforts have explored their utilization in affordance
itate the alignment between visual representation and affor- learning or reasoning. Mees et al. [37] leverage GPT-
dance text embeddings. Lastly, we select a dense prediction 3 [7] to break down language instructions into subgoals,
task, affordance segmentation, for evaluation and compari- and learn a visual affordance model to complete real world
son with a variety of state-of-the-art models, where we find long-horizon tasks. Li et al. [29] adopt DINO-ViT features
that our methods can achieve higher performance with less to perform affordance grounding by transferring affordance
than 1% of the complete training data. knowledge from human-object interaction images to ego-
Overall, our contributions can be summarized as follows: centric views. Huang et al. [25] propose a novel pipeline
(1) We introduce the problem of OOAL, aiming to develop that uses LLMs [48] for affordance reasoning, which inter-
a robust affordance model that can generalize to novel ob- acts with VLMs to produce 3D affordance maps for robotic
ject and affordance categories without the need of massive manipulation. Recent studies [38,51,56] delve into the inte-
training data. (2) We conduct a comprehensive analysis gration of affordance and language models for task-oriented
on existing foundation models to explore their potential for grasping, which allows robots to grasp objects in a more ap-
OOAL. Following the analysis, we build a learning archi- propriate and safe manner.
tecture with vision-language foundation models, and design The closest methods to ours are AffCorrs [22] and Ope-
several methods to improve the alignment between visual nAD [47]. AffCorrs utilizes the visual foundation model
features and affordance text labels. (3) We implement ex- DINO to find corresponding affordances in a one-shot man-
tensive experiments with two datasets on affordance seg- ner, but relevant objects are explicitly selected as support
mentation to demonstrate the effectiveness of our learning images to significantly reduce the difficulty. OpenAD takes
pipeline, and observe significant gains over baselines with advantage of CLIP for open-vocabulary affordance detec-
strong generalization capability. tion in point clouds. It requires a large number of manual
2
annotations, while our work performs affordance learning
with merely one example per base object category.
3. Problem Setting
One-shot Open Affordance Learning (OOAL) aims to
learn a model to predict affordance with one example per
base object class and can generalize to novel object classes.
In this work, we focus on the dense prediction task of affor-
dance segmentation. Specifically, objects are first divided
into Nb base classes and No novel classes without intersec-
tion. The model receives only Nb samples during training,
one for each base object category, which is a pair of im-
age I ∈ RH×W ×3 and pixel-wise affordance annotation
M ∈ RH×W ×N (N is the number of affordance categories
in the dataset). After training, evaluation is performed on Figure 2. Analysis of vision-language foundation models on
the combination of base and novel object categories to mea- text-based affordance grounding. The 1st and 3rd rows use af-
sure the generalization ability of the model. Also, affor- fordance texts as input queries, and the 2nd and 4th rows use cor-
dance labels can be replaced with novel vocabularies that responding object parts as input text queries. Visualizations show
share similar semantics, such as “chop”, “slice”, and “trim” that these models have limited ability to recognize fine-grained af-
fordances and object parts.
to represent affordance akin to “cut”.
It is worth noting that OOAL is different from one-
shot semantic segmentation (OSSS) [55] and one-shot af-
fordance detection (OS-AD) [34]. Both OSSS and OS-AD ity method CLIP Surgery [30], a state-of-the-art open-
receive one-shot sample during training. However, the sam- vocabulary segmentation method CAT-Seg [12], and an
ple keeps changing in each iteration, so the model can be open-vocabulary detection method GroundingDINO [32].
exposed to many different images. Additionally, a sup- For vanilla CLIP, we employ the method proposed in
port image is required at inference to provide prior infor- MaskCLIP [66] that directly extracts dense predictions
mation. In comparison, OOAL performs one-shot training without fine-tuning. We use the text prompt template of
and zero-shot inference, which poses additional challenges. “somewhere to [affordance]” to query visual features to
The model needs to generalize to previously unseen objects, find corresponding areas. As illustrated in Fig. 2, we note
necessitating the ability to understand and recognize seman- that most models cannot understand affordance well, ex-
tic relationships between seen and unseen classes with very cept the detection model GroundingDINO, but its predic-
limited data. tions mainly focus on the whole object rather than parts.
As for dense prediction models, CAT-Seg often recognizes
4. Method affordance regions as background, and CLIP gives high ac-
tivation on both foreground and background. In compari-
4.1. Analysis of Foundation Models son, CLIP Surgery fails to localize the “holding” area for a
The field of computer vision has recently witnessed a knife, but manages to associate the phrase “sit on” with a
surge in the prevalence of large foundational models, such chair. Furthermore, even when the affordance text is re-
as CLIP [50], Segment Anything [26], DINO [8, 49] etc. placed with corresponding object parts, predictions from
These models exhibit strong zero-shot generalization ca- CLIP and GroundingDINO remain biased toward objects,
pabilities for several computer vision tasks, making them while CLIP Surgery and CAT-Seg tend to activate the wrong
seem like a great option to tackle the problem of OOAL. To parts. This is consistent with recent findings [57, 60] that
this end, we perform analysis on several existing founda- CLIP has limited part recognition ability.
tion models which we split into three parts: ❶ Do current To answer questions ❷ and ❸, we consider two essen-
vision-language foundation models and their variants have tial characteristics of a good affordance model in the low-
the ability to detect affordances via affordance/part-based shot setting: (1) Part-aware representation. The visual rep-
prompting? ❷ Can the features of visual foundation mod- resentation should exhibit awareness of object parts, given
els discriminate affordance regions in images? and ❸ Can that affordance often denotes small and fine-grained re-
these models generalize affordance recognition to novel ob- gions, e.g., a bicycle saddle to sit on or a knife handle to
jects and perform well in the low-shot setting? hold. (2) Part-level semantic correspondence. This prop-
Driven by question ❶, we select four representative erty is critical for generalization, since the model requires
models, i.e., the vanilla CLIP, a CLIP-based explainabil- the understanding of semantic relations to make reasonable
3
Figure 3. Proposed learning framework for OOAL. Our designs are highlighted in three color blocks, which are text prompt learning,
multi-layer feature fusion, and CLS-guided transformer decoder. [CLS] denotes the CLS token of the vision encoder.
4
In this section, we first describe the overview of our function, it often carries rich prior information of the whole
proposed learning framework that builds on the powerful image, such as salient objects or regions. Consequently,
foundation models. Then, we elaborate on the three pro- we utilize the [CLS] token to produce a guidance mask that
posed designs that help in the challenging OOAL problem. constrains the cross-attention within a foreground region.
Finally, we discuss the framework’s capability to identify The decoder receives three inputs, i.e., text embeddings
novel objects and affordances at inference. Ft , visual features Fv , and the [CLS] token Lcls . Firstly,
Overview. The proposed learning framework is presented linear transformations are performed to yield query, key,
in Fig. 3, which consists of a vision encoder, a text encoder, and value:
and a transformer decoder. First, the pretrained vision en-
coder DINOv2 is used to extract dense patch embeddings Q = ϕq (Ft ), K = ϕk (Fv ), V = ϕv (Fv ). (2)
F̂v ∈ RL×Cv , where L is the number of tokens or patches.
Then, affordance labels are processed by the CLIP text en- Here we use text embeddings as query, and visual features
coder to obtain text embeddings Ft ∈ RN ×C . To cope as key and value, allowing the model to focus on the update
with inconsistent dimensions between visual and text em- of text embeddings by retrieving relevant visual informa-
beddings, an embedder ev : RCv → RC with a single MLP tion that corresponds to the affordance text. Next, the CLS-
layer is employed. In the end, the lightweight transformer guided mask is calculated between the [CLS] token and key
decoder takes both visual and text embeddings as input, and via matrix multiplication:
outputs the affordance prediction.
Text Prompt Learning. Manually designing prompts for ϕc (Lcls )K T
Mcls = sigmoid( √ ), (3)
affordances can be a complicated work, especially consid- dk
ering that CLIP has difficulty in recognizing affordance
where dk is a scaling factor that equals the dimension of the
(see Fig. 2). Thus, we adopt the Context Optimization
keys. The masked cross-attention is then computed as:
(CoOp) [67] method to introduce automatic text prompt
learning. Instead of finetuning the CLIP text encoder, the p
F̂t = softmax(QK T / dk ) · Mcls V + Ft . (4)
inclusion of learnable prompts is an effective strategy that
can alleviate the problem of overfitting and retain the in- ′
After that, the updated text embeddings Ft are obtained
herent text recognition ability of CLIP. Specifically, p ran- by sending F̂t through a feed-forward network (FFN) with
domly initialized learnable context vectors {v1 , v2 , ..., vp } a residual connection. The decoder comprises t layers of
are inserted in front of the text CLS token, and they are transformers, and the ultimate prediction is generated by
shared for all affordance classes. performing matrix product between the output of the last
Multi-Layer Feature Fusion. Different layers of DINOv2 transform layer and original visual features Fv , thereby en-
features often exhibit different levels of granularity [1]. suring the maximum retention of part-aware representations
Since affordance may correspond to multiple parts of an ob- from DINOv2. Lastly, binary cross entropy is employed as
ject, a diverse set of granularities can be beneficial. For this loss function to optimize parameters of linear layers, em-
purpose, we aggregate the features of the last j layers. Each bedder, and decoder.
layer of features is first processed by a linear projection, Inference on novel objects and affordances. During the
and then all features are linearly combined with a weighted training process, the decoder learns to establish an align-
summation: ment between visual features and affordance text embed-
j
X dings. When encountering a novel object at inference, the
F̂v = αi · ϕ(Fn−i+1 ), α1 + α2 + ... + αj = 1, (1) aligned affordance text embeddings can locate correspond-
i=1 ing object regions, leveraging the part-level semantic cor-
where Fn denotes the last layer, α is a learnable parame- respondence property inherent in DINOv2. Similarly, as
ter that controls the fusion ratio of each layer, and ϕ indi- the model processes novel affordance text inputs, the gen-
cates the linear transformation. This straightforward fusion erated text embeddings can also retrieve the aligned visual
scheme enables adaptive selection among different granu- features, which are based on the semantic similarities to the
larity levels, allowing the model to handle affordance recog- base affordances seen in the training.
nition across diverse scenarios.
CLS-Guided Transformer Decoder. To deal with the lack 5. Experiments
of alignment between visual and text features, we propose
5.1. Datasets
a lightweight transformer decoder that applies a masked
cross-attention mechanism to promote the mutual commu- We choose two typical datasets, AGD20K [36] and
nication between two branches. Since the [CLS] token of UMD part affordance [46], both of which include a large
a foundation model is used in the computation of objective number of object categories that help in the evaluation of
5
Training Data Seen Unseen
Task Method
seen / unseen split KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
Hotspots [43] 1.773 0.278 0.615 1.994 0.237 0.577
23,083 / 15,543 images Cross-view-AG [36] 1.538 0.334 0.927 1.787 0.285 0.829
WSAG
image-level labels Cross-view-AG+ [35] 1.489 0.342 0.981 1.765 0.279 0.882
LOCATE [29] 1.226 0.401 1.177 1.405 0.372 1.157
MaskCLIP [66] 5.752 0.169 0.041 6.052 0.152 0.047
50 / 33 images SAN [62] 1.435 0.357 0.941 1.580 0.351 1.022
OOAL
keypoint labels ZegCLIP [68] 1.413 0.387 1.001 1.552 0.361 1.042
Ours 0.740 0.577 1.745 1.070 0.461 1.503
Table 1. Comparison with state of the art on AGD20K dataset. OOAL setting uses 0.22% / 0.21% of the full training data. WSAG denotes
weakly-supervised affordance grounding. The best and second-best results are highlighted in bold and underlined, respectively.
Setting Method Seen Unseen hIoU novel split to evaluate performance on novel object classes.
Due to its small number of object categories, we take one
DeepLabV3+ [10] 70.5 57.5 63.3
Fully example from each base object instance to form the training
SegFormer [61] 74.6 57.7 65.0
Supervised set. Specific affordance categories and object class splits
PSPNet [65] 72.0 60.8 66.0
can be found in the supplementary material.
PSPNet [65] 56.7 46.6 51.1
DeepLabV3+ [10] 56.8 48.4 52.3 5.2. Implementation details
SegFormer [11] 64.6 51.4 57.3
OOAL Experiments are implemented on two GeForce RTX
MaskCLIP [66] 4.25 4.24 4.25
3090 GPUs. All visual foundation models use the same
SAN [62] 45.1 32.2 37.5
ZegCLIP [68] 47.4 36.0 40.9
base-sized vision transformer (ViT-base). We train the
Ours 74.6 59.7 66.4 model using SGD optimizer with learning rate 0.01 for 20k
iterations. For experiments on AGD20K, images are first re-
Table 2. Comparison on UMD dataset. Fully-supervised methods sized to 256 × 256 and randomly cropped to 224 × 224 with
are trained with 14,823 and 20,874 images with pixel-level labels horizontal flipping. Experiments for UMD dataset are con-
for seen and unseen split, respectively. In contrast, OOAL setting ducted on the opensource toolbox MMSegmentation [14]
uses 54 and 76 images, 0.36% of the full training data. with the default training setting. The hyperparameters p, j,
and t are set to 8, 3, and 2, respectively.
Following previous work, we adopt the commonly used
novel objects. AGD20K is a large-scale affordance ground- Kullback-Leibler Divergence (KLD), Similarity (SIM), and
ing dataset with 36 affordances and 50 objects, containing Normalized Scanpath Saliency (NSS) metrics to evaluate
23,816 images from exocentric and egocentric views. It the results on AGD20K. For UMD dataset, we use the met-
aims to learn affordance from human-object interaction im- ric of mean intersection-over-union (mIoU), and also incor-
ages, and perform affordance localization on egocentric im- porate the harmonic mIoU as a balanced measure that ac-
ages. As it is a dataset for weakly-supervised learning, im- counts for both seen and unseen settings.
ages in the training set only have image-level labels. There-
fore we manually annotate 50 randomly selected egocentric 5.3. Comparison to state-of-the-art methods
images from each object category for training. AGD20K AGD20K dataset is benchmarked with weakly super-
also has two train-test splits for seen and unseen settings, vised affordance grounding (WSAG) approaches, which
and we follow their splits to evaluate the performance. Note use image-level object and affordance labels to do affor-
that AGD20K uses sparse annotation, where ground truth dance segmentation. Note that results from WSAG meth-
consists of keypoints within affordance areas, and then a ods are not directly comparable to our setting, as training
gaussian kernel is applied over each point to produce dense labels are different. Despite using only image-level labels,
annotation. the training data required are more than 460 times of ours.
UMD dataset consists of 28,843 RGB-D images with 7 The results in Tab. 1 demonstrate that our results exceed
affordances and 17 object categories, and images of each all WSAG counterparts in an easy and realistic setting. We
object are captured on a revolving turntable. It has two also benchmark open-vocabulary segmentation methods of
train-test splits termed category split and novel split. We MaskCLIP, SAN, and ZegCLIP for further comparison. We
use the category split to evaluate base object categories and find that these CLIP-based methods have a large perfor-
6
Figure 5. Qualitative comparison with LOCATE and ZegCLIP on AGD20K dataset. When multiple affordance predictions overlap, the one
with higher value is displayed. Our predictions distinguish different object parts, while other methods often make overlapping predictions.
7
Seen Unseen
Model
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
CLIP 1.294 0.384 1.107 1.556 0.327 0.966
DeiT III 1.301 0.378 1.140 1.535 0.321 1.049
DINOv2 1.156 0.425 1.297 1.462 0.360 1.105
Seen Unseen
Method
KLD↓ SIM↑ NSS↑ KLD↓ SIM↑ NSS↑
Baseline 1.156 0.425 1.297 1.462 0.360 1.105
+ TPL 1.060 0.455 1.422 1.338 0.390 1.302
+ MLFF 0.846 0.537 1.622 1.115 0.447 1.440
+ TD 0.749 0.578 1.738 1.131 0.443 1.408
+ CTM 0.740 0.577 1.745 1.070 0.461 1.503
Figure 7. Qualitative examples of novel affordance prediction on
UMD dataset. The 1st and 2nd rows display results on base ob- Table 4. Ablation results of proposed modules. TPL: text prompt
jects, and the 3rd and 4th rows show results for novel objects. learning. MLFF: multi-layer feature fusion. TD: transformer de-
coder. CTM: CLS-guided mask.
on the whole object and the accuracy is far from satisfac- ply process the visual features with the embedder, and per-
tory, whereas our results are more part-focused, especially form matrix multiplication with pre-computed affordance
for the unseen objects. For example, the prediction for the text embeddings to output segmentation maps. As shown in
unseen object of bicycle show that our model can handle Tab. 3, CLIP and DeiT III exhibit comparable performance,
the complex affordance (ride) with multiple separated af- whereas DINOv2 achieves much better results in both seen
fordance areas (saddle, handlebar, and pedal). In Fig. 6, we and unseen settings, which are consistent with the analysis
display the results for UMD dataset. We observe that Seg- that DINOv2 is more suitable for affordance learning.
Former and ZegCLIP often fail to recognize affordances of Proposed Methods. We use the DINOv2 with a simple em-
objects whose parts are similar in appearance. Also, they bedder as baseline, and gradually integrate our methods to
tend to misclassify metallic object parts as cuttable affor- analyze the effect of each proposed design. The results in
dance, suggesting that inferring affordances with only ap- Tab. 4 reveal that each module can consistently deliver no-
pearance features can be misleading. In comparison, our table improvements. In particular, we notice that the inclu-
predictions are more accurate due to the utilization of DI- sion of a transformer decoder can enhance the performance
NOv2’s part-level semantic correspondences. in the seen setting, but yield inferior results for the unseen
One particular feature of our model is that it can rec- setting. With the integration of the CLS-guided mask, re-
ognize novel affordances not shown during training. To sults of both settings can be improved, suggesting that re-
demonstrate this, we replace the original affordance labels stricting the cross-attention space is an effective strategy for
with semantically similar words and check if the model unseen object affordance recognition.
can still reason about corresponding affordance areas. As
shown in Fig. 7, the model manages to make correct pre- 6. Conclusion
dictions for novel affordances, such as “hold and grab” for
base affordance “grasp”, “saw” for “cut”, and “accommo- In this paper, we propose the problem of one-shot open
date” for “contain”. affordance learning that uses one example per base object
category as training data, and has the ability to recognize
5.5. Ablation study novel objects and affordances. We first present a detailed
analysis into different foundation models for the purpose of
The ablation study is performed on the more challeng- data-limited affordance learning. Motivated by the analysis,
ing AGD20K dataset due to its natural images with diverse we build a vision-language learning framework with sev-
backgrounds. Ablations on hyperparameters are left in the eral proposed designs that better utilize the visual features
supplementary material. and promote the alignment with text embeddings. Experi-
Different Vision Encoders. To complement the qualitative ment results demonstrate that we achieve comparable per-
analysis in Sec. 4.1, we conduct quantitative experiments formance over several fully-supervised baselines with less
on CLIP, DeiT III, and DINOv2. Specifically, we sim- than 1% of the full training data.
8
References [14] MMSegmentation Contributors. MMSegmentation:
Openmmlab semantic segmentation toolbox and
[1] Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. benchmark. https : / / github . com / open -
Deep vit features as dense visual descriptors. ECCVW What mmlab/mmsegmentation, 2020. 6
is Motion For, 2022. 5
[15] Francisco Cruz, Sven Magg, Cornelius Weber, and Stefan
[2] Paola Ardón, Èric Pairet, Katrin S Lohan, Subramanian Ra- Wermter. Training agents with interactive reinforcement
mamoorthy, and Ronald Petrick. Affordances in robotic learning and contextual affordances. IEEE Transactions on
tasks–a survey. arXiv preprint arXiv:2004.07400, 2020. 2 Cognitive and Developmental Systems, 8(4):271–284, 2016.
[3] Paola Ardón, Eric Pairet, Ronald PA Petrick, Subramanian 1
Ramamoorthy, and Katrin S Lohan. Learning grasp affor- [16] Leiyao Cui, Xiaoxue Chen, Hao Zhao, Guyue Zhou, and
dance reasoning through semantic relations. IEEE Robotics Yixin Zhu. Strap: Structured object affordance segmentation
and Automation Letters, 4(4):4571–4578, 2019. 2 with point supervision. arXiv preprint arXiv:2304.08492,
[4] Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, 2023. 1, 2
and Deepak Pathak. Affordances from human videos as [17] Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and
a versatile representation for robotics. In Proceedings of Kui Jia. 3d affordancenet: A benchmark for visual object
the IEEE/CVF Conference on Computer Vision and Pattern affordance understanding. In CVPR, 2021. 1
Recognition, pages 13778–13790, 2023. 2
[18] Thanh Toan Do, Anh Nguyen, and Ian Reid. Affordancenet:
[5] Homanga Bharadhwaj, Abhinav Gupta, and Shubham Tul-
An end-to-end deep learning approach for object affordance
siani. Visual affordance prediction for guiding robot explo-
detection. ICRA, 2018. 1, 2
ration. arXiv preprint arXiv:2305.17783, 2023. 2
[19] Kuan Fang, Te Lin Wu, Daniel Yang, Silvio Savarese, and
[6] Jessica Borja-Diaz, Oier Mees, Gabriel Kalweit, Lukas Her-
Joseph J. Lim. Demo2Vec: Reasoning Object Affordances
mann, Joschka Boedecker, and Wolfram Burgard. Affor-
from Online Videos. CVPR, 2018. 1, 2
dance learning from play for sample-efficient policy learn-
ing. In 2022 International Conference on Robotics and Au- [20] Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen,
tomation (ICRA), pages 6372–6378. IEEE, 2022. 2 Yaodong Yang, and Hao Dong. End-to-end affor-
dance learning for robotic manipulation. arXiv preprint
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
arXiv:2209.12941, 2022. 1
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- [21] James J. Gibson. The Ecological Approach to Visual Percep-
guage models are few-shot learners. Advances in neural in- tion: Classic Edition. Houghton Mifflin, 1979. 1
formation processing systems, 33:1877–1901, 2020. 2 [22] Denis Hadjivelichkov, Sicelukwanda Zwane, Marc Deisen-
[8] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, roth, Lourdes Agapito, and Dimitrios Kanoulas. One-Shot
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- Transfer of Affordance Regions? AffCorrs! CoRL, 2022. 2
ing properties in self-supervised vision transformers. In [23] Mohammed Hassanin, Salman Khan, and Murat Tahtali. Vi-
ICCV, 2021. 3 sual affordance and function understanding: A survey. ACM
[9] Joya Chen, Difei Gao, Kevin Qinghong Lin, and Mike Zheng Computing Surveys (CSUR), 54(3):1–35, 2021. 1
Shou. Affordance grounding from demonstration video to [24] Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, and
target image. In Proceedings of the IEEE/CVF Conference Dacheng Tao. Affordance transfer learning for human-object
on Computer Vision and Pattern Recognition, pages 6799– interaction detection. In Proceedings of the IEEE/CVF Con-
6808, 2023. 2 ference on Computer Vision and Pattern Recognition, pages
[10] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian 495–504, 2021. 1
Schroff, and Hartwig Adam. Encoder-decoder with atrous [25] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li,
separable convolution for semantic image segmentation. In Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value
Proceedings of the European conference on computer vision maps for robotic manipulation with language models. arXiv
(ECCV), pages 801–818, 2018. 6 preprint arXiv:2307.05973, 2023. 2
[11] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- [26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
der Kirillov, and Rohit Girdhar. Masked-attention mask Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
transformer for universal image segmentation. In Proceed- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
ings of the IEEE/CVF conference on computer vision and thing. arXiv preprint arXiv:2304.02643, 2023. 3
pattern recognition, pages 1290–1299, 2022. 6 [27] Mia Kokic, Johannes A Stork, Joshua A Haustein, and Dan-
[12] Seokju Cho, Heeseong Shin, Sunghwan Hong, Seungjun ica Kragic. Affordance detection for task-specific grasping
An, Seungjun Lee, Anurag Arnab, Paul Hongsuck Seo, using deep learning. In 2017 IEEE-RAS 17th International
and Seungryong Kim. Cat-seg: Cost aggregation for Conference on Humanoid Robotics (Humanoids), pages 91–
open-vocabulary semantic segmentation. arXiv preprint 98. IEEE, 2017. 1, 2
arXiv:2303.11797, 2023. 3 [28] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Sax-
[13] Ching-Yao Chuang, Jiaman Li, Antonio Torralba, and Sanja ena. Learning human activities and object affordances from
Fidler. Learning to act properly: Predicting and explaining rgb-d videos. The International journal of robotics research,
affordances from images. In CVPR, 2018. 1, 2 32(8):951–970, 2013. 2
9
[29] Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla- [44] Tushar Nagarajan and Kristen Grauman. Learning affor-
Lara. Locate: Localize and transfer object parts for weakly dance landscapes for interaction exploration in 3d environ-
supervised affordance grounding. In Proceedings of the ments. Advances in Neural Information Processing Systems,
IEEE/CVF Conference on Computer Vision and Pattern 33:2005–2015, 2020. 1
Recognition, pages 10922–10931, 2023. 1, 2, 6 [45] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and
[30] Yi Li, Hualiang Wang, Yiqun Duan, and Xiaomeng Li. Clip Nikos G Tsagarakis. Detecting object affordances with con-
surgery for better explainability with enhancement in open- volutional neural networks. In 2016 IEEE/RSJ International
vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023. Conference on Intelligent Robots and Systems (IROS), pages
3 2765–2770. IEEE, 2016. 1
[31] Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xi- [46] Anh Nguyen, Dimitrios Kanoulas, Darwin G Caldwell, and
aolong Wang. Joint hand motion and interaction hotspots Nikos G Tsagarakis. Object-based affordances detection
prediction from egocentric videos. In CVPR, 2022. 1, 2 with convolutional neural networks and dense conditional
[32] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao random fields. In IROS, 2017. 1, 5
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun
[47] Toan Ngyen, Minh Nhat Vu, An Vuong, Dzung Nguyen,
Zhu, et al. Grounding dino: Marrying dino with grounded
Thieu Vo, Ngan Le, and Anh Nguyen. Open-vocabulary
pre-training for open-set object detection. arXiv preprint
affordance detection in 3d point clouds. arXiv preprint
arXiv:2303.05499, 2023. 3
arXiv:2303.02401, 2023. 2
[33] Timo Luddecke and Florentin Worgotter. Learning to seg-
ment affordances. In Proceedings of the IEEE International [48] R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View
Conference on Computer Vision Workshops, pages 769–776, in Article, 2023. 2
2017. 1 [49] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
[34] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Dacheng Tao. One-shot affordance detection. In IJCAI, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
2021. 3 Dinov2: Learning robust visual features without supervision.
[35] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and arXiv preprint arXiv:2304.07193, 2023. 3
Dacheng Tao. Grounded affordance from exocentric view. [50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
arXiv preprint arXiv:2208.13196, 2022. 6 Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[36] Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
Dacheng Tao. Learning affordance grounding from exocen- transferable visual models from natural language supervi-
tric images. CVPR, 2022. 1, 2, 5, 6, 12 sion. In International conference on machine learning, pages
[37] Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. 8748–8763. PMLR, 2021. 2, 3
Grounding language with visual affordances over unstruc- [51] Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr,
tured data. In 2023 IEEE International Conference on Lawrence Chen, Angjoo Kanazawa, and Ken Goldberg. Lan-
Robotics and Automation (ICRA), pages 11576–11582. guage embedded radiance fields for zero-shot task-oriented
IEEE, 2023. 2 grasping. arXiv preprint arXiv:2309.07970, 2023. 2
[38] Reihaneh Mirjalili, Michael Krawez, Simone Silenzi, Yan- [52] Anirban Roy and Sinisa Todorovic. A multi-scale cnn for
nik Blei, and Wolfram Burgard. Lan-grasp: Using large lan- affordance segmentation in rgb images. In Computer Vision–
guage models for semantic object grasping. arXiv preprint ECCV 2016: 14th European Conference, Amsterdam, The
arXiv:2310.05239, 2023. 2 Netherlands, October 11–14, 2016, Proceedings, Part IV 14,
[39] Luis Montesano, Manuel Lopes, Alexandre Bernardino, and pages 186–201. Springer, 2016. 1
Jose Santos-Victor. Affordances, development and imitation. [53] Johann Sawatzky and Jurgen Gall. Adaptive binarization
In 2007 IEEE 6th International Conference on Development for weakly supervised affordance segmentation. In ICCVW,
and Learning, pages 270–275. IEEE, 2007. 1 2017. 1, 2
[40] Lorenzo Mur-Labadia, Ruben Martinez-Cantin, and Jose J
[54] Johann Sawatzky, Abhilash Srikantha, and Juergen Gall.
Guerrero. Bayesian deep learning for affordance segmenta-
Weakly supervised affordance detection. CVPR, 2017. 1,
tion in images. arXiv preprint arXiv:2303.00871, 2023. 1
2
[41] Austin Myers, Angjoo Kanazawa, Cornelia Fermuller, and
Yiannis Aloimonos. Affordance of Object Parts from Geo- [55] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and
metric Features. Int. Conf. Robot. Autom., pages 5–6, 2015. Byron Boots. One-shot learning for semantic segmentation.
2 arXiv preprint arXiv:1709.03410, 2017. 3
[42] Austin Myers, Ching L Teo, Cornelia Fermüller, and Yiannis [56] Yaoxian Song, Penglei Sun, Yi Ren, Yu Zheng, and
Aloimonos. Affordance detection of tool parts from geomet- Yue Zhang. Learning 6-dof fine-grained grasp detec-
ric features. ICRA, 2015. 12 tion based on part affordance grounding. arXiv preprint
[43] Tushar Nagarajan, Christoph Feichtenhofer, and Kristen arXiv:2301.11564, 2023. 2
Grauman. Grounded human-object interaction hotspots from [57] Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping
video. In Proceedings of the IEEE/CVF International Con- Luo, Saining Xie, and Zhicheng Yan. Going denser
ference on Computer Vision, pages 8688–8697, 2019. 1, 2, with open-vocabulary part segmentation. arXiv preprint
6 arXiv:2305.11173, 2023. 3
10
[58] Chao Tang, Jingwen Yu, Weinan Chen, and Hong Zhang.
Relationship oriented affordance learning through manipula-
tion graph construction. arXiv preprint arXiv:2110.14137,
2021. 2
[59] Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii:
Revenge of the vit. In European Conference on Computer
Vision, pages 516–533. Springer, 2022. 4
[60] Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong,
Xihui Liu, and Jiangmiao Pang. Ov-parts: To-
wards open-vocabulary part segmentation. arXiv preprint
arXiv:2310.05107, 2023. 3
[61] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and
efficient design for semantic segmentation with transform-
ers. Advances in Neural Information Processing Systems,
34:12077–12090, 2021. 6
[62] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi-
ang Bai. Side adapter network for open-vocabulary semantic
segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2945–
2954, 2023. 6
[63] Xintong Yang, Ze Ji, Jing Wu, and Yu-Kun Lai. Recent ad-
vances of deep robotic affordance learning: a reinforcement
learning perspective. IEEE Transactions on Cognitive and
Developmental Systems, 2023. 2
[64] Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Jiebo
Luo, and Zheng-Jun Zha. Grounding 3d object affor-
dance from 2d interactions in images. arXiv preprint
arXiv:2303.10437, 2023. 2
[65] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2881–2890, 2017. 6
[66] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
dense labels from clip. In European Conference on Com-
puter Vision, pages 696–712. Springer, 2022. 3, 6
[67] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Liu. Learning to prompt for vision-language models. In-
ternational Journal of Computer Vision, 130(9):2337–2348,
2022. 5
[68] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and
Yifan Liu. Zegclip: Towards adapting clip for zero-shot se-
mantic segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
11175–11185, 2023. 6
11
A. Dataset Details
To evaluate the model’s generalization ability in the chal-
lenging One-shot Open Affordance Learning (OOAL) set-
ting, datasets with a large number of object categories are
required. In addition, at least two object categories are
needed for each affordance so that the model can be trained Figure 8. Different affordance annotation schemes. Dense affor-
on one object and tested on the other. After an investigation dance annotation is labeled as binary masks. Sparse affordance
of existing affordance datasets, we find only two datasets, annotation is first labeled as keypoints, and then a gaussian kernel
is performed over each point to produce pixel-wise ground truth.
AGD20K [36] and UMD [42], that fulfill the prerequisites
and can be used to evaluate the affordance segmentation
task. Specific affordance and object categories of these two
datasets are shown in Tab. 5. For the unseen split, we dis-
play the object category division in Tab. 6. The model is
trained on base object classes, and evaluated on novel ob-
jects categories.
Moreover, it is worth noting that annotations in AGD20K
and UMD are of different types. UMD uses pixel-level
dense binary maps, while the ground truth of AGD20K con-
sist of sparse keypoints within the affordance areas, and a
gaussian distribution is then applied on each point to gen-
erate dense annotation. The difference of dense and sparse
affordance annotation is highlighted in Fig. 8.
Table 5. Affordance and object classes in the UMD and AGD20K dataset. The number of classes is shown in parentheses.
Table 6. Object category division in the unseen split of UMD and AGD20K dataset. The number of categories is shown in parentheses.