2307.09220v2
2307.09220v2
Abstract—As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress
in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale
and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To
resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection
(OVD) and Segmentation (OVS). By “open-vocabulary”, we mean that the models can classify objects beyond pre-defined categories.
In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize
arXiv:2307.09220v2 [cs.CV] 15 Apr 2024
different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different
methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling,
knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection,
semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development
routes, methodology strengths, and weaknesses are thoroughly analyzed. In addition, we benchmark each task along with the vital
components of each method in appendix and updated online at awesome-ovd-ovs. Finally, several promising directions are provided
and discussed to stimulate future research.
Index Terms—Open-Vocabulary, Zero-Shot Learning, Object Detection, Image Segmentation, Future Directions
1 I NTRODUCTION
en ary
n
en ry
en lary
n
n
tio
tio
m ula
tio
gm bul
Settings &
ag ary
gm bu
ta
ta
ta
eg cab
Se ca
Se ca
n ary
Im bul
es
tic Vo
eS o
ic o
ta t
Task
tio ul
nc -V
pt -V
en Sho
nd ca
n
an n-
tio t
ec ab
tio
ec ho
yo Vo
n
no en
m pe
gm o-
sta e
et c
et S
n
In Op
Pa Op
D -Vo
Be n-
Se O
Se Zer
D ero-
pe
n
O
Z
pe
Methodology
O
LAB [18] SPNet [37]
PL [32] JoEm [38]
Visual-Semantic MS-Zero [33] PMOSR [39]
Space Mapping
DELO [34]
ZS3Net [21] 3DGenZ [63]
GTNet [35]
CaGNet [40] SeCondPoint [64]
Novel Visual Feature RRFS [36]
Synthesis OVR-CNN [27]
MDETR [41] GroupViT [50]
VLDet [42] OpenSeg [29] CGG [57] APE [59]
Region-Aware OV-DETR [43]
with Training GLIP [44]
Image-Text PB-OVD [45] TTD [51] OV-3DET [65]
XPM [28] FM-OV3D [66]
Pairs Pseudo- Detic [46]
Labeling ViLD [26]
DetPro [47] GKC [52] CoDA [67]
SAM-CLIP [53] OV-SAM [58] PADing [60] OpenScene [68]
Knowledge BARON [48]
Fig. 1: The proposed taxonomy. Typical models are shown in each category. VLMs-IE denote the image encoder of VLMs.
focusing on limited tasks and settings2 . We include zero- besides substituting the learnable classifier in closed-set
shot setting as a complement to open-vocabulary setting for detectors/segmentors with fixed semantic embeddings, the
two reasons: 1) both settings aim for resolving the closed- class-specific localizer is also switched to a class-agnostic
vocabulary constraint; 2) methodologies under the two one, i.e., the output dimension of last regression layer is
settings are interchangeable, e.g., the novel visual feature four ([x1 , y1 , x2 , y2 ] or [x, y, w, h]) instead of four times the
synthesis methodology (discussed later) under zero-shot number of test classes (see Figs. 2a and 2b). Methodologies
setting can be transferred to open-vocabulary setting with under zero-shot setting can be grouped into:
negligible effort. The motivation for separating detection Visual-Semantic Space Mapping. Though the visual and
and segmentation task is that their definitions, training semantic space may bear discriminative capabilities in one
losses, architectures, evaluation metrics, and datasets are modality, there is no direct cross-modality training mech-
different. These tasks are also advanced separately over anisms mining mutual relationships between both spaces.
the past decade in literature, we discuss on unifying open- Thus, learning a mapping from visual to semantic space,
vocabulary detection and segmentation in Sec. 8. semantic to visual space, or a joint mapping of visual-
In this paper, we provide a comprehensive review on semantic space via tailored losses is crucial to enable such a
different scene understanding tasks and settings includ- reliable cross-space similarity measurement. However, due
ing zero-shot/open-vocabulary detection, zero-shot/open- to the lack of unseen annotations, the prediction confidence
vocabulary semantic/instance/panoptic segmentation, as is always biased toward seen classes.
well as 3D scene and video understanding. To organize Novel Visual Feature Synthesis. This methodology utilizes
methods from these diverse tasks and settings, we need to an additional generative model [71], [72], [73] to synthesize
answer the question: How to build a taxonomy that differenti- fake unseen visual features conditioned on semantic embed-
ates zero-shot and open-vocabulary settings while in the meantime dings and random noises, which transfer the problem into
abstracts universal methodologies across tasks? We find that, a “fully-supervised” setting. The generation loss in Fig. 2c
whether or not to permit access to weak supervision signals, and is to approximate the underlying distribution of real visual
if permitted, which one of them to utilize is key to categorization. features. Then, the classifier embedded in the detector head
As shown in Fig. 1, zero-shot and open-vocabulary settings is retrained on both pristine real seen and generated un-
are differentiated by the permission of weak supervision seen visual features. Since non-seen regions are typically
signals, and different methodologies differ on which weak classified as background, this methodology alleviates both
supervision signal to use during training. Under each set- the confusion between novel and background concepts, and
ting, different tasks can share the same taxonomy. the bias issue in previous methodology. A more detailed
Concretely, ZSD and ZSS are not allowed to access pipeline is given in Fig. 3.
weak supervision signals. To generalize beyond seen objects,
Once allowed to access the weak supervision signals,
2. In this survey, “zero-shot” and “open-vocabulary” are regarded as OVD and OVS methodologies can be mainly categorized
two different settings following prior work. into four types: Region-Aware Training mainly lever-
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
TABLE 1: Differences of ZSD/ZSS and OVD/OVS. G and objects or backgrounds. This task greatly expands the con-
Non-G denote generalized and non-generalized evaluation. cepts that the model can ground but the vocabulary is still
ZSD/ZSS closed-set. An ideal way to achieve OVD and OVS is to scale
Inductive Transductive
the small-scale grounding datasets to web-scale datasets.
External Information OVD/OVS
However, the laborious labeling cost is non-negligible.
G Non-G G Non-G
Weakly-Supervised Detection and Segmentation. Without
Occurrence of any bounding box or dense mask annotation, training labels
Unlabeled & Novel % % ! ! !
Objects of weakly-supervised setting [84] only comprise image-
level class names. The image-level labels indicate object
Image-Text Pairs % % % % !
classes that appear in the image. Many weakly-supervised
Large VLMs % % % % ! learning techniques like multiple instance learning [85] and
Test Classes CB ∪ C N CN CB ∪ C N CN CB ∪ C N weakly-supervised grounding loss have been introduced
into OVD/OVS. However, weakly-supervised object detec-
tion and segmentation work under the closed-set setting.
The remainder of the paper is organized as follows: Open-Set Detection and Segmentation. Open-set detec-
Sec. 2 introduces preliminary background. Then we review tion [86], [87] and segmentation [88], [89] stem from open-
ZSD and ZSS in Sec. 3 and Sec. 4, OVD and OVS in Sec. 5 set recognition [90], [91]. It requires classifying known
and Sec. 6, respectively. Open-vocabulary 3D scene and classes and identifying a single “unknown” class without
video understanding are covered in Sec. 7. We point out further classifying exact classes. It is equivalent to setting
challenges and outlook in Sec. 8 and conclude in Sec. 9. Ad- Ctrain = CB and Ctest = CB ∪ {u} where u represents the
ditional benchmarks with vital components of each method single “unknown” class. The main target is to reject unknown
are listed in appendix. classes that emerge unexpectedly and may hamper the
robustness of the recognition system.
Open-World Detection and Segmentation. Open-world
2 P RELIMINARIES detection [92], [93] and segmentation [94] take a step fur-
2.1 Problem Definition ther to open-set detection and segmentation. At time step
t
The goal of OVD/OVS is to detect or segment unseen or t, objects are classified as CB = {c1 , c2 , . . . , ck } ∪ {u}.
novel classes that occupy semantically coherent regions or Then, unknown instances are labeled as newly known
volumes within an image, video, or a set of point clouds. classes {ck+1 , . . . , ck+m } by an oracle and added back to
t t+1
During its early stage of development, inductive ZSD/ZSS CB . At time t+1, detector is required to detect CB =
is first formulated to achieve this goal. ZSD/ZSS imposes a {c1 , . . . , ck , . . . , ck+m } ∪ {u}. After each time step, the num-
constraint that training images do not contain any unseen ber of unknown classes belonging to “unknown” will de-
instance even if it is not annotated. To resolve the limitation, crease. This continual learning cycle repeats over the life-
transductive ZSD [80] considers unannotated and unseen time of the detector. The task aims at incrementally learning
samples during training. However, inductive ZSD/ZSS has new classes without forgetting previously learned classes in
been the mainstream and more actively studied than trans- a dynamic world. Note that the detector is not re-trained
ductive ZSD/ZSS. OVD/OVS can be deemed as a variant from scratch across time steps.
of transductive ZSD/ZSS that further allows the usage of Out-of-Distribution Detection. In out-of-distribution detec-
weak supervision signals. Nonetheless, both zero-shot and tion [95], [96], test samples are not assumed to be drawn
open-vocabulary settings avoid annotated novel objects ap- from the same distribution of training data (i.e., i.i.d.). Specif-
pearing in the training set. OVD/OVS splits the labeled set ically, distribution shift can be divided into: 1) Semantic
C of annotations into two disjoint subsets of base and novel shift, where out-of-distribution samples are from different
categories. We denote them by CB and CN , respectively. classes. 2) Covariate shift, where out-of-distribution sam-
Note that CB ∩ CN = ∅ and C = CB ∪ CN . Thus the labeled ples are from different domains but with same classes, such
set for training and test is Ctrain = CB and Ctest = CB ∪ CN . as sketches or adversarial examples. Out-of-distribution de-
With this definition, the difference with closed-set tasks is tection primarily focuses on the former which resembles
clear, where Ctest = Ctrain = C . A complete comparison open-set recognition. Nonetheless, it still does not require
between ZSD/ZSS and OVD/OVS is listed in Table 1. sub-classifying the single “out-of-distribution” class which is
the same as the “unknown” class in open-set/world detec-
tion and segmentation.
2.2 Related Domains and Tasks
In this subsection, we describe several highly related do-
mains with OVD/OVS and summarize their differences. 2.3 Canonical Closed-Set Detectors and Segmentors
Visual Grounding. It grounds semantic concepts to image Object Detection. Faster R-CNN [5] is a representative two-
regions [75], [76], [81], [82], [83]. Specifically, the task can stage detector. Based on anchor boxes, the region proposal
be divided into: 1) phrase localization [82] that grounds all network (RPN) first hypothesizes potential object regions
nouns in the sentence; 2) referring expression comprehen- to separate foreground and background proposals by mea-
sion [75], [76], [81] that only grounds the referent in the suring their objectness scores. Then, an RCNN-style [97]
sentence. In the latter, the referent is labeled not with a detection head predicts per-class probability and refines the
class name but with freeform natural language describing locations of positive proposals. Meanwhile, one-stage detec-
instance attributes, positions, and relationships with other tors directly refine the positions of anchors without proposal
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
generation stage. FCOS [98] regards each feature map grid non-masked image patches and the decoder regresses the
within the ground-truth box as a positive anchor point and pixel value of masked patches. Since CLIP behaves like bag-
regresses its distances to the four edges of the target box. of-words [106], i.e., a bag of local objects matches a bag
The centerness suppresses low-quality predictions of anchor of semantic concepts without differentiating distinct local
points that are near the boundary of ground-truths. With the regions, MAE and DINO are mainly utilized to enhance
development of Transformers in NLP, Transformer-based local image feature representations for OVD/OVS.
detectors have dominated the literature recently. DETR [13] Text-to-Image (T2I) Diffusion Models. T2I diffusion mod-
reformulates object detection as a set matching problem els [107], [108] are also trained on internet-scale data.
with a Transformer encoder-decoder architecture. The learn- The step-by-step de-noising process gradually evolves pure
able object queries attend to encoder output via cross- noise tensors into realistic images conditioned on languages.
attention and specialize in detecting objects with different Clustering on its internal feature representations clearly
positions and sizes. Deformable DETR [99] designs a multi- separates different objects, which is much less noisy than
scale deformable attention mechanism that sparsely attends CLIP [31], [109], [110]. Since to generate distinct objects the
sampled points around queries to accelerate convergence. model has to differentiate semantic concepts, text-guided
Segmentation. DeepLab [100], [101] enhances FCN [8] with generation objective naturally imposes precise region-word
dilated convolution, conditional random field, and atrous correspondence learning for the model during pretraining.
spatial pyramid pooling. Mask R-CNN [10] adds a parallel In contrast, CLIP may cheat by only performing bag-of-
mask branch to Faster R-CNN and proposes RoI Align for words classification [106].
instance segmentation. Following DETR, MaskFormer [9] Parameter-Efficient Fine-Tuning (PEFT). Given the compu-
obtains mask embeddings from object queries, and performs tational cost and memory footprint of these large VLMs,
dot-product with up-sampled pixel embeddings to produce current endeavors only fine-tune a small set of additional
segmentation maps. It transforms the per-pixel classifica- parameters, such as prompt tuning [77], [111], adapters [56],
tion paradigm into a mask region classification framework. [78], [79]. Compared to fully fine-tuning the whole model
Mask2Former [14] follows the same meta-architecture of on downstream data, PEFT balances the trade-off between
MaskFormer and introduces a masked cross-attention mod- overfitting on downstream data and preserving the prior
ule that only attends to predicted mask regions. knowledge of VLMs.
multiple classes from WordNet [112] belonging to neither Seen Vis. Feats
seen nor unseen classes to the background. A contempora- Classifier Classifier
1) Backbone Replace
neous work [19], [113] proposes a meta-class clustering loss Localizer
besides the max-margin separation loss. It groups similar
Noise
concepts to improve the separation between semantically- Generation
2) Synthesizer 4)
dissimilar concepts and reduce the noise in word vectors. Seen Loss
Luo et al. [114] provide external relationship knowledge Sem. Embs Re-use Re-train
graph as pairwise potentials besides unary potentials in the Noise Copy
Re-train
conditional random field to achieve context awareness. ZS- 3) Synthesizer Classifier
Unseen Parameters
DTD [115] leverages textual descriptions to guide the map- Sem. Embs Fake Unseen Vis. Feats
ping process. The textual descriptions are a general source
for improving ZSD due to its rich and diverse context com- Fig. 3: Flowchart of novel visual feature synthesis.
pared to a single word vector. Following polarity loss [32],
BLC [116] develops a cascade architecture to progressively 3.1.3 Learning a Mapping from Semantic to Visual Space
learn the mapping with an external vocabulary and a back-
Zhang et al. [124] argue that learning a mapping from visual
ground learnable RPN to model background appropriately.
to semantic space or a joint space will shrink the variance of
Rahman et al. [80] explore transductive generalized ZSD via
projected visual features and thus aggravates the hubness
fixed and dynamic pseudo-labeling strategies to promote
problem [125], i.e., the high-dimensional visual features
training in unseen samples. Recently SSB [117] establishes
are likely to be embedded into a low dimensional area of
a simple but strong baseline. It carefully ablates model
incorrect labels. Hence they embed semantic embeddings to
characteristics, learning dynamics, and inference procedures
the visual space via a least square loss.
from a myriad of design options.
One-Stage Methods. Besides two-stage methods, applying
one-stage detectors [12], [118], [119] to ZSD is also ex- 3.2 Novel Visual Feature Synthesis
plored. To improve the low recall rate of novel objects, ZS- Overview. To enable recognition of novel concepts, novel
YOLO [20] conditions the objectness branch of YOLO [119] visual feature synthesis produces fake unseen visual features
on the combination of semantic attributes, visual features, as training samples for a new classifier. This methodology
and localization output instead of visual features alone. follows a multi-stage pipeline (as shown in Fig. 3): 1) Train
HRE [120] constructs two parallel visual-to-semantic map- the base model with seen classes annotations in a fully-
ping branches for classification. One is a convex combina- supervised manner. 2) Train the feature synthesizer G :
tion of semantic embeddings, while the other maps grid W × Z 7→ F̃ on seen semantic embeddings w ∈ Ws ∈ Rd
features associated with positive anchors into the semantic and real seen visual features fs ∈ Fs ∈ Rc extracted from
space. Later Rahman et al. [32] design a polarity loss that ex- the base model to learn the underlying distribution of visual
plicitly maximizes the margin between predictions of target features. 3) Conditioned on the unseen semantic embed-
and negative classes based on focal loss [12]. A vocabulary dings w ∈ Wu and a random noise vector z ∼ N (0, 1), the
metric learning approach is also proposed to provide a synthesizer generates novel unseen visual features. A new
richer and more complete semantic space for learning the classifier is retrained on fake unseen and real seen visual
mapping. Similar to the hybrid branches in HRE [120], Li features, while the remaining parts of the base model are
et al. [121] perform the prediction of super-classes and fine- kept frozen. 4) Finally, the new classifier is plugged back
grained classes in parallel. into the base model. Note that the noise vector perturbs the
synthesizer to produce various visually diverging features
given the semantic embeddings.
3.1.2 Learning a Joint Mapping of Visual-Semantic Space
DELO [34] leverages conditional variational auto-
Learning a mapping from visual to semantic space ne- encoder [73] with three consistency losses forcing the gen-
glects the discriminative structure of visual space itself. MS- erated visual features to be coherent with the original real
Zero [33] demonstrates that classes can have poor separation ones on the predicted objectness score, category, and class
in semantic space but are well separated in visual space, semantic. Then, it [34] retrains the objectness branch to as-
and vice versa. Hence it exploits this complementary infor- sign high confidence scores on both seen and unseen objects.
mation via two unidirectional mapping functions. Similarity This is to mitigate the low recall rate of unseen objects. Later
metrics are calculated in both spaces which are then aver- on, Hayat et al. [126] use the same class consistency loss
aged as the final prediction. Similar to previous works [120], but they adopt the mode seeking regularization [127] which
[121], DPIF [122] proposes a dual-path inference fusion maximizes the distances of generated data points w.r.t their
module. It integrates empirical analysis of unseen classes noise vectors. At the same time with DELO [34], GTNet [35]
by analogy with seen classes (past knowledge) into the proposes an IoU-aware synthesizer based on wasserstein
basic knowledge transfer branch. The association predictor generative adversarial network [128]. Since DELO is only
learns unseen concepts using training data from a group trained on ground-truths that encloses the object boundary
of associative seen classes as their pseudo instances. Con- tightly, so its synthesizer can not generate unseen RoI fea-
trastZSD [123] proposes to contrast seen region features tures with diverse spatial context clues. That is, the retrained
to make the visual space more discriminative. It contrasts classifier can not correctly classify RoIs that loosely encloses
seen region features with both seen and unseen semantic the object boundary. To mitigate this context gap between
embeddings under the guidance of semantic relation matrix. unseen RoI features from RPN and those synthesized by
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
the generator, GTNet [35] randomly samples foreground collapse problem, i.e., the generator often ignores the ran-
and background RoIs as the additional generation target, dom noise vectors appended to the semantic embeddings
thus making the new classifier robust to various degrees and produces limited visual diversity, hindering the effec-
of context. RRFS [36] proposes an intra-class semantic di- tiveness of generative models. CaGNet [40] addresses this
verging loss and an inter-class structure preserving loss. The problem by replacing the simple noise with contextual la-
former pulls positive synthesized features lying within the tent code, which captures pixel-wise contextual information
hypersphere of the corresponding noise vector close while via dilated convolution and adaptive weighting between
pushing away those generated from distinct noise vectors. different dilation rates. Following CaGNet [40], SIGN [133]
The latter constructs a hybrid feature pool of real and fake also substitutes the noise vector but with a spatial latent
features to avoid mixing up the inter-class relationship. code incorporating the relative positional encoding. While
previous ZS3Net [21] and CaGNet [40] simply discard
pseudo-labels whose confidence scores are below a thresh-
4 Z ERO -S HOT S EGMENTATION (ZSS) old and weight the importance of the remaining pseudo-
ZSS takes a step further than ZSD at a finer pixel-level labels equally, SIGN [133] utilizes all pseudo annotations but
granularity. We cover zero-shot semantic and instance seg- assigns different loss weights according to the confidence
mentation in this section as a complement to ZSD. scores of pseudo-labels.
4.1 Zero-Shot Semantic Segmentation (ZSSS) 4.2 Zero-Shot Instance Segmentation (ZSIS)
4.1.1 Visual-Semantic Space Mapping Zheng et al. [22] are the first to propose the task of zero-shot
Learning a Mapping from Visual to Semantic Space. instance segmentation. They establish a simple mapping
SPNet [37] is the first work that proposes the zero-shot from visual features to semantic space then classify them
semantic segmentation task. It directly maps pixel features using fixed semantic embeddings. The mapping is opti-
into semantic space optimized by the canonical cross-entropy mized by a mean-squared error reconstruction loss. Zheng
loss. During inference, SPNet calibrates seen predictions by et al. also argue that disambiguation between background
subtracting a factor tuned on a held-out validation set. and unseen classes is crucial [18], [116], they design a
Learning a Joint Mapping of Visual-Semantic Space. Hu et background-aware RPN and a synchronized background
al. [129] argue that the noisy and irrelevant samples in seen strategy to adaptively represent background.
classes have negative effects on learning visual-semantic
correspondence. An uncertainty-aware loss is proposed to
5 O PEN -VOCABULARY D ETECTION (OVD)
adaptively strengthen representative samples while an at-
tenuating loss for uncertain samples with high variance OVD removes the stringent restriction of inductive ZSD
estimation. JoEm [38] learns a joint embedding space via on the absence of unannotated novel objects as in Sec. 3.
the proposed boundary-aware regression loss and semantic From this section on, we discuss methodologies resorting to
consistency loss. At the test stage, semantic embeddings are weak supervision signals, i.e., the open-vocabulary setting
transformed into semantic prototypes acting as a nearest- in Sec. 1 and Sec. 2.
neighbor classifier (without the classifier retraining stage
in Sec. 4.1.2). The apollonius calibration inference technique 5.1 Region-Aware Training
is further proposed to alleviate the bias problem.
This methodology incorporates image-text pairs [134], [135],
Learning a Mapping from Semantic to Visual Space. Kato
[136] into detection training phase. The vast vocabulary CT
et al. [130] propose a variational mapping from semantic
in captions encompasses both CB and CN , thus aligning pro-
space to visual space via sampling the conditions (mim-
posals containing novel proposals with words containing
icking the support images in few-shot semantic segmen-
novel classes improves classification on CN .
tation [131]) from the predicted distribution. PMOSR [39]
Weakly-Supervised Grounding or Contrastive Loss. This
abstracts a set of seen visual prototypes, then trains a
line of work establishes a coarse and soft correspondence
projection network mapping seen semantic embeddings to
between regions and words via the average of the following
these prototypes. Similar to JoEm [38], since one can simply
two symmetrical losses:
project unseen semantic embeddings to unseen prototypes
for classification, new unseen classes can be flexibly added exp(sim(I, T ))
in inference without classifier retraining. An open-set rejec- LT →I = − log P ′ , (1)
I ′ ∈B exp(sim(I , T ))
tion module is further proposed to prevent unseen classes
from directly competing with seen classes. exp(sim(I, T ))
LI→T = − log P ′ , (2)
T ′ ∈B exp(sim(I, T ))
4.1.2 Novel Visual Feature Synthesis
where B represents the image-text batch. The similarity
Concurrent with SPNet [37], ZS3Net [21] conditions the sim(I, T ) between image I and caption T is given by:
synthesizer [71] on adjacency graph encoding structural
NT XNI
object arrangement to capture contextual cues for the gen- 1 X
eration process. CSRL [132] transfers the relational struc- sim(I, T ) = αi,j ⟨eTi · eIj ⟩, (3)
NT i=1 j=1
ture constraint in the semantic space including point-wise,
pair-wise, and list-wise granularities to the visual feature where NT and NI are the number of nouns in the caption
generation process. However, in both methods, the mode and the number of proposals in the image, respectively. eTi
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
and eIj are the text and region embedding typically encoded embeddings during pretraining to achieve a similar effect
by the CLIP text encoder and detection head. The weight of RO-ViT [146]. RO-ViT and CFM-ViT only pretrain the
αi,j is calculated by: image encoder, while detection-specific components such
as FPN [6] and RoI head [5] are randomly initialized
exp⟨eTi · eIj ⟩ and trained only on detection data. To bridge the archi-
αi,j = PNI . (4)
⟨eT · eIj′ ⟩
j ′ =1 i
tecture gap between pretraining and detection finetuning,
DITO [148] pretrains FPN [6] and RoI head [5] along with
By minimizing the distance of matched image-caption image encoder. Embeddings of randomly sampled regions
pairs, novel proposals and novel classes are aligned in a are max pooled and image-text contrastive loss is applied
weakly-supervised manner (w/o knowing the correspon- on each FPN level separately.
dence between proposals and words). OVR-CNN [27] first Bipartite Matching. VLDet [42] formulates region-word
formulates the OVD task. Previous ZSD methods only train alignment as a set-matching problem between regions and
the vision-to-language projection layer from scratch on base nouns. The matching is automatically learned on image-
classes, which is prone to overfitting. OVR-CNN learns caption pairs via the off-the-shelf Hungarian algorithm [13].
the projection layer during pretraining on image-caption Following VLDet, GOAT [149] mitigates the biased ob-
pairs using Eq. (1) and Eq. (2) to align image grids and jectness score by comparing region features with an open
words. Following OVR-CNN, LocOv [137] introduces a corpus of external object concepts as another assessment of
consistency loss that regularizes image-caption similarities objectness. GOAT reconstructs the base classifier by taking
over a batch to be the same before and after the multi- a weighted mean of top-k similar external concept embed-
modal fusion transformer. LocOv utilizes both region and dings w.r.t. a base concept to enhance generalization. OV-
grid features for measuring Eq. (3) while OVR-CNN only DETR [43] conditions object queries on concept embed-
adopts the latter. However, the multi-modal masked lan- dings. They are constructed either from text embeddings or
guage modeling [138], [139], [140], [141] objective in OVR- image embeddings by feeding base ground-truth boxes and
CNN and LocOv tends to attend to similar global pro- novel proposals to the CLIP image encoder. It reformulates
posals covering many concepts for distinct masked words. the set matching problem into conditional binary match-
To force the multi-modal fusion transformer focus more ing, which measures the matchability between detection
on exclusive proposals containing only one concept for outputs and the conditional object queries. However, the
different masked words, MMC-Det [142] drives the atten- conditioned object queries are class-specific, i.e., the number
tion map over proposals be divergent for different masked is linearly proportional to the number of classes. Prompt-
words via the proposed divergence loss. DetCLIP [143] adds OVD [150] addresses this slow inference speed of OV-DETR
per-category definition from WordNet [112] for CLIP text by prepending class prompts instead of repeatedly adding
encoder to encode richer semantics. The category names to object queries and changing the binary matching objective
and definitions are individually and parallelly encoded to to a multi-label classification cost. It further proposes RoI-
avoid unnecessary interactions between category names. based masked attention and RoI pruning to extract region
DetCLIPv2 [144] selects a single region that best fits the embeddings in just one forward pass of CLIP. CORA [151]
current word via argmax instead of aggregating all region learns region prompts following the addition variant of
features via softmax in Eq. (4). It excludes the LI→T in its bi- VPT [77] to adapt CLIP into region-level classification. The
directional loss due to the partial labeling problem, i.e., the proposed anchor pre-matching makes object queries class-
caption usually only describes salient objects in the image, aware and can avoid repetitive per-class inference in OV-
hence most proposals can not find their matching words DETR [43]. Based on OV-DETR [43], EdaDet [152] preserves
in the caption. WSOVOD [145] recalibrates RoI features via fine-grained and generalizable local image semantics for
input-conditional coefficients over dataset attribute proto- attaining better base-to-novel generalization.
types to de-bias different distributions in different datasets. Leveraging Visual Grounding Datasets. Only resorting
It employs multiple instance learning [85] to address the to image-level captions brings noise and misalignment in
lack of box annotations. region-word correspondence learning. This line of work
Previous work perform weakly-supervised grounding leverages ground-truth region-level texts to help mitigate
only on relatively small-scale image-text pairs, another se- this problem. Each region-level description may contain
ries of work pretrains the model on web-scale image-text one object (phrase grounding [74], [82]) or multiple objects
datasets. RO-ViT [146] is pre-trained from scratch on the with a subject (referring expression comprehension and
same dataset of ALIGN [102] using focal loss [12] instead of segmentation [75], [76]). These datasets are combined into
cross-entropy loss to mine the hard negative examples. The one dataset termed GoldG in MDETR [41]. It proposes soft
positional embeddings are randomly cropped and resized token prediction to predict the span of tokens in these texts
to the whole-image resolution during pretraining, causing and performs contrastive loss at the region-word level in the
the model to regard pretrained images not as full images latent feature space. Note that GLIP [44] in Sec. 5.2 also uses
but as region crops from some unknown larger images. the GoldG dataset but its main purpose is to generate reli-
This matches the usage of proposals in the detection fine- able pseudo labels for subsequent self-training. MAVL [153]
tuning stage as they are cropped from a holistic image. improves MDETR by multi-scale deformable attention [99]
To improve local feature representation for localization, and late fusion between image and text modality. MQ-
CFM-ViT [147] adds the masked autoencoder objective [105] Det [154] augments language queries in GLIP [44] with fine-
besides the bi-directional contrastive loss during pretraining grained vision exemplars in a gated residual-like manner. It
on ALIGN [102] dataset. It randomly masks the positional takes vision queries as keys and values to the class-specific
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
cross-attention layer. The vision-conditioned masked lan- image treated as a pseudo region, and the paired caption
guage prediction forces the model to align with vision is deemed as the region label. CLIM can be applied to
cues to reduce the learning inertia problem [154]. YOLO- many open-vocabulary detectors [27], [46], [48]. To generate
World [155] follows the formulation of GLIP and mainly fo- more accurate pseudo labels, VTP-OVD [173] introduces
cuses on equipping the YOLO series with open-vocabulary an adapting stage to enhance the alignment between pix-
detection capability. SGDN [156] also leverages grounding els and categories via learnable visual [77] and textual
datasets [74], [82], but it exploits additional object relations prompts. ProxyDet [174] argues that, pseudo-labeling can
in a scene graph to facilitate discovering, classifying, and not improve novel classes other than those defined in CN .
localizing novel objects. It finds that many novel classes reside in the convex hull
constructed by base classes in the CLIP visual-semantic
embedding space. These novel classes can be approximated
5.2 Pseudo-Labeling via a linear mixup between a pair of base classes. Hence,
Models advocating pseudo-labeling also leverage abundant ProxyDet synthesizes region and text embeddings of proxy
image-text pairs as in Sec. 5.1, but additionally, they adopt (fake) novel classes and aligns them to achieve a broader
large pretrained VLMs or themselves (via self-training) to generalization ability beyond those pseudo-labeled novel
generate pseudo labels. They need to know the exact novel classes. CoDet [175] builds a concept group comprising all
categories CN in the training stage. Detectors are then image-text pairs that mention the concept in their captions.
trained on the unification of base seen class annotations The most common object appearing in the group is assumed
and pseudo labels. According to the type and granularity of to match the concept. Hence it reduces the problem of
pseudo labels, methods can be grouped into: pseudo region- modeling cross-modality correspondence (region-word) to
word pairs, region-caption pairs, and pseudo captions. in-modality correspondence (region-region). The new for-
Pseudo Region-Word Pairs. The bi-directional grounding mulation requires finding the most co-occurring objects by
or contrastive loss in Sec. 5.1 allows one region/word to utilizing the fact that VLMs-IE exhibits feature consistency
correspond to multiple words/regions weighted by soft- for visually similar regions across images.
max. However, this line of work explicitly allows only Pseudo Region-Caption Pairs. Models in this category es-
one region/word to correspond to one word/region (i.e., tablish pseudo correspondence between the whole caption
hard alignment). RegionCLIP [157] leverages CLIP to create and a single image region, which is easier and less noisy
pseudo region-word pairs to pretrain the image encoder of compared to the more fine-grained region-word correspon-
the detector. However, proposals with the highest CLIP scores dence. Contrary to weakly-supervised detection that prop-
yield low localization performance. Targeting at this problem, agates image-level labels to corresponding proposals, De-
VL-PLM [158] fuses CLIP scores with objectness scores and tic [46] side-steps this error-prone label assignment process.
repeatedly applies the RoI head to remove redundant pro- It simply trains the max-size proposal to predict all image-
posals. GLIP [44] reformulates object detection into phrase level labels, similar to multiple instance learning [85] and
grounding and trains the model on the union of detec- multi-label classification. The max-size proposal is assumed
tion and grounding data. It enables the teacher to utilize to be big enough to cover all image-level labels. Thus, the
language context for grounding novel concepts, while pre- classifier encounters various novel classes during training
vious pseudo-labeling methods only train the teacher on on ImageNet21K [176] and it can be generalized to novel
detection data. GLIPv2 [159] further reformulates visual objects at inference. Building on top of Detic, Kaul et al. [177]
question answering [160], [161] and image captioning [162], employ a large language model, i.e., GPT-3 [178], to gener-
[163] into a grounded VL understanding task. Following ate rich descriptions for text embeddings. Besides, vision
GLIP [44], Grounding DINO [164] upgrades the detector exemplars are encoded and merged with text embeddings
into a Transformer-based one, enhancing the capacity of to incorporate in-modality classification clues. 3Ways [169]
the teacher model. Instead of generating pseudo labels regards the top-scoring bounding box per image as corre-
once, PromptDet [165] iteratively learns region prompts and spondence to the whole caption. It also augments text em-
sources uncurated web images in two rounds, leading to beddings to avoid overfitting and includes trainable gated
more accurate pseudo boxes in the second round. Similar shortcuts to stabilize training. PLAC [179] learns a region-to-
to semi-supervised detection [166], [167], SAS-Det [168] also text mapping module that pulls close image regions with the
proposes to refine the quality of pseudo boxes online. The corresponding caption. Then, novel proposals are matched
student is optimized on pseudo boxes and base ground- to these region embeddings containing novel semantics.
truths. Then, the teacher updates its parameters through Pseudo Captions. PCL [180] proposes to generate another
the exponential moving average of the student. Thereby the type of pseudo label, i.e., pseudo captions describing objects
quality of pseudo boxes is gradually improved during train- in natural language instead of generating bounding boxes.
ing. Previous approaches adopt a simple thresholding [44], It leverages an image captioning model to generate captions
[159], [164] or a top-scoring heuristic rule [157], [165], [169], for each object, which are then fed into the CLIP text
[170] to filter pseudo labels, which is vulnerable and lacking encoder, encoding the class attributes and relationships with
explainability. PB-OVD [45] employs GradCAM [171] to the surrounding environment. It can be regarded as better
compute the activation map in the cross-attention layer prompting compared to the template prompts in CLIP.
of ALBEF [141] w.r.t. an object of interest in the caption.
Then, all proposals that overlap the most with the activation 5.3 Knowledge Distillation
map are regarded as pseudo ground-truths. CLIM [172] Knowledge distillation methodology (c.f. Fig. 4) employs
combines multiple images into a mosaicked image with each a teacher-student paradigm where the student learns to
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
sources and heterogeneous label spaces. During inference, caption. TCL [206] designs a region-level text grounder
it calibrates the base and novel predictions via a class- that produces text-grounded masks containing objects only
specific prior probability recording how the network biases described in the caption without irrelevant areas. Then the
towards that category. Instead of fine-tuning the whole matching is performed on the grounded image region and
model, F-VLM [49] leverages the frozen CLIP image encoder text instead of the whole image and text. SimSeg [207]
to extract features and only trains the detection head. It points out that CLIP heavily relies on contextual pixels and
ensembles predictions of the detector and CLIP via the contextual words instead of entity words. Hence, instead of
same geometric mean (dual-path inference) as ViLD [26]. densely aligning all image patches with all words, SimSeg
Another line of work only employs the CLIP text encoder for sparsely samples a portion of patches and words used for
open-vocabulary classification and discards the CLIP image the bi-directional contrastive loss in Eq. (2) and Eq. (1).
encoder. ScaleDet [192] unifies the multi-dataset label spaces
by relating labels with sematic similarities across datasets. 6.1.2 Pseudo-Labeling
OpenSeeD [193] unifies OVD and OVS in one network. It Zabari et al. [51] leverage an interpretability method [208]
proposes decoupled foreground-background decoding and to generate a coarse relevance map for each category. The
conditioned mask decoding to compensate for task and data relevance map is refined by test-time-augmentation tech-
discrepancies, respectively. CRR [194] analyzes whether the niques (e.g., horizontal flip, contrast change, and crops).
classification in RoI head network hampers the generaliza- The synthetic supervision is generated from the refined
tion ability of RPN and proposes to decouple them, i.e., relevance maps using stochastic pixel sampling.
RPN and RoI head do not share the backbone and they
are separately trained. Sambor [195] introduces a ladder 6.1.3 Knowledge Distillation
side adapter which assimilates localization and semantic
prior from SAM and CLIP simultaneously. The automatic GKC [52] enriches the template prompts with synonyms
mask generation in SAM [196] is employed to enhance the from WordNet [112] instead of relying only on a single
robustness of class-agnostic proposal generation in RPN. category name to guess what the object looks like. The
text-guided knowledge distillation transfers the inter-class
distance relationships in semantic space into visual space,
6 O PEN -VOCABULARY S EGMENTATION (OVS) which is similar to [170]. SAM-CLIP [53] merges the image
In this section, we review semantic, instance, and panoptic encoders of SAM [196] and CLIP [31] into one via cosine
segmentation tasks using the same taxonomy in Sec. 5. distillation loss in a memory replay and rehearsal way on a
subset of their pretraining images. It combines the superior
localization ability of SAM and semantic understanding
6.1 Open-Vocabulary Semantic Segmentation (OVSS) ability of CLIP. ZeroSeg [209] builds on top of MAE [105]
6.1.1 Region-Aware Training and GroupViT [50]. Unlike GroupViT which requires text
Weakly-Supervised Grounding or Contrastive Loss. Using supervision, ZeroSeg distills the multi-scale CLIP image
the same Eq. (2) and Eq. (1), OpenSeg [29] randomly drops features into learnable segment tokens purely on unlabled
each word to prevent overfitting. SLIC [197] incorporates images. The image is divided into multiple views thus
local-to-global self-supervised learning in DINO [103], [104] capturing both local and global semantics.
to improve the quality of local features for dense prediction.
Learning from Natural Language Supervision Only. Meth- 6.1.4 Transfer Learning
ods fall into this category aim to learn transferrable segmen- This methodology aims to transfer the VLMs-TE and VLMs-
tation models purely on image-text pairs without densely IE to segmentation tasks. The transfer strategy is explored
annotated masks. GroupViT [50] progressively groups in the following aspects: 1) only adopting VLMs-TE for
segment tokens into arbitrary-shaped and semantically- open-vocabulary classification; 2) leveraging frozen VLMs-
coherent segments given their assigned group index via IE as a feature extractor; 3) directly fine-tuning VLMs-IE
gumbel-softmax [198]. Segment tokens in the last layer are on segmentation datasets; 4) employing visual prompts
average pooled and contrasted with captions like CLIP [31]. or attaching a lightweight adapter to frozen VLMs-IE for
Besides the image-text contrastive loss, ViL-Seg [199] incor- feature adaptation. A detailed comparison is given in Fig. 5.
porates the same local-to-global correspondence learning VLMs-TE as Classifier. LSeg [30] simply replaces the learn-
in DINO [103] via a multi-crop strategy [200] to capture able classifier in segmentor [210] with text embeddings
fine-grained semantics. An online clustering head trained from the CLIP text encoder. SAZS [211] focuses on improv-
by mutual information maximization [201] groups pixel ing boundary segmentation supervised by the output of
embeddings at the end of ViT. Following GroupViT, Seg- edge detection on ground-truth masks. During inference,
CLIP [202] enhances local feature representation with addi- SAZS fuses the predictions with eigen segments obtained
tional MAE [105] objective and a superpixel-based Kullback- through spectral analysis on DINO [103] to promote shape-
Leibler loss. OVSegmentor [203] adopts slot attention [204] awareness. Son et al. [212] align object queries and text
to bind patch tokens into groups. To ensure visual invari- embeddings by bipartite matching [13]. They force queries
ance across images for the same object, OVSegmentor pro- not matched to any base or novel class to predict a uniform
poses a cross-image mask consistency loss. CLIP ViT-based distribution over base and novel classes, thus avoiding the
image encoder performs worse on patch-level classification. use of “background” embedding. A multi-label ranking loss
To remedy this issue, PACL [205] adds an additional patch- is employed to encourage the similarity of any positive label
text contrastive loss between patch embeddings and the to be higher than that of any negative label.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
sure the reliability of the raw segmentation map by CLIP. except that Uni-OVSeg designs a multi-scale cost matrix to
CLIPSeg [240] attaches a lightweight decoder to CLIP image achieve more accurate matching. X-Decoder [232] proposes
encoder with U-Net [241] skip connections conditioned on an image-text decoupled framework for not only open-
text embeddings. SAN [56] attaches a lightweight vision vocabulary panoptic segmentation but also other vision-
transformer to the frozen CLIP image encoder. It requires language tasks. It defines latent and text queries respon-
only a single forward pass of CLIP. SAN decouples the mask sible for pixel-level segmentation and semantic-level clas-
proposal and classification stage by predicting attention sification, respectively. APE [59] jointly trains on multiple
biases applied to deeper layers of CLIP for recognition. detection and grounding datasets as well as SA-1B [196].
CLIP Surgery [109] discovers that CLIP has opposite visu- In contrast to GLIP [44], it reformulates visual grounding
alizations similar to the findings of SimSeg [207] and noisy as object detection and independently encodes concepts
activations. The proposed architecture surgery replaces the instead of concatenating concepts and encoding them all.
query-key self-attention with a value-value self-attention It treats stuff class as multiple disconnected standalone
to causes the opposite visualization problem. The feature regions, removing separate head networks for things and
surgery identifies and removes redundant features to re- stuff as in OpenSeeD [193]. Moreover, SA-1B is semantic-
duce noisy activations. Instead of learning visual prompts, unaware, i.e., thing and stuff are indistinguishable. APE
CaR [242] puts a red circle on the proposal as a prompt can utilize SA-1B for training with only one head without
to guide the attention of CLIP toward the region [243]. differentiating things and stuff.
It designs a training-free and RNN-like framework where Knowledge Distillation. Previous synthesizers [72], [73],
text queries are deemed as hidden states and are gradually [128] in Sec. 4.1.2 with several linear layers do not consider
refined to remove categories not present in the image. How- the feature granularity gap between image and text modal-
ever, repeatedly forwarding the whole image multiple times ity. PADing [60] proposes learnable primitives to reflect the
incurs a slow inference speed. rich and fine-grained attributes of visual features, which
are then synthesized via weighted assemblies from these
6.2 Open-Vocabulary Instance Segmentation (OVIS) abundant primitives. In addition, PADing [60] decouples
Region-Aware Training. CGG [57] achieves the region-text visual features into semantic-related and semantic-unrelated
alignment via a grounding loss, but not with the whole cap- parts and it only aligns the semantic-related parts to the
tion as in OVR-CNN [27]. CGG extracts object nouns so that inter-class relationship structure in the semantic space.
object-unrelated words do not interfere with the matching Transfer Learning. FC-CLIP [61] resembles closely with F-
process. In addition, CGG proposes caption generation to VLM [49] in that both use a frozen CNN-based CLIP image
reproduce the caption paired with the image. D2 Zero [244] encoder and take a geometric ensemble for base and novel
proposes an unseen-constrained feature extractor and an classes separately. FreeSeg [248] learns prompts separately
input-conditional classifier to address the bias issue. It pro- for semantic/instance/panoptic tasks and different classes
poses image-adaptive background representations to dis- during the training stage. In inference, FreeSeg optimizes
criminate novel and background classes more effectively. class prompts following test-time adaption [249], [250] by
Pseudo-Labeling. XPM [28] first trains a teacher model minimizing the entropy. PosSAM [251] employs the frozen
on base annotations, then self-trains a student model. The CLIP and SAM image encoder and fuses their output vi-
pseudo regions are selected as the most compatible region sual features via cross-attention. MasQCLIP [252] follows
w.r.t object nouns in the caption. However, pseudo masks MaskCLIP [218] except that the query projection of the mask
contain noises that degrade performance. XPM assumes that class token is learnable. This is to reduce the shift of CLIP
each pixel in pseudo masks is corrupted by a Gaussian from extracting image-level features to classifying a masked
noise, and the student is trained to predict the noise level region in an image. OMG-Seg [253] explores the paradigm
to down weight incorrect teacher predictions. Mask-free of unifying semantic/instance/panoptic segmentation and
OVIS [245] performs iterative masking using ALBEF [141] their counterparts in videos under both closed and open
and GradCAM [171] to generate pseudo-instances both for vocabulary settings. Semantic-SAM [254] consolidates mul-
base and novel categories. To alleviate the overfitting issue, tiple datasets across granularities and trains on decoupled
it avoids training base categories using strong supervision object and part classification, achieving semantic-awareness
and novel categories using weak supervision. MosaicFu- and granularity-abundance through a multiple-choice learn-
sion [246] runs the T2I diffusion model on a mosaic image ing paradigm. ODISE [62] resorts to T2I diffusion mod-
canvas to generate multiple pseudo instances simultane- els [108] as the mask feature extractor. It also proposes
ously. Pseudo masks are obtained by aggregating cross- an implicit captioner via the CLIP image encoder to map
attention maps across heads, layers, and time steps. images into pseudo words. The training is driven by a bi-
Knowledge Distillation. OV-SAM [58] proposes to combine directional grounding loss in Sec. 5.1. Same as ODISE [62],
CLIP and SAM into one unified architecture. Adapters are HIPIE [255] ensembles classification logits with CLIP. It
inserted at the end of the CLIP image encoder and mimic can hierarchically segment things, stuff, and object parts.
SAM features via a mean-squared error loss. CLIP features It employs two separate decoders for things and stuff to
are also fed into the SAM mask decoder to enhance open- deal with different losses. Zheng et al. [256] design mask
vocabulary recognition ability. class tokens to extract dense image features corresponding
to each mask area via the proposed relative mask attention.
6.3 Open-Vocabulary Panoptic Segmentation (OVPS) OPSNet [257] attaches spatial adapter on top of CLIP image
Region-Aware Training. Uni-OVSeg [247] employs bipartite encoder. Since mask embeddings are reliable at predicting
matching on image-caption pairs similar to VLDet [42] base classes, and CLIP image embeddings preserve novel
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
class recognition ability, thus OPSNet modulates the two to PLA [261] first calibrates predictions to avoid over-confident
learn from each other. scores on base classes. Then, hierarchical coarse-to-fine
point-caption pairs, i.e., scene-, view-, and entity-level point-
7 O PEN -VOCABULARY B EYOND I MAGES caption association via a pretrained captioning model [262]
are constructed, effectively facilitating learning from lan-
7.1 Open-Vocabulary 3D Scene Understanding guage supervisions. However, the pseudo-captions at the
Open-vocabulary 3D scene understanding is relatively view level only cover sparse and salient objects in a scene,
under-explored and suffers a more severe data scarcity failing to provide fine-grained descriptions. To enable dense
issue, even pairing point clouds with text descriptions is not regional point-language associations, RegionPLC [263] cap-
feasible. Hence, the point-cloud and text modality are typ- tions image patches in a sliding window fashion together
ically bridged via the intermediate image modality, where with object proposals. A point-discriminative contrastive
VLMs (e.g., CLIP) step in to guide the association. Typically, learning objective is further proposed that makes the gra-
the dataset annotates the projection matrix that transforms dient of each point unique. Note that both PLA and Region-
3D point clouds into 2D boxes and vice versa. PLC are capable of instance segmentation.
Open-Vocabulary 3D Instance Segmentation (OV3IS).
7.1.1 Open-Vocabulary 3D Object Detection (OV3D) OpenMask3D [264] aggregates per-mask feature via multi-
OV-3DET [65] leverages pseudo 2D boxes w/o class labels view fusion of CLIP image embeddings. The projected 2D
generated from a pretrained 2D open-vocabulary detec- segmentation masks are refined by SAM [196] to remove
tor [46]. These pseudo 2D boxes are then back projected outliers. MaskClustering [265] proposes multi-view con-
to 3D space, which is deemed as the supervision signal sensus rate to assess whether 2D masks from off-the-shelf
for localization learning. To classify the predicted 3D RoI detectors belong to the same 3D instance (i.e., they should
features, they are first back projected to image crops which be merged). An iterative graph clustering is designed to
are then encoded by CLIP image encoder. These image crop better distinguish distinct 3D instances in a class-agnostic
features are assigned a semantic label by CLIP, thus the manner. OpenIns3D [266] requires no image inputs as a
3D RoI feature can connect to its labeled text embedding bridge between point clouds and texts. It synthesizes scene-
via paired relationship between 3D and 2D space. A triplet level images and leverages OVD detector in 2D domain
contrastive loss is applied to drive the 3D RoI feature to to detect objects the associates them with semantic labels.
approach both the projected image feature and its associated Open3DIS [267] addresses the problem of proposing high-
text embedding. FM-OV3D [66] follows a similar pipeline quality small-scale and geometrically ambiguous instances
by back projecting 2D boxes of Grounded-SAM [258]. For by aggregating them across frames.
open-vocabulary 3D recognition, FM-OV3D aligns 3D RoI
features with CLIP representations of both text and vi-
7.2 Open-Vocabulary Video Understanding (OVVU)
sual prompts from GPT-3 [178] and Stable Diffusion [108].
OpenSight [259] increases temporal awareness that corre- OV2Seg [268] first proposes open-vocabulary video instance
lates the predicted 3D boxes across consecutive timestamps segmentation task that simultaneously detects, segments,
and recalibrates missed or inaccurate boxes. In contrast and tracks arbitrary instances regardless of their presence
to relying on a 2D open-vocabulary detector, CoDA [67] in the training set. It collects a large vocabulary video
discovers novel 3D objects in an online and progressive instance segmentation dataset covering 1,212 categories for
manner, even though the detector is trained only on a few benchmarking. OV2Seg leverages CLIP text encoder to clas-
3D base annotations. The projected 2D box features are used sify queries from its memory-induced tracking module.
to distill CLIP knowledge into 3D object features. Though A concurrent work OpenVIS [69] first proposes instances
the image modality bridges the intermediate representation in a frame exhaustively. Then in the second stage, it de-
between point clouds and the text modality, it requires extra signs square crop that avoids distorting the aspect ratio
alignment between point clouds and images which may of instances to better conform to the pre-processing of the
limit the performance. L3Det [260] proposes not to leverage CLIP image encoder. However, previous work [69], [268]
images but large-scale large-vocabulary 3D object datasets. align the same instance in different frames separately with
It inserts 3D objects covering both base and novel classes text embeddings without considering the correlation across
into the scene in a physically reasonable way and generates frames. BriVIS [269] links instance features across frames as
grounded descriptions for them. Therefore, L3Det bypasses a Brownian bridge and aligns the bridge center with text
the image modality and directly learns the alignment be- embeddings. DVIS++ [270] presents a unified framework
tween 3D objects and texts through contrastive learning. capable of various video segmentation simultaneously.
the first aspect, many endeavors seek to 1) adopt a stan- result in lower recall rate of novel objects. Besides, distilling
dalone and frozen proposal generator to avoid classification the knowledge of large VLMs into these small-scale models
of base classes in detector head affecting gradients of pro- remains questionable due to their limited learning capac-
posal generator [194]; 2) employ pure localization quality- ity. Meanwhile, the evaluation is also problematic [273],
based objectness score [271] w/o foreground-background for example, suppose the predicted category and label are
binary classification, or design complementary objectness synonyms, current metrics will not deem the prediction as a
measures utilizing a large corpus of concepts [149]. Lever- true positive. This might be too strict given the fact that in
aging unsupervised localization methods [272] built on an open-world, many words are interchangeable.
DINO [103] or SAM [196] for proposal generation can also
potentially mitigate the bias. For the second aspect, recali- 8.2 Future Directions
bration in the inference stage is used in many works [26], Enabling Open-Vocabulary on Other Scene Understand-
[49] by separating and ensembling the predictions of base ing Tasks. Currently, other tasks including open-vocabulary
and novel classes between the detector and CLIP. 3D scene understanding, video analysis [274], action recog-
Confusion on Novel and Background Concepts. Since only nition, object tracking, and human-object interaction [275],
base and background text embeddings are used as classi- etc, are underexplored. In these problems, either the weak
fier weights during training, both novel and background supervision signals are absent or the large VLMs yield
proposals are classified as background. This drawback will pool open-vocabulary classification ability. Enabling open-
cause novel proposals misclassified into the background vocabulary beyond detection and segmentation has became
in inference. Besides, background text embedding is typ- a mainstream trend.
ically encoded by passing the template prompt “A photo Unifying OVD and OVS. Unification is an inevitable
of the [background].” into a CLIP text encoder or an all- trend for computer vision. Though there are several
zero vector. This simple representation is not sufficient and works addressing different segmentation tasks simultane-
representative to cope with diverse contexts. ously [62], [248], [255], [276] or training on multiple detec-
Correct Region-Word Correspondence. Though image-text tion datasets [191], [192]. A universal foundational model
pairs are cheap and abundant, the region-word correspon- for all tasks and datasets [193] remains barely untouched,
dence is weak, noisy, and explicitly unknown. The bi- or even further, accomplishing 2D and 3D open-vocabulary
directional grounding loss in Sec. 5.1 may cheat on establish- perception simultaneously can be more challenging.
ing correct region-word correspondence by only aligning Multimodal Large Language Models (LLMs) for Percep-
bag-of-regions to bag-of-words. Besides, the object nouns tion. Multimodal large language models [277], [278], [279],
in the caption may only cover salient objects, and they are [280] typically comprise three parts: 1) a vision encoder; 2)
far less than the number of proposals, i.e., many objects a mapper that maps the visual features to the input space
may not find the matching words. Pseudo labels impose of LLM.2) an LLM for decoding desired outputs. Bounding
the constraint of one region connecting to one word and boxes are represented as two corner integer points [281] and
vice versa. However, they are generated once and done, similarly for segmentation masks via sampling points on the
iteratively refining the quality of pseudo labels in online contour [83], [282] of the mask. The reasoning capability of
training is less explored. user intentions and interactive detection within a language
Large VLMs Adaptation. There is a distinct discrepancy and context endow multi-modal LLMs for detection and seg-
domain gap in terms of image resolution, context, and task mentation in the wild.
statistics between the pretraining and detection tuning phases. Combining Large Foundation Models. Different founda-
During the pertaining phase, CLIP receives low-resolution tion models have different capabilities. SAM [196] excels in
images with full contexts including object occurrences, re- localizing objects but in a class-agnostic manner. CLIP [31]
lationships, spatial layout, etc. However, in the detection is superior at image-text alignment but behaves like bag-
tuning phase, CLIP either receives high-resolution images of-words and lacks spatial-awareness for dense prediction
or low-resolution masked image crops containing a single tasks. DINO [103], [104] exhibits a superior cross-image
object without any context. The masked image crops are of correspondence for objects or parts of the same class but
non-square sizes or extreme aspect ratios, the pre-processing is mainly used in unsupervised localization tasks. T2I Dif-
step of CLIP resizes the shorter edge and center crop, adds fusion models generate astonishing images but their usage
more distortion and aggravates this gap. The prediction of in discriminative dense prediction tasks remains under-
CLIP is also not sensitive to localization quality, i.e., given an explored. In a nutshell, how to benefit from these emerging
image crop with only a small portion of objects of interest, large foundation models and combine them are key ques-
CLIP still makes predictions with high confidence. Besides, tions for future research.
fully finetuning the whole VLMs for adaptation always Real-Time OVD and OVS. Current models possess a heavy
leads to catastrophic forgetting of prior knowledge on open- backbone and neck architecture, which are unsuitable for
vocabulary tasks. In light of this, lightweight adapters or real-time applications. To fully unleash the productive po-
prompt tuning plays a crucial role in large VLMs adaptation. tential of OVD and OVS, exploring real-time detectors [155],
Inference Speed and Evaluation Metrics. Current OVD [283] with open-vocabulary recognition ability is a promis-
and OVS methods mainly build on top of mainstream ing research direction.
object detectors, such as Faster R-CNN [5], DETR [13], and
Mask2Former [14], which are slow when deployed on edge 9 C ONCLUSION
devices. However, lightweight detectors like YOLO [119] We covered a broad and concrete development of OVD
may aggravate the above challenges. Real-time detectors and OVS in this survey. First, the background section con-
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
sistig of definitions, related domains and tasks, canonical [30] B. Li, K. Q. Weinberger et al., “Language-driven semantic seg-
closed-set detectors and segmentors, and large VLMs were mentation,” arXiv, 2022.
[31] A. Radford, J. W. Kim et al., “Learning transferable visual models
introduced. Then, we detailed near two hundred OVD and from natural language supervision,” in ICML, 2021.
OVS methods. At the task level, both 2D detection and [32] S. Rahman, S. Khan et al., “Improved visual-semantic alignment
different segmentation tasks are discussed, along with 3D for zero-shot object detection,” in AAAI, 2020.
scene and video understanding. At the methodology level, [33] D. Gupta, A. Anantharaman et al., “A multi-space approach to
we pivoted on the permission and usage of weak super- zero-shot object detection,” in WACV, 2020.
[34] P. Zhu, H. Wang et al., “Don’t even look once: Synthesizing
vision signals and grouped most of the existing methods features for zero-shot detection,” in CVPR, 2020.
into six categories, which are universal across tasks. In the [35] S. Zhao, C. Gao et al., “Gtnet: Generative transfer network for
end, challenges and promising directions are discussed to zero-shot object detection,” in AAAI, 2020.
facilitate future research. In addition, we benchmarked the [36] P. Huang, J. Han et al., “Robust region feature synthesizer for
zero-shot object detection,” in CVPR, 2022.
performance of state-of-the-art methods along with their [37] Y. Xian, S. Choudhury et al., “Semantic projection network for
vital components for each task in the appendix. zero-and few-label semantic segmentation,” in CVPR, 2019.
[38] D. Baek, Y. Oh et al., “Exploiting a joint embedding space for
R EFERENCES generalized zero-shot semantic segmentation,” in ICCV, 2021.
[39] H. Zhang and H. Ding, “Prototypical matching and open set
[1] H. Caesar, V. Bankiti et al., “nuscenes: A multimodal dataset for rejection for zero-shot semantic segmentation,” in ICCV, 2021.
autonomous driving,” in CVPR, 2020. [40] Z. Gu, S. Zhou et al., “Context-aware feature generation for zero-
[2] Z. Li, W. Wang et al., “Bevformer: Learning bird’s-eye-view rep- shot semantic segmentation,” in ACM MM, 2020.
resentation from multi-camera images via spatiotemporal trans- [41] A. Kamath, M. Singh et al., “Mdetr - modulated detection for
formers,” in ECCV, 2022. end-to-end multi-modal understanding,” in ICCV, 2021.
[3] F. Zhu, Y. Zhu et al., “Deep learning for embodied vision naviga- [42] C. Lin, P. Sun et al., “Learning object-language alignments for
tion: A survey,” arXiv, 2021. open-vocabulary object detection,” arXiv, 2022.
[4] P. Anderson, Q. Wu et al., “Vision-and-language navigation:
[43] Y. Zang, W. Li et al., “Open-vocabulary detr with conditional
Interpreting visually-grounded navigation instructions in real
matching,” in ECCV, 2022.
environments,” in CVPR, 2018.
[5] S. Ren, K. He et al., “Faster r-cnn: Towards real-time object [44] L. H. Li, P. Zhang et al., “Grounded language-image pre-
detection with region proposal networks,” NeurIPS, 2015. training,” in CVPR, 2022.
[6] T.-Y. Lin, P. Dollár et al., “Feature pyramid networks for object [45] M. Gao, C. Xing et al., “Open vocabulary object detection with
detection,” in CVPR, 2017. pseudo bounding-box labels,” in ECCV, 2022.
[7] I. Misra, R. Girdhar et al., “An end-to-end transformer model for [46] X. Zhou, R. Girdhar et al., “Detecting twenty-thousand classes
3d object detection,” in ICCV, 2021. using image-level supervision,” in ECCV, 2022.
[8] J. Long, E. Shelhamer et al., “Fully convolutional networks for [47] Y. Du, F. Wei et al., “Learning to prompt for open-vocabulary
semantic segmentation,” in CVPR, 2015. object detection with vision-language model,” in CVPR, 2022.
[9] B. Cheng, A. Schwing et al., “Per-pixel classification is not all you [48] S. Wu, W. Zhang et al., “Aligning bag of regions for open-
need for semantic segmentation,” NeurIPS, 2021. vocabulary object detection,” in CVPR, 2023.
[10] K. He, G. Gkioxari et al., “Mask r-cnn,” in ICCV, 2017. [49] W. Kuo, Y. Cui et al., “F-vlm: Open-vocabulary object detection
[11] A. Kirillov, K. He et al., “Panoptic segmentation,” in CVPR, 2019. upon frozen vision and language models,” arXiv, 2022.
[12] T.-Y. Lin, P. Goyal et al., “Focal loss for dense object detection,” in [50] J. Xu, S. De Mello et al., “Groupvit: Semantic segmentation
ICCV, 2017. emerges from text supervision,” in CVPR, 2022.
[13] N. Carion, F. Massa et al., “End-to-end object detection with [51] N. Zabari and Y. Hoshen, “Open-vocabulary semantic segmenta-
transformers,” in ECCV, 2020. tion using test-time distillation,” in ECCV, 2022.
[14] B. Cheng, I. Misra et al., “Masked-attention mask transformer for [52] K. Han, Y. Liu et al., “Global knowledge calibration for fast open-
universal image segmentation,” in CVPR, 2022. vocabulary segmentation,” arXiv, 2023.
[15] M. Everingham, S. A. Eslami et al., “The pascal visual object [53] H. Wang, P. K. A. Vasu et al., “Sam-clip: Merging vision founda-
classes challenge: A retrospective,” IJCV, 2015. tion models towards semantic and spatial understanding,” arXiv,
[16] T.-Y. Lin, M. Maire et al., “Microsoft coco: Common objects in 2023.
context,” in ECCV, 2014. [54] J. Ding, N. Xue et al., “Decoupling zero-shot semantic segmenta-
[17] A. Gupta, P. Dollar et al., “Lvis: A dataset for large vocabulary tion,” in CVPR, 2022.
instance segmentation,” in CVPR, 2019. [55] F. Liang, B. Wu et al., “Open-vocabulary semantic segmentation
[18] A. Bansal, K. Sikka et al., “Zero-shot object detection,” in ECCV, with mask-adapted clip,” in CVPR, 2023.
2018. [56] M. Xu, Z. Zhang et al., “Side adapter network for open-
[19] S. Rahman, S. Khan et al., “Zero-shot object detection: Learning to vocabulary semantic segmentation,” in CVPR, 2023.
simultaneously recognize and localize novel concepts,” in ACCV,
[57] J. Wu, X. Li et al., “Betrayed by captions: Joint caption grounding
2019.
and generation for open vocabulary instance segmentation,”
[20] P. Zhu, H. Wang et al., “Zero shot detection,” TCSVT, 2019.
arXiv, 2023.
[21] M. Bucher, T.-H. Vu et al., “Zero-shot semantic segmentation,”
[58] H. Yuan, X. Li et al., “Open-vocabulary sam: Segment and recog-
NeurIPS, 2019.
nize twenty-thousand classes interactively,” arXiv, 2024.
[22] Y. Zheng, J. Wu et al., “Zero-shot instance segmentation,” in
CVPR, 2021. [59] Y. Shen, C. Fu et al., “Aligning and prompting everything all at
[23] T. Mikolov, I. Sutskever et al., “Distributed representations of once for universal visual perception,” 2024.
words and phrases and their compositionality,” NeurIPS, 2013. [60] S. He, H. Ding et al., “Primitive generation and semantic-related
[24] J. Pennington, R. Socher et al., “Glove: Global vectors for word alignment for universal zero-shot segmentation,” in CVPR, 2023.
representation,” in EMNLP, 2014. [61] Q. Yu, J. He et al., “Convolutions die hard: Open-vocabulary
[25] J. Devlin, M.-W. Chang et al., “Bert: Pre-training of deep bidirec- segmentation with single frozen convolutional clip,” NeurIPS,
tional transformers for language understanding,” arXiv, 2018. 2024.
[26] X. Gu, T.-Y. Lin et al., “Open-vocabulary object detection via [62] J. Xu, S. Liu et al., “Open-vocabulary panoptic segmentation with
vision and language knowledge distillation,” arXiv, 2021. text-to-image diffusion models,” in CVPR, 2023.
[27] A. Zareian, K. D. Rosa et al., “Open-vocabulary object detection [63] B. Michele, A. Boulch et al., “Generative zero-shot learning for
using captions,” in CVPR, 2021. semantic segmentation of 3d point clouds,” in 3DV, 2021.
[28] D. Huynh, J. Kuen et al., “Open-vocabulary instance segmenta- [64] B. Liu, S. Deng et al., “Language-level semantics conditioned 3d
tion via robust cross-modal pseudo-labeling,” in CVPR, 2022. point cloud segmentation,” arXiv, 2021.
[29] G. Ghiasi, X. Gu et al., “Scaling open-vocabulary image segmen- [65] Y. Lu, C. Xu et al., “Open-vocabulary point-cloud object detection
tation with image-level labels,” in ECCV, 2022. without 3d annotation,” in CVPR, 2023.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17
[66] D. Zhang, C. Li et al., “Fm-ov3d: Foundation model-based cross- [102] C. Jia, Y. Yang et al., “Scaling up visual and vision-language
modal knowledge blending for open-vocabulary 3d detection,” representation learning with noisy text supervision,” in ICML,
arXiv, 2023. 2021.
[67] Y. Cao, Z. Yihan et al., “Coda: Collaborative novel box discovery [103] M. Caron, H. Touvron et al., “Emerging properties in self-
and cross-modal alignment for open-vocabulary 3d object detec- supervised vision transformers,” in ICCV, 2021.
tion,” NeurIPS, 2024. [104] M. Oquab, T. Darcet et al., “Dinov2: Learning robust visual
[68] S. Peng, K. Genova et al., “Openscene: 3d scene understanding features without supervision,” arXiv, 2023.
with open vocabularies,” in CVPR, 2023. [105] K. He, X. Chen et al., “Masked autoencoders are scalable vision
[69] P. Guo, T. Huang et al., “Openvis: Open-vocabulary video in- learners,” in CVPR, 2022.
stance segmentation,” arXiv, 2023. [106] M. Yuksekgonul, F. Bianchi et al., “When and why vision-
[70] A. Joulin, E. Grave et al., “Bag of tricks for efficient text classifica- language models behave like bags-of-words, and what to do
tion,” arXiv, 2016. about it?” in ICLR, 2022.
[71] Y. Li, K. Swersky et al., “Generative moment matching networks,” [107] J. Ho, A. Jain et al., “Denoising diffusion probabilistic models,”
in ICML, 2015. NeurIPS, 2020.
[72] I. Goodfellow, J. Pouget-Abadie et al., “Generative adversarial [108] R. Rombach, A. Blattmann et al., “High-resolution image synthe-
networks,” ACM Communications, 2020. sis with latent diffusion models,” in CVPR, 2022.
[73] K. Sohn, H. Lee et al., “Learning structured output representation [109] Y. Li, H. Wang et al., “Clip surgery for better explainability with
using deep conditional generative models,” NeurIPS, 2015. enhancement in open-vocabulary tasks,” arXiv, 2023.
[74] R. Krishna, Y. Zhu et al., “Visual genome: Connecting language [110] S. Wu, W. Zhang et al., “CLIPSelf: Vision transformer distills itself
and vision using crowdsourced dense image annotations,” IJCV, for open-vocabulary dense prediction,” in ICLR, 2024.
2017. [111] K. Zhou, J. Yang et al., “Learning to prompt for vision-language
[75] L. Yu, P. Poirson et al., “Modeling context in referring expres- models,” IJCV, 2022.
sions,” in ECCV, 2016. [112] G. A. Miller, “Wordnet: a lexical database for english,” ACM
[76] V. K. Nagaraja, V. I. Morariu et al., “Modeling context between Communications, 1995.
objects for referring expression understanding,” in ECCV, 2016. [113] S. Rahman, S. H. Khan et al., “Zero-shot object detection: Joint
[77] M. Jia, L. Tang et al., “Visual prompt tuning,” in ECCV, 2022. recognition and localization of novel concepts,” IJCV, 2020.
[78] R. Zhang, R. Fang et al., “Tip-adapter: Training-free clip-adapter [114] R. Luo, N. Zhang et al., “Context-aware zero-shot recognition,”
for better vision-language modeling,” arXiv, 2021. in AAAI, 2020.
[79] Y.-L. Sung, J. Cho et al., “Vl-adapter: Parameter-efficient transfer [115] Z. Li, L. Yao et al., “Zero-shot object detection with textual
learning for vision-and-language tasks,” in CVPR, 2022. descriptions,” in AAAI, 2019.
[80] S. Rahman, S. Khan et al., “Transductive learning for zero-shot [116] Y. Zheng, R. Huang et al., “Background learnable cascade for
object detection,” in ICCV, 2019. zero-shot object detection,” in ACCV, 2020.
[81] S. Kazemzadeh, V. Ordonez et al., “Referitgame: Referring to [117] S. Khandelwal, A. Nambirajan et al., “Frustratingly simple but
objects in photographs of natural scenes,” in EMNLP, 2014. effective zero-shot detection and segmentation: Analysis and a
strong baseline,” arXiv, 2023.
[82] B. A. Plummer, L. Wang et al., “Flickr30k entities: Collecting
region-to-phrase correspondences for richer image-to-sentence [118] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in
models,” in IJCV, 2015. CVPR, 2017.
[119] ——, “Yolov3: An incremental improvement,” arXiv, 2018.
[83] C. Zhu, Y. Zhou et al., “Seqtr: A simple yet universal network for
visual grounding,” in ECCV, 2022. [120] B. Demirel, R. G. Cinbis et al., “Zero-shot object detection by
hybrid region embedding,” arXiv, 2018.
[84] D. Zhang, J. Han et al., “Weakly supervised object localization
and detection: A survey,” TPAMI, 2021. [121] Y. Li, Y. Shao et al., “Context-guided super-class inference for
zero-shot detection,” in CVPRW, 2020.
[85] H. Bilen and A. Vedaldi, “Weakly supervised deep detection
[122] Y. Li, P. Li et al., “Inference fusion with associative semantics for
networks,” in CVPR, 2016.
unseen object detection,” in AAAI, 2021.
[86] A. Dhamija, M. Gunther et al., “The overlooked elephant of object
[123] C. Yan, X. Chang et al., “Semantics-guided contrastive network
detection: Open set,” in WACV, 2020.
for zero-shot object detection,” TPAMI, 2022.
[87] D. Miller, L. Nicholson et al., “Dropout sampling for robust object
[124] L. Zhang, X. Wang et al., “Zero-shot object detection via learning
detection in open-set conditions,” in ICRA, 2018.
an embedding from semantic space to visual space,” in IJCAI,
[88] T. Pham, T.-T. Do et al., “Bayesian semantic instance segmentation 2020.
in open set world,” in ECCV, 2018.
[125] G. Dinu, A. Lazaridou et al., “Improving zero-shot learning by
[89] J. Hwang, S. W. Oh et al., “Exemplar-based open-set panoptic mitigating the hubness problem,” arXiv, 2014.
segmentation network,” in CVPR, 2021. [126] N. Hayat, M. Hayat et al., “Synthesizing the unseen for zero-shot
[90] W. J. Scheirer, A. de Rezende Rocha et al., “Toward open set object detection,” in ACCV, 2020.
recognition,” TPAMI, 2012. [127] Q. Mao, H.-Y. Lee et al., “Mode seeking generative adversarial
[91] C. Geng, S.-j. Huang et al., “Recent advances in open set recogni- networks for diverse image synthesis,” in CVPR, 2019.
tion: A survey,” TPAMI, 2020. [128] M. Arjovsky, S. Chintala et al., “Wasserstein generative adversar-
[92] K. Joseph, S. Khan et al., “Towards open world object detection,” ial networks,” in ICML, 2017.
in CVPR, 2021. [129] P. Hu, S. Sclaroff et al., “Uncertainty-aware learning for zero-shot
[93] A. Gupta, S. Narayan et al., “Ow-detr: Open-world detection semantic segmentation,” NeurIPS, 2020.
transformer,” in CVPR, 2022. [130] N. Kato, T. Yamasaki et al., “Zero-shot semantic segmentation via
[94] J. Cen, P. Yun et al., “Deep metric learning for open world variational mapping,” in ICCVW, 2019.
semantic segmentation,” in ICCV, 2021. [131] K. Wang, J. H. Liew et al., “Panet: Few-shot image semantic
[95] W. Liu, X. Wang et al., “Energy-based out-of-distribution detec- segmentation with prototype alignment,” in ICCV, 2019.
tion,” NeurIPS, 2020. [132] P. Li, Y. Wei et al., “Consistent structural relation learning for
[96] J. Yang, K. Zhou et al., “Generalized out-of-distribution detection: zero-shot segmentation,” NeurIPS, 2020.
A survey,” arXiv, 2021. [133] J. Cheng, S. Nandi et al., “Sign: Spatial-information incorporated
[97] R. Girshick, “Fast r-cnn,” in ICCV, 2015. generative network for generalized zero-shot semantic segmen-
[98] Z. Tian, C. Shen et al., “Fcos: Fully convolutional one-stage object tation,” in ICCV, 2021.
detection,” in ICCV, 2019. [134] P. Sharma, N. Ding et al., “Conceptual captions: A cleaned, hyper-
[99] X. Zhu, W. Su et al., “Deformable detr: Deformable transformers nymed, image alt-text dataset for automatic image captioning,”
for end-to-end object detection,” in ICLR, 2021. in ACL, 2018.
[100] L.-C. Chen, G. Papandreou et al., “Deeplab: Semantic image [135] S. Changpinyo, P. Sharma et al., “Conceptual 12m: Pushing
segmentation with deep convolutional nets, atrous convolution, web-scale image-text pre-training to recognize long-tail visual
and fully connected crfs,” TPAMI, 2017. concepts,” in CVPR, 2021.
[101] L.-C. Chen, Y. Zhu et al., “Encoder-decoder with atrous separable [136] X. Chen, H. Fang et al., “Microsoft coco captions: Data collection
convolution for semantic image segmentation,” in ECCV, 2018. and evaluation server,” arXiv, 2015.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
[137] M. A. Bravo, S. Mittal et al., “Localized vision-language matching [172] S. Wu, W. Zhang et al., “Clim: Contrastive language-image mosaic
for open-vocabulary object detection,” in DAGM GCPR, 2022. for region representation,” AAAI, 2024.
[138] Z. Huang, Z. Zeng et al., “Pixel-bert: Aligning image pixels with [173] Y. Long, J. Han et al., “Fine-grained visual–text prompt-driven
text by deep multi-modal transformers,” arXiv, 2020. self-training for open-vocabulary object detection,” TNNLS, 2023.
[139] J. Lu, D. Batra et al., “Vilbert: Pretraining task-agnostic visiolin- [174] J. Jeong, G. Park et al., “Proxydet: Synthesizing proxy novel
guistic representations for vision-and-language tasks,” NeurIPS, classes via classwise mixup for open vocabulary object detec-
2019. tion,” arXiv, 2023.
[140] W. Kim, B. Son et al., “Vilt: Vision-and-language transformer [175] C. Ma, Y. Jiang et al., “Codet: Co-occurrence guided region-word
without convolution or region supervision,” in ICML, 2021. alignment for open-vocabulary object detection,” NeurIPS, 2024.
[141] J. Li, R. Selvaraju et al., “Align before fuse: Vision and language [176] O. Russakovsky, J. Deng et al., “Imagenet large scale visual
representation learning with momentum distillation,” NeurIPS, recognition challenge,” IJCV, 2015.
2021. [177] P. Kaul, W. Xie et al., “Multi-modal classifiers for open-vocabulary
[142] Y. Xu, M. Zhang et al., “Exploring multi-modal contextual knowl- object detection,” arXiv, 2023.
edge for open-vocabulary object detection,” arXiv, 2023. [178] T. Brown, B. Mann et al., “Language models are few-shot learn-
[143] L. Yao, J. Han et al., “Detclip: Dictionary-enriched visual-concept ers,” NeurIPS, 2020.
paralleled pre-training for open-world detection,” arXiv, 2022. [179] S. Kang, J. Cha et al., “Learning pseudo-labeler beyond noun
[144] ——, “Detclipv2: Scalable open-vocabulary object detection pre- concepts for open-vocabulary object detection,” arXiv, 2023.
training via word-region alignment,” in CVPR, 2023. [180] H.-C. Cho, W. Y. Jhoo et al., “Open-vocabulary object detection
[145] J. Lin, Y. Shen et al., “Weakly supervised open-vocabulary object using pseudo caption labels,” arXiv, 2023.
detection,” arXiv, 2023. [181] e. a. Xie, Johnathan, “Zero-shot object detection through vision-
[146] D. Kim, A. Angelova et al., “Region-aware pretraining for open- language embedding alignment,” in ICDMW, 2022.
vocabulary object detection with vision transformers,” in CVPR, [182] C. Pham, T. Vu et al., “Lp-ovod: Open-vocabulary object detection
2023. by linear probing,” in WACV, 2024.
[147] ——, “Contrastive feature masking open-vocabulary vision [183] Z. Liu, X. Hu et al., “Efficient feature distillation for zero-shot
transformer,” in ICCV, 2023. detection,” arXiv, 2023.
[148] ——, “Detection-oriented image-text pretraining for open- [184] R. Fang, G. Pang et al., “Simple image-level classification im-
vocabulary detection,” arXiv, 2023. proves open-vocabulary object detection,” arXiv, 2023.
[149] J. Wang, H. Zhang et al., “Open-vocabulary object detection with [185] A. v. d. Oord, Y. Li et al., “Representation learning with con-
an open corpus,” in ICCV, 2023. trastive predictive coding,” arXiv, 2018.
[150] H. Song and J. Bang, “Prompt-guided transformers for end-to- [186] L. Wang, Y. Liu et al., “Object-aware distillation pyramid for
end open-vocabulary object detection,” arXiv, 2023. open-vocabulary object detection,” in CVPR, 2023.
[151] X. Wu, F. Zhu et al., “Cora: Adapting clip for open-vocabulary [187] J. Lin and S. Gong, “Gridclip: One-stage object detection by grid-
detection with region prompting and anchor pre-matching,” in level clip representation learning,” arXiv, 2023.
CVPR, 2023. [188] L. Li, J. Miao et al., “Distilling detr with visual-linguistic knowl-
[152] C. Shi and S. Yang, “Edadet: Open-vocabulary object detection edge for open-vocabulary object detection,” in ICCV, 2023.
using early dense alignment,” in ICCV, 2023. [189] Z. Ma, G. Luo et al., “Open-vocabulary one-stage detection with
[153] M. Maaz, H. Rasheed et al., “Class-agnostic object detection with hierarchical visual-language knowledge distillation,” in CVPR,
multi-modal transformer,” in ECCV, 2022. 2022.
[154] Y. Xu, M. Zhang et al., “Multi-modal queried object detection in [190] M. Minderer, A. Gritsenko et al., “Simple open-vocabulary object
the wild,” arXiv, 2023. detection with vision transformers,” arXiv, 2022.
[155] T. Cheng, L. Song et al., “Yolo-world: Real-time open-vocabulary [191] Z. Wang, Y. Li et al., “Detecting everything in the open world:
object detection,” in CVPR, 2024. Towards universal object detection,” in CVPR, 2023.
[156] H. Shi, M. Hayat et al., “Open-vocabulary object detection via [192] Y. Chen, M. Wang et al., “Scaledet: A scalable multi-dataset object
scene graph discovery,” arXiv, 2023. detector,” in CVPR, 2023.
[157] Y. Zhong, J. Yang et al., “Regionclip: Region-based language- [193] H. Zhang, F. Li et al., “A simple framework for open-vocabulary
image pretraining,” in CVPR, 2022. segmentation and detection,” arXiv, 2023.
[158] S. Zhao, Z. Zhang et al., “Exploiting unlabeled data with vision [194] J. Li, C. Xie et al., “What makes good open-vocabulary detector:
and language models for object detection,” in ECCV, 2022. A disassembling perspective,” arXiv, 2023.
[159] H. Zhang, P. Zhang et al., “Glipv2: Unifying localization and [195] X. Han, L. Wei et al., “Boosting segment anything model towards
vision-language understanding,” NeurIPS, 2022. open-vocabulary learning,” arXiv, 2023.
[160] S. Antol, A. Agrawal et al., “Vqa: Visual question answering,” in [196] A. Kirillov, E. Mintun et al., “Segment anything,” arXiv, 2023.
ICCV, 2015. [197] M. F. Naeem, Y. Xian et al., “Silc: Improving vision language
[161] P. Anderson, X. He et al., “Bottom-up and top-down attention pretraining with self-distillation,” arXiv, 2023.
for image captioning and visual question answering,” in CVPR, [198] E. Jang, S. Gu et al., “Categorical reparameterization with
2018. gumbel-softmax,” arXiv, 2016.
[162] K. Xu, J. Ba et al., “Show, attend and tell: Neural image caption [199] Q. Liu, Y. Wen et al., “Open-world semantic segmentation via con-
generation with visual attention,” in ICML, 2015. trasting and clustering vision-language embedding,” in ECCV,
[163] S. J. Rennie, E. Marcheret et al., “Self-critical sequence training for 2022.
image captioning,” in CVPR, 2017. [200] M. Caron, I. Misra et al., “Unsupervised learning of visual fea-
[164] S. Liu, Z. Zeng et al., “Grounding dino: Marrying dino with tures by contrasting cluster assignments,” NeurIPS, 2020.
grounded pre-training for open-set object detection,” arXiv, 2023. [201] M. Tschannen, J. Djolonga et al., “On mutual information maxi-
[165] C. Feng, Y. Zhong et al., “Promptdet: Towards open-vocabulary mization for representation learning,” arXiv, 2019.
detection using uncurated images,” in ECCV, 2022. [202] H. Luo, J. Bao et al., “Segclip: Patch aggregation with learnable
[166] Y.-C. Liu, C.-Y. Ma et al., “Unbiased teacher for semi-supervised centers for open-vocabulary semantic segmentation,” arXiv, 2022.
object detection,” in ICLR, 2020. [203] J. Xu, J. Hou et al., “Learning open-vocabulary semantic seg-
[167] M. Xu, Z. Zhang et al., “End-to-end semi-supervised object detec- mentation models from natural language supervision,” in CVPR,
tion with soft teacher,” in ICCV, 2021. 2023.
[168] S. Zhao, S. Schulter et al., “Improving pseudo labels for open- [204] F. Locatello, D. Weissenborn et al., “Object-centric learning with
vocabulary object detection,” arXiv, 2023. slot attention,” NeurIPS, 2020.
[169] R. Arandjelović, A. Andonian et al., “Three ways to improve [205] J. Mukhoti, T.-Y. Lin et al., “Open vocabulary semantic segmenta-
feature alignment for open vocabulary detection,” arXiv, 2023. tion with patch aligned contrastive learning,” in CVPR, 2023.
[170] H. Bangalath, M. Maaz et al., “Bridging the gap between object [206] J. Cha, J. Mun et al., “Learning to generate text-grounded mask for
and image-level representations for open-vocabulary detection,” open-world semantic segmentation from only image-text pairs,”
NeurIPS, 2022. in CVPR, 2023.
[171] R. R. Selvaraju, M. Cogswell et al., “Grad-cam: Visual explana- [207] M. Xu, Z. Zhang et al., “A simple baseline for open-vocabulary
tions from deep networks via gradient-based localization,” in semantic segmentation with pre-trained vision-language model,”
ICCV, 2017. in ECCV, 2022.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
[208] H. Chefer, S. Gur et al., “Transformer interpretability beyond [243] A. Shtedritski, C. Rupprecht et al., “What does clip know about a
attention visualization,” in CVPR, 2021. red circle? visual prompt engineering for vlms,” in ICCV, 2023.
[209] J. Chen, D. Zhu et al., “Exploring open-vocabulary semantic [244] S. He, H. Ding et al., “Semantic-promoted debiasing and back-
segmentation from clip vision encoder distillation only,” in ICCV, ground disambiguation for zero-shot instance segmentation,” in
2023. CVPR, 2023.
[210] R. Ranftl, A. Bochkovskiy et al., “Vision transformers for dense [245] V. VS, N. Yu et al., “Mask-free ovis: Open-vocabulary instance
prediction,” in ICCV, 2021. segmentation without manual mask annotations,” in CVPR, 2023.
[211] X. Liu, B. Tian et al., “Delving into shape-aware zero-shot seman- [246] J. Xie, W. Li et al., “Mosaicfusion: Diffusion models as data
tic segmentation,” in CVPR, 2023. augmenters for large vocabulary instance segmentation,” arXiv,
[212] S. D. Dao, H. Shi et al., “Class enhancement losses with pseudo 2023.
labels for open-vocabulary semantic segmentation,” TMM, 2023. [247] Z. Wang, X. Xia et al., “Open-vocabulary segmentation with
[213] G. Shin, W. Xie et al., “Reco: Retrieve and co-segment for zero- unpaired mask-text supervision,” arXiv, 2024.
shot transfer,” NeurIPS, 2022. [248] J. Qin, J. Wu et al., “Freeseg: Unified, universal and open-
[214] K. He, H. Fan et al., “Momentum contrast for unsupervised visual vocabulary image segmentation,” in CVPR, 2023.
representation learning,” in CVPR, 2020. [249] D. Wang, E. Shelhamer et al., “Tent: Fully test-time adaptation by
[215] Y. Rao, W. Zhao et al., “Denseclip: Language-guided dense pre- entropy minimization,” arXiv, 2020.
diction with context-aware prompting,” in CVPR, 2022. [250] M. Shu, W. Nie et al., “Test-time prompt tuning for zero-shot
[216] Y. Liu, S. Bai et al., “Open-vocabulary segmentation with generalization in vision-language models,” NeruIPS, 2022.
semantic-assisted calibration,” arXiv, 2023. [251] V. VS, S. Borse et al., “Possam: Panoptic open-vocabulary segment
[217] M. Xu, Z. Zhang et al., “A simple baseline for open-vocabulary anything,” arXiv, 2024.
semantic segmentation with pre-trained vision-language model,” [252] X. Xu, T. Xiong et al., “Masqclip for open-vocabulary universal
in ECCV, 2022. image segmentation,” in ICCV, 2023.
[218] C. Zhou, C. C. Loy et al., “Extract free dense labels from clip,” in [253] X. Li, H. Yuan et al., “Omg-seg: Is one model good enough for all
ECCV, 2022. segmentation?” arXiv, 2024.
[219] M. Wysoczańska, O. Siméoni et al., “Clip-dinoiser: Teaching clip [254] F. Li, H. Zhang et al., “Semantic-sam: Segment and recognize
a few dino tricks,” arXiv, 2023. anything at any granularity,” arXiv, 2023.
[220] O. Siméoni, C. Sekkat et al., “Unsupervised object localization: [255] X. Wang, S. Li et al., “Hierarchical open-vocabulary universal
Observing the background to discover objects,” in CVPR, 2023. image segmentation,” arXiv, 2023.
[221] J. Guo, Q. Wang et al., “Mvp-seg: Multi-view prompt learning for [256] Z. Ding, J. Wang et al., “Open-vocabulary panoptic segmentation
open-vocabulary semantic segmentation,” arXiv, 2023. with maskclip,” arXiv, 2022.
[222] R. Burgert, K. Ranasinghe et al., “Peekaboo: Text to image diffu- [257] X. Chen, S. Li et al., “Open-vocabulary panoptic segmentation
sion models are zero-shot segmentors,” arXiv, 2022. with embedding modulation,” arXiv, 2023.
[258] T. Ren, S. Liu et al., “Grounded sam: Assembling open-world
[223] L. Karazija, I. Laina et al., “Diffusion models for zero-shot open-
models for diverse visual tasks,” arXiv, 2024.
vocabulary segmentation,” arXiv, 2023.
[259] H. Zhang, J. Xu et al., “Opensight: A simple open-vocabulary
[224] L. Barsellotti, R. Amoroso et al., “Fossil: Free open-vocabulary
framework for lidar-based object detection,” arXiv, 2023.
semantic segmentation through synthetic references retrieval,” in
[260] C. Zhu, W. Zhang et al., “Object2scene: Putting objects in context
WACV, 2024, pp. 1464–1473.
for open-vocabulary 3d detection,” arXiv, 2023.
[225] S. Ren, A. Zhang et al., “Prompt pre-training with twenty-
[261] R. Ding, J. Yang et al., “Pla: Language-driven open-vocabulary 3d
thousand classes for open-vocabulary visual recognition,” arXiv,
scene understanding,” in CVPR, 2023.
2023.
[262] A. Radford, J. Wu et al., “Language models are unsupervised
[226] C. Ma, Y. Yang et al., “Open-vocabulary semantic segmentation
multitask learners,” OpenAI blog, 2019.
via attribute decomposition-aggregation,” in NeurIPS, 2023.
[263] J. Yang, R. Ding et al., “Regionplc: Regional point-language con-
[227] L. Jiayun, S. Khandelwal et al., “Plug-and-play, dense-label- trastive learning for open-world 3d scene understanding,” arXiv,
free extraction of open-vocabulary semantic segmentation from 2023.
vision-language models,” arXiv, 2023.
[264] A. Takmaz, E. Fedele et al., “Openmask3d: Open-vocabulary 3d
[228] Q. Liu, K. Zheng et al., “Tagalign: Improving vision-language instance segmentation,” 2023.
alignment with multi-tag classification,” arXiv, 2023. [265] M. Yan, J. Zhang et al., “Maskclustering: View consensus based
[229] O. Ülger, M. Kulicki et al., “Self-guided open-vocabulary seman- mask graph clustering for open-vocabulary 3d instance segmen-
tic segmentation,” arXiv, 2023. tation,” arXiv, 2024.
[230] J. Li, D. Li et al., “Blip: Bootstrapping language-image pre- [266] Z. Huang, X. Wu et al., “Openins3d: Snap and lookup for 3d
training for unified vision-language understanding and gener- open-vocabulary instance segmentation,” arXiv, 2023.
ation,” in ICML, 2022. [267] P. D. Nguyen, T. D. Ngo et al., “Open3dis: Open-vocabulary 3d
[231] ——, “Blip-2: Bootstrapping language-image pre-training with instance segmentation with 2d mask guidance,” arXiv, 2023.
frozen image encoders and large language models,” in ICML, [268] H. Wang, C. Yan et al., “Towards open-vocabulary video instance
2023. segmentation,” in ICCV, 2023.
[232] X. Zou, Z.-Y. Dou et al., “Generalized decoding for pixel, image, [269] Z. Cheng, K. Li et al., “Instance brownian bridge as texts for open-
and language,” in CVPR, 2023. vocabulary video instance segmentation,” arXiv, 2024.
[233] H. Touvron, L. Martin et al., “Llama 2: Open foundation and fine- [270] T. Zhang, X. Tian et al., “Dvis++: Improved decoupled framework
tuned chat models,” arXiv, 2023. for universal video segmentation,” arXiv, 2023.
[234] S. Cho, H. Shin et al., “Cat-seg: Cost aggregation for open- [271] D. Kim, T.-Y. Lin et al., “Learning open-world object proposals
vocabulary semantic segmentation,” arXiv, 2023. without learning to classify,” Robotics and Automation, 2022.
[235] B. Xie, J. Cao et al., “Sed: A simple encoder-decoder for open- [272] O. Siméoni, É. Zablocki et al., “Unsupervised object localization
vocabulary semantic segmentation,” arXiv, 2023. in the era of self-supervised vits: A survey,” arXiv, 2023.
[236] Z. Liu, H. Mao et al., “A convnet for the 2020s,” in CVPR, 2022. [273] H. Zhou, T. Shen et al., “Rethinking evaluation metrics of open-
[237] S. Jiao, Y. Wei et al., “Learning mask-aware clip representations vocabulary segmentaion,” arXiv, 2023.
for zero-shot segmentation,” NeurIPS, 2023. [274] K. Gao, L. Chen et al., “Compositional prompt tuning with
[238] Z. Zhou, Y. Lei et al., “Zegclip: Towards adapting clip for zero- motion cues for open-vocabulary video relation detection,” in
shot semantic segmentation,” in CVPR, 2023. ICLR, 2023.
[239] J. Li, P. Chen et al., “Tagclip: Improving discrimination ability of [275] L. Li, J. Xiao et al., “Zero-shot visual relation detection via
open-vocabulary semantic segmentation,” arXiv, 2023. composite visual cues from large language models,” arXiv, 2023.
[240] T. Lüddecke and A. Ecker, “Image segmentation using text and [276] X. Gu, Y. Cui et al., “Dataseg: Taming a universal multi-dataset
image prompts,” in CVPR, 2022. multi-task segmentation model,” arXiv, 2023.
[241] O. Ronneberger, P. Fischer et al., “U-net: Convolutional networks [277] Y. Zang, W. Li et al., “Contextual object detection with multi-
for biomedical image segmentation,” in MICCAI, 2015. modal large language models,” arXiv, 2023.
[242] S. Sun, R. Li et al., “Clip as rnn: Segment countless visual concepts [278] R. Pi, J. Gao et al., “Detgpt: Detect what you need via reasoning,”
without training endeavor,” arXiv, 2023. arXiv, 2023.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 20
[279] W. Wang, Z. Chen et al., “Visionllm: Large language model is also [317] H. Zhang, F. Li et al., “Dino: Detr with improved denoising
an open-ended decoder for vision-centric tasks,” NeurIPS, 2024. anchor boxes for end-to-end object detection,” arXiv, 2022.
[280] X. Lai, Z. Tian et al., “Lisa: Reasoning segmentation via large [318] F. Li, H. Zhang et al., “Mask dino: Towards a unified transformer-
language model,” arXiv, 2023. based framework for object detection and segmentation,” in
[281] T. Chen, S. Saxena et al., “Pix2seq: A language modeling frame- CVPR, 2023.
work for object detection,” arXiv, 2021. [319] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high
[282] ——, “A unified sequence interface for vision tasks,” NeurIPS, quality object detection,” in CVPR, 2018.
2022. [320] X. Chen, X. Wang et al., “Pali: A jointly-scaled multilingual
[283] W. Lv, S. Xu et al., “Detrs beat yolos on real-time object detection,” language-image model,” arXiv, 2022.
arXiv, 2023. [321] O. Ronneberger, P. Fischer et al., “U-net: Convolutional networks
[284] R. Mottaghi, X. Chen et al., “The role of context for object for biomedical image segmentation,” in MICCAI, 2015.
detection and semantic segmentation in the wild,” in CVPR, 2014. [322] B. Zhang, Z. Tian et al., “Segvit: Semantic segmentation with plain
[285] S. Shao, Z. Li et al., “Objects365: A large-scale, high-quality vision transformers,” NeurIPS, 2022.
dataset for object detection,” in ICCV, 2019. [323] M. Ding, B. Xiao et al., “Davit: Dual attention vision transform-
[286] A. Kuznetsova, H. Rom et al., “The open images dataset v4: Uni- ers,” in ECCV, 2022.
fied image classification, object detection, and visual relationship
detection at scale,” IJCV, 2020. Chaoyang Zhu currently is a Ph.D. student at
[287] H. Caesar, J. Uijlings et al., “Coco-stuff: Thing and stuff classes in the Department of Computer Science and Engi-
context,” in CVPR, 2018. neering, HKUST. He received the M.Sc degree
[288] B. Zhou, H. Zhao et al., “Scene parsing through ade20k dataset,” in Computer Technology from Xiamen University
in CVPR, 2017. in 2023, and the B.Eng. degree in Computer
[289] M. Cordts, M. Omran et al., “The cityscapes dataset for semantic Science and Technology from Hangzhou Dianzi
urban scene understanding,” in CVPR, 2016. University in 2019. His research interests are
[290] S. Song, S. P. Lichtenberg et al., “Sun rgb-d: A rgb-d scene computer vision and multimodal learning.
understanding benchmark suite,” in CVPR, 2015.
[291] A. Dai, A. X. Chang et al., “Scannet: Richly-annotated 3d recon-
structions of indoor scenes,” in CVPR, 2017.
[292] D. Rozenberszki, O. Litany et al., “Language-grounded indoor 3d Long Chen received the Ph.D. degree in Com-
semantic segmentation in the wild,” in ECCV, 2022. puter Science from Zhejiang University in 2020,
[293] L. Yang, Y. Fan et al., “Video instance segmentation,” in ICCV, and the B.Eng. degree in Electrical Information
2019. Engineering from Dalian University of Technol-
[294] A. Athar, J. Luiten et al., “Burst: A benchmark for unifying object ogy in 2015. He is currently an assistant profes-
recognition, segmentation and tracking in video,” in WACV, 2023. sor at the Department of Computer Science and
[295] C. Szegedy, S. Ioffe et al., “Inception-v4, inception-resnet and the Engineering, HKUST. He was a postdoctoral re-
impact of residual connections on learning,” in AAAI, 2017. search scientist at Columbia University and a se-
[296] K. He, X. Zhang et al., “Deep residual learning for image recog- nior researcher at Tencent AI Lab. His research
nition,” in CVPR, 2016. interests are computer vision and multimedia.
[297] A. Farhadi, I. Endres et al., “Describing objects by their at-
tributes,” in CVPR, 2009.
[298] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” arXiv, 2014.
[299] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point
cloud based 3d object detection,” in CVPR, 2018, pp. 4490–4499.
[300] J. Schult, F. Engelmann et al., “Mask3d: Mask transformer for 3d
semantic instance segmentation,” in ICRA, 2023.
[301] L. Qi, J. Kuen et al., “High-quality entity segmentation,” arXiv,
2022.
[302] A. Bewley, Z. Ge et al., “Simple online and realtime tracking,” in
ICIP, 2016.
[303] Y. Shen, R. Ji et al., “Enabling deep residual networks for weakly
supervised object detection,” in ECCV, 2020.
[304] G. Zhang, Z. Luo et al., “Accelerating detr convergence via
semantic-aligned matching,” in CVPR, 2022.
[305] Y. Liu, M. Ott et al., “Roberta: A robustly optimized bert pretrain-
ing approach,” arXiv, 2019.
[306] T. Wang, “Learning to detect and segment for open vocabulary
object detection,” in CVPR, 2023.
[307] C. Schuhmann, R. Vencu et al., “Laion-400m: Open dataset of
clip-filtered 400 million image-text pairs,” arXiv, 2021.
[308] V. Ordonez, G. Kulkarni et al., “Im2text: Describing images using
1 million captioned photographs,” NeurIPS, 2011.
[309] C.-Y. Wang, H.-Y. M. Liao et al., “Cspnet: A new backbone that
can enhance learning capability of cnn,” in CVPRW, 2020.
[310] [Online]. Available: https://github.com/ultralytics/yolov5
[311] S. Zhang, C. Chi et al., “Bridging the gap between anchor-
based and anchor-free detection via adaptive training sample
selection,” in CVPR, 2020.
[312] X. Zhou, V. Koltun et al., “Probabilistic two-stage detection,”
arXiv, 2021.
[313] A. Brock, S. De et al., “High-performance large-scale image recog-
nition without normalization,” in ICML, 2021.
[314] X. Zhai, X. Wang et al., “Lit: Zero-shot transfer with locked-image
text tuning,” in CVPR, 2022.
[315] B. Thomee, D. A. Shamma et al., “Yfcc100m: The new data in
multimedia research,” ACM Communications, 2016.
[316] X. Dai, Y. Chen et al., “Dynamic head: Unifying object detection
heads with attentions,” in CVPR, 2021.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 21
A PPENDIX novel classes. For LVIS [17] dataset, the rare categories are
regarded as novel classes, its metric is denoted as APr ,
TABLE 2: Datasets and evaluation metrics. common and frequent classes are base classes. Additionally,
object detection and instance segmentation use recall as
Datasets Evaluation
Tasks
(Split of Base/Novel Categories) Metrics
a complementary metric. For semantic segmentation, the
metric is mIoU only considering either base (mIoUB ) or
Pascal VOC [15] (16/4)
COCO [16] (48/17, 65/15) APN B novel (mIoUN ) classes. The hamonic mean (hIoU) between
ZSD 50 , AP50 , AP50
ILSVRC-2017 Detection [176] (177/23) R@100 mIoUB and mIoUN is calculated as the following:
Visual Genome [74] (478/130)
Pascal VOC [15] (15/5) 2 ∗ mIoUB ∗ mIoUN
ZSSS
Pascal Context [284] (29/4)
mIoUB , mIoUN , hIoU hIoU = . (5)
mIoUB + mIoUN
COCO [16] (48/17)
LVIS [17] (866/337) APN B Note that for ZSD, the AP50 may represent the harmonic
OVD 50 , AP50 , AP50 ,
Objects365 [285] APr , APc , APf , AP mean of APN B
50 and AP50 in some work, we do not differ-
OpenImages [286]
entiate them here. For panoptic segmentation, the metric is
Pascal VOC [15] (15/5) panoptic quality [11] (PQ) which can be viewed as a mul-
COCO stuff [287] (156/15)
ADE20K-150 [288] (135/15) mIoU, tiplication between segmentation quality (SQ) and recogni-
OVSS
ADE20K-847 [288] (572/275) mIoUN , mIoUB , hIoU tion quality (RQ). For 3D scene and video understanding,
Pascal Context-59 [284]
Pascal Context-459 [284] the metrics are mainly inherited from their counterparts in
COCO [16] (48/17) image domain. For a complete dataset and metric list, c.f.
mask AP50 ,
OVIS ADE20k [288] (135/15)
mask APN B to Table 2.
OpenImages [286] (200/100) 50 , mask AP50
COCO Panoptic [11] (119/14) TABLE 3: ZSD performance on COCO [16] dataset. IRv2 is
OVPS ADE20k [288] PQ, SQ, RQ InceptionResnetv2 [295], c.f. to Table 4 for other notations.
Cityscapes [289]
SUN RGB-D [290] Image Semantic
OV3D ScanNet [291] APN B Method APN APB N
50 /AP50 /AP50
25 , AP25 , AP25 Backbone Embeddings 50
nuScenes [1]
48/17 split [18]
ScanNet [291] mIoU,
OV3SS SAN [113] R50 W2V 5.1 13.9/2.6/4.3
nuScenes [1] mIoUN , mIoUB , hIoU
SB [18] IRv2 - 0.7 -
OV3IS ScanNet200 [292] AP, AP25 , AP50 LAB [18] IRv2 - 0.3 -
DSES [18] IRv2 - 0.5 -
Youtube-VIS [293] MS-Zero [33] R101 GloVe [24] 12.9 -/-/30.7
OVVU BURST [294] AP, APB , APN PL [32] R50-FPN W2V 10.0 35.9/4.1/7.4
LV-VIS [268] CG-ZSD [121] DN53 [119] BERT [25] 7.2 -
BLC [116] RN50 W2V 10.6 42.1/4.5/8.2
ContrastZSD [123] R101 W2V 12.5 45.1/6.3/11.1
In this supplementary material, we provide as much SSB [117] R101 W2V 14.8 48.9/10.2/16.9
DELO [34] DN19 [119] W2VR [20] 7.6 -/-/13.0
as detailed, comprehensive, and fair comparisons of meth- RRFS [36] R101 FT 13.4 42.3/13.4/20.4
ods for different tasks and settings. We keep track of 65/15 split [32]
new works at awesome-ovd-ovs. However, note that the PL [32] R50-FPN W2V 12.4 34.1/12.4/18.2
benchmark does not differentiate the subtle nuances such TL [80] R50-FPN W2V 14.6 28.8/14.1/18.9
CG-ZSD [121] DN53 BERT [25] 10.9 -
as image backbone initialization weights, with or without BLC [116] R50 W2V 14.7 36.0/13.1/19.2
background evaluation, and different version of validation DPIF-M [122] R50 W2V 19.8 29.8/19.5/23.6
ContrastZSD [123] R101 W2V 18.6 40.2/16.5/23.4
sets, etc. For precise details, please refer to the original paper. SSB [117] R101 W2V 19.6 40.2/19.3/26.1
SU [126] R101 FT 19.0 36.9/19.0/25.1
RRFS [36] R101 FT 19.8 37.4/19.8/26.0
Evaluation Protocols, Metrics, and Datasets
ZSD and ZSS mainly use two evaluation protocols for
assessment: 1) evaluating only on novel classes (non- TABLE 4: ZSD performance on Pascal VOC [15] under
generalized); 2) evaluating on both base and novel classes the non-generalized and generalized evaluation protocol.
(generalized). The generalized assessment is more chal- R denote ResNet [296]. W2V and FT is Word2Vec [23] and
lenging and realistic than non-generalized evaluation. It FastText [70], respectively.
competes novel with base classes and requires model not Image Semantic
to overfit on base classes. OVD and OVS mainly adopt the Method APN
50 APB N
50 /AP50 /AP50
Backbone Embeddings
generalized protocol for evaluation. Besides, OVD and OVS SAN [19] R50 - 59.1 48.0/37.0/41.8
introduce the third protocol, termed cross-dataset transfer HRE [120] DN19 [119] aPY [297] 54.2 62.4/25.5/36.2
PL [32] R50-FPN aPY [297] 62.1 -
evaluation (CDTE). Namely, the model is trained on one BLC [116] R50 - 55.2 58.2/22.9/32.9
source dataset and tested on other target datasets without TL [80] R50-FPN W2V 66.6 -
MS-Zero [33] R101 aPY [297] 62.2 -/-/60.1
adaptation. Vocabularies of source and target datasets may CG-ZSD [121] DN53 [119] BERT [25] 54.8 -
or may not partially overlap with each other. SU [126] R101 FT 64.9 -
DPIF [122] R50 aPY [297] - 73.2/62.3/67.3
The evaluation metric for object detection and instance ContrastZSD [123] R101 aPY [297] 65.7 63.2/46.5/53.6
segmentation is mainly box and mask AP at a certain IoU RRFS [36] R101 FT 65.5 47.1/49.1/48.1
threshold (AP50 , AP25 ) or integrated over a series of IoU
threshold (0.5 to 0.95 with 0.05 as interval). The AP can
be divided into APB and APN considering only base or
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 22
TABLE 7: Open-vocabulary 3D detection performance on SUN RGB-D [290], ScanNet [291], and nuScenes [1] datasets
under the generalized evaluation and CDTE protocol.
3D 2D ScanNet200
Method Segmentor
Annotations Segmentor AP AP25 AP50
TABLE 9: Open-vocabulary video instance segmentation performance on validation set of Youtube-VIS19 [293] (YTVIS-19),
Youtube-VIS21 [293] (YTVIS-21), BURST [294], and LV-VIS [268].
OV2Seg [268] - SORT [302] LVIS [17] L (cat) 37.6 41.1 21.3 33.9 36.7 18.2 4.9 5.3 3.0 21.1 27.5 16.3
OpenVIS [69] M2F [14] % YTVIS [293] T (cat) - - - - - - 3.5 5.8 3.0 - - -
BriVIS [269] M2F [14] % LV-VIS T (cat) 45.3 - - 39.5 - - 7.4 9.5 6.9 27.68 - -
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 23
TABLE 10: OVD performance on COCO [16] under generalized evaluation protocol. “T” and “L” denote template and
learnable prompts. “cat” and “desc” denote that the prompts are filled with class names or class descriptions (definitions,
synonyms, etc). “Ensemble” represents that whether detector prediction is ensembled with CLIP prediction [26], [49] or not.
COCO Cap is COCO Captions dataset [136], Visual Genome [74] is denoted as VG, Conceptual Captions [134] is denoted
as CC3M.
Region-Aware Training
OVR-CNN [27] R50 FRCNN COCO Cap BERT % % 22.8 46.0 39.9
LocOv [137] R50 FRCNN COCO Cap BERT % % 28.6 51.3 45.7
MMC-Det [142] R50 FRCNN COCO Cap BERT % % 33.5 - 47.5
WSOVOD [145] DRN [303] FRCNN % CLIP T (cat) % 35.0 27.9 29.8
RO-ViT [146] ViT-B/16 MRCNN ALIGN [102] CLIP T (cat) ! 30.2 - 41.5
CFM-ViT [147] ViT-B/16 MRCNN ALIGN [102] CLIP T (cat) ! 30.8 - 42.4
DITO [148] ViT-B/16 FRCNN ALIGN [102] CLIP T (cat) ! 38.6 - 48.5
VLDet [42] R50 FRCNN COCO Cap CLIP T (cat) % 32.0 50.6 45.8
GOAT [149] R50 FRCNN COCO Cap CLIP T (cat) % 31.7 51.3 45.7
OV-DETR [43] R50 Def-DETR [99] % CLIP T (cat) % 29.4 61.0 52.7
Prompt-OVD [150] ViT-B/16 Def-DETR [99] % CLIP T (cat) ! 30.6 63.5 54.9
CORA [151] R50 SAM-DETR [304] % CLIP T (cat) % 35.1 35.5 35.4
EdaDet [152] R50 Def-DETR [99] CLIP [31] CLIP T (cat) ! 35.1 35.5 35.4
SGDN [156] R50 Def-DETR [99] VG, Flickr30K [82] RoBERTa [305] % % 37.5 61.0 54.9
Pseudo-Labeling
RegionCLIP [157] R50 FRCNN CC3M CLIP T (cat) % 31.4 57.1 50.4
CondHead [306] R50 RegionCLIP [157] % CLIP T (cat) % 33.7 58.0 51.7
VL-PLM [158] R50 FRCNN % CLIP T (cat) % 34.4 60.2 53.5
PromptDet [165] R50 MRCNN LAION [307] CLIP L (cat+desc) % 26.6 - 50.6
SAS-Det [168] R50 FRCNN % CLIP T (cat) ! 37.4 58.5 53.0
COCO Cap, VG,
PB-OVD [45] R50 MRCNN
SBU [308]
CLIP T (cat) % 30.8 46.1 42.1
CLIM [172] R50 Detic [46] COCO Cap CLIP T (cat) % 35.4 - -
VTP-OVD [173] R50 MRCNN % CLIP T (cat) % 31.5 51.9 46.6
ProxyDet [174] R50 FRCNN COCO Cap CLIP T (cat) ! 30.4 52.6 46.8
CoDet [175] R50 FRCNN COCO Cap CLIP T (cat) ! 30.6 52.3 46.6
Detic [46] R50 FRCNN COCO Cap CLIP T (cat) % 27.8 47.1 45.0
Knowledge Distillation
ViLD [26] R50 MRCNN % CLIP T (cat) ! 27.6 59.5 51.3
ZSD-YOLO [181] CSP [309] YOLOv5x [310] % CLIP T (cat+desc) % 13.6 31.7 19.0
LP-OVOD [182] R50 FRCNN % CLIP T (cat) ! 40.5 60.5 55.2
EZSD [183] R50 MRCNN % CLIP T (cat) % 31.6 59.9 52.1
SIC-CADS [184] R50 BARON [48] % CLIP T (cat) ! 36.9 56.1 51.1
BARON [48] R50 FRCNN COCO Cap CLIP T (cat) % 42.7 54.9 51.7
OADP [186] R50 FRCNN % CLIP T (cat) ! 35.6 55.8 50.5
RKDWTF [170] R50 FRCNN COCO Cap CLIP T (cat) % 36.6 54.0 49.4
DK-DETR [188] R50 Def-DETR [99] % CLIP T (cat) % 32.3 61.1 -
HierKD [189] R50 ATSS [311] CC3M CLIP T (cat/desc) % 20.3 51.3 43.2
CLIPSelf [110] ViT-B/16 F-VLM [49] % CLIP T (cat) ! 37.6 - -
Transfer Learning
F-VLM [49] R50 MRCNN % CLIP T (cat) ! 28.0 - 39.6
DRR [194] R50 FRCNN CC3M CLIP T (cat) % 35.8 54.6 49.6
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 24
TABLE 11: OVD performance on LVIS [17] dataset under generalized evaluation protocol. Base classes are common and
frequent classes in LVIS, rare classes in LVIS are novel class. Gray numbers denote mask AP. Conceptual 12M dataset [135]
is denoted as CC12M. The subset of ImageNet-21K [176] (IN21K) that overlaps with LVIS vocabulary is IN-L [46] (c.f.
to Table 10 for other abbreviatios).
Region-Aware Training
MMC-Det [142] R50 CN2 [312] CC3M CLIP T (cat) % 21.1 30.9 35.5 31.0
RO-ViT [146] ViT-B/16 MRCNN ALIGN [102] CLIP T (cat) ! 28.0 - - 30.2
CFM-ViT [147] ViT-B/16 MRCNN ALIGN [102] CLIP T (cat) ! 29.628.8 - - 33.832.0
DITO [148] ViT-B/16 FRCNN ALIGN [102] CLIP T (cat) ! 34.932.5 - - 36.934.0
VLDet [42] R50 CN2 [312] CC3M CLIP T (cat) % 21.7 29.8 34.3 30.1
GOAT [149] R50 CN2 [312] CC3M CLIP T (cat) % 23.3 29.7 34.3 30.4
OV-DETR [43] R50 Def-DETR [99] % CLIP T (cat) % 17.4 25.0 32.5 26.6
Prompt-OVD [150] ViT-B/16 Def-DETR [99] % CLIP T (cat) % 29.423.1 - - 33.024.2
CORA [151] R50x4 CN2 [312] % CLIP T (cat) % 28.1 - - -
EdaDet [152] R50 Def-DETR [99] % CLIP T (cat) ! 23.7 27.5 29.1 27.5
VG,
SGDN [156] R50 Def-DETR [99]
Flickr30K [82]
RoBERTa [305] % % 23.6 29.0 34.3 31.1
Pseudo-Labeling
RegionCLIP [157] R50 MRCNN CC3M CLIP T (cat) % 17.117.4 27.426.0 34.031.6 28.226.7
CondHead [306] R50 RegionCLIP [157] % CLIP T (cat) % 19.920.0 28.627.3 35.232.2 29.727.9
PromptDet [165] R50 MRCNN LAION [307] CLIP L (cat+desc) % 21.4 23.3 29.3 25.3
SAS-Det [168] R50 FRCNN % CLIP T (cat) ! 20.9 26.1 31.6 27.4
CLIM [172] R50 VLDet [46] CC3M CLIP T (cat) % 22.2 - - -
ProxyDet [174] R50 CN2 [312] IN-L CLIP T (cat) ! 26.2 - - 32.5
CoDet [175] R50 CN2 [312] CC3M CLIP T (cat) % 23.4 30.0 34.6 30.7
Detic [46] R50 CN2 [312] IN-L CLIP T (cat) % 24.6 - - 32.4
MMC [177] R50 CN2 [312] IN-L CLIP GPT-3 [178] % 27.3 - - 33.1
3Ways [169] NF-F0 [313] FCOS [98] CC12M CLIP T (cat) % 25.6 34.2 41.8 35.7
PLAC [179] Swin-B Def-DETR [99] CC3M CLIP T (cat) % 27.0 40.0 44.5 39.5
Knowledge Distillation
ViLD-ens [26] R50 MRCNN % CLIP T (cat) ! 16.716.6 26.524.6 34.230.3 27.825.5
LP-OVOD [182] R50 MRCNN % CLIP T (cat) ! 19.3 26.1 29.4 26.2
EZSD [183] R50 MRCNN % CLIP T (cat) % 15.8 25.6 31.7 26.3
SIC-CADS [184] R50 Detic [46] IN21K CLIP T (cat) ! 26.5 33.0 35.6 32.9
BARON [48] R50 FRCNN % CLIP L (cat) % 23.222.6 29.327.6 32.529.8 29.527.6
OADP [186] R50 FRCNN % CLIP T (cat) ! 21.921.7 28.426.3 32.029.0 28.726.6
GridCLIP [187] R50 FCOS [98] % CLIP T (cat) % 15.0 22.7 32.5 25.2
RKDWTF [170] R50 CN2 [312] IN21K CLIP T (cat) % 25.2 33.4 35.8 32.9
DK-DETR [188] R50 Def-DETR [99] % CLIP T (cat) % 22.220.5 32.028.9 40.235.4 33.530.0
DetPro [47] R50 MRCNN % CLIP L (cat) ! 20.819.8 27.825.6 32.428.9 28.425.9
CLIPSelf [240] ViT-B/16 F-VLM [49] % CLIP T (cat) ! 25.3 - - -
Transfer Learning
OWL-ViT [190] ViT-H/14 DETR LiT [314] CLIP T (cat) % 23.3 - - 35.3
F-VLM [49] R50 MRCNN % CLIP T (cat) ! 18.6 - - 24.2
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 25
TABLE 12: OVD performance under the CDTE protocol on Pascal VOC [15] (VOC), Obejcts365 [285] (O365), COCO [16],
OpenImages [286] (OI), and LVIS [17] validation sets. Cap4M image-text pairs are crawled in [44]. GoldG denote the
merged grounding datasets in [41], [44] (c.f. to Table 11 for other notations). Note that some methods evaluate on different
versions of O365 and OI validation datasets, we do not differentiate them here. Gray Numbers denote the performance of
LVIS minival set [41]. All metrics are box AP.
Region-Aware Training
MMC-Det [142] R50 CN2 [312] LVIS, CC3M - - 56.4 - 21.4 38.6 -
O365, GoldG,
DetCLIP [143] Swin-T ATSS [311] - - - - - - 25.033.2 /28.435.9
YFCC1M [315]
O365, GoldG,
DetCLIPv2 [144] Swin-T ATSS [311] - - - - - - 36.0/40.4
CC3M, CC12M
RO-ViT [146] ViT-B/16 MRCNN LVIS, ALIGN [102] - - - 17.1 26.9 - -
CFM-ViT [147] ViT-B/16 MRCNN LVIS, ALIGN [102] - - - 15.9 24.6 - -
DITO [148] ViT-L/16 FRCNN LVIS, ALIGN [102] - - - 19.8 30.4 - -
OV-DETR [43] R50 Def-DETR [99] LVIS 76.1 38.1 58.4 - - - -
EdaDet [152] R50 Def-DETR [99] LVIS - - - 13.6 19.8 - -
MDETR [41] R101 DETR GoldG+ [41] - - - - - - 7.420.9 /22.524.2
MQ-Det [154] Swin-T GLIP [44] O365 - - - - - - 15.421.0 /22.630.4
YOLO-World [155] - YOLOv8-L O365, GoldG - - - - - - 27.1/35.0
SGDN [156] R50 Def-DETR [99] LVIS, Flickr30K, VG - 40.5 - - - - -
Pseudo-Labeling
GLIP [44] Swin-T DyHead [316] O365, GoldG, Cap4M - 46.3 - - - - 10.120.8 /17.226.0
GLIPv2 [159] Swin-T DyHead [316] O365, GoldG, Cap4M - - - - - - -/29.0
Grounding DINO [164] Swin-T DINO [317] O365, GoldG, Cap4M - 48.4 - - - - 18.1/27.4
COCO, COCO Cap
PB-OVD [45] R50 MRCNN 59.2 - - 6.9 - - -
VG, SBU [308]
VTP-OVD [173] R50 MRCNN COCO 61.1 - - - 7.4 - -
ProxyDet [174] R50 CN2 [312] LVIS - - 57.0 - 19.1 - -
CoDet [175] R50 CN2 [312] LVIS, CC3M - 39.1 57.0 14.2 20.5 - -
Detic [46] Swin-B CN2 [312] LVIS, IN21K - - - - 21.5 55.2 -
MMC (text) [177] R50 CN2 [312] IN-L, LVIS - - - 16.6 23.1 - -
3Ways [169] NF-F0 [313] FCOS LVIS - 41.5 - 16.4 - - -
Knowledge Distillation
ViLD [26] R50 MRCNN LVIS 72.2 36.6 55.6 11.8 18.2 - -
CondHead [306] R50 ViLD [26] LVIS 74.6 39.1 59.1 13.2 20.4 - -
SIC-CADS [184] R50 Detic [46] LVIS - - - - 31.2 54.7 -
BARON [48] R50 FRCNN LVIS 76.0 36.2 55.7 13.6 21.0 - -
GridCLIP [187] R50 FCOS LVIS 70.9 34.7 52.2 - - - -
RKDWTF [170] R50 MRCNN IN21K, LVIS - - 56.6 - 22.3 42.9 -
DK-DETR [188] R50 Def-DETR [99] LVIS 71.3 39.4 54.3 12.4 17.3 - -
DetPro [47] R50 MRCNN LVIS 74.6 34.9 53.8 12.1 18.8 - -
CLIPSelf [110] ViT-L/14 F-VLM [49] LVIS - 40.5 63.8 19.5 31.3 - -
Transfer Learning
OWL-ViT [190] ViT-B/16 DETR O365, VG - - - - - - 23.6/26.7
UniDetector [191] R50 FRCNN COCO, O365, OI - - - - - - 18.0/19.8
F-VLM [49] R50 MRCNN LVIS - 32.5 53.1 11.9 19.2 - -
OpenSeeD [193] Swin-T Mask DINO [318] COCO, O365 - - - - - - -/21.8
Sambor [195] ViT-B Cascade R-CNN [319] O365 - 48.6 66.1 - - - 20.929.6 /26.333.1
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 26
TABLE 13: Open-vocabulary semantic segmentation performance on the validation set of ADE20K [288] (A-847 and A-150),
Pascal Context [15] (PC-459 and PC-59), Pascal VOC [15] (PAS-20), Cityscapes [289] (CS-19), COCO Stuff [287] (Stuff), and
COCO [16] datasets under the CDTE protocol. MF is MaskFormer [9], c.f. to Tables 10 and 11 for other notations.
Region-Aware Training
COCO Pan,
OpenSeg [29] R101 MF
COCO Cap
BERT cat+desc % 4.0 15.3 6.5 36.9 60.0 - - -
WebLI [320],
SLIC [197] ViT-B/16 CAT-Seg [234]
COCO Stuff
CLIP T (cat) % 13.4 36.6 22.0 61.2 95.9 - - -
CC12M,
GroupViT [50] ViT-S -
YFCC14M [315]
CLIP T (cat) % - - - 22.4 52.3 - - -
TCL [206] ViT-B/16 - CC3M, CC12M CLIP T (cat) % - 17.1 - 33.9 83.2 24.0 22.4 31.6
SimSeg [207] ViT-B/16 - CC3M, CC12M CLIP T (cat) % - - - 26.2 57.4 - 29.7 -
Knowledge Distillation
COCO Pan,
GKC [52] R50 MF
COCO Cap
CLIP T (cat+desc) % 3.2 17.5 6.5 41.9 78.7 34.3 - -
CC3M, CC12M,
SAM-CLIP [53] ViT-B SAM [196] YFCC15M [315], CLIP T (cat) % - - - 29.2 60.6 17.1 31.5 -
IN21K, SA-1B [196]
ZeroSeg [209] ViT - IN1K [176] CLIP T (cat) % - - - 20.4 40.8 - - 20.2
Transfer Learning
LSeg+ [29] R101 SRB [30] COCO Pan CLIP T (cat) % 2.5 13.0 5.2 36.0 59.0 - - -
COCO Cap,
CEL [212] R50 MF
COCO Stuff
CLIP T (cat) ! 7.2 20.5 9.6 49.6 86.7 - - -
ZSSeg [217] R101 MF COCO Stuff CLIP L (cat) ! 7.0 20.5 - 47.7 - 34.5 - -
MaskCLIP [218] R101 DL [100] % CLIP T (cat) % - - - 25.5 - - 14.6 -
CLIP-DINOiser [219] ViT-B/16 - Pascal VOC CLIP T (cat) % - 20.0 - 35.9 80.2 31.7 - -
MVP-SEG [221] R50 DL [100] COCO Stuff CLIP L (cat) % - - - 38.7 - - - -
ReCo [213] - - - CLIP T (cat) % - - - - - 19.3 26.3 -
CLIP [31],
OVDiff [223] UNet [321] -
Stable Diffusion [108] % T (cat) % - - - 30.1 67.1 - - 34.8
FOSSIL [224] ViT-L/14 - COCO Cap CLIP T (cat) % - - - 35.8 - 23.2 24.8 -
POMP [225] R101 MF COCO Stuff CLIP L (cat) % - 20.7 - 51.1 - - - -
AttrSeg [226] R101 - COCO Stuff CLIP desc % - - - 56.3 91.6 - - -
PnP-OVSS [227] ViT-L/16 BLIP [230] COCO Cap BERT T (cat) % - 23.2 - 41.9 55.7 - 32.6 33.8
SCAN [216] Swin-B M2F COCO Stuff CLIP T (cat) % 10.8 30.8 13.2 58.4 97.0 - - -
TagAlign [228] ViT-B/16 - CC12M CLIP T (cat) % - 17.3 - 37.6 87.9 27.5 - 33.3
Self-Seg [229] ViT-L X-Dec [232] COCO Cap - - % 6.4 - - - - 41.1 - -
COCO Stuff,
OVSeg [55] R101c [100] MF CLIP T (cat) ! 7.1 24.8 11.0 53.3 92.6 - - -
COCO Cap
CAT-Seg [234] Swin-B - COCO Stuff CLIP T (cat) % 10.8 31.5 20.4 62.0 96.6 - - -
SED [235] ConvNext-B [236] - COCO Stuff CLIP T (cat) % 11.4 31.6 18.6 57.3 94.4 - - -
MAFT [237] ViT-B/16 FreeSeg [248] COCO Stuff CLIP T (cat) % 10.1 29.1 12.8 53.5 90.0 - - -
SAN [56] ViT-B/16 - COCO Stuff CLIP T (cat) % 10.1 27.5 12.6 53.8 94.0 - - -
CaR [242] ViT-B/16 - % CLIP T (cat) % - - - 30.5 67.6 - - 36.6
TABLE 14: Open-vocabulary semantic segmentation performance under generalized evaluation protocol. HM is the
hamonic mean (hIoU), c.f. to Tables 10, 11 and 13 for other notations.
TABLE 15: Open-vocabulary instance segmentation performance on COCO [16] and OpenImages [286] datasets under the
gOVE protocol. The metric is mask AP. M2F is Mask2Former [14], c.f. to Tables 10 and 11 for other notations.
CGG [57] R50 M2F COCO Cap BERT % % 28.4 46.0 41.4 - - -
D2 Zero [244] R50 M2F % CLIP T (cat) % 15.8 54.1 24.5 - - -
XPM [28] R50 MRCNN CC3M BERT % % 21.6 41.5 36.3 22.7 49.8 40.7
Mask-free OVIS R50 MRCNN COCO, OI ALBEF [141] % % 25.0 - - 25.8 - -
TABLE 16: Open-vocabulary panoptic segmentation performance on COCO Panoptic [11] and ADE20k [288] dataset. PQst
and PQth represent PQ for stuff and thing classes, respectively. For other notations, c.f. to Tables 10, 11 and 15.
PADing [60] R50 M2F CLIP T (cat) % 41.5 80.6 49.7 15.3 72.8 18.4 - - - - -
FreeSeg [248] R101 M2F CLIP L (cat) % 31.4 78.3 38.9 29.8 79.2 37.6 - - - - -
MaskCLIP [256] R50 M2F CLIP cat % - - - - - - 15.1 13.5 18.3 70.5 19.2
OPSNet [257] R50 M2F CLIP cat % - - - - - - 17.7 15.6 21.9 54.9 21.6
TABLE 17: Open-vocabulary panoptic segmentation performance under the CDTE protocol. For notations, c.f. to Table 16.
Uni-OVSeg [247] ConvNext-L [236] M2F CLIP T (cat) c.f. to [247] % 18.0 72.6 24.3 14.1 66.1 19.0 17.5 65.2 23.5
CC3M, SBU [308],
X-Decoder [232] DaViT-B [323] M2F CLIP T (cat) % - - - 21.1 - - 39.5 - -
COCO Cap, VG