0% found this document useful (0 votes)
15 views

From Text to Mask Localizing Entities Using the

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

From Text to Mask Localizing Entities Using the

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

From Text to Mask: Localizing Entities Using the

Attention of Text-to-Image Diffusion Models


Changming Xiaoa,∗, Qi Yanga,∗, Feng Zhoub , Changshui Zhanga,∗∗
a
Institute for Artificial Intelligence, Tsinghua University (THUAI); Beijing National
arXiv:2309.04109v2 [cs.CV] 1 Oct 2024

Research Center for Information Science and Technology (BNRist); Department of


Automation, Tsinghua University, 100084, Beijing, P.R. China
b
Algorithm Research, Aibee Inc., Beijing, P.R. China

Abstract

Diffusion models have revolted the field of text-to-image generation recently.


The unique way of fusing text and image information contributes to their re-
markable capability of generating highly text-related images. From another
perspective, these generative models imply clues about the precise correlation
between words and pixels. This work proposes a simple but effective method
to utilize the attention mechanism in the denoising network of text-to-image
diffusion models. Without additional training time nor inference-time opti-
mization, the semantic grounding of phrases can be attained directly. We
evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under
weakly-supervised semantic segmentation setting and our method achieves
superior performance to prior methods. In addition, the acquired word-pixel
correlation is generalizable for the learned text embedding of customized
generation methods, requiring only a few modifications. To validate our


Indicates equal contribution.
∗∗
Corresponding author.
Email address: [email protected] (Changshui Zhang)

Preprint submitted to Neurocomputing October 2, 2024


discovery, we introduce a new practical task called “personalized referring
image segmentation” with a new dataset. Experiments in various situations
demonstrate the advantages of our method compared to strong baselines on
this task. In summary, our work reveals a novel way to extract the rich
multi-modal knowledge hidden in diffusion models for segmentation.
Keywords: Diffusion model for segmentation, Text-image correlation,
Weakly-supervised semantic segmentation, Personalized recognition

1. Introduction

Dense image prediction is a long-established research field that aims at


producing pixel-level labels for given images [1]. It is the foundation of many
applications in fields such as biomedicine [2], robotics [3], and surveillance [4].
To attain precise masks, the dense image prediction task usually requires
expensive dense annotations for training. Although recent works [5, 6] have
shown impressive results when training without pixel-wise labels, deliberate
designs of the model architecture or the optimization objective are required.
With the development of powerful foundation models trained with internet-
scale data, studies have emerged on how to mine valuable information from
these off-the-shelf models for diverse tasks. Regarding segmentation, some
methods like [7] have utilized the learned localization information from text-
image discriminative models [8] to reduce reliance on segmentation-specific
schemes. Inspired by a renowned quote from Feynman: What I cannot create,
I do not understand, we believe that generative models, the counterpart of
discriminative models, should also have a thorough comprehension of images.
Diffusion models [9, 10] have opened a new era of generative models,

2
dCRF

SelfCross SelfCross Background Pseudo Mask


Add noise in
the latent
space Self Cross Cross Background Pseudo Mask

aeroplane

U-net person with


clothes

Text Query: A photo including aeroplane,


person with clothes and {Background Prompts}. dCRF

Figure 1: An overview of our proposed framework. We first add noise to the latent
and then input it into the denoising U-net with specially designed text queries. Next,
we combine cross-attention and self-attention in the model to obtain the correlation map
between words and pixels. After comparing different correlation maps and post-processing
with dense CRF [16], we attain pseudo masks at last. Best viewed in color.

and their multi-modal variants [11] trained on billions of image-caption pairs


have revolutionized the field of text-to-image synthesis. When diffusion mod-
els generate images from texts, the association between words and different
spatial regions can be leveraged to indicate localization information. We
propose a simple but effective way to distill this information and use it for
dense image prediction. Building off a recently open-sourced text-to-image
diffusion model [11], our method exhibits remarkable semantic segmentation
capability. Figure 1 shows an overview of our framework, which effectively
leverages cross-attention and self-attention in the diffusion model to obtain
the segmentation map. Compared to previous works, our proposed method
doesn’t need segmentation-specific re-training [12], module addition [13], or
inference time optimization [14], and the whole pipeline is accomplished with
no exposure to pixel-wise labels [15].
Leveraging pre-trained text-to-image diffusion models brings more ben-

3
efits. It has been shown that generative models have a better spatial and
relational understanding than discriminative models [13], which is crucial for
segmentation. Besides, internet-scale training data endows the model with
the ability to handle open-vocabulary scenarios. Furthermore, customized
generation with text-to-image diffusion models has been proven feasible [17].
These methods typically map the subject to an identifier word, and novel
renditions of it can be obtained by inserting the word into various descrip-
tions. We integrate this technique with our method and locate user-specific
items by exploiting the correspondence between learned word embeddings
of personalized concepts and image segments. It is worth noting that we
can compose customized instances and textual contexts into multi-modal
segmentation queries, which was hard to achieve with previous methods [18].
To investigate the performance of different methods in this customized
case, we introduce a new task called “personalized referring image segmen-
tation” with a new dataset named Mug19. This task aims to locate user-
specific entities and has many possible practical application scenarios, such
as those for household robots. The dataset is collected in a laboratory sce-
nario and is conscientiously devised to ensure that in most cases, uni-modal
information is insufficient to locate the proper object. This setup makes it
a more pragmatic and user-friendly task and allows this new benchmark to
assess the multi-modal comprehension ability of different models.
We conduct experiments from different aspects to validate the effective-
ness of our method. We first evaluate the weakly-supervised semantic seg-
mentation ability of our approach on classic datasets: Pascal VOC 2012 [19]
and MS COCO 2014 [20]. Then we demonstrate that we can locate per-

4
sonalized embeddings just like locating category embeddings, showcasing the
generality of our framework. We further reveal that traditional approaches
using uni-modal information have difficulties in dealing with our proposed
task. Finally, ablation studies are conducted to validate the intuition of our
framework designs.
In summary, the contributions of this work are as follows:

• A novel plug-and-play method for open-vocabulary segmentation is pro-


posed, which utilizes the attention mechanism in off-the-shelf text-to-
image diffusion models.

• A new benchmark named “personalized referring image segmentation”


is introduced, which is valuable for both industrial application and
academic research.

• Experiment results on diverse segmentation tasks verify that our method


can achieve state-of-the-art (SOTA) performance.

2. Related Work

2.1. Open-Vocabulary Segmentation

Researchers have developed different technologies to make segmentation


systems more practical. One popular avenue is considering the weakly super-
vised setting and using image-level annotations which are easier to obtain [21,
22]. But these solutions based on Class Activation Mapping (CAM) [23]
are limited to a predefined taxonomy. Other researchers consider zero-shot
learning for segmentation, which aims to transfer the knowledge extracted

5
from seen categories to unseen ones. With the development of word em-
bedding [24], language generalization ability has been exploited to depict
semantic associations between different classes [25].
Recently, foundation models have made game-changing progress, and
zero-shot abilities have been found to automatically emerge from them [8].
Thus open-vocabulary paradigm1 , a more general zero-shot formulation, has
gradually replaced the classic paradigm [26]. Contrastive Language-Image
Pre-training (CLIP) [8], a discriminative foundation model that has learned
vast image-text knowledge from the internet, has become a common com-
ponent of open-vocabulary recognition methods [7]. Nevertheless, it has
been discovered that the representation of CLIP is sub-optimal for segmen-
tation tasks [13]. Therefore, we leverage a generative diffusion model in-
stead [11], which is believed to have a thorough perception of scene-level
structure [13, 27].

2.2. Diffusion Model for Segmentation

With the development of diffusion models, researchers have begun to pay


attention to their potential in perception tasks. In addition to global predic-
tion tasks like classification [28], dense prediction tasks like segmentation are
also gradually attracting attention [29, 12, 13, 30, 31]. These methods can
be roughly divided into 3 groups.
One group regards the segmentation task as a generative process [29].
They learn the conditional denoising process from noise distribution to dense
masks. The input image serves as a condition to guide the sampling process,

1
Also known as open-set paradigm or open-world paradigm.

6
and dense annotations are required for training. Instead of learning a differ-
ent denoising process, another group exploits the knowledge of pre-trained
diffusion models [12, 13]. They leverage diffusion features to train additional
networks that provide dense predictions. As pre-trained diffusion models
possess powerful generative capability, their internal representation captures
rich semantic information. Thus labels used for training and parameters to
be optimized can be greatly reduced, making their frameworks very efficient.
In order to further diminish dependence on pixel-wise labels, the last group
operates on synthetic data [30, 31]. They synthesize realistic images and
segmentation masks simultaneously and use this automatically constructed
dataset to train segmentation models. The trained model exhibits competi-
tive performance on real images to models trained on real data. Nevertheless,
due to the domain gap between synthetic data and real data, deliberate de-
signs of data processing are required for good performance.
Compared to these segmentation methods that employ diffusion mod-
els, our method is more flexible and does not require dense annotations,
segmentation-specific re-training, or complicated data processing designs.
Thus it only takes 1 second to infer an image in our framework while some
competing methods need minutes to segment one image.

2.3. Composed Image Retrieval

Composed image retrieval is a multi-modal task aimed at retrieving im-


ages through image and text queries [32]. Earlier studies have focused on
synthesis data [32] and fashion products [33], while recent works have taken
open-domain data into consideration [34]. From a methodological perspec-
tive, researchers have explored different ways to fuse multi-modal informa-

7
tion. They have proposed various mechanisms ranging from affine transfor-
mation [35] to cross-attention [36].
Currently, with the development of vision-and-language pre-trained mod-
els [8], researchers have contemplated applying them to composed image re-
trieval [18, 34, 33, 37]. Among these approaches, [18, 37] took up early-fusion
and had freer reasoning capabilities than late-fusion methods [34, 33]. We
adopt early-fusion in our proposed method, but we care more about local-
ization ability, which is valuable for many practical applications. Although
PALAVRA [18] also considered the segmentation task, its experimental re-
sults indicated its failure to distinguish between context and subject. On
the contrary, we design the localization task to rely on the comprehension of
context and subject, and experimental results demonstrate the effectiveness
of our method.

3. Method

3.1. Preliminary

Diffusion models [9, 10] are a class of generative models which learn the
data distribution by progressively denoising from a tractable noise distribu-
tion. They can be interpreted as a sequence of time-conditional denoising
auto-encoders. To reduce resource consumption, Latent Diffusion Model [11]
is proposed which conducts the diffusion process in the latent space obtained
by a perceptual compression model [38]. Additionally, in order to achieve
flexible conditional generation, the denoising U-net backbone [39] is usually
augmented with the cross-attention and self-attention mechanism [36]. With
paired image-condition training data (x, y), the optimization objective of

8
denoising model ϵθ is ordinarily simplified as:

LLDM := EE(x),y,ϵ∼N (0,1),t∼U [1,T ] ∥ϵ − ϵθ (zt , t, y)∥2 , (1)

where E is the encoder of the compression model, ϵ is the sampled Gaussian


noise added to the image, t represents the current time step, T is the total
number of denoising steps, and zt is the noisy version of E(x). More specif-
√ √
ically, zt = ᾱt E(x) + 1 − ᾱt ϵ, where ᾱt ∈ (0, 1) is the step size derived
from a predefined variance schedule and decreases as t increases [10].
We adopt an open-sourced text-conditional latent diffusion model named
Stable Diffusion [11] for this work, which is affordable to infer on consumer
GPUs. Its compression model encodes images into a low-dimensional latent
space by sampling based on the predicted mean and standard deviation,
similar to Variational Auto-Encoder (VAE) [40]. y represents text in this
case, and the model is trained on 5 billion image-caption pairs [41]. The
language prompt is encoded by CLIP first and is then injected into the model
through cross-attention layers.

3.2. Attention Mechanism

Previous works have considered exploiting cross-attention and self-attention


layers of text-to-image diffusion models for localization tasks [43, 44]. How-
ever, they often apply these two attention layers separately without fully
leveraging their internal correlations. We instead treat self-attention as the
affinity matrix of different patches [45], which conforms more to its essence
compared to regarding it as the clustering feature [43], as shown in Figure 2.
Specifically, the text condition y is first projected to the CLIP embedding
space as h ∈ Rl×dt , where l is the length of tokens and dt is its dimension.

9
Image Self Cross CBP SelfCross
chair
dog
boat

Figure 2: Visualization of correlation maps. The texts on the left are the corresponding
categories. The 2-nd column depicts the spectral clustering result [42] utilizing the self-
attention map, the 3-rd column shows the cross-attention map, the 4-st column displays
the attention score attained after employing the clustering technique in CBP [43], and the
last column shows our final correlation map after propagation. Best viewed in color.

The spatial visual feature at the intermediate layer n of the backbone is


(n) ×d(n)
denoted by f (n) ∈ RW H i , where W H (n) represents the resolution and
(n)
di is its dimension. f (n) interacts with h via a cross-attention mechanism,
(n) ×l
which can derive a patch-token correlation map Cross(n) ∈ RW H as:
!
(n) (n)T
Q cross K cross
Cross(n) = softmax √ , (2)
d

(n) (n) (n) (n) (n) (n) (n)


×d
where Qcross = f (n) ·WQc , Kcross = h·WKc . Here WQc ∈ Rdi and WKc ∈
Rdt ×d are learned projection matrices while d is the projection dimension.
Following a similarcalculation  process, we build the self-attention map as
(n) (n)T
Q K (n) (n) (n) (n)
Self (n) = softmax self self

d
, with Qself = f (n) · WQs , Kself = f (n) · WKs .
(n) (n) (n) (n)
×d ×d
Here WQs ∈ Rdi and WKs ∈ Rdi are also learned projection matrices.

10
Inspired by [45, 46], we regard self-attention maps as semantic affinity
matrices and propagate cross-attention scores accordingly. Thereby, we ag-
gregate semantic information from regions with similar appearances. We se-
lect a particular layer with a fine resolution W H to extract the self-attention
map as Self ∈ RW H×W H and accumulate Cross(n) of different layers via
interpolation and averaging to obtain the aggregated cross-attention map
Cross ∈ RW H×l . We then leverage the patch-patch affinity matrix to refine
the patch-token correlation map, which can be formulated as:

Self Cross = Self iter · Cross, (3)

where iter is the number of refining iterations. The obtained Self Cross ∈
RW H×l can serve as a substitute for Cross to complete localization tasks [44].
The originally predicted correlation between patch and token probably carries
noise, yet ensembling within relevant areas could yield more stable results.
This refinement operation also declines forecast discrepancy within seman-
tically similar regions. As shown in Figure 2, the correlation map attained
after propagation preserves the structure better and possesses finer details.
To attain the pseudo mask, we leverage the image-level label to identify
the categories contained in the image and use K to represent this set. Re-
garding class k ∈ K, we extract correlation maps of its relevant tokens from
corresponding positions of Self Cross and average along the token dimension
to obtain the object attribution map SCk ∈ RW H×1 . We reshape it to RW ×H
and then normalize it so that values at each position are between 0 and 1:
SCk (x, y) − minx,y SCk (x, y)
SCk (x, y) → . (4)
maxx,y SCk (x, y) − minx,y SCk (x, y)
We further estimate the attribution map of background given by SCbg (x, y) =

11
(1.0 − maxk∈K SCk (x, y))2 . By concatenating these attribution maps, we
acquire the final attention map SC ∈ R(|K|+1)×W ×H . After upsampling and
refining SC, we assign the label for each pixel to the class (including the
background) with the maximum attention score here.

3.3. Text Query

When we want to locate specific objects in the image, we should provide a


text query y to distill the attention value. Regarding the weakly-supervised
semantic segmentation setting, we merge the category names contained in
the image into one sentence. For instance, if one image involves bottles,
chairs, and a sofa, the text prompt will be “A photo including bottle, chair,
and sofa.”. This approach has several advantages. First, it is more time-
efficient than querying each object separately, as only one forward pass is
required. Second, the attention value of different objects will be comparable
due to the normalization operation along the token dimension in Equation 2.
Last, we can conveniently append background prompts at the end of the
sentence, which can alleviate the false-activation issue found in [5]. Moreover,
class names in classic datasets may not fully represent the semantics of the
category. Therefore, we conduct prompt engineering to find synonyms with
richer semantics and replace original category names with them.

3.4. Summary

We summarize our plug-and-play method by describing how one image is


processed. Given one image x, we first encode it using E to get the latent z.
√ √
We then obtain the noisy latent zt = ᾱt z+ 1 − ᾱt ϵ with a specific time step
t as described in Section 3.1 and input it into the denoising network ϵθ along

12
with the text query y constructed as described in Section 3.3. During the
feed-forward calculation of ϵθ (zt , t, y), we distill Cross and Self from certain
layers of ϵθ and acquire Self Cross using Equaion 3. Next, we extract the
attribution map of each entity from corresponding positions of Self Cross
and normalize it using Equaion 4. Finally, after estimating the attribution
map of background, we determine which entity each pixel belongs to by
comparing the values of it on all attribution maps.

4. Mug19 Dataset

4.1. Motivation
To investigate the problem of personalized segmentation, we create a
dataset in a laboratory scenario. As mentioned in Section 1, this new task
aims to locate the user-specific instance corresponding to the textual query.
It requires a more refined multi-modal comprehension capability than com-
posed image retrieval [32] and may have potential application perspectives
in products like home robotics. Most existing datasets for similar tasks are
designed for retrieval only [32, 34], thus are not suitable for us. [18] created a
personalized segmentation benchmark repurposing from a video dataset [47].
However, the temporal continuity in the video may have information leakage
for localization. Besides, the depictions were not always useful for segmen-
tation as we found it hard to tell the difference between similar entities in
most images of this dataset.

4.2. Distractors
To mitigate these issues, we build a new dataset on which models with
good image-text reasoning ability would perform better. We consider vari-

13
mug_02 scene_32 a mug_02 filled with Cola

mug_17 scene_19 a green mug_17

Figure 3: Examples in our proposed dataset. The first 2 columns display multi-view
photos of personalized items and the 3-rd column presents the image of different scenes.
The last column shows the highlighted segmentation result along with the text query.

ous ambiguities and ensure that the multi-modal references are necessary and
sufficient to discriminate different objects, which requires a careful arrange-
ment of scenes. As shown in the 1-st row of Figure 3, personalized knowledge
is needed to tell the difference between 2 mugs filled with Cola, while tex-
tual context is required to distinguish between 2 white mugs. These two
situations are termed “semantic distractors” and “visual distractors” in [18].

4.3. Statistics

We choose a typical daily essential “mug” to construct the dataset. The


dataset includes 19 mugs, and they compose 47 diverse scenes. We provide
5 to 10 640 × 640 RGB images for each object. Each scene is consisted
of 3 to 5 mugs and contains 100 1137 × 640 RGB images captured from
different angles of view. We create 2468 segmentation triplets (instances in
the scene, descriptions, scene image) based on our data, and we pick out 2

14
special groups regarding the degree of ambiguity: semantic distractor and
visual distractor. Scenes in the semantic distractor split include different
items with similar contexts and it consists of 440 triplets. Scenes in the
visual distractor split contain various items with analogous appearances and
this group comprises 170 triplets. The triplets are carefully annotated by
experts and the instance mask labels are acquired by the pre-trained Mask
R-CNN [48] model in the Detectron2 library2 . This dataset will be made
available online upon acceptance.

5. Experiments

We evaluate our framework quantitatively and qualitatively on different


segmentation tasks. We further compare with strong baselines to showcase
our strengths. Lastly, we conduct ablation studies to analyze various designs.
The source code and new dataset are available at https://github.com/
Big-Brother-Pikachu/Text2Mask.

5.1. Weakly-Supervised Semantic Segmentation

Implementation Details. We adopt Stable Diffusion for our experiment. When


processing one test image, we first rescale it to meet the model requirements
and then encode it through E. We add noise to the latent and input it into
the denoising network ϵθ . The conditional language prompt y is constructed
as described in Section 3.3. We choose the time step t of the input noisy
latent zt to be 150 with the total denoising length T = 1000. With one feed-
forward computation, we aggregate Cross from the 4, 5, 6, 7, 8-th layers of ϵθ

2
https://github.com/facebookresearch/detectron2

15
and distill Self from the 11-th layer of ϵθ . We set iter in Equaion 3 to be 2.
Following [5, 6], the obtained segmentation maps are further refined by dense
Conditional Random Fields (CRF) [16] to generate pseudo masks, which are
then used to train a standard segmentation network based on DeepLab [49].
Furthermore, due to the stochastic sampling process in the diffusion frame-
work, we can easily ensemble segmentation masks generated from different
noises. Thus we can sample multiple times for better pseudo masks.

Computation Consumption. Our method requires no re-training and has


1066.2M frozen parameters in total. It takes an average of 0.88 seconds
to infer an image with a single NVIDIA GeForce RTX 3090 GPU and 24 GB
of memory.

Evaluation. We select Pascal VOC 2012 [19] and Microsoft COCO 2014 [20]
for evaluation. Pascal VOC 2012 is a semantic segmentation dataset with
20 object categories. It consists of 1464 training images, 10582 augmented
training images, 1449 validation images, and 1456 test images. MS COCO
2014 dataset contains 80 object categories and 1 background class. It has
82081 training images and 40137 validation images. In our experiment, only
image-level labels are used to generate pseudo masks. The mean Intersection
over Union (mIoU) value is reported as the evaluation metric.

Quantitative Results. We first evaluate the quality of our pseudo masks on


Pascal VOC 2012. As shown in Table 1, our method outperforms previous
methods by a considerable margin on initial seeds. Without training an extra
affinity network, we can achieve 76.1% mIoU on the train set after dCRF
post-processing. To train the segmentation model, we generate pseudo masks

16
IRN M&L SC-CAM SEAM AdvCAM CLIMS RIB OoD MCT CLIP-ES Ours Ours
Method
[46] [22] [50] [51] [52] [5] [53] [54] [55] [6] 1 time 10 times
Initial 48.8 49.6 50.9 55.4 55.6 56.6 56.5 59.1 61.7 70.8 72.7 74.2
dCRF 54.3 - 55.3 56.8 62.1 62.4 62.9 65.5 64.5 75.0 74.4 76.1
RW 66.3 67.0 63.4 63.6 68.0 70.5 70.6 72.1 69.1 - - -

Table 1: mIoU of pseudo masks on PASCAL VOC 2012 train set. The best results are
in bold. dCRF represents post-processing with dense CRF. RW denotes refining with
trained affinity networks.

Dataset VOC 2012 trainaug COCO 2014 train


Method CLIP-ES [6] Ours CLIP-ES [6] Ours
Initial 65.9 70.3 39.7 43.7
dCRF 68.7 71.7 41.5 45.3

Table 2: mIoU of pseudo masks on the data set used to train DeepLab. The best results
are in bold. dCRF represents post-processing with dense CRF.

on the trainaug set. As shown in Table 2, we obtain the best quality on this
set, and even our initial masks can exceed the results of previous methods
post-processing with dCRF. Utilizing these masks, we train segmentation
models based on DeepLabV2 and assess them on the val and test sets. As
shown in Table 3, our framework outperforms most previous methods and
we achieve a new state-of-the-art with the Imagenet pre-trained model. As
for MS COCO 2014, we first compare the generated pseudo masks on the
train images. As shown in Table 2, our method produces more precise masks.
After training a DeepLabV2 segmentation model with these pseudo masks,
we can achieve 45.7% mIoU on the val set as shown in Table 4, which is also
a new SOTA. As we adopt the classic DeepLab repository directly without
extensive modifications, the enhancement observed in the DeepLab results

17
Method Backbone Ver. Val Test

Image-level supervision only.


PSA [45] WR38 V1 61.7 63.7
IRN [46] R50 V2 63.5 64.8
ICD [56] R101 V1‡ 64.1 64.3
SC-CAM [50] R101 V2‡ 66.1 65.9
BES [57] R101 V2‡ 65.7 66.6
ETG [21] R101 V2 66.8 67.6
M&L [22] R101 V2 67.2 69.1
SIPE [58] R101 V2‡ 68.8 69.7
RIB [53] R101 V2 68.3 68.6
AMN [59] R101 V2‡ 70.7 70.6

Image-level supervision + Language supervision.


CLIMS [5] R101 V2 69.3 68.7
CLIMS [5] R101 V2‡ 70.4 70.0
CLIP-ES [6] R101 V2 71.1 71.4
CLIP-ES [6] R101 V2‡ 73.8 73.9
Ours R101 V2 71.2 71.5
Ours R101 V2‡ 73.3 74.2

Table 3: DeepLab results on PASCAL VOC 2012 val and test sets. The best results are

in bold. Ver. denotes the version. represents adopting COCO pre-trained models.

18
Method Backbone Sup. Val

ETG [21] R101 I 28.0


IRN [46] R50 I 32.6
IRN [46] R101 I 41.4
SIPE [58] R101 I 40.6
RIB [53] R101 I 43.8
AMN [59] R101 I 44.7
CLIP-ES [6] R101 I+L 45.4
Ours R101 I+L 45.7

Table 4: DeepLab results on MS COCO 2014 val set. The best results are in bold. I
and L represents image-level supervision and language supervision, respectively.

is relatively modest. However, we believe that investing excessive time in


tuning the training process of DeepLab may not be cost-effective. Our plug-
and-play approach inherently holds the advantage of simplicity over most
competing methods, as we don’t require extra optimization processes as in
CLIMS [5]. Besides, generative models are capable of determining where to
locate an entity while discriminative models can only ascertain its existence.
Thus we perform better than the discriminative-model-based method CLIP-
ES [6].

Qualitative Results. We visualize the results of our approach and other multi-
modal related methods [5, 6] in Figure 4. We produce pseudo masks with finer
structures, such as the feather of the bird and the legs of the horse as shown
in the image. Furthermore, we can directly generate the confidence map from
attention scores. If the value of a position is within the 0.05 interval of the

19
Image CLIMS CLIP-ES Ours Ground Truth

Figure 4: Visualizations of the pseudo masks generated by various methods. The 1-


st column shows the input image and the last column shows the ground truth mask.
Uncertain pixels are set to white.

Attention Cross CBP [43] Self Cross Cross+Self Cross

Initial 61.7 64.1 72.7 69.9


dCRF 67.1 69.9 74.4 73.0

Table 5: Attention mechanism analysis on PASCAL VOC 2012 train set. The best results
are in bold.

interface, we set it as uncertain. We find that ambiguous pixels are mainly


concentrated on object boundaries. More results are shown in Appendix.

5.2. Ablation Study

Attention. We verify the effectiveness of our attention mechanism based on


the quality of pseudo masks. As shown in Table 5, Self Cross achieves
remarkably better mIoU than Cross, and a qualitative comparison has been
displayed in Figure 2. A further ensemble of Cross and Self Cross shows no
benefits. We also compare to Controllable Background Preservation (CBP)
from [43], which clusters the self-attention map and assigns labels to each
segment based on the cross-attention value. We achieve better results as our

20
Comp. Syn. BG. bird boat person train tvmonitor mIoU
85.6 74.2 38.4 74.1 62.3 72.1
✓ 87.1 72.8 59.3 73.5 60.6 73.1
✓ 80.3 71.7 36.1 81.9 66.3 72.7
✓ ✓ 83.9 71.6 63.5 81.5 67.3 74.3
✓ 85.3 76.4 45.2 75.2 62.8 73.2
✓ ✓ 87.0 75.7 56.9 74.7 64.5 73.8
✓ ✓ 79.4 77.1 41.0 82.2 66.3 73.1
✓ ✓ ✓ 85.0 77.0 59.7 82.5 65.6 74.4

Table 6: Prompt strategy comparison on PASCAL VOC 2012 train set. Comp. denotes
using one composed sentence, Syn. denotes adopting category synonyms, and BG. denotes
appending background prompts. The best results are in bold.

approach conforms more to the essence of these two attention maps.

Prompt. We analyze different strategies for selecting the text query and de-
pict the IoU of some categories along with the overall average in Table 6.
Compared to using one composed sentence as the prompt, applying sepa-
rate texts to query multi-label scenes will decline segmentation precision and
increase inference time (the number of pseudo labels generated per second
decreases from 1.6 to 1.0 in practice). Then following [6], we substitute
the original class name with category synonyms that can reduce ambiguity.
For instance, the high-attention region corresponding to “person” tends to
overlook the body, and adding “clothes” to the query prompt of “person”
can enhance the performance. Furthermore, we append some background
prompts at the end of text queries. With the existing softmax operation in

21
Equation 2, category scores in co-occurring background regions [5] will be
naturally suppressed. For example, when “railway” and “track” are used
as the background prompts for “train”, the segmentation results will be im-
proved due to the exclusion of these background areas.

5.3. Personalized Referring Image Segmentation

Implementation Details. This task aims to locate personalized instances in


the scene image with the help of textual descriptions. We first leverage
Custom Diffusion [17] to acquire the text embedding of the personalized item.
Given target object images, we optimize the token embedding of the identifier
word “<new1>” along with the diffusion model. Next, we replace “<class>”
in the textual context with “<new1> <class>” as the condition text prompt,
and then attain the object attribution map from Self Cross using the same
hyper-parameters as in Section 5.1. To focus on the foreground region, we
query the diffusion model with “A photo including <class>.” first to get the
object area in the scene. Then we compare the attribution map of different
instances to determine the segmentation results in the object area. We have
found it hard to gain a complete silhouette with the above process. Therefore,
we add a clustering process in advance, so that we only have to focus on
assigning regions instead of pixels. We employ a simple version of spectral
clustering [42] which performs surprisingly well in separating each instance.
Later, we average the attention value in different segments and finally assign
the instance to the area with the highest correlation.

Computation Consumption. The training process of Custom Diffusion takes


around 6 minutes on 2 A100 GPUs for each instance, and the inference cost

22
is similar to that in Section 5.1.

Evaluation. We use standard mIoU along with two kinds of accuracy for eval-
uation. Accuracy represents the proportion of correctly predicted instances
to all instances. The ground truth assignment between instances and seg-
ments is determined based on the relative relationship between the center
positions of segments. When we consider each object solely, the segment
most relevant to each object is predicted as its location. bf acc denotes the
accuracy under this protocol. When we consider all the items contained in
the scene together, we treat the correlation between items and segments as
the cost matrix of an assignment problem. Then Hungarian algorithm [60]
is adopted to attain the assignment between instances and segments. af acc
denotes the accuracy under this protocol. Usually, the latter accuracy will
be higher as we compare the correlation values not only between segments
but also between contained instances.

Baselines. We compare our method with several strong baselines. First,


we use the feature of Mask R-CNN [48] to calculate the similarity between
queries and instances in the scene. It is worth noting that we have used the
same model to provide instance masks and we aim to evaluate the ability of
this localization model to discriminate similar objects. We also adopt DINO-
ViT, a self-supervised Vision Transformer model, as dense visual descriptors
following [61]. Next, we simply query the image with “a <new1> <class>”
and we call this baseline “Subject Only”. On the contrary, the “Context
Only” baseline stands for not incorporating the identifier embedding in the
textual prompt. These approaches only use uni-modal information. Then
the “Arithmetic” baseline combines subject and context by replacing the

23
Split All Semantic Distractor Visual Distractor
Metric mIoU bf acc af acc mIoU bf acc af acc mIoU bf acc af acc
Mask R-CNN [48] 75.1 58.9 75.1 78.2 63.3 78.2 42.9 38.0 42.9
DINO-ViT [61] 52.7 42.3 72.8 70.7 43.3 93.2 39.9 33.7 52.0
Subject Only 62.3 60.1 79.5 76.2 79.3 95.0 31.9 42.0 40.0
Context Only 54.4 51.2 70.9 35.3 40.5 45.1 50.3 42.4 64.3
Arithmetic 28.5 35.1 37.2 27.9 37.7 35.2 55.2 46.7 70.4
Ours 64.9 60.2 83.3 78.8 73.5 98.3 56.8 49.4 71.8

Table 7: Evaluation results on different splits of Mug19 dataset. The best results are in
bold while the second best are underlined.

category embedding in the description with the average between CLIP image
embeddings of the instance and the CLIP text embedding of the class.

Quantitative Results. As shown in Table 7, our method achieves the best


results on the entire data. Mask R-CNN obtains good mIoU because it
knows the exact position of all mugs in advance, but it performs poorly
in distinguishing between various mugs. It is also found that the subject
information is beneficial for excluding semantic distractors while the context
information is useful for dealing with visual distractors, which is in line with
our definition of these two groups. Besides, the average operation can not
fully leverage the information from different modalities, as the “Arithmetic”
baseline behaves similarly to “Context Only” on various splits. We can also
conclude that multi-modal knowledge is vital for this task, as uni-modal
methods have difficulties in handling distinct distractors.

24
Object Subject Only Context Only SEEM Ours Ground Truth Scene
mug_02 a mug_02 filled with Cola scene_32

mug_19 a green mug_19 scene_19

Figure 5: Localization results of different methods on Mug19 dataset samples. The 1-st
column shows the object, the last column shows the scene and the rest columns display
highlighted segmentation masks with the text reference.

Case Study. We select several samples to showcase the deviation of different


approaches. As shown in Figure 5, the “Subject Only” baseline has difficul-
ties in distinguishing visual distractors while the “Context Only” baseline is
hard to discriminate semantic distractors. More specifically, the former one
finds another mug with a similar appearance in the 1-st row, and the latter
baseline detects another “green” mug in the 2-nd row. We also compare our
method with the recent work SEEM [15]. We use its official demo and simul-
taneously choose example and text interactive mode. The result of the 2-nd
case indicates that it sometimes neglects textual information. Compared to
these methods, our approach leverages multi-modal information more effi-
ciently.

6. Limitations and Discussions

We have found that the segmentation results will drop drastically when an
image contains semantically similar objects as shown in Figure 6. The same
phenomenon has also been found in [27] and was termed “Cohyponym En-
tanglement”. Incorporating extra visual knowledge into the learning stage of

25
Image Ours Ground Truth Image Ours Ground Truth

Figure 6: Failure cases: when “chair” and “sofa”, “cat” and “dog” are in the same image,
it is difficult for our method to distinguish them.

diffusion models [62] may also help enhance their discriminative abilities. Be-
sides, we have made preliminary attempts to adopt affordance-related texts.
When considering texts with richer semantics like “graspable part” and “cut-
table part”, our framework can provide meaningful masks for images contain-
ing knives. We will further explore how to efficiently equip our framework
with affordance ability in future works.

7. Broader Impacts

The booming generative models have caused many concerns. The unau-
thentic content they create can be misused. We instead pay attention to the
discriminative task and show the potential of these generative models from
another aspect. Our method can further help researchers to understand the
generation process in large models, which will also benefit the detection of
generated content. On the other hand, as we propose a more flexible seg-
mentation technology, surveillance of specific identities becomes easier, which
may be abused to infringe on personal privacy.

26
8. Conclusion

In this work, we introduce a simple but effective approach to distill the


attention value in text-to-image diffusion models for segmentation. With-
out re-training nor inference-time optimization, text-related regions can be
located precisely. We first validate the effectiveness of our framework on
weakly-supervised semantic segmentation tasks. To this end, we propose a
novel way to generate query prompts in accordance with image-level labels.
Our framework achieves state-of-the-art performance on PASCAL VOC 2012
and MS COCO 2014 and the validity of our designs is confirmed. To further
verify the role of word-pixel correlation, we introduce a new practical task
named “personalized referring image segmentation” with a new real-world
dataset. Experiments on this task demonstrate that our method possesses a
better multi-modal comprehension ability than several strong baselines.

Acknowledgments

This work is funded by the National Science and Technology Major


Project of China (No. 2022ZD0114903).

Appendix A. Experiment Details

Datasets and Pre-trained Models. The licenses of the datasets and the pre-
trained models are listed here. The PASCAL VOC 2012 dataset [19] is
from http://host.robots.ox.ac.uk/pascal/VOC/index.html. The MS
COCO 2014 dataset [20] is from https://cocodataset.org/#home. The
pre-trained CLIP [8] is from https://github.com/openai/CLIP. The Sta-
ble Diffusion v1.4 model [11] is from https://huggingface.co/CompVis/

27
stable-diffusion. The DeepLab pre-trained model [49] is from https:
//github.com/kazuto1011/deeplab-pytorch. The Detectron2 pre-trained
model is from https://github.com/facebookresearch/detectron2, and
we select mask rcnn R 50 FPN 3x from official baseline models for its good
performance.

Implementation Details. We use Pytorch [63] for our experiments. Our code-
base builds heavily on https://github.com/CompVis/stable-diffusion.
For Pascal VOC 2012, MS COCO 2014, and Mug19 datasets, we sample 10,
1, and 1 times respectively to generate pseudo masks.

Prompts. We adopt the category synonyms in [6] except for “person”, as we


find “person with clothes” works better than “person with clothes, people,
human”. We use “tree, river, sea, lake, water, railway, railroad, track, stone,
rocks” as the background prompts following [5]. We find the background
prompts in [6] work slightly worse. We attribute it to more background
classes, which leads to more complex query sentences as we append back-
ground prompts to the end of texts.

Training Details of DeepLab. We refer to the config file in https://github.


com/kazuto1011/deeplab-pytorch and https://github.com/CVI-SZU/CLIMS.
We adopt DeepLabV2 and use ResNet-101 as the backbone.

Appendix B. More Experiment Results

Appendix B.1. Weakly-Supervised Semantic Segmentation

Qualitative Results. We show more qualitative comparisons of our generated


pseudo masks in Figure B.7. We can observe that our framework accurately

28
Image CLIP-ES Ours Ground Truth Image CLIP-ES Ours Ground Truth

VOC COCO

Figure B.7: More visualization results on PASCAL VOC 2012 and MS COCO 2014
datasets. Uncertain pixels are set to white.

29
75 75 75 75 75
mIoU (%)

70 70 70 70 70

65 65 65 65 65
thr ÷ 2
60 60 60 60 (thr − max)1 60
w/ dCRF w/ dCRF w/ dCRF (thr − max)2 w/ dCRF
55 w/o dCRF 55 w/o dCRF 55 w/o dCRF 55 (thr − max) 3 55 w/o dCRF

0 50 100 150 200 250 2-6 2-8 3-7 3-8 4-8 5-8 5-9 6-8 0 3 6 9 12 15 0.6 0.7 0.8 0.9 1.0 0.3 0.6 0.9 1.2 1.5
time step cross layer self layer background threshold Frame Per Second

Figure B.8: Analyses of different hyper-parameters. The mIoU values are evaluated on
PASCAL VOC 2012 train set. The default setting is 150 time step, 4-8-th cross layer,
11-th self layer, 1.0 background threshold with power 2, and sampling 1 time.

locates objects of different sizes in both simple and complex scenarios.

Hyper-parameters. We analyze the impact of different hyper-parameters and


compare the quality of generated pseudo masks in Figure B.8. The time
step t of the input noisy latent zt has little effect on the result, while the
performance is better when employing coarse inner layers for Cross layers,
where the text condition is more relevant to the layout according to a re-
cent study [64]. On the contrary, higher Self layers boost the performance
more, since deeper layers tend to have a better understanding of appearance
similarity. As for determining background regions, we have found a constant
threshold less effective. We follow [6] with a minus-max-power mechanism
where the background score per pixel is determined by threshold thr and the
maximum category attention value here. We compare different power expo-
nents and square is the best. Lastly, as sampling more times requires more
time, the trade-off between time and performance should be considered. We
sample once for ablation studies and COCO experiments to save time, while
for VOC, we sample 10 times for better performance.

30
Seeds 40 41 42 43 44 Average

mIoU 74.5 74.0 74.4 74.6 74.4 74.4 ± 0.2

Table B.8: mIoU of pseudo masks on PASCAL VOC 2012 train set with different seeds.

Split All Semantic Distractor Visual Distractor


Metric mIoU bf acc af acc mIoU bf acc af acc mIoU bf acc af acc
PALAVRA[18]+CLIP-ES[6] 26.0 44.5 31.5 2.5 33.0 3.0 11.5 38.8 14.1
Custom[17]+CLIP-ES[6] 36.1 37.6 43.0 33.4 37.1 39.6 53.4 42.0 63.5
PALAVRA[18]+Self Cross 53.1 48.2 69.4 53.6 46.3 68.3 58.2 50.8 74.5
Custom[17]+Self Cross 64.9 60.2 83.3 78.8 73.5 98.3 56.8 49.4 71.8

Table B.9: Evaluation results of various combinations on Mug19 dataset. The best
results are in bold while the second best results are underlined. Custom + Self Cross is
the combination we ultimately adopt.

Different Seeds. As we can obtain better performance when sampling several


times, it would be interesting to probe the performance changes with differ-
ent seeds. We run 5 experiments to generate pseudo masks on PASCAL
VOC 2012 train set with different seeds, and we only sample once for each
time. As shown in Table B.8, the results are stable, which indicates that the
improvements brought by the ensemble operation may come from its ability
to make the results in uncertain regions more reliable.

Appendix B.2. Personalized Referring Image Segmentation

Combinations of Personalized and Localized Methods. We conduct a study


concerning distinct combinations of personalized and localized approaches.
We adopt discriminative model-based methods: PALAVRA [18] for personal-
ization, CLIP-ES [6] for segmentation, and generative model-based methods:

31
Custom Diffusion [17] for personalization, our framework for segmentation.
As shown in Table B.9, CLIP-ES has difficulties dealing with personalized
items. We attribute it to the instability of Grad-CAM [65] as we occasion-
ally get all zero CAMs for specific objects. In contrast, our mechanism is
robust to various personalized embeddings and we ultimately choose Custom
Diffusion because it performs better and is more consistent with our method.

Objects in the Same Series. We further construct a new split from Mug19
dataset regarding series. We build a scenario where we want to locate the
object with only access to images of its variants within the same series.
For instance, we aim to find the red transparent plastic mug in the scene,
but we only possess photos of its blue variant. We handle it by using the
identifier embedding of the blue variant along with the context “a red mug”
as the segmentation query. We pick out 632 triplets based on this notion and
name them “variant split” as a whole. Quantitative comparisons of various
baselines are conducted on this split in Table B.10. Our method fully utilizes
both subject and context information and addresses the localization issue of
this situation more effectively than all the baselines.

Appendix C. Comparisons with SAM

Drawing inspiration from Large Language Models (LLM), foundation


models for segmentation have been developed recently [66, 15]. They can
solve universal segmentation tasks with prompts of various modalities. How-
ever, this series of methods need dense annotations while our framework only
requires image-level labels.

32
Split Variant
Metric mIoU bf acc af acc
Mask R-CNN [48] 71.9 59.5 71.9
DINO-ViT [61] 48.5 44.7 66.7
Subject Only 61.6 54.9 78.8
Context Only 54.7 52.6 71.1
Arithmetic 36.5 37.3 47.1
Ours 69.2 60.7 88.3

Table B.10: Evaluation results on the variant split of Mug19 dataset. The best results
are in bold while the second best results are underlined.

Appendix D. More Discussions on Limitations

First, we further analyze the situation where an image contains semanti-


cally similar objects. We compare the textual embeddings of different entities
in one image, and choose images in VOC train set that contain at least 2
classes to calculate the results. For each image, we compute the average
cosine similarity between different classes and record the mIoU. As shown in
Figure D.9, the more similar different entities in one image, the worse the
segmentation results of our method, which is consistent with our finding in
the Limitations and Discussions section. The p-value of rejecting the null
hypothesis that the slope is zero is 0.0049. Next, we discuss the limitations
of our proposed task. We care more about the appearance of different per-
sonalized entities, and our descriptions currently focus more on what they
look like. As CLIP, the text encoder of Stable Diffusion, struggles with un-
derstanding spatial relations [67], we doubt our framework’s ability on them

33
100
mIoU(%)
r = -0.73
75

50
0.3 0.4 0.5
Cosine Similarity
Figure D.9: The correlation between segmentation results and textual embedding simi-
larity. The r-value is -0.73.

and have not yet considered them in our dataset. In future work, we will
add this type of description to our dataset and consider how to enhance our
method’s ability to comprehend them.

References

[1] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for


semantic segmentation, in: CVPR, 2015, pp. 3431–3440.

[2] G. Wang, W. Li, M. Aertsen, J. Deprest, S. Ourselin, T. Vercauteren,


Aleatoric uncertainty estimation with test-time augmentation for medi-
cal image segmentation with convolutional neural networks, Neurocom-
puting 338 (2019) 34–45.

[3] T. Do, A. Nguyen, I. D. Reid, Affordancenet: An end-to-end deep

34
learning approach for object affordance detection, in: ICRA, 2018, pp.
5882–5889.

[4] W. Li, V. Mahadevan, N. Vasconcelos, Anomaly detection and localiza-


tion in crowded scenes, IEEE Trans. Pattern Anal. Mach. Intell. (2014).

[5] J. Xie, X. Hou, K. Ye, L. Shen, CLIMS: cross language image matching
for weakly supervised semantic segmentation, in: CVPR, 2022, pp.
4483–4492.

[6] Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, X. He,


CLIP is also an efficient segmenter: A text-driven approach for weakly
supervised semantic segmentation, in: CVPR, 2023, pp. 15305–15314.

[7] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou,


J. Lu, Denseclip: Language-guided dense prediction with context-aware
prompting, in: CVPR, 2022, pp. 18082–18091.

[8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,


G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever,
Learning transferable visual models from natural language supervision,
in: ICML, 2021, pp. 8748–8763.

[9] Y. Song, S. Ermon, Generative modeling by estimating gradients of the


data distribution, in: NeurIPS, 2019, p. 11918–11930.

[10] J. Ho, A. Jain, P. Abbeel, Denoising diffusion probabilistic models, in:


NeurIPS, 2020, pp. 6840–6851.

35
[11] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-
resolution image synthesis with latent diffusion models, in: CVPR,
2022, pp. 10684–10695.

[12] D. Baranchuk, A. Voynov, I. Rubachev, V. Khrulkov, A. Babenko,


Label-efficient semantic segmentation with diffusion models, in: ICLR,
2022.

[13] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, S. D. Mello, Open-


vocabulary panoptic segmentation with text-to-image diffusion models,
in: CVPR, 2023, pp. 2955–2966.

[14] R. Burgert, K. Ranasinghe, X. Li, M. S. Ryoo, Peekaboo: Text to image


diffusion models are zero-shot segmentors, 2022. arXiv:2211.13224.

[15] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, Y. J.


Lee, Segment everything everywhere all at once, in: NeurIPS, 2023, pp.
19769–19782.

[16] P. Krähenbühl, V. Koltun, Efficient inference in fully connected crfs


with gaussian edge potentials, in: NeurIPS, 2011, p. 109–117.

[17] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, J.-Y. Zhu, Multi-


concept customization of text-to-image diffusion, in: CVPR, 2023, pp.
1931–1941.

[18] N. Cohen, R. Gal, E. A. Meirom, G. Chechik, Y. Atzmon, ”this is my


unicorn, fluffy”: Personalizing frozen vision-language representations,
in: ECCV, 2022, p. 558–577.

36
[19] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, A. Zisser-
man, The pascal visual object classes (VOC) challenge, Int. J. Comput.
Vis. (2010).

[20] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan,


P. Dollár, C. L. Zitnick, Microsoft COCO: common objects in context,
in: ECCV, 2014, p. 740–755.

[21] Y. Chong, X. Chen, Y. Tao, S. Pan, Erase then grow: Generating cor-
rect class activation maps for weakly-supervised semantic segmentation,
Neurocomputing 453 (2021) 97–108.

[22] J. Li, Z. Jie, X. Wang, Y. Zhou, L. Ma, J. Jiang, Weakly supervised


semantic segmentation via self-supervised destruction learning, Neuro-
computing 561 (2023) 126821.

[23] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, A. Torralba, Learning deep


features for discriminative localization, in: CVPR, 2016, pp. 2921–2929.

[24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed


representations of words and phrases and their compositionality, in:
NeurIPS, 2013, pp. 3111–3119.

[25] M. Bucher, T. Vu, M. Cord, P. Pérez, Zero-shot semantic segmentation,


in: NeurIPS, 2019, pp. 466–477.

[26] A. Zareian, K. D. Rosa, D. H. Hu, S. Chang, Open-vocabulary object


detection using captions, in: CVPR, 2021, pp. 14393–14402.

37
[27] R. Tang, A. Pandey, Z. Jiang, G. Yang, K. V. S. M. Kumar, J. Lin,
F. Ture, What the daam: Interpreting stable diffusion using cross at-
tention, in: ACL, 2022, pp. 5644–5659.

[28] A. C. Li, M. Prabhudesai, S. Duggal, E. Brown, D. Pathak, Your diffu-


sion model is secretly a zero-shot classifier, in: ICCV, 2023, pp. 2206–
2217.

[29] J. Wolleb, R. Sandkühler, F. Bieder, P. Valmaggia, P. C. Cattin, Dif-


fusion models for implicit image segmentation ensembles, in: MIDL,
2022, pp. 1336–1348.

[30] Z. Li, Q. Zhou, X. Zhang, Y. Zhang, Y. Wang, W. Xie, Open-vocabulary


object segmentation with diffusion models, in: ICCV, 2023, pp. 7667–
7676.

[31] W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, C. Shen, Diffumask: Synthesiz-


ing images with pixel-level annotations for semantic segmentation using
diffusion models, in: ICCV, 2023, pp. 1206–1217.

[32] N. Vo, L. Jiang, C. Sun, K. Murphy, L. Li, L. Fei-Fei, J. Hays, Com-


posing text and image for image retrieval - an empirical odyssey, in:
CVPR, 2019, pp. 6439–6448.

[33] S. Goenka, Z. Zheng, A. Jaiswal, R. Chada, Y. Wu, V. Hedau, P. Natara-


jan, Fashionvlp: Vision language transformer for fashion retrieval with
feedback, in: CVPR, 2022, pp. 14085–14095.

[34] Z. Liu, C. R. Opazo, D. Teney, S. Gould, Image retrieval on real-life

38
images with pre-trained vision-and-language models, in: ICCV, 2021,
pp. 2105–2114.

[35] E. Perez, F. Strub, H. de Vries, V. Dumoulin, A. C. Courville, Film:


Visual reasoning with a general conditioning layer, in: AAAI, 2018, pp.
3942–3951.

[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,


L. Kaiser, I. Polosukhin, Attention is all you need, in: NeurIPS, 2017,
pp. 5998–6008.

[37] K. Saito, K. Sohn, X. Zhang, C. Li, C. Lee, K. Saenko, T. Pfister,


Pic2word: Mapping pictures to words for zero-shot composed image
retrieval, in: CVPR, 2023, pp. 19305–19314.

[38] P. Esser, R. Rombach, B. Ommer, Taming transformers for high-


resolution image synthesis, in: CVPR, 2021, pp. 12873–12883.

[39] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks


for biomedical image segmentation, in: MICCAI, 2015, pp. 234–241.

[40] D. P. Kingma, M. Welling, Auto-encoding variational bayes, in: ICLR,


2014.

[41] C. Schuhmann, R. Beaumont, R. Vencu, C. W. Gordon, R. Wight-


man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,
P. Schramowski, S. R. Kundurthy, K. Crowson, L. Schmidt, R. Kacz-
marczyk, J. Jitsev, LAION-5b: An open large-scale dataset for training
next generation image-text models, in: NeurIPS Datasets and Bench-
marks Track, 2022, pp. 25278–25294.

39
[42] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans.
Pattern Anal. Mach. Intell. (2000).

[43] O. Patashnik, D. Garibi, I. Azuri, H. Averbuch-Elor, D. Cohen-Or, Lo-


calizing object-level shape variations with text-to-image diffusion mod-
els, in: ICCV, 2023, pp. 23051–23061.

[44] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, D. Cohen-


or, Prompt-to-prompt image editing with cross-attention control, in:
ICLR, 2023.

[45] J. Ahn, S. Kwak, Learning pixel-level semantic affinity with image-level


supervision for weakly supervised semantic segmentation, in: CVPR,
2018, pp. 4981–4990.

[46] J. Ahn, S. Cho, S. Kwak, Weakly supervised learning of instance seg-


mentation with inter-pixel relations, in: CVPR, 2019, pp. 2209–2218.

[47] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. L. Price, S. Co-


hen, T. S. Huang, Youtube-vos: Sequence-to-sequence video object
segmentation, in: ECCV, 2018, pp. 603–619.

[48] K. He, G. Gkioxari, P. Dollár, R. B. Girshick, Mask r-cnn, IEEE Trans.


Pattern Anal. Mach. Intell. 42 (2017) 386–397.

[49] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab:


Semantic image segmentation with deep convolutional nets, atrous con-
volution, and fully connected crfs, IEEE Trans. Pattern Anal. Mach.
Intell. (2018).

40
[50] Y. Chang, Q. Wang, W. Hung, R. Piramuthu, Y. Tsai, M. Yang,
Weakly-supervised semantic segmentation via sub-category exploration,
in: CVPR, 2020, pp. 8988–8997.

[51] Y. Wang, J. Zhang, M. Kan, S. Shan, X. Chen, Self-supervised equivari-


ant attention mechanism for weakly supervised semantic segmentation,
in: CVPR, 2020, pp. 12272–12281.

[52] J. Lee, E. Kim, S. Yoon, Anti-adversarially manipulated attributions for


weakly and semi-supervised semantic segmentation, in: CVPR, 2021,
pp. 4071–4080.

[53] J. Lee, J. Choi, J. Mok, S. Yoon, Reducing information bottleneck


for weakly supervised semantic segmentation, in: NeurIPS, 2021, pp.
27408–27421.

[54] J. Lee, S. J. Oh, S. Yun, J. Choe, E. Kim, S. Yoon, Weakly supervised


semantic segmentation using out-of-distribution data, in: CVPR, 2022,
pp. 16876–16885.

[55] L. Xu, W. Ouyang, M. Bennamoun, F. Boussaı̈d, D. Xu, Multi-class


token transformer for weakly supervised semantic segmentation, in:
CVPR, 2022, pp. 4300–4309.

[56] J. Fan, Z. Zhang, C. Song, T. Tan, Learning integral objects with intra-
class discriminator for weakly-supervised semantic segmentation, in:
CVPR, 2020, pp. 4282–4291.

[57] L. Chen, W. Wu, C. Fu, X. Han, Y. Zhang, Weakly supervised semantic


segmentation with boundary exploration, in: ECCV, 2020, pp. 347–362.

41
[58] Q. Chen, L. Yang, J. Lai, X. Xie, Self-supervised image-specific pro-
totype exploration for weakly supervised semantic segmentation, in:
CVPR, 2022, pp. 4278–4288.

[59] M. Lee, D. Kim, H. Shim, Threshold matters in WSSS: manipulating


the activation for the robust and accurate segmentation model against
thresholds, in: CVPR, 2022, pp. 4320–4329.

[60] H. W. Kuhn, The hungarian method for the assignment problem, Naval
Research Logistics (NRL) (1955).

[61] S. Amir, Y. Gandelsman, S. Bagon, T. Dekel, Deep vit features as dense


visual descriptors, ECCVW What is Motion For? (2022).

[62] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu,


W. Yin, S. Feng, Y. Sun, L. Chen, H. Tian, H. Wu, H. Wang, Ernie-vilg
2.0: Improving text-to-image diffusion model with knowledge-enhanced
mixture-of-denoising-experts, in: CVPR, 2023, pp. 10135–10145.

[63] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,


T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf,
E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style,
high-performance deep learning library, in: NeurIPS, 2019, pp. 8024–
8035.

[64] A. Voynov, Q. Chu, D. Cohen-Or, K. Aberman, P+: extended textual


conditioning in text-to-image generation, 2023. arXiv:2303.09522.

42
[65] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra,
Grad-cam: Visual explanations from deep networks via gradient-based
localization, in: ICCV, 2017, pp. 618–626.

[66] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,


T. Xiao, S. Whitehead, A. C. Berg, W. Lo, P. Dollár, R. B. Girshick,
Segment anything, 2023. arXiv:2304.02643.

[67] N. Liu, S. Li, Y. Du, J. Tenenbaum, A. Torralba, Learning to compose


visual relations, in: NeurIPS, 2021, pp. 23166–23178.

43

You might also like