0% found this document useful (0 votes)
13 views

DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic

Segmentation Using Diffusion Models

Weijia Wu1,3 , Yuzhong Zhao2 , Mike Zheng Shou3 *, Hong Zhou1∗ , Chunhua Shen1,4
1 2 3 4
Zhejiang University University of Chinese Academy of Sciences National University of Singapore Ant Group
arXiv:2303.11681v4 [cs.CV] 21 Jan 2024

aeroplane bicycle bird boat bottle bus car cat chair cow motorbike
dog person sheep sofa train potted plant background
(a) Synthesizing Images with Pixel-level Annotations for Semantic Segmentation

Prompt: A photograph of Eiffel Tower Prompt: A painting of a highly detailed Ultraman Prompt: A road sign shows Mask
(b) Open-Vocabulary Image and Semantic Mask Generation
Figure 1 – DiffuMask synthesizes photo-realistic images and high-quality mask annotations by exploiting the attention maps of
the diffusion model. Without human effort for localization DiffuMask is capable of producing high-quality semantic masks.

Abstract localize class/word-specific regions, which are combined


with practical techniques to create a novel high-resolution
Collecting and annotating images with pixel-wise labels and class-discriminative pixel-wise mask. The methods
is time-consuming and laborious. In contrast, synthetic help to significantly reduce data collection and annotation
data can be freely available using a generative model (e.g., costs. Experiments demonstrate that the existing segmenta-
DALL-E, Stable Diffusion). In this paper, we show that tion methods trained on synthetic data of DiffuMask can
it is possible to automatically obtain accurate semantic achieve a competitive performance over the counterpart
masks of synthetic images generated by the Off-the-shelf of real data (VOC 2012, Cityscapes). For some classes
Stable Diffusion model, which uses only text-image pairs (e.g., bird), DiffuMask presents promising performance,
during training. Our approach, termed DiffuMask, ex- close to the state-of-the-art result of real data (within 3%
ploits the potential of the cross-attention map between text mIoU gap). Moreover, in the open-vocabulary segmenta-
and image, which is natural and seamless to extend the tion (zero-shot) setting, DiffuMask achieves new state-of-
text-driven image synthesis to semantic mask generation. the-art results on the Unseen classes of VOC 2012. The
DiffuMask uses text-guided cross-attention information to project website can be found at DiffuMask.
∗ Corresponding author
1. Introduction
Semantic segmentation is a fundamental task in vision,
and existing data-hungry semantic segmentation models Synthetic Image “a ” “horse” “on” “the” “grass”
usually require a large amount of data with pixel-level an- (a) Cross attention maps of different text tokens.
notations to achieve significant progress. Unfortunately,
pixel-wise mask annotation is a labor-intensive and expen-
sive process. For example, labeling a single semantic ur-
ban image in Cityscapes [14] can take up to 60 minutes,
8×8 16×16 32×32 64×64 Average Map
underscoring the level of difficulty involved in this task Ad- (b) Cross attention maps of different resolutions.
ditionally, in some cases, it may be challenging or even im-
possible to collect images due to existing privacy and copy-
right. To reduce the cost of annotation, weakly-supervised
learning has become a popular approach in recent years. Attention Map 𝛾: 0.25 𝛾: 0.3 𝛾: 0.35 𝛾: 0.4 𝛾: 0.45
This approach involves training strong segmentation mod- (c) Binarization Mask with different thresholds γ in Equ. (3).
els using weak or cheap labels, such as image-level la- Figure 2 – Cross-attention maps of a text-conditioned diffu-
bels [2, 33, 59, 61, 51, 52], points [3], scribbles [37, 63], sion model (i.e., Stable Diffusion [49]). Prompt language: ‘a
and bounding boxes [34]. Although these methods are free horse on the grass’.
of pixel-level annotations, still suffer from several disadvan-
tages, including low-performance accuracy, complex train-
ing strategy, indispensable extra annotation cost (e.g., edge), visualization of cross attention map between text token and
and image collection cost. vision. 8 × 8, 16 × 16, 32 × 32, and 64 × 64, as four dif-
With the great development of computer graphics (e.g., ferent resolutions, are extracted from different layers of the
generative model), an alternative way is to utilize synthetic U-Net of Stable Diffusion [49]. 8×8 feature map is the low-
data, which is largely available from the virtual world, and est resolution, including obvious class-discriminative loca-
the pixel-level ground truth can be freely and automati- tion. 32 × 32 and 64 × 64 feature maps include high-
cally generated. DatasetGAN [65] firstly exploits the fea- resolution and highlight fine-grained details. The average
ture space of a trained GAN and trains a shallow decoder map shows the possibility for us to use for semantic seg-
to produce pixel-level labeling. BigDatasetGAN [35] ex- mentation, where it is class-discriminative and fine-grained.
tends DatasetGAN to handle the large class diversity of Im- To further validate the potential of the attention map of the
ageNet. However, both methods suffer from certain draw- generative task, we convert the probability map to a binary
backs, the need for a small number of pixel-level labeled map with fixed thresholds γ, and refine them with Dense
examples to generalize to the rest of the latent space and CRF [31], as shown in Fig. 2c. With the 0.35 threshold,
suboptimal performance due to imprecise generative masks. the mask presents a wonderful precision on fine-grained de-
Recently, large-scale language-image generation (LLIG) tails (e.g., foot, ear of the ‘horse’).
models, such as DALL-E [48], and Stable Diffusion [49], Based on the above observation, we present DiffuMask,
have shown phenomenal generative semantic and composi- an automatic procedure to generate a massive high-quality
tional power, as shown in Fig. 1. Given one language de- image with a pixel-level semantic mask. Unlike Dataset-
scription, the text-conditioned image generation model can GAN [65] and BigDatasetGAN [35], DiffuMask does not
create corresponding semantic things and stuff, where vi- require any pixel-level annotations. This approach takes
sual and textual embedding are fused using spatial cross- full advantage of powerful zero-shot text-to-image genera-
attention. We dive deep into the cross-attention layers and tive models such as Stable Diffusion [49], which are trained
explore how they affect the generative semantic object and on web-scale image-text pairs. DiffuMask mainly includes
structure of the image. We find that cross-attention maps two advantages for two challenges: 1) Precise Mask. An
are the core, which binds visual pixels and text tokens of adaptive threshold of binarization is proposed to convert
the prompt text. Also, the cross-attention maps contain rich the probability map (attention map) to a binary map, as
class (text token) discriminative spatial localization infor- the mask annotation. Besides, noise learning [44, 56] is
mation, which critically affects the generated image. used to filter noisy labels. 2) Domain Gap: retrieval-based
Can the attention map be used as mask annotation? prompt (various and verisimilar prompt guidance) and data
Consider semantic segmentation [19, 14]—a ‘good’ pixel- augmentations (e.g., Splicing [7]), as two effective solu-
level semantic mask annotation should satisfy two condi- tions, are designed to reduce the domain gap via enhancing
tions: (a) class-discriminative (i.e., localize and distinguish the diversity of data. With the above advantages, DiffuMask
the categories in the image); (b) high-resolution, precise can generate infinite images with pixel-level annotation for
mask (i.e., capture fine-grained detail). Fig. 2b presents a any class without human effort. These synthetic data can
then be used for training any semantic segmentation archi- Synthetic Dataset Generation. Prior works [29, 16] for
tecture (e.g., mask2former [11]), replacing real data. dataset synthesis mainly utilize 3D scene graphs to render
To summarize, our contributions are three-folds: images and their labels. 2D methods, i.e., Generative Ad-
versarial Networks (GAN) [23] mainly is used to solve do-
• We show a novel insight that it is possible to automat- main adaptation task [13, 13], which leverages image-to-
ically obtain the synthetic image and mask annotation image translation to reduce the domain gap. Recently, in-
from a text-supervised pre-trained diffusion model. spired by the success of generative model (e.g., DALL-E
• We present DiffuMask, an automatic procedure to gen- 2, Stable Diffusion), some works further try to explore the
erate massive image and pixel-level semantic annota- potential of synthetic data to replace real data as the train-
tion without human effort and any manual mask an- ing data in many downstream tasks, including image clas-
notation, which exploits the potential of the cross- sification [28, 6], object detection [60, 42, 21, 20, 67, 66],
attention map between text and image. image segmentation [35, 65, 36], 3D Rendering [64, 46].
• Experiments demonstrate that segmentation methods DatasetGAN [65] utilized a few labeled real images to train
trained on DiffuMask perform competitively on real a segmentation mask decoder, leading to an infinite syn-
data, e.g., VOC 2012. For some classes, e.g., dog, the thetic image and mask generator. Based on DatasetGAN,
performance is close to that of training with real data BigDatasetGAN [35] scale the class diversity to ImageNet
(within 3% gap). Moreover, in the open-vocabulary size, which generates 1k classes with manually annotated 5
segmentation (zero-shot) setting, DiffuMask achieves images per class. With Stable diffusion and Mask R-CNN
new SOTA results on the Unseen classes of VOC pre-trained on COCO dataset, Li et al. [36] design and train
2012. a grounding module to generate images and segmentation
masks. Different from the above methods, we go one step
2. Related Work further and synthesize accurate semantic labels by exploit-
ing the potential of cross attention map between text and
Reducing Annotation Cost. Various ways can be ex- image. One significant advantage of the DiffuMask is that
plored to reduce the segmentation data cost, including in- it does not require any manual localization annotations (i.e.,
teractive human-in-the-loop annotation [1, 39], nearest- box and mask) and only rely on text supervision.
neighbor mask transfer [26], or weak/cheap mask an-
notation supervision in different levels, such as image- 3. Methodology
level labels [2, 33, 59, 61, 51, 52], points [3], scrib-
bles [37, 63], and bounding boxes [34, 9, 32]. Among In this paper, we explore simultaneously generating im-
the above-related works, image-level label supervised learn- ages and the semantic mask described in the text prompt
ing [51, 52] presents the lowest cost, and its performance with the existing pre-trained diffusion model. Using the
is unacceptable. Bounding boxes [9, 32] annotation usu- synthetic data to train the existing segmentation methods,
ally shows a competitive performance than pixel-wise su- and apply them to the real images.
pervised methods, but its annotation cost is the most expen- The core is to exploit the potential of the cross-attention
sive. By comparison, synthetic data presents many advan- map in the generative model and domain gap between syn-
tages, including lower data cost without image collection, thetic and real data, providing corresponding new insights,
and infinite availability for enhancing the diversity of data. solutions, and analysis. We introduce the preliminary of
Image Generation. Image generation is a basic and cross attention in Sec. 3.1, Mask generation and refinement
challenging task in computer vision. There are several with cross-attention map in text-conditioned diffusion mod-
mainstream methods for the task, including Generative Ad- els in Sec. 3.2, data diversity enhancement with prompt en-
versarial Networks (GAN) [23], Variational autoencoders gineering in Sec. 3.4, data augmentation in Sec. 3.5.
(VAE) [30], flow-based models [18], and Diffusion Prob-
3.1. Cross-Attention of Text-Image
abilistic Models (DM) [55, 49, 24]. Recently, the diffu-
sion model has drawn lots of attention due to its wonder- Text-guided generative models (e.g., Imagen [53], Stable
ful performance. GLIDE [43] used pre-trained language Diffusion [49]) use a text prompt P to guide the content-
model (CLIP [47]) and the cascaded diffusion structure for related image I generation from a random gaussian image
text-to-image generation. Similarly, DALL-E 2 [48] of noise z, where visual and textual embedding are fused us-
OpenAI Imagen [53] obtain the corresponding text embed- ing the spatial cross-attention. Specifically, Stable Diffu-
ding with CLIP and adopted a similar hieratical structure sion [49] consists of a text encoder, a variational autoen-
to generate images. To increase accessibility and reduce coder (VAE), and a U-shaped network [50]. The interaction
significant resource consumption, Stable Diffusion [49] of between the text and vision occurs in the U-Net for the la-
Stability AI introduced a novel direction in which the model tent vectors at each time step, where cross-attention layers
diffuses on VAE latent spaces instead of pixel spaces. are used to fuse the embeddings of the visual and textual
1
features and produce spatial attention maps for each textual
0.9
token. Formally, for step t, the visual features of the noisy 0.8
image φ(zt ) ∈ RH×W ×C are flatted and linearly projected 0.7
into a Query vector Q = ℓQ (φ(zt )). The text prompt P

IoU
0.6
is projected into the textual embedding τθ (P) ∈ RN ×d (N 0.5
0.4
refers to the sequence length of text tokens and d is the la-
0.3
tent projection dimension) with the text encoder τθ , then is 0.2
mapped into a Key matrix K = ℓK (τθ (P)) and a Value 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54
Threshold 𝜸
matrix V = ℓV (τθ (P)), via learned projections ℓQ , ℓK , ℓV .
Horse Bird Bottle Dog Cat Selection with AffinityNet
The cross attention maps can be calculated by:
QK T Figure 3 – Relationship between mask quality (IoU)
 
A = Softmax √ , (1) and threshold for various categories. 1k generative im-
d ages are used for each class from Stable Diffusion [49].
where A ∈ RH×W ×N (re-shape). For j-th text token, e.g., Mask2former [11] pre-trained on Pascal-VOC 2012 [19] is used
horse on Fig. 2a, the corresponding weight Aj ∈ RH×W on to generate the ground truth. The optimal threshold of different
the visual map φ(zt ) can be obtained. Finally, the output of classes usually is different.
cross-attention can be obtained with φb (zt ) = AV , which is
then used to update the spatial features φ(zt ).
class. The prediction of Mask2former [11] pre-trained on
3.2. Mask Generation and Refinement Pascal-VOC 2012 as the ground truth is adopted to calcu-
Based on Equ. 1, we can obtain the corresponding cross late the quality of mask quality (mIoU), as shown in Fig. 3.
attention map As,t The optimal threshold of different classes usually are dif-
j . s denotes the attention map from s-th
layer of U-Net, and corresponding to four different resolu- ferent, e.g., around 0.48 for ‘Bottle’ class, different from
tions, i.e., 8 × 8, 16 × 16, 32 × 32, and 64 × 64, as shown that (i.e., around 0.39) of ‘Dog’ class. To achieve the best
in Fig. 2b. t denotes t-th diffusion step (time). Then the av- quality of the mask, the adaptive threshold is a feasible so-
erage cross-attention map can be calculated by aggregating lution for the various binarization for each image and class.
the multi-layer and multi-time attention maps as follows:
3.2.2 Adaptive Threshold for Binarization
1 X As,t
j
Âj = , (2) It is challenging to determine the optimal threshold for bina-
S·T
s∈S,t∈T
max(As,t
j )
rizing the probability maps because of the variation in shape
where S and T refer to the total steps and the number of and region for each object class. The image generation re-
layers (i.e., four for U-Net). Normalization is necessary due lies on text-supervision, which does not provide a precise
the value of the attention map from the output of Softmax definition of the shape and region of object classes. For ex-
is not a probability between 0 and 1. ample, the masks with 0.45γ and that with 0.35γ in Fig. 2c,
the model can not judge which one is better, while no lo-
3.2.1 Standard Binarization cation information as supervision and reference is provided
by human effort.
Given an average attention map (a probability map) M ∈
Looking deeper at the challenge, pixels with a middle
RH×W for j-th text token produced by the cross attention
confidence score cause uncertainty, while that with a high
in Equ. (1), it is essential to convert it to a binary map, where
and low score usually represent the true foreground and the
pixels with 1 as the foreground region (e.g., ‘horse’). Usu-
background. To address the challenge, semantic affinity
ally, as shown in Fig. 2c, the simplest solution for the bi-
learning (i.e., AffinityNet [2]) is used to give an estimation
narization process is using a fixed threshold value γ, and
for those pixels with a middle confidence score. Thus we
refining with DenseCRF [31] (local relationship defined by
can obtain the definition for global prototype, i.e., which
color and distance of pixels) as follows:
h i semantic masks with different threshold γ is suitable to rep-
B = DenseCRF( γ; Âj ). (3) resent the whole prototype. AffinityNet aims to predict se-
argmax mantic affinity between a pair of adjacent coordinates. Dur-
The above method is not practical and effective, while the ing the training phase, those pixels in the middle score range
optimal threshold of each image and each category are are considered as neutral. If one of the adjacent coordi-
not exactly the same. To explore the relationship between nates is neutral, the network simply ignores the pair dur-
threshold and binary mask quality, we set a simple anal- ing training. Without neutral pixels, the affinity label of
ysis experiment. Stable Diffusion [49] is used to gener- two coordinates is set to 1 (positive pair) if their classes are
ate 1k images and corresponding attention maps for each the same, and 0 (negative pair) otherwise. During the in-
Prompt: ‘ photo of a [sub-class] car in the street’
AffinityNet
Noisy Data, 𝑋
Clip Retrieval (𝐼; 𝐵𝜸! ) ∈ (ℝ#×%, 𝕄#×%)
Massive Data with Noisy Label
1. “ Photo of vintage car on the
street” ×𝑛
2. “ Photo of a car parked on the Model, 𝜽
street in the town of Trieste,
Italy”
⊗ IoU Matching
3. “ Photo of a red lamborghini Noisy Predicted
aventador sportscar car parked in Count
the street town” (Cross Validation) 𝑴(𝐵&! ; 𝐼, 𝜽)
… $
Image Caption Cross Attention Map Affinity Map 𝑩
Rank, Prune
random
sampling Clean Data
Q Q Q Q Q Q
Sampled Prompt: ‘Photo of a red K K
V
K
V
K
V
K
V
K
...
V V
lamborghini aventador sportscar
car parked in the street town’ 𝓏& Denoisng U-Net
Synthetic Image
Dense CRF
Text Encoder 𝑁×d $
with Different Thresholds {𝛾! }!"# Final Image and Mask

Diversity and Reality for Prompt Image and Mask Generation and Refinement Noise Learning (Prune)
Figure 4 – Pipeline for DiffuMask with a prompt: ‘Photo of a [sub-class] car in the street’. DiffuMask mainly includes
three steps: 1) Prompt engineering is used to enhance the diversity and reality of prompt language (Sec. 3.4). 2) Image and mask
generation and refinement with adaptive threshold from AffinityNet (Sec. 3.2). 3) Noise learning is designed to further improve the
quality of data via filtering the noisy label (Sec. 3.3).

12
12 With Noise Learning
Original Distribution of Accuracy
With Noise Learning
Original Distribution of Accuracy
3.3. Noise Learning
10 10

8 8 Although refined mask Bγ̂ presents a competitive result,


Density

Density

6 6
there are still existing noisy labels with low precision. Fig. 5
4 4
provides the probability density distribution of IoU for the
2 2
‘Horse’ and ‘Bird’ classes. The masks with IoU under
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 80% account for a non-negligible proportion and may cause
IoU IoU
(a) Distribution of ‘Horse’. (b) Distribution of ‘Bird’.
a significant performance drop. Inspired by noise learn-
ing [44, 56, 10] for the classification task, we design a sim-
Figure 5 – Effect of Noise Learning (NL). 30k generative im- ple, yet effective noise learning (NL) strategy to prune the
ages are used for each class. NL prunes 70% images on the noise labels for the segmentation task.
basis of the rank of IoU. Mask2former [11] pre-trained on VOC
2012 [19] is used to generate the ground truth. NL brings obvi- NL improves the data quality by identifying and filter-
ous improvement in mask quality by pruning data. ing noisy labels. The main procedure (see Fig. 4) com-
prises two steps: (1) Count: estimating the distribution of
label noise QBγ̂ ,B ∗ to characterize pixel-level label noise,
B ∗ refers to the prediction of model. (2) Rank, and
ference phase, a coarse affinity map B̂ ∈ RH×W can be
Prune: filter out noisy examples and train with errors re-
predicted by AffinityNet for each class of each image. B̂
moved data. Formally, given massive generative images
is used to search for a suitable threshold γ̂ during a search
and annotations {(I, Bγ̂ )}, a segmentation model θ (e.g.,
space Ω = {γi }Li=1 as follows: Mask2former [11], Mask-RCNN [27]) is used to predict
out-of-sample probabilities of segmentation result θ : I →
X Mc (Bγ̂ ; I, θ) by cross-validation. Then we can estimate
γ̂ = arg max Lmatch (B̂, Bγ ), (4)
γ∈Ω the joint distribution of noisy labels Bγ̂ and true labels,
QcBγ̂ ,B ∗ = ΦIoU (Bγ̂ , B ∗ ), where c denotes c-th class. With
QcBγ̂ ,B ∗ , some interpretable and explainable ranking meth-
where Lmatch (B̂, Bγ ) is a pair-wise matching cost of IoU ods, such as loss reweighting [22, 41] can be used for CL to
between affinity map B̂ and a binary map from attention find label errors using. In this paper, we adopt a simple and
map with threshold γ. As a result, an adaptive threshold effective modularized rank and prune method, i.e., Prune by
γ̂ can be obtained for each image of each class. The red Class, which decouples the model and data cleaning proce-
points in Fig. 3 represent the corresponding threshold from dure. For each class, select and prune α% examples with the
matching with the affinity map. They are usually close to lowest self-confidence QcBγ̂ ,B ∗ as the noisy data, and train
the optimal threshold. model θ with the remaining clean data. While α% is set
Prompt: Photo of a bird

(a) Splicing (2×2) (b) Gaussian Blur (c) Occlusion (d) Perspective Transform

Figure 7 – Data Augmentation. Four data augmentations are


Golden Bulbul Macaw Duck Ostrich Waterfowl
used to reduce the domain gap.
Crane

During inference, we randomly sample a prompt from this


Magpie Swan Pigeon Egret Eagle Hyliota
set to generate each image.
Prompt for diversity: Photo of a [sub-class ] bird
Figure 6 – Prompt for diversity in sub-class for the bird class. 3.5. Data Augmentation
100 sub-classes for bird class in total for our experiment. The
same prompt strategy is used for other classes, e.g., cat, car. To further reduce the domain gap between the generated
images and the real-world images in terms of size, blur, and
occlusion, data augmentations Φ(·) (e.g., Splicing [7]), as
to 50%, the probability density distribution of IoU from the the effective strategies are used, as shown in Fig. 7. Splic-
remaining clean data is presented in Fig. 5 (yellow). CL can ing. Synthetic image usually present normal size for the
bring an obvious gain for the mask precision, which further foreground (object), i.e., objects typically occupy the ma-
taps the potential of attention map as mask annotation. jority of image. However, real-world images often contain
objects of varying resolutions, including small objects in
3.4. Prompt Engineering datasets such as Cityscapes [15]. To address this issue, we
Previous works [42, 58] have shown the effectiveness of use Splicing augmentation. Fig. 7 (a) presents one example
prompt engineering on diversity enhancement of generative for the image splicing (2 × 2). In the experiment, six scales
data. These studies utilize a variety of prompt modifiers to of image splicing are used, i.e., 1 × 2, 2 × 1, 2 × 2, 3 × 3,
influence the generated images, e.g., GPT3 used by Imag- 5 × 5, and 8 × 8, and the images are sampled from train
inaryNet [42]. Unlike generation-based or modification- set randomly. Gaussian Blur. Synthetic images typically
based prompts, we design two practical, reality-based exhibit a uniform level of blur, whereas real images exhibit
prompt strategies. varying degrees of blur due to motion, focus, and artifact
Prompt with Sub-Classes. Simple text prompts, such issues. Gaussian Blur [40] is used to increase the diversity
as ‘Photo of a bird’, often results in monotony for gen- of blur, where the length of Gaussian Kernel is randomly
erative images, as depicted in Fig. 6 (upper), they fail to sampled from a range of 6 to 22. Occlusion. Similar to
capture the diverse range of objects and scenes found in CutMix [62], to make the model focus on discriminative
the real world. To address this challenge, we incorporate parts of objects, patches of another image are cut and pasted
‘sub-classes’ for each category to improve diversity. To among training images where the corresponding labels are
achieve this, we select K sub-classes for each category from also mixed proportionally to the area of the patches. Per-
Wiki1 and integrate this information into the prompt tem- spective Transform. Similar to the above augmentations,
plates. Fig. 6 (down) presents an example for ‘bird’ cate- perspective transform is used to improve the diversity of the
gory. Given K sub-classes, i.e., Golden Bullul, Crane, this generated images by simulating different viewpoints.
allows us to obtain K corresponding text prompts ‘Photo
of a [sub-class] bird’, denoted by {P̂1 , P̂2 , ..., P̂K }. 4. Experiments
Retrieval-based Prompt. The prompt P̂ still is a hand- 4.1. Experimental Setups
crafted sentence template, we expect to develop it into a
real language prompt in the human community. One fea- Datasets and Task. Datasets. Following the previ-
sible solution for that is through prompt retrieval [5, 47]. ous works [11, 36] for semantic segmentation, Pascal-VOC
As shown in Fig. 4, given a prompt P̂, i.e., ‘Photo of 2012 [19], ADE20k [68] and Cityscapes [15] are used to
a [sub-class] car in the street’, Clipretrieval [5] evaluate DiffuMask. Tasks. Three tasks are adopted in our
pre-trained on Laion5B [54] is used to retrieve top N real experiment, i.e., semantic segmentation, open-vocabulary
images and captions, where the captions as the final prompt segmentation, and domain generalization.
sets. Using this approach,Pwe can collect a total of K × N Implementation Details The pre-trained Stable Diffu-
K×N
text prompts, denoted by i=1 P̂i , for our synthetic data. sion [49], the text encoder of CLIP [47], AffinityNet [2]
are adopted as the base components. We do not finetune
1 https://en.wikipedia.org/wiki/Main Page the Stable Diffusion and only train AffinityNet for each
Semantic Segmentation (IoU) for Selected Classes/%
Train Set Number Backbone aeroplane bird boat bus car cat chair cow dog horse person sheep sofa mIoU
Train with Pure Real Data
R: 10.6k (all) R50 87.5 94.4 70.6 95.5 87.7 92.2 44.0 85.4 89.1 82.1 89.2 80.6 53.6 77.3
VOC
R: 10.6k (all) Swin-B 97.0 93.7 71.5 91.7 89.6 96.5 57.5 95.9 96.8 94.4 92.5 95.1 65.6 84.3
R: 5.0k Swin-B 95.5 87.7 77.1 96.1 91.2 95.2 47.3 90.3 92.8 94.6 90.9 93.7 61.4 83.4
Train with Pure Synthetic Data
S: 60.0k R50 80.7 86.7 56.9 81.2 74.2 79.3 14.7 63.4 65.1 64.6 71.0 64.7 27.8 57.4
DiffuMask
S: 60.0k Swin-B 90.8 92.9 67.4 88.3 82.9 92.5 27.2 92.2 86.0 89.0 76.5 92.2 49.8 70.6
Finetune on Real Data
S: 60.0k + R: 5.0k R50 85.4 92.8 74.1 92.9 83.7 91.7 38.4 86.5 86.2 82.5 87.5 81.2 39.8 77.6
VOC, DiffuMask
S: 60.0k + R: 5.0k Swin-B 95.6 94.4 72.3 96.9 92.9 96.6 51.5 96.7 95.5 96.1 91.5 96.4 70.2 84.9

Table 1 – Result of Semantic Segmentation on the VOC 2012 val. mIoU is for 20 classes. ‘S’ and ‘R’ refer to ‘Synthetic’ and ‘Real’.

Category/% Train Set/% mIoU/%


Train Set Number Backbone Human Vehicle mIoU Methods Type Categories Seen Unseen Harmonic
Train with Pure Real Data Manual Mask Supervision
3.0k (all) R50 83.4 94.5 89.0 ZS3 [8] real 15 78.0 21.2 33.3
Cityscapes
3.0k (all) Swin-B 85.5 96.0 90.8 CaGNet [25] real 15 78.6 30.3 43.7
1.5k Swin-B 84.6 95.3 90.0 Joint [4] real 15 77.7 32.5 45.9
Train with Pure Synthetic Data STRICT [45] real 15 82.7 35.6 49.8
100.0k R50 70.7 85.3 78.0 SIGN [12] real 15 83.5 41.3 55.3
DiffuMask
100.0k Swin-B 72.1 87.0 79.6 ZegFormer [17] real 15 86.4 63.6 73.3
Finetune with Real Data Pseudo Mask Supervision from Model pre-trained on COCO [38]
100.0k + 1.5k R50 84.6 95.5 90.1 Li et al. [36] (ResNet101) synthetic 15+5 62.8 50.0 55.7
Cityscapes, DiffuMask
100.0k + 1.5k Swin-B 86.4 96.4 91.4 Text(Prompt) Supervision
DiffuMask (ResNet50) synthetic 15+5 60.8 50.4 55.1
Table 2 – The mIoU (%) of Semantic Segmentation on
DiffuMask (ResNet101) synthetic 15+5 62.1 50.5 55.7
Cityscapes val. ‘Human’ includes two sub-classes person DiffuMask (Swin-B) synthetic 15+5 71.4 65.0 68.1
and rider. ‘Vehicle’ includes four sub-classes, i.e., car, bus,
truck and train. Mask2former [11] with ResNet50 is used. Table 3 – Performance for Zero-Shot Semantic Segmenta-
tion Task on PASCAL VOC. ‘Seen’, ‘Unseen’, and ‘Har-
monic’ denote mIoU of seen, unseen categories, and their har-
monic mean. Priors are trained with real data and masks.
category. The corresponding parameter optimization and
setting (e.g., initialization, data augmentation, batch size,
learning rate) all are similar to that of the original paper. mIoU averaged on seen classes, unseen classes, and their
Synthetic data for training. For each category on Pascal- harmonic mean are used.
VOC 2012 [19], we generate 10k images and set α of noise Mask Smoothness. The mask Bγ̂ generated by the
learning to 0.7 to filter 7k images. As a result, we collect Dense CRF often contains jagged edges and numerous
60k synthetic data for 20 classes as the final training set, small regions that do not correspond to distinct objects in
and the spatial resolution is 512 × 512. For Cityscapes [14], the image. To address these issues, we trained a segmenta-
we only evaluate 2 important classes, i.e., ‘Human’ and tion model θ (i.e. Mask2Former), using the mask Bγ̂ gen-
‘Vehicle’, including six sub-classes, person, rider, car, erated by the Dense CRF as input. We then used this model
bus, truck, train, and generate 30k images for each sub- to predict the pseudo labels for the training set of synthetic
category, where 10k images are selected as the final train- data, resulting in a final semantic mask annotation
ing data by noise learning. Considering the relationship Cross Validation for Noise Learning. In the experi-
between rider and motorbike/bicycle, we set the two ment, we performed the three-fold cross-validation for each
classes to be ignored, while evaluating the ‘Human’ class class. The five-fold cross-validation (CV) is a process in
on Table 2 and Table 6. In our experiment, only a single which all data is randomly split into k folds, in our case k
object for an image is considered. Multi-categories gen- = 3, and then the model is trained on the k − 1 folds, while
eration [36] usually causes the unstable quality of the im- one fold is left to test the quality.
ages, limited by the generation ability of Stable Diffusion.
Mask2Former [11] is used as the baseline to evaluate the 4.2. Protocol-I: Semantic Segmentation
dataset. 8 Tesla V100 GPUs are used for all experiments. VOC 2012. Table 1 presents the results of semantic seg-
Evaluation Metrics. Mean intersection-over-union mentation on the VOC 2012. The existing segmentation
(mIoU) [19, 11], as the common metric of semantic seg- methods trained on synthetic data (DiffuMask) can achieve
mentation, is used to evaluate the performance. For open- a competitive performance, i.e., 70.6% v.s. 84.3% for mIoU
vocabulary segmentation, following the prior [17, 12], the with Swin-B backbone. A point worth emphasizing is that
Annotation γ Bird Dog mIoU Retri. Sub-C Bird Dog mIoU α Bird Dog mIoU Method Bird Dog mIoU
Affinity map - 84.4 78.8 81.6 78.2 75.6 76.9 0.3 87.2 79.2 83.2 − 87.0 81.5 84.3
Attention 0.4 88.1 82.4 85.3 ✓ 79.2 76.2 77.7 0.4 89.5 79.9 84.7 Φ1 90.2 83.7 87.0
Attention 0.5 90.3 67.4 78.9 ✓ 10 91.3 83.9 87.6 0.5 91.9 84.4 88.2 Φ1 , Φ2 90.9 84.8 87.9
Attention 0.6 50.5 38.3 44.4 ✓ 50 92.5 85.4 89.0 0.6 92.6 85.2 89.1 Φ1 , Φ2 , Φ3 91.2 85.1 88.2
DiffuMask AT 92.9 86.0 89.5 ✓ 100 92.9 86.0 89.5 0.7 92.9 86.0 89.5 Φ1 , Φ2 , Φ3 , Φ4 92.9 86.0 89.5

(a) DiffuMask v.s. Attention Map. (b) Prompt Engineering. (c) Noise Learning. (d) Data Augmentation.
Table 4 – DiffuMask ablations. We perform ablations on VOC 2012 val. γ and ‘AT’ denotes the ‘Threshold’ and ‘Adaptive Threshold’,
respectively. α refers to the proportion of data pruning. Φ1 , Φ2 , Φ3 and Φ4 refer to ‘Splicing’, ‘Gaussian Blur’, ‘Occlusion’, and
‘Perspective Transform’, respectively. ‘Retri.’ and ‘Sub-C’ denotes ‘retrieval-based’ and ‘Sub-Class’, respectively. Mask2former with
Swin-B is adopted as the baseline.

our synthetic data does not need any manual localization Category/%
Train Set Number Backbone bus car person mIoU
and mask annotation, while real data need humans to per- Train with Pure Real Data
form a pixel-wise mask annotation. For some categories, R: 20.2k R50 87.9 82.5 79.4 83.3
ADE20K
i.e., bird, cat, cow, horse, sheep, DiffuMask presents a R: 20.2k Swin-B 93.6 86.1 84.0 87.9
powerful performance, which is quite close to that of train- Train with Pure Synthetic Data
S: 6.0k R50 43.4 67.3 60.2 57.0
ing on real (within 5% gap). Besides, finetune on few real DiffuMask
S: 6.0k Swin-B 72.8 73.4 62.6 69.6
data, the results can be improved further, and exceed that of
training on full real data, e.g., 84.9% mIoU finetune on 5.0k Table 5 – The mIoU (%) of Semantic Segmentation on the
real data v.s 83.4% mIoU training on full real data (10.6k). ADE20K val.
Cityscapes. Table 2 presents the results on Cityscapes.
mIoU/%
Urban street scenes of Cityscapes are more challenging, in- Train Set Test Set Car Person Motorbike mIoU
cluding a mass of small objects and complex backgrounds. Cityscapes [14] VOC 2012 [19] val 26.4 32.9 28.3 29.2
We only evaluate two classes, i.e., Vehicle and Human, ADE20K [68] VOC 2012 [19] val 73.2 66.6 64.1 68.0
which are the two most important categories in the driving DiffuMask VOC 2012 [19] val 74.2 71.0 63.2 69.5
VOC 2012 [19] Cityscapes [14] val 85.6 53.2 11.9 50.2
scene. Compared with training on real images, DiffuMask
ADE20K [68] Cityscapes [14] val 83.3 63.4 33.7 60.1
presents a competitive result, i.e., 79.6% vs. 90.8% mIoU. DiffuMask Cityscapes [14] val 84.0 70.7 23.6 59.4
ADE20K ADE20K, as one more challenging dataset, is
Table 6 – Performance for Domain Generalization between
also used to evaluate the DiffuMask. Table 5 presents the re-
different datasets. Mask2former [11] with ResNet50 is used as
sults of three categories (bus, car, person) on ADE20K. the baseline. Person and Rider classes of Cityscapes [14] are
With fewer synthetic images (6k), we achieve a competi- consider as the same class, i.e., Person in the experiment.
tive performance than that of a mass of real images (20.2k).
Compared with the other two categories, Class car achieves
the best performance, with 73.4% mIoU. domain generalization, e.g., 69.5% with DiffuMask v.s
68.0 with ADE20K [68] on VOC 2012 val. The domain
4.3. Protocol-II: Open-vocabulary Segmentation gap [57] between real datasets sometimes is bigger than
As shown in Fig. 1, it is natural and seamless to extend that among synthetic and real data. For Motorbike class,
the text-driven synthetic data (our DiffuMask) to the open- model training with Cityscapes only achieves 28.9% mIoU,
vocabulary (zero-shot) task. As shown in Table 3, compared but that of DiffuMask is 63.2% mIoU. We argue that the
with priors training on real images with manually annotated main reason is domain shift in foreground and background
mask, DiffuMask can achieve a SOTA result on Unseen domains, i.e., Cityscapes contains images of city roads, with
classes. It is worth mentioning that DiffuMask is pure syn- the majority of Motorbike objects being small in size. But
thetic/fake data and supervised by text, while priors all must VOC 2012 is an open-set scenario, where Motorbike ob-
need the real image and corresponding manual mask anno- jects vary greatly in size and include close-up shots.
tation. Li et al., as one contemporaneous work, use the seg-
mentation model pre-trained on COCO [38] to predict the
4.5. Ablation Study
pseudo label of the synthetic image, which is high-cost. Compared with Attention Map. Table 4a presents the
comparison with the attention map and the impact of bina-
4.4. Protocol-III: Domain Generalization
rization threshold γ. It is clear that the optimal threshold
Table 6 presents the results for cross-dataset validation, for different categories is different, even various for differ-
which can evaluate the generalization of data. Compared ent images of the same category. Sometimes it is sensitive
with real data, DiffuMask show powerful effectiveness on for some categories, such as Dog. The mIoU of 0.4 γ is
Backbone Bird Dog Sheep Horse Person mIoU
RseNet 50 86.7 65.1 64.7 64.6 71.0 70.3 Background

ResNet 50
RseNet 101 86.7 66.8 65.3 63.4 70.2 70.5 Sheep
Background Background
Backbone
Swin-B 92.9 86.0 92.2 89.0 76.5 87.3 Dog

Swin-L 92.8 86.4 92.3 88.3 77.3 87.4

Table 7 – Impact of Backbone on VOC 2012 val.


Mask2former [11] is used as the baseline.
Background

Swin-B
Background
Background
Annotation Bird Dog Person Sofa mIoU Sheep
Sheep
Real Image, Manual Label 93.7 96.8 92.5 65.6 87.2
Sheep
Synthetic Image, Pseudo Label 95.2 86.2 89.9 59.5 82.7
Synthetic Image, DiffuMask 92.9 86.0 76.5 49.8 76.3 Classification False Negative Mask Precision

Figure 8 – Impact of Backbone. Stronger backbone is robust


Table 8 – Impact of Mask Precision and Domain Gap on
for classification, False Negative, and mask precision.
VOC 2012 val. Mask2former [11] with Swin-B is used as
the baseline. ‘Pseudo’ denotes pseudo mask annotation from
Mask2former [11] pre-trained on VOC 2012.
backbone. For some classes, e.g. sheep, the stronger
backbone can bring obvious gains, i.e. Swin-B achieves
better than that of 0.6 γ around 40% mIoU, which can not 27.5% mIoU improvement than that of ResNet 50. And
be neglectful. By contrast, our adaptive threshold is robust. the mIoU of all classes with Swin-B achieves 19.2% mIoU
Fig. 3 also shows it is close to the optimal threshold. improvements. It is an interesting and novel insight that a
Prompt Engineering. Table 4b provides the related ab- stronger backbone can reduce the domain gap between syn-
lation study for prompt strategies. Retrieval-based and sub- thetic and real data. To give a further analysis for that, we
classes prompt all can bring an obvious gain. For dog, present some results comparison of visualizations, as shown
10 sub-classes prompt brings a 7.7% mIoU improvement, in Fig. 8. Swin-B brings an obvious improvement in classi-
which is quite significant. It is reasonable, the fine-grained fication, False Negatives, and mask precision.
prompts can directly enhance the diversity of generative im-
ages, as shown in Fig. 6. 5. Conclusion
Noise Learning. Table 4c presents the impact of prune A new insight is presented in this paper, demonstrat-
threshold α. 10k synthetic images for each class are used in ing that the accurate semantic mask of generative images
this experiment. The gain is considerable while α changes can be automatically obtained through the use of a text-
from 0.3 to 0.5. In other experiments, we set the α to 0.7 driven diffusion model. To achieve this goal, we present
for each category. DiffuMask, an automatic procedure to generate image and
Data Augmentation. The ablation study for the four pixel-level semantic annotation. The existing segmenta-
augmentations is shown in Table 4d. Compared with the tion methods training on synthetic data of DiffuMask can
other three augmentations, the gain of image splicing is the achieve a competitive performance over the counterpart of
biggest. One main reason is that the synthetic images are real data. Besides, DiffuMask shows the powerful per-
all 512 × 512 resolution and the size of the object usually is formance for open-vocabulary segmentation, which can
normal, image splicing can enhance the diversity of scale. achieve a promising result on Unseen category. We hope
What causes the performance gap between synthetic DiffuMask can bring new insights and inspiration for bridg-
and real data. Domain gap and mask precision are the ing generative data and real-world data in the community.
main reasons for the performance gap between synthetic
and real data. Table 8 is set To further explore the problem.
Acknowledgements
Li et al. [36] shows that the pseudo mask of the synthetic
image from Mask2former [11] pre-trained on VOC 2012 is W. Wu, C. Shen’s participation was supported by the Na-
quite accurate, and can as the ground truth. Thus, we also tional Key R&D Program of China (No. 2022ZD0118700).
use the pseudo label from the pre-trained Mask2former to W. Wu, H. Zhou’s participation was supported by the Na-
train the model. As shown in Table 8, mask precision cause tional Key Research and Development Program of China
6.4% mIoU gap, and the domain gap of images causes 4.5% (No. 2022YFC3602601), and the Key Research and De-
mIoU gap. Notably, for the bird class, the use of synthetic velopment Program of Zhejiang Province of China (No.
data with a pseudo label resulted in better results than the 2021C02037). M. Shou’s participation was supported by
corresponding real images. This observation suggests that the National Research Foundation, Singapore under its
there may be no domain gap for the bird class in the VOC NRFF Award NRF-NRFF13-2021-0008, and his Start-Up
2012 dataset. Grant from National University of Singapore. Thank you to
Backbone Table 7 presents the ablation study for the Runlong Liao for pointing out some citation errors.
References [14] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
[1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Ef- Franke, Stefan Roth, and Bernt Schiele. The cityscapes
ficient interactive annotation of segmentation datasets with dataset for semantic urban scene understanding. In Proceed-
polygon-rnn++. In Proceedings of the IEEE conference on ings of the IEEE conference on computer vision and pattern
Computer Vision and Pattern Recognition, pages 859–868, recognition, pages 3213–3223, 2016. 2, 7, 8
2018. 3 [15] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
[2] Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
affinity with image-level supervision for weakly supervised Franke, Stefan Roth, and Bernt Schiele. The cityscapes
semantic segmentation. In Proceedings of the IEEE con- dataset for semantic urban scene understanding. In Proceed-
ference on computer vision and pattern recognition, pages ings of the IEEE conference on computer vision and pattern
4981–4990, 2018. 2, 3, 4, 6 recognition, pages 3213–3223, 2016. 6
[3] Peri Akiva and Kristin Dana. Towards single stage [16] Jeevan Devaranjan, Sanja Fidler, and Amlan Kar. Unsuper-
weakly supervised semantic segmentation. arXiv preprint vised learning of scene structure for synthetic data genera-
arXiv:2106.10309, 2021. 2, 3 tion, Sept. 9 2021. US Patent App. 17/117,425. 3
[4] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Ex- [17] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De-
ploiting a joint embedding space for generalized zero-shot coupling zero-shot semantic segmentation. In Proc. CVPR,
semantic segmentation. In Proc. ICCV, 2021. 7 2022. 7
[5] Romain Beaumont. Clip retrieval: Easily compute clip em- [18] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice:
beddings and build a clip retrieval system with them. GitHub, Non-linear independent components estimation. arXiv
2022. 6 preprint arXiv:1410.8516, 2014. 3
[6] Victor Besnier, Himalaya Jain, Andrei Bursuc, Matthieu [19] Mark Everingham, Luc Van Gool, Christopher KI Williams,
Cord, and Patrick Pérez. This dataset does not exist: training John Winn, and Andrew Zisserman. The pascal visual object
models from generated images. In ICASSP 2020-2020 IEEE classes (voc) challenge. International journal of computer
International Conference on Acoustics, Speech and Signal vision, 88(2):303–338, 2010. 2, 4, 5, 6, 7, 8
Processing (ICASSP), pages 1–5. IEEE, 2020. 3 [20] Yunhao Ge, Harkirat Behl, Jiashu Xu, Suriya Gunasekar,
[7] Alexey Bochkovskiy, Chien-Yao Wang, and Hong- Neel Joshi, Yale Song, Xin Wang, Laurent Itti, and Vibhav
Yuan Mark Liao. Yolov4: Optimal speed and accuracy of Vineet. Neural-sim: Learning to generate training data with
object detection. arXiv preprint arXiv:2004.10934, 2020. 2, nerf. In European Conference on Computer Vision, pages
6 477–493. Springer, 2022. 3
[21] Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Laurent Itti, and
[8] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick
Vibhav Vineet. Dall-e for detection: Language-driven con-
Pérez. Zero-shot semantic segmentation. NeurIPS, 2019. 7
text image synthesis for object detection. arXiv preprint
[9] Liang-Chieh Chen, Sanja Fidler, Alan L Yuille, and Raquel arXiv:2206.09592, 2022. 3
Urtasun. Beat the mturkers: Automatic image labeling from [22] Jacob Goldberger and Ehud Ben-Reuven. Training deep
weak 3d supervision. In Proceedings of the IEEE conference neural-networks using a noise adaptation layer. In Proc. Int.
on computer vision and pattern recognition, pages 3198– Conf. Learn. Representations, 2017. 5
3205, 2014. 3
[23] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
[10] Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Shengyu Zhang. Understanding and utilizing deep neural Yoshua Bengio. Generative adversarial networks. Commu-
networks trained with noisy labels. In International Confer- nications of the ACM, 63(11):139–144, 2020. 3
ence on Machine Learning, pages 1062–1070. PMLR, 2019. [24] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun-
5 peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning
[11] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-
der Kirillov, and Rohit Girdhar. Masked-attention mask rank adaptation for multi-concept customization of diffusion
transformer for universal image segmentation. In Proceed- models. arXiv preprint arXiv:2305.18292, 2023. 3
ings of the IEEE/CVF Conference on Computer Vision and [25] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and
Pattern Recognition, pages 1290–1299, 2022. 3, 4, 5, 6, 7, Liqing Zhang. Context-aware feature generation for zero-
8, 9 shot semantic segmentation. In ACM MM, 2020. 7
[12] Jiaxin Cheng, Soumyaroop Nandi, Prem Natarajan, and [26] Matthieu Guillaumin, Daniel Küttel, and Vittorio Ferrari.
Wael Abd-Almageed. Sign: Spatial-information incorpo- Imagenet auto-annotation with segmentation propagation.
rated generative network for generalized zero-shot semantic International Journal of Computer Vision, 110(3):328–348,
segmentation. In Proc. ICCV, 2021. 7 2014. 3
[13] Jaehoon Choi, Taekyung Kim, and Changick Kim. Self- [27] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
ensembling with gan-based data augmentation for domain shick. Mask r-cnn. In Proceedings of the IEEE international
adaptation in semantic segmentation. In Proceedings of the conference on computer vision, pages 2961–2969, 2017. 5
IEEE/CVF International Conference on Computer Vision, [28] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing
pages 6830–6840, 2019. 3 Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic
data from generative models ready for image recognition? [42] Minheng Ni, Zitong Huang, Kailai Feng, and Wangmeng
arXiv preprint arXiv:2210.07574, 2022. 3 Zuo. Imaginarynet: Learning object detectors without real
[29] Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, images and annotations. arXiv preprint arXiv:2210.06886,
Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, 2022. 3, 6
and Sanja Fidler. Meta-sim: Learning to generate synthetic [43] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
datasets. In Proceedings of the IEEE/CVF International Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
Conference on Computer Vision, pages 4551–4560, 2019. 3 Mark Chen. Glide: Towards photorealistic image generation
[30] Diederik P Kingma and Max Welling. Auto-encoding varia- and editing with text-guided diffusion models. arXiv preprint
tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3 arXiv:2112.10741, 2021. 3
[31] Philipp Krähenbühl and Vladlen Koltun. Efficient inference [44] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident
in fully connected crfs with gaussian edge potentials. Ad- learning: Estimating uncertainty in dataset labels. Journal
vances in neural information processing systems, 24, 2011. of Artificial Intelligence Research, 70:1373–1411, 2021. 2,
2, 4 5
[32] Viveka Kulharia, Siddhartha Chandra, Amit Agrawal, Philip [45] Giuseppe Pastore, Fabio Cermelli, Yongqin Xian, Massimil-
Torr, and Ambrish Tyagi. Box2seg: Attention weighted loss iano Mancini, Zeynep Akata, and Barbara Caputo. A closer
and discriminative feature learning for weakly supervised look at self-training for zero-label semantic segmentation. In
segmentation. In European Conference on Computer Vision, Proc. CVPRW, 2021. 7
pages 290–308. Springer, 2020. 3 [46] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden-
[33] Jungbeom Lee, Eunji Kim, and Sungroh Yoon. Anti- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv
adversarially manipulated attributions for weakly and semi- preprint arXiv:2209.14988, 2022. 3
supervised semantic segmentation. In Proceedings of the [47] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
IEEE/CVF Conference on Computer Vision and Pattern Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Recognition, pages 4071–4080, 2021. 2, 3 Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[34] Jungbeom Lee, Jihun Yi, Chaehun Shin, and Sungroh Yoon. transferable visual models from natural language supervi-
Bbam: Bounding box attribution map for weakly super- sion. In International conference on machine learning, pages
vised semantic and instance segmentation. In Proceedings 8748–8763. PMLR, 2021. 3, 6
of the IEEE/CVF conference on computer vision and pattern [48] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
recognition, pages 2643–2652, 2021. 2, 3 and Mark Chen. Hierarchical text-conditional image gen-
[35] Daiqing Li, Huan Ling, Seung Wook Kim, Karsten Kreis, eration with clip latents. arXiv preprint arXiv:2204.06125,
Sanja Fidler, and Antonio Torralba. Bigdatasetgan: Synthe- 2022. 2, 3
sizing imagenet with pixel-wise annotations. In Proceedings [49] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
of the IEEE/CVF Conference on Computer Vision and Pat- Patrick Esser, and Björn Ommer. High-resolution image
tern Recognition, pages 21330–21340, 2022. 2, 3 synthesis with latent diffusion models. In Proceedings of
[36] Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yan- the IEEE/CVF Conference on Computer Vision and Pattern
feng Wang, and Weidi Xie. Guiding text-to-image diffu- Recognition, pages 10684–10695, 2022. 2, 3, 4, 6
sion model towards grounded generation. arXiv preprint [50] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
arXiv:2301.05221, 2023. 3, 6, 7, 9 net: Convolutional networks for biomedical image segmen-
[37] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. tation. In Medical Image Computing and Computer-Assisted
Scribblesup: Scribble-supervised convolutional networks for Intervention–MICCAI 2015: 18th International Conference,
semantic segmentation. In Proceedings of the IEEE con- Munich, Germany, October 5-9, 2015, Proceedings, Part III
ference on computer vision and pattern recognition, pages 18, pages 234–241. Springer, 2015. 3
3159–3167, 2016. 2, 3 [51] Lixiang Ru, Bo Du, and Chen Wu. Learning visual words
[38] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, for weakly-supervised semantic segmentation. In IJCAI, vol-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence ume 5, page 6, 2021. 2, 3
Zitnick. Microsoft coco: Common objects in context. In [52] Lixiang Ru, Yibing Zhan, Baosheng Yu, and Bo Du. Learn-
Proc. ECCV, 2014. 7, 8 ing affinity from attention: End-to-end weakly-supervised
[39] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and Sanja semantic segmentation with transformers. In Proceedings of
Fidler. Fast interactive object annotation with curve-gcn. In the IEEE/CVF Conference on Computer Vision and Pattern
Proceedings of the IEEE/CVF conference on computer vi- Recognition, pages 16846–16855, 2022. 2, 3
sion and pattern recognition, pages 5257–5266, 2019. 3 [53] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
[40] Raphael Gontijo Lopes, Dong Yin, Ben Poole, Justin Gilmer, Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
and Ekin D Cubuk. Improving robustness without sacrificing Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
accuracy with patch gaussian augmentation. arXiv preprint Rapha Gontijo Lopes, et al. Photorealistic text-to-image
arXiv:1906.02611, 2019. 6 diffusion models with deep language understanding. arXiv
[41] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Raviku- preprint arXiv:2205.11487, 2022. 3
mar, and Ambuj Tewari. Learning with noisy labels. 2013. [54] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
5 Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- [66] Yuzhong Zhao, Weijia Wu, Zhuang Li, Jiahong Li, and
man, et al. Laion-5b: An open large-scale dataset for Weiqiang Wang. Flowtext: Synthesizing realistic scene
training next generation image-text models. arXiv preprint text video with optical flow estimation. arXiv preprint
arXiv:2210.08402, 2022. 6 arXiv:2305.03327, 2023. 3
[55] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, [67] Yuzhong Zhao, Qixiang Ye, Weijia Wu, Chunhua Shen, and
and Surya Ganguli. Deep unsupervised learning using Fang Wan. Generative prompt model for weakly supervised
nonequilibrium thermodynamics. In International Confer- object localization. arXiv preprint arXiv:2307.09756, 2023.
ence on Machine Learning, pages 2256–2265. PMLR, 2015. 3
3 [68] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
[56] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Barriuso, and Antonio Torralba. Scene parsing through
and Jae-Gil Lee. Learning from noisy labels with deep neural ade20k dataset. In Proceedings of the IEEE conference on
networks: A survey. IEEE Transactions on Neural Networks computer vision and pattern recognition, pages 633–641,
and Learning Systems, 2022. 2, 5 2017. 6, 8
[57] Marco Toldo, Andrea Maracani, Umberto Michieli, and
Pietro Zanuttigh. Unsupervised domain adaptation in seman-
tic segmentation: a review. Technologies, 8(2):35, 2020. 8
[58] Sam Witteveen and Martin Andrews. Investigating
prompt engineering in diffusion models. arXiv preprint
arXiv:2211.15462, 2022. 6
[59] Tong Wu, Junshi Huang, Guangyu Gao, Xiaoming Wei, Xi-
aolin Wei, Xuan Luo, and Chi Harold Liu. Embedded dis-
criminative attention mechanism for weakly supervised se-
mantic segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
16765–16774, 2021. 2, 3
[60] Zhenyu Wu, Lin Wang, Wei Wang, Tengfei Shi, Chenglizhao
Chen, Aimin Hao, and Shuo Li. Synthetic data supervised
salient object detection. In Proceedings of the 30th ACM
International Conference on Multimedia, pages 5557–5565,
2022. 3
[61] Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid
Boussaid, Ferdous Sohel, and Dan Xu. Leveraging auxil-
iary tasks with affinity learning for weakly supervised se-
mantic segmentation. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 6984–6993,
2021. 2, 3
[62] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
larization strategy to train strong classifiers with localizable
features. In Proceedings of the IEEE/CVF international con-
ference on computer vision, pages 6023–6032, 2019. 6
[63] Bingfeng Zhang, Jimin Xiao, Jianbo Jiao, Yunchao Wei,
and Yao Zhao. Affinity attention graph neural network for
weakly supervised semantic segmentation. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 2021. 2,
3
[64] Yuxuan Zhang, Wenzheng Chen, Huan Ling, Jun Gao,
Yinan Zhang, Antonio Torralba, and Sanja Fidler. Im-
age gans meet differentiable rendering for inverse graph-
ics and interpretable 3d neural rendering. arXiv preprint
arXiv:2010.09125, 2020. 3
[65] Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-
Francois Lafleche, Adela Barriuso, Antonio Torralba, and
Sanja Fidler. Datasetgan: Efficient labeled data factory with
minimal human effort. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 10145–10155, 2021. 2, 3

You might also like