DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic
DiffuMask Synthesizing Images with Pixel-level Annotations for Semantic
Weijia Wu1,3 , Yuzhong Zhao2 , Mike Zheng Shou3 *, Hong Zhou1∗ , Chunhua Shen1,4
1 2 3 4
Zhejiang University University of Chinese Academy of Sciences National University of Singapore Ant Group
arXiv:2303.11681v4 [cs.CV] 21 Jan 2024
aeroplane bicycle bird boat bottle bus car cat chair cow motorbike
dog person sheep sofa train potted plant background
(a) Synthesizing Images with Pixel-level Annotations for Semantic Segmentation
Prompt: A photograph of Eiffel Tower Prompt: A painting of a highly detailed Ultraman Prompt: A road sign shows Mask
(b) Open-Vocabulary Image and Semantic Mask Generation
Figure 1 – DiffuMask synthesizes photo-realistic images and high-quality mask annotations by exploiting the attention maps of
the diffusion model. Without human effort for localization DiffuMask is capable of producing high-quality semantic masks.
IoU
0.6
is projected into the textual embedding τθ (P) ∈ RN ×d (N 0.5
0.4
refers to the sequence length of text tokens and d is the la-
0.3
tent projection dimension) with the text encoder τθ , then is 0.2
mapped into a Key matrix K = ℓK (τθ (P)) and a Value 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.54
Threshold 𝜸
matrix V = ℓV (τθ (P)), via learned projections ℓQ , ℓK , ℓV .
Horse Bird Bottle Dog Cat Selection with AffinityNet
The cross attention maps can be calculated by:
QK T Figure 3 – Relationship between mask quality (IoU)
A = Softmax √ , (1) and threshold for various categories. 1k generative im-
d ages are used for each class from Stable Diffusion [49].
where A ∈ RH×W ×N (re-shape). For j-th text token, e.g., Mask2former [11] pre-trained on Pascal-VOC 2012 [19] is used
horse on Fig. 2a, the corresponding weight Aj ∈ RH×W on to generate the ground truth. The optimal threshold of different
the visual map φ(zt ) can be obtained. Finally, the output of classes usually is different.
cross-attention can be obtained with φb (zt ) = AV , which is
then used to update the spatial features φ(zt ).
class. The prediction of Mask2former [11] pre-trained on
3.2. Mask Generation and Refinement Pascal-VOC 2012 as the ground truth is adopted to calcu-
Based on Equ. 1, we can obtain the corresponding cross late the quality of mask quality (mIoU), as shown in Fig. 3.
attention map As,t The optimal threshold of different classes usually are dif-
j . s denotes the attention map from s-th
layer of U-Net, and corresponding to four different resolu- ferent, e.g., around 0.48 for ‘Bottle’ class, different from
tions, i.e., 8 × 8, 16 × 16, 32 × 32, and 64 × 64, as shown that (i.e., around 0.39) of ‘Dog’ class. To achieve the best
in Fig. 2b. t denotes t-th diffusion step (time). Then the av- quality of the mask, the adaptive threshold is a feasible so-
erage cross-attention map can be calculated by aggregating lution for the various binarization for each image and class.
the multi-layer and multi-time attention maps as follows:
3.2.2 Adaptive Threshold for Binarization
1 X As,t
j
Âj = , (2) It is challenging to determine the optimal threshold for bina-
S·T
s∈S,t∈T
max(As,t
j )
rizing the probability maps because of the variation in shape
where S and T refer to the total steps and the number of and region for each object class. The image generation re-
layers (i.e., four for U-Net). Normalization is necessary due lies on text-supervision, which does not provide a precise
the value of the attention map from the output of Softmax definition of the shape and region of object classes. For ex-
is not a probability between 0 and 1. ample, the masks with 0.45γ and that with 0.35γ in Fig. 2c,
the model can not judge which one is better, while no lo-
3.2.1 Standard Binarization cation information as supervision and reference is provided
by human effort.
Given an average attention map (a probability map) M ∈
Looking deeper at the challenge, pixels with a middle
RH×W for j-th text token produced by the cross attention
confidence score cause uncertainty, while that with a high
in Equ. (1), it is essential to convert it to a binary map, where
and low score usually represent the true foreground and the
pixels with 1 as the foreground region (e.g., ‘horse’). Usu-
background. To address the challenge, semantic affinity
ally, as shown in Fig. 2c, the simplest solution for the bi-
learning (i.e., AffinityNet [2]) is used to give an estimation
narization process is using a fixed threshold value γ, and
for those pixels with a middle confidence score. Thus we
refining with DenseCRF [31] (local relationship defined by
can obtain the definition for global prototype, i.e., which
color and distance of pixels) as follows:
h i semantic masks with different threshold γ is suitable to rep-
B = DenseCRF( γ; Âj ). (3) resent the whole prototype. AffinityNet aims to predict se-
argmax mantic affinity between a pair of adjacent coordinates. Dur-
The above method is not practical and effective, while the ing the training phase, those pixels in the middle score range
optimal threshold of each image and each category are are considered as neutral. If one of the adjacent coordi-
not exactly the same. To explore the relationship between nates is neutral, the network simply ignores the pair dur-
threshold and binary mask quality, we set a simple anal- ing training. Without neutral pixels, the affinity label of
ysis experiment. Stable Diffusion [49] is used to gener- two coordinates is set to 1 (positive pair) if their classes are
ate 1k images and corresponding attention maps for each the same, and 0 (negative pair) otherwise. During the in-
Prompt: ‘ photo of a [sub-class] car in the street’
AffinityNet
Noisy Data, 𝑋
Clip Retrieval (𝐼; 𝐵𝜸! ) ∈ (ℝ#×%, 𝕄#×%)
Massive Data with Noisy Label
1. “ Photo of vintage car on the
street” ×𝑛
2. “ Photo of a car parked on the Model, 𝜽
street in the town of Trieste,
Italy”
⊗ IoU Matching
3. “ Photo of a red lamborghini Noisy Predicted
aventador sportscar car parked in Count
the street town” (Cross Validation) 𝑴(𝐵&! ; 𝐼, 𝜽)
… $
Image Caption Cross Attention Map Affinity Map 𝑩
Rank, Prune
random
sampling Clean Data
Q Q Q Q Q Q
Sampled Prompt: ‘Photo of a red K K
V
K
V
K
V
K
V
K
...
V V
lamborghini aventador sportscar
car parked in the street town’ 𝓏& Denoisng U-Net
Synthetic Image
Dense CRF
Text Encoder 𝑁×d $
with Different Thresholds {𝛾! }!"# Final Image and Mask
Diversity and Reality for Prompt Image and Mask Generation and Refinement Noise Learning (Prune)
Figure 4 – Pipeline for DiffuMask with a prompt: ‘Photo of a [sub-class] car in the street’. DiffuMask mainly includes
three steps: 1) Prompt engineering is used to enhance the diversity and reality of prompt language (Sec. 3.4). 2) Image and mask
generation and refinement with adaptive threshold from AffinityNet (Sec. 3.2). 3) Noise learning is designed to further improve the
quality of data via filtering the noisy label (Sec. 3.3).
12
12 With Noise Learning
Original Distribution of Accuracy
With Noise Learning
Original Distribution of Accuracy
3.3. Noise Learning
10 10
Density
6 6
there are still existing noisy labels with low precision. Fig. 5
4 4
provides the probability density distribution of IoU for the
2 2
‘Horse’ and ‘Bird’ classes. The masks with IoU under
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 80% account for a non-negligible proportion and may cause
IoU IoU
(a) Distribution of ‘Horse’. (b) Distribution of ‘Bird’.
a significant performance drop. Inspired by noise learn-
ing [44, 56, 10] for the classification task, we design a sim-
Figure 5 – Effect of Noise Learning (NL). 30k generative im- ple, yet effective noise learning (NL) strategy to prune the
ages are used for each class. NL prunes 70% images on the noise labels for the segmentation task.
basis of the rank of IoU. Mask2former [11] pre-trained on VOC
2012 [19] is used to generate the ground truth. NL brings obvi- NL improves the data quality by identifying and filter-
ous improvement in mask quality by pruning data. ing noisy labels. The main procedure (see Fig. 4) com-
prises two steps: (1) Count: estimating the distribution of
label noise QBγ̂ ,B ∗ to characterize pixel-level label noise,
B ∗ refers to the prediction of model. (2) Rank, and
ference phase, a coarse affinity map B̂ ∈ RH×W can be
Prune: filter out noisy examples and train with errors re-
predicted by AffinityNet for each class of each image. B̂
moved data. Formally, given massive generative images
is used to search for a suitable threshold γ̂ during a search
and annotations {(I, Bγ̂ )}, a segmentation model θ (e.g.,
space Ω = {γi }Li=1 as follows: Mask2former [11], Mask-RCNN [27]) is used to predict
out-of-sample probabilities of segmentation result θ : I →
X Mc (Bγ̂ ; I, θ) by cross-validation. Then we can estimate
γ̂ = arg max Lmatch (B̂, Bγ ), (4)
γ∈Ω the joint distribution of noisy labels Bγ̂ and true labels,
QcBγ̂ ,B ∗ = ΦIoU (Bγ̂ , B ∗ ), where c denotes c-th class. With
QcBγ̂ ,B ∗ , some interpretable and explainable ranking meth-
where Lmatch (B̂, Bγ ) is a pair-wise matching cost of IoU ods, such as loss reweighting [22, 41] can be used for CL to
between affinity map B̂ and a binary map from attention find label errors using. In this paper, we adopt a simple and
map with threshold γ. As a result, an adaptive threshold effective modularized rank and prune method, i.e., Prune by
γ̂ can be obtained for each image of each class. The red Class, which decouples the model and data cleaning proce-
points in Fig. 3 represent the corresponding threshold from dure. For each class, select and prune α% examples with the
matching with the affinity map. They are usually close to lowest self-confidence QcBγ̂ ,B ∗ as the noisy data, and train
the optimal threshold. model θ with the remaining clean data. While α% is set
Prompt: Photo of a bird
(a) Splicing (2×2) (b) Gaussian Blur (c) Occlusion (d) Perspective Transform
Table 1 – Result of Semantic Segmentation on the VOC 2012 val. mIoU is for 20 classes. ‘S’ and ‘R’ refer to ‘Synthetic’ and ‘Real’.
(a) DiffuMask v.s. Attention Map. (b) Prompt Engineering. (c) Noise Learning. (d) Data Augmentation.
Table 4 – DiffuMask ablations. We perform ablations on VOC 2012 val. γ and ‘AT’ denotes the ‘Threshold’ and ‘Adaptive Threshold’,
respectively. α refers to the proportion of data pruning. Φ1 , Φ2 , Φ3 and Φ4 refer to ‘Splicing’, ‘Gaussian Blur’, ‘Occlusion’, and
‘Perspective Transform’, respectively. ‘Retri.’ and ‘Sub-C’ denotes ‘retrieval-based’ and ‘Sub-Class’, respectively. Mask2former with
Swin-B is adopted as the baseline.
our synthetic data does not need any manual localization Category/%
Train Set Number Backbone bus car person mIoU
and mask annotation, while real data need humans to per- Train with Pure Real Data
form a pixel-wise mask annotation. For some categories, R: 20.2k R50 87.9 82.5 79.4 83.3
ADE20K
i.e., bird, cat, cow, horse, sheep, DiffuMask presents a R: 20.2k Swin-B 93.6 86.1 84.0 87.9
powerful performance, which is quite close to that of train- Train with Pure Synthetic Data
S: 6.0k R50 43.4 67.3 60.2 57.0
ing on real (within 5% gap). Besides, finetune on few real DiffuMask
S: 6.0k Swin-B 72.8 73.4 62.6 69.6
data, the results can be improved further, and exceed that of
training on full real data, e.g., 84.9% mIoU finetune on 5.0k Table 5 – The mIoU (%) of Semantic Segmentation on the
real data v.s 83.4% mIoU training on full real data (10.6k). ADE20K val.
Cityscapes. Table 2 presents the results on Cityscapes.
mIoU/%
Urban street scenes of Cityscapes are more challenging, in- Train Set Test Set Car Person Motorbike mIoU
cluding a mass of small objects and complex backgrounds. Cityscapes [14] VOC 2012 [19] val 26.4 32.9 28.3 29.2
We only evaluate two classes, i.e., Vehicle and Human, ADE20K [68] VOC 2012 [19] val 73.2 66.6 64.1 68.0
which are the two most important categories in the driving DiffuMask VOC 2012 [19] val 74.2 71.0 63.2 69.5
VOC 2012 [19] Cityscapes [14] val 85.6 53.2 11.9 50.2
scene. Compared with training on real images, DiffuMask
ADE20K [68] Cityscapes [14] val 83.3 63.4 33.7 60.1
presents a competitive result, i.e., 79.6% vs. 90.8% mIoU. DiffuMask Cityscapes [14] val 84.0 70.7 23.6 59.4
ADE20K ADE20K, as one more challenging dataset, is
Table 6 – Performance for Domain Generalization between
also used to evaluate the DiffuMask. Table 5 presents the re-
different datasets. Mask2former [11] with ResNet50 is used as
sults of three categories (bus, car, person) on ADE20K. the baseline. Person and Rider classes of Cityscapes [14] are
With fewer synthetic images (6k), we achieve a competi- consider as the same class, i.e., Person in the experiment.
tive performance than that of a mass of real images (20.2k).
Compared with the other two categories, Class car achieves
the best performance, with 73.4% mIoU. domain generalization, e.g., 69.5% with DiffuMask v.s
68.0 with ADE20K [68] on VOC 2012 val. The domain
4.3. Protocol-II: Open-vocabulary Segmentation gap [57] between real datasets sometimes is bigger than
As shown in Fig. 1, it is natural and seamless to extend that among synthetic and real data. For Motorbike class,
the text-driven synthetic data (our DiffuMask) to the open- model training with Cityscapes only achieves 28.9% mIoU,
vocabulary (zero-shot) task. As shown in Table 3, compared but that of DiffuMask is 63.2% mIoU. We argue that the
with priors training on real images with manually annotated main reason is domain shift in foreground and background
mask, DiffuMask can achieve a SOTA result on Unseen domains, i.e., Cityscapes contains images of city roads, with
classes. It is worth mentioning that DiffuMask is pure syn- the majority of Motorbike objects being small in size. But
thetic/fake data and supervised by text, while priors all must VOC 2012 is an open-set scenario, where Motorbike ob-
need the real image and corresponding manual mask anno- jects vary greatly in size and include close-up shots.
tation. Li et al., as one contemporaneous work, use the seg-
mentation model pre-trained on COCO [38] to predict the
4.5. Ablation Study
pseudo label of the synthetic image, which is high-cost. Compared with Attention Map. Table 4a presents the
comparison with the attention map and the impact of bina-
4.4. Protocol-III: Domain Generalization
rization threshold γ. It is clear that the optimal threshold
Table 6 presents the results for cross-dataset validation, for different categories is different, even various for differ-
which can evaluate the generalization of data. Compared ent images of the same category. Sometimes it is sensitive
with real data, DiffuMask show powerful effectiveness on for some categories, such as Dog. The mIoU of 0.4 γ is
Backbone Bird Dog Sheep Horse Person mIoU
RseNet 50 86.7 65.1 64.7 64.6 71.0 70.3 Background
ResNet 50
RseNet 101 86.7 66.8 65.3 63.4 70.2 70.5 Sheep
Background Background
Backbone
Swin-B 92.9 86.0 92.2 89.0 76.5 87.3 Dog
Swin-B
Background
Background
Annotation Bird Dog Person Sofa mIoU Sheep
Sheep
Real Image, Manual Label 93.7 96.8 92.5 65.6 87.2
Sheep
Synthetic Image, Pseudo Label 95.2 86.2 89.9 59.5 82.7
Synthetic Image, DiffuMask 92.9 86.0 76.5 49.8 76.3 Classification False Negative Mask Precision