0% found this document useful (0 votes)
6 views

2310.09760

This paper presents a novel approach called Image Augmentation with Controlled Diffusion (IACD) for weakly-supervised semantic segmentation (WSSS), which enhances the quality of pseudo labels by generating diverse synthetic images using a diffusion model. The method includes a high-quality image selection strategy to filter out low-quality generated images, significantly improving performance, especially with limited training data. Experimental results demonstrate that IACD outperforms existing state-of-the-art methods, achieving notable improvements in segmentation accuracy on the PASCAL VOC 2012 dataset.

Uploaded by

wxwu3219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

2310.09760

This paper presents a novel approach called Image Augmentation with Controlled Diffusion (IACD) for weakly-supervised semantic segmentation (WSSS), which enhances the quality of pseudo labels by generating diverse synthetic images using a diffusion model. The method includes a high-quality image selection strategy to filter out low-quality generated images, significantly improving performance, especially with limited training data. Experimental results demonstrate that IACD outperforms existing state-of-the-art methods, achieving notable improvements in segmentation accuracy on the PASCAL VOC 2012 dataset.

Uploaded by

wxwu3219
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IMAGE AUGMENTATION WITH CONTROLLED DIFFUSION FOR WEAKLY-SUPERVISED

SEMANTIC SEGMENTATION
∗ ∗
Wangyu Wu1,2 , Tianhong Dai3 , Xiaowei Huang2 , Fei Ma1 , Jimin Xiao1
1 2 3
Xi’an Jiaotong-Liverpool University The University of Liverpool University of Aberdeen

ABSTRACT
arXiv:2310.09760v1 [cs.CV] 15 Oct 2023

WSSS

Weakly-supervised semantic segmentation (WSSS), which <Image:X_in,


Label:Y> (a) Traditional WSSS
aims to train segmentation models solely using image-level
Diffusion
labels, has achieved significant attention. Existing methods with Good WSSS
quality
primarily focus on generating high-quality pseudo labels us- <Image:X_aug, label:Y>
ing available images and their image-level labels. However,
the quality of pseudo labels degrades significantly when the <Image:X_in, Diffusion
Filter
Label:Y> with Low
size of available dataset is limited. Thus, in this paper, we quality out

tackle this problem from a different view by introducing a (b) Diffusion augmented for WSSS
novel approach called Image Augmentation with Controlled
Diffusion (IACD). This framework effectively augments ex- Fig. 1. (a) In the previous method, only images from the orig-
isting labeled datasets by generating diverse images through inal dataset are used for training. (b) Our proposed IACD uti-
controlled diffusion, where the available images and image- lizes an diffusion model to generate synthetic images. Then,
level labels are served as the controlling information. More- an image selection module is used to annotate and select the
over, we also propose a high-quality image selection strategy high-quality synthetic images to augment the original dataset
to mitigate the potential noise introduced by the randomness for training.
of diffusion models. In the experiments, our proposed IACD
approach clearly surpasses existing state-of-the-art methods.
This effect is more obvious when the amount of available
data is small, demonstrating the effectiveness of our method. gained prominence in the field of computer vision [7, 8, 9].
The generated images exhibit high-quality with few artifacts
Index Terms— weakly-supervised semantic segmenta- and effectively align with the given text prompts, even when
tion, diffusion model, high-quality image selection these prompts depict unrealistic scenarios that were not en-
countered during training. This highlights the robust gen-
1. INTRODUCTION eralization capabilities of diffusion models. Notably, recent
works such as Stable Diffusion [10] with ControlNet [11] is
Weakly-supervised semantic segmentation (WSSS) leverages able to generate high-quality synthetic images.
image-level labels to generate pixel-level pseudo masks for
training the the segmentation models. The primary challenge In this paper, we propose Image Augmentation with Con-
lies in enhancing the quality of generated pseudo-labels. Cur- trolled Diffusion (IACD) to generate high-quality synthetic
rently, most methods involve injecting more category infor- training data for WSSS (see Fig. 1). Our contributions are: 1).
mation into the network or performing additional information Our approach aims to enhance WSSS performance, which is
learning on existing training data [1, 2], such as sub-class dis- the first proposal to utilize conditional diffusion for augment-
tinctions [2] and adding category information to network [1]. ing the original dataset with image-level labels. 2). An image
Alternatively, efforts are directed towards optimizing network selection approach is introduced, aiming to keep high-quality
structures [3, 4, 5] to better suit learning in weakly-supervised training data while effectively filtering out low-quality gener-
scenarios. However, the aforementioned methods are all con- ated images. This strategy helps prevent any adverse impact
strained by the scale of the available training data. on model training. 3). Our proposed framework outperforms
The Diffusion Probabilistic Model (DPM) [6] is an ap- all current state-of-the-art methods, and it shows varying per-
pealing choice for the aforementioned problem because it be- formance improvements across different training data sizes,
longs to a class of deep generative models that have recently particularly at 5% of the training data, where there is 4.9%
∗ Corresponding authors increase for the segmentation task on the validation set on the
PASCAL VOC 2012 dataset.
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Diffusion High quality
We use {Image_labels} as the prompt
generated Image Image

Input
Image Image
Prompt Diffusion
labels classifier WSSS

Detector
Image labels
input image

Fig. 2. The pipeline of our IACD. The airplane in the showcase is using the diffusion model with prompts to generate candidate
images. Subsequently, the candidate images are filtered through an image selection process to ensure that only high-quality
images are used as training data for the downstream WSSS.

2. METHODOLOGY

In this section, we present the general framework and key Xaug = δ(Xin , M, P ). (1)
components of our proposed method. Initially, We introduce
the overall architecture and pipeline of our IACD method More details about the data augmentation process are de-
(Sec. 2.1). Then, a diffusion model based approach is pro- scribed in Algo. 1.
posed for data augmentation in WSSS tasks (Sec. 2.2). Fur-
thermore, we develop a high-quality image selection strategy Algorithm 1: Diffusion Model for Data Augmentation
that aims to ensure the quality of data generated by diffusion
Input: an input image Xin , an image-level label Y
model, thereby reducing the model noise (Sec. 2.3). Finally,
Output: a generated image Xaug
the components of the final dataset used for training are also
1 P ← generate prompt(Y )
discussed (Sec. 2.4).
2 if “person” ∈ Y then
3 M ← detect map(Xin , human pose)
2.1. Overall Structure 4 else
5 M ← detect map(Xin , canny edge)
As illustrated in Fig. 2, we utilize the diffusion model [10]
6 Xaug ← δ(Xin , M, P )
along with ControlNet [11] to generate new training samples
under the guidance of conditioning inputs: original images
and label prompts. In addition, we train a Vision Transformer
(ViT) based image classifier [12] by using existing dataset
with image-level labels to select high-quality generated train-
ing samples. During the selection, we select generated images 2.3. High-quality Synthetic Image Selection
with high prediction scores as high-quality samples and filter
In order to guarantee the quality of synthetic data that will
out low-quality generated images with noise. Finally, we ex-
be used for training, a selection strategy is introduced to se-
tend the original dataset with the generated samples for the
lect the high-quality generated samples. As shown in Fig. 3,
training of WSSS.
a ViT-based patch-driven classifier is first trained by using
the original dataset with the image-level labels. To train the
2.2. Controlled Diffusion for Data Augmentation classifier, the input image Xin is divided into s input patches
Xpatch ∈ Rd×d×3 with the fixed size, where s = hw d2 . Then,
The motivation for using controlled image diffusion models the patch embedding F ∈ Rs×e is achieved by using ViT en-
to augment data is that these models can generate infinite and coder. Next, a weight W ∈ Re×|C| and a softmax function
diverse task-specific synthetic images based on a given im- is applied to output the prediction scores Z ∈ Rs×|C| of each
age and a text prompt. In this work, we utilize Stable Dif- patch:
fusion with ControlNet (SDC) [11] as our generative model
(see Fig. 3). In the data augmentation stage, an input image Z = softmax(F W ), (2)
Xin ∈ Rh×w×3 , a text prompt P , and a detection map M
are feed into SDC δ(·) to generate a new training data Xaug . where C is the set of categories in the dataset. Global max-
More specifically, the text prompt is formulated by the corre- imum pooling (GMP) is then used to select the highest pre-
sponding image-level label Y . The detection map is an extra diction scores ŷ ∈ R1×|C| in each class among all patches.
condition (e.g., Canny Edge [13] and Openpose [14]) to con- Finally, ŷ is utilized as the prediction scores for the image-
trol the generation results. level classification and the classifier is trained via using the
Fig. 3. The overall framework of IACD consists of several steps. Firstly, IACD utilizes controlled diffusion to generate entirely
different images. Subsequently, the original image is processed using the Vision Transformer (ViT) as an encoder to generate
patch embeddings, and a patch-driven classifier is trained for image categorization. Then, the generated diffusion images are
passed through the same trained image classifier to select a high-quality image set. Moreover, the selected image set, along
with the original images and their corresponding labels, is passed to the downstream WSSS task.

multi-label classification prediction error (MCE): keeps the synthetic samples with high prediction scores in
specific categories, which guarantees there is a high proba-
1 X
LM CE = BCE(yc , ŷc ) bility that objects of these classes will appear in the synthetic
|C|
c∈C image. Second, the synthetic image will not contain objects
(3)
1 X that do not belong to the image-level label of the input im-
=− yc log(ŷc ) + (1 − yc ) log(1 − ŷc ),
|C| age. In this way, the quality of synthetic dataset Daug can be
c∈C
improved.
where, ŷc is the prediction score of class c and yc is the
ground-truth label. Once the classifier finishes the training, 2.4. Final Training Dataset of WSSS
we can use it to select the high-quality generated training
data. After selecting the high-quality generated training samples,
In the selection stage, the synthetic image Xaug gener- the synthetic dataset Daug and the original dataset Dorigin
ated by ⟨Xin , Y ⟩, is passed into the classifier, followed by the are combined as an extended dataset Df inal for the training
GMP to output the image-level prediction score ŷ. Then, the of WSSS: Df inal = Dorigin ∪ Daug .
classes with the scores above a certain threshold ϵ are used
as the ground-truth label for the generated image: Yaug = 3. EXPERIMENTS
{c|ŷc > ϵ}. If Yaug is a subset of the label of the input image
Y , the generated sample ⟨Xaug , Yaug ⟩ will be added into the In this section, we describe the experimental setup, including
synthetic dataset Daug . More details about the image selec- dataset, evaluation metrics, and implementation details. We
tion are described in Algo. 2. then compare our method with state-of-the-art approaches on
PASCAL VOC 2012 [15]. Finally, ablation studies are per-
Algorithm 2: High-Quality Image Selection formed to validate the effectiveness of the proposed method.
Input: a ground-truth label of the input image Y , a gen-
erated image Xaug , a prediction score ŷ, a label of 3.1. Experimental Settings
generated image Yaug , a set of classes C, a thresh- Dataset and Evaluated Metric. We conduct our experi-
old ϵ and a synthetic dataset Daug ments on PASCAL VOC 2012 [15], which comprises 21
Output: a synthetic dataset Daug
categories, including the additional background class. The
1 Yaug ← ∅
PASCAL VOC 2012 Dataset is typically augmented with
2 foreach c ∈ C do
the SBD dataset [16]. In total, we utilize 10,582 images
3 if ŷc > ϵ then
with image-level annotations for training, and 1,449 images
4 Yaug ← Yaug ∪ {c}
for validation. The training sets of Pascal VOC contain
5 if Yaug ⊆ Y then images with only image-level labels. We report the mean
6 Daug ← Daug ∪ {⟨Xaug , Yaug ⟩} Intersection-Over-Union (mIoU) as the evaluation criterion.
Additionally, we also evaluate the performance of our IACD
method when the amount of original training data is gradually
This selection strategy serves two purposes. First, it only reduced from 10,582 (100%) to 529 (5%).
Implementation Details. In our experiments, we employe
the ViT-B/16 as the ViT model, and we use the stable diffu-
sion model [10] with ControlNet [11] as our diffusion model.
Images are resized to 384×384 pixels [17] during the train-
ing of the patch-driven image classifier. The 24×24 encoded
patch features are retained as input. The model is trained with
a batch size of 16 for a maximum of 80 epochs. The im-
age selection threshold ϵ is 0.9. We use Canny Edge [13]
and Openpose [14] as detectors for ControlNet [11], with a
total of 20 diffusion steps.. Due to limitations in computa-
tional resources, we generate additional 10,582 images using
a diffusion model in the experiments. During the WSSS train-
ing stage, we combine our synthetic dataset with the original Fig. 4. The comparison of qualitative segmentation results
training dataset as our final training dataset. Subsequently, we with ViT-PCM [12].
selecte ViT-PCM [3] as our WSSS framework without any
modifications. Our final training dataset serve as input for the the effect is more obvious. This suggests that our approach
WSSS framework, while keeping all other settings consistent is highly effective to augment the dataset and improve the
with ViT-PCM [3]. The experiments are conducted using two performance when the amount of training data is insufficient.
NVIDIA 4090 GPUs. Finally, we use the same verification Improvements in Segmentation Results. To assess our
task and settings as ViT-PCM [3]. methods, we apply our approach to the current state-of-the-
art ViT-PCM as upstream data augmentation, while keeping
the downstream WSSS consistent with the existing ViT-
Table 1. The comparison of segmentation performance on PCM. We then compare the segmentation results with the
different sizes of training data. state-of-the-art techniques in Tab. 2. Even with only 50% of
Percentage of Train Data Baseline on Val IACD on Val the train data, our method outperforms the baseline method
ViT-PCM [12]. The comparison of qualitative segmentation
5% 62.6% 67.5% +4.9%
results are shown in Fig. 4.
15% 65.6% 68.5% +3.9%
50% 68.2% 70.5% +2.3%
100% 69.3% 71.4% +2.1% 3.3. Ablation Studies

We conducted an ablation study to assess the impact of our


Table 2. The comparison of semantic segmentation perfor- two key contributions: diffusion augmentation and high-
mance by using only pseudo masks for training. quality image selection. As shown in Tab. 3, our diffusion
Percentage of Train Data Model Pub. mIoU (%)
augmentation introduces some random noisy images gener-
100% MCTformer [4] CVPR22 61.7
100% PPC [18] CVPR22 61.5 ated by the diffusion model, resulting in a 0.2% decrease in
100% SIPE [19] CVPR22 58.6 mIoU on the validation set. Additionally, the proposed high-
100% AFA [5] CVPR22 63.8
100% ViT-PCM [12] ECCV22 69.3 quality image selection effectively reduces noisy images by
50% IACD (Ours) + ViT-PCM 70.5 filtering out low-quality ones, leading to a 2.1% improvement
100% IACD (Ours) + ViT-PCM 71.4 in mIoU for the baseline WSSS framework. When these
two methods are combined, our comprehensive approach
significantly outperforms the original framework.
Table 3. Ablation study on the data augmentation module and
the high-quality image selection module.
Backbone Original Train Diffusion Augmentation Image Selection Result on Val 4. CONCLUSION
ViT-B/16 ✓ 69.3%
ViT-B/16 ✓ ✓ 69.1% In this work, we propose the IACD approach for data augmen-
ViT-B/16 ✓ ✓ ✓ 71.4%
tation in weakly supervised semantic segmentation (WSSS).
Unlike previous methods that focus on optimizing network
3.2. Comparison with State-of-the-arts
structures or mining information from existing images, we
Comparison of Different Data Percentage. Our proposed introduce a diffusion model based module to augment addi-
IACD method effectively enhances the original training tional data for training. To guarantee the quality of generate
dataset size in Tab. 1. As part of upstream data augmen- images, a high-quality image selection module is also pro-
tation, it greatly aids the downstream WSSS framework in posed. By combining these two components, our approach
achieving higher segmentation performance. Furthermore, has better performance than other state-of-the-art methods on
we observed that when the amount of training data is smaller, PASCAL VOC 2012 dataset.
5. REFERENCES [11] Lvmin Zhang and Maneesh Agrawala, “Adding condi-
tional control to text-to-image diffusion models,” arXiv
[1] Zhaozhi Xie and Hongtao Lu, “Exploring category con- preprint arXiv:2302.05543, 2023. 1, 2, 4
sistency for weakly supervised semantic segmentation,”
in IEEE Int. Conf. Acoust. Speech Signal Process., 2022, [12] Alexey Dosovitskiy, Lucas Beyer, Alexander
pp. 2609–2613. 1 Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer,
[2] Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Georg Heigold, Sylvain Gelly, et al., “An image is worth
Robinson Piramuthu, Yi-Hsuan Tsai, and Ming-Hsuan 16x16 words: Transformers for image recognition at
Yang, “Weakly-supervised semantic segmentation via scale,” arXiv preprint arXiv:2010.11929, 2020. 2, 4
sub-category exploration,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern [13] John Canny, “A computational approach to edge de-
Recognition, 2020, pp. 8991–9000. 1 tection,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. PAMI-8, no. 6, pp. 679–698,
[3] Simone Rossetti, Damiano Zappia, Marta Sanzari, 1986. 2, 4
Marco Schaerf, and Fiora Pirri, “Max pooling with vi-
sion transformers reconciles class and shape in weakly [14] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei,
supervised semantic segmentation,” in Eur. Conf. Com- and Yaser Sheikh, “Openpose: Realtime multi-person
put. Vis., 2022, pp. 446–463. 1, 4 2d pose estimation using part affinity fields,” IEEE
Transactions on Pattern Analysis and Machine Intelli-
[4] Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid gence, vol. 43, no. 1, pp. 172–186, 2021. 2, 4
Boussaid, and Dan Xu, “Multi-class token transformer
for weakly supervised semantic segmentation,” in Proc. [15] Mark Everingham, Luc Van Gool, Christopher KI
IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. Williams, John Winn, and Andrew Zisserman, “The
4310–4319. 1, 4 pascal visual object classes (VOC) challenge,” Int. J.
Comput. Vis., vol. 88, pp. 303–338, 2010. 3
[5] Lixiang Ru, Yibing Zhan, Baosheng Yu, and Bo Du,
[16] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev,
“Learning affinity from attention: End-to-end weakly-
Subhransu Maji, and Jitendra Malik, “Semantic con-
supervised semantic segmentation with transformers,”
tours from inverse detectors,” in Proc. IEEE Int. Conf.
in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2022,
Comput. Vis., 2011, pp. 991–998. 3
pp. 16846–16855. 1, 4
[17] Alexander Kolesnikov and Christoph H Lampert, “Seed,
[6] Jascha Sohl-Dickstein, Eric Weiss, Niru Mah-
expand and constrain: Three principles for weakly-
eswaranathan, and Surya Ganguli, “Deep unsupervised
supervised image segmentation,” in Eur. Conf. Comput.
learning using nonequilibrium thermodynamics,” in Int.
Vis., 2016, pp. 695–711. 4
Conf. Mach. Learn., 2015, pp. 2256–2265. 1
[18] Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang,
[7] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising “Weakly supervised semantic segmentation by pixel-to-
diffusion probabilistic models,” Advances in neural in- prototype contrast,” in Proc. IEEE Conf. Comput. Vis.
formation processing systems, vol. 33, pp. 6840–6851, Pattern Recog., 2022, pp. 4320–4329. 4
2020. 1
[19] Qi Chen, Lingxiao Yang, Jian-Huang Lai, and Xiaohua
[8] Jiaming Song, Chenlin Meng, and Stefano Ermon, “De- Xie, “Self-supervised image-specific prototype explo-
noising diffusion implicit models,” arXiv preprint ration for weakly supervised semantic segmentation,” in
arXiv:2010.02502, 2020. 1 Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2022,
[9] Yang Song, Jascha Sohl-Dickstein, Diederik P pp. 4288–4298. 4
Kingma, Abhishek Kumar, Stefano Ermon, and Ben
Poole, “Score-based generative modeling through
stochastic differential equations,” arXiv preprint
arXiv:2011.13456, 2020. 1

[10] Robin Rombach, Andreas Blattmann, Dominik Lorenz,


Patrick Esser, and Björn Ommer, “High-resolution im-
age synthesis with latent diffusion models,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp.
10684–10695. 1, 2, 4

You might also like