Ssssss
Ssssss
Abstract
With the remarkable progress on deep CNNs, recent approaches have achieved certain success on image generation from
scene-level freehand sketches. However, most of the researches adopt a two-staged way, that is, to generate the foreground
and the background of the image respectively. In this paper, we propose a novel one-stage paradigm of GAN-based
architecture, which named SSGAN for image generation using sketch-to-image directly. Moreover, we design a novel
Semantic Fusion Module (SFM) for better learn the intermediate features. Extensive experiments on SketchyCOCO demon-
strate that our proposed framework can obtain competitive performance compared with the state-of-the-art methods.
,6%1
Authorized licensed use limited to: VTU Consortium. Downloaded on 9'(9(5/$**0%+Â%HUOLQÂ2IIHQEDFK
May 16,2025 at 15:50:44 UTC from IEEE Xplore. Restrictions apply.
,&(7,6-DQXDU\+DUELQ&KLQD
Figure 1 Overview of the proposed SSGAN for image synthesis from scene-level freehand sketches. Given a scene-
level freehand sketch, we obtain the semantic mask of the sketch by using a pre-trained segmentation model. Proposed
Semantic Fusion Module (SFM) realizes the learning of sketch-to-mask-to-image for the generative learning problem of
sketch-to-image. Moreover, the right-bottom illustrates the SFM.
,6%1
Authorized licensed use limited to: VTU Consortium. Downloaded on 9'(9(5/$**0%+Â%HUOLQÂ2IIHQEDFK
May 16,2025 at 15:50:44 UTC from IEEE Xplore. Restrictions apply.
,&(7,6-DQXDU\+DUELQ&KLQD
Mathematically, we define ܯ { א0,1}ு×ௐ× as interme- ral images from COCO Stuff [2], using the segmentation
diate mask from the i-th SPADE layer.ܯ { א0,1}ு×ௐ× is masks of these natural images as reference, scene sketches
the input semantic sketch and ܯ { א0,1}ு×ௐ× is the were generated by compositing the instance freehand
foreground segmentation of the sketch which is kept the sketches from Sketchy [17], Tu-berlin [6], and QuickDraw
same shape as ܯ by padding 0. As illustrated in Figure [8]. SketchyCOCO datasets contain 14081 images and split
1, we first use a convolutional network ࣠ଵ to encode the them into two sets, 80% for training and the remaining 20%
label maps into feature maps: for test.
We use two metrics to evaluate generated images. The first
݂ = ࣠ଵ ൫ܯ ൯ ْ ܲ ൫࣠ଵ (ܯ , ܯ )൯ (4) metric is FID [9] which has been widely used to evaluate
the quality of generated images. The lower the FID value,
where ܲ represents average pooling, Ͱdenotes ele- the more realistic the image. Another metric is the struc-
mentwise addition. Average pooling is used because it pre- tural similarity metric (SSIM) [23] used to quantify the
serves background information better. We then use another structural similarity between the generated image and the
convolutional network ࣠ଶ to obtain final updated feature ground truth images. The higher the SSIM value, the closer
maps: they are.
,6%1
Authorized licensed use limited to: VTU Consortium. Downloaded on 9'(9(5/$**0%+Â%HUOLQÂ2IIHQEDFK
May 16,2025 at 15:50:44 UTC from IEEE Xplore. Restrictions apply.
,&(7,6-DQXDU\+DUELQ&KLQD
model trained using the semantic maps contains all catego- 4.3 Quantitative results
ries in the COCO Stuff dataset, while our model trained on Figure 2 shows the images generated by our method and
SketchyCOCO which only contain a part of categories in the comparison methods. Note that we cannot reproduce
ground truth. Compared with the GauGAN model trained the results of SketchyCOCO because it only provides the
using semantic sketches, SSGAN’s score is the same as pre-trained foreground generation model, not the pre-
GauGAN-semantic sketch’s score in SSIM, but our method trained background generation model. Figure 2 demon-
yields better results for FID. Indicating that the SFM can strates that SSGAN is able to generate complex images
effectively learn fine-grained mask. Compare with the with multiple objects from simple scene-level freehand
scene-level sketch-based image generation baseline model sketches, and the generated images respect the constraints
SketchyCOCO, our SSGAN achieves better score on FID of the input scene-level freehand sketches. We can see our
but lower score on SSIM. This may be because approach produce much better results than LostGANs
SketchyCOCO generate foreground separately, and using which use layouts as input. But compared to the GauGAN
the generated foreground instances as constraints which model trained using semantic maps, our approach produces
provide a more explicit spatial constraint. slightly worse images. This is consistent with our analysis
of the qualitative results.
In Figure 3 we prove the effectiveness of proposed SFM.
(c) shows the semantic masks learned by SFM. It is clear
that our approach represents the foreground object accu-
rately and infer background from limited information.
5 Conclusion
In this paper, we propose SSGAN for synthesis images
from scene-level freehand sketches, which use a joint learn-
ing paradigm to transform sketch-to-image into sketch-to-
mask-to-image. Specifically, we present a new module,
Figure 2 Scene-level comparison. (a) Input layout, (b)
SFM, which fuses the segmentation masks of phase and the
Generated images by LostGANs, (c) Input semantic map,
semantic sketches to realize the sketch-to-mask-to-image
(d) Generated images by GauGAN, (e) Input scene-level
pipeline. Comprehensive experiments on SketchyCOCO
freehand sketch, (f) Generated images by our SSGAN.
datasets demonstrate the effectiveness of our proposed
model.
Table 1 The results of quantitative experiments
Model ),'Ļ 66,0Ĺ
LostGANs-layout 134.6 0.280
References
GauGAN-semantic map 80.3 0.306 [1] Brock, Andrew, Jeff Donahue, and Karen Simonyan.
“Large Scale GAN Training for High Fidelity Natural
GauGAN-semantic sketch 215.1 0.285
Image Synthesis.” ArXiv Preprint ArXiv:1809.11096,
SketchyCOCO-scene 164.8 0.288 2018.
Ours 123.8 0.285 [2] Caesar, Holger, Jasper Uijlings, and Vittorio Ferrari.
“COCO-Stuff: Thing and Stuff Classes in Context.”
ArXiv:1612.03716 [Cs], March 28, 2018.
,6%1
Authorized licensed use limited to: VTU Consortium. Downloaded on 9'(9(5/$**0%+Â%HUOLQÂ2IIHQEDFK
May 16,2025 at 15:50:44 UTC from IEEE Xplore. Restrictions apply.
,&(7,6-DQXDU\+DUELQ&KLQD
[3] Chen, Tao, Ming-Ming Cheng, Ping Tan, Ariel [19] Sun, Wei, and Tianfu Wu. “Learning Layout and Style
Shamir, and Shi-Min Hu. “Sketch2Photo: Internet Reconfigurable GANs for Controllable Image
Image Montage.” ACM Transactions on Graphics 28, Synthesis.” ArXiv:2003.11571 [Cs], March 26, 2021.
no. 5 (December 2009): 1–10. [20] Sushko, Vadim, Edgar Schönfeld, Dan Zhang,
[4] Chen, Wengling, and James Hays. “SketchyGAN: Juergen Gall, Bernt Schiele, and Anna Khoreva. “You
Towards Diverse and Realistic Sketch to Image Only Need Adversarial Supervision for Semantic
Synthesis.” ArXiv:1801.02753 [Cs], April 12, 2018. Image Synthesis.” ArXiv:2012.04781 [Cs, Eess],
[5] Eitz, M., R. Richter, K. Hildebrand, T. Boubekeur, March 19, 2021.
and M. Alexa. “Photosketcher: Interactive Sketch- [21] Tang, Hao, Song Bai, and Nicu Sebe. “Dual Attention
Based Image Synthesis.” IEEE Computer Graphics GANs for Semantic Image Synthesis.”
and Applications 31, no. 6 (November 2011): 56–66. ArXiv:2008.13024 [Cs], August 29, 2020.
[6] Eitz, Mathias, James Hays, and Marc Alexa. “How Do [22] Wang, Sheng-Yu, David Bau, and Jun-Yan Zhu.
Humans Sketch Objects?” ACM Transactions on “Sketch Your Own GAN.” ArXiv:2108.02774 [Cs],
Graphics 31, no. 4 (August 5, 2012): 1–10. September 20, 2021.
[7] Gao, Chengying, Qi Liu, Qi Xu, Limin Wang, [23] :DQJ = ³,PDJH 4XDOLW\ $VVHVVPHQWௗ )URP (UURU
Jianzhuang Liu, and Changqing Zou. “SketchyCOCO: Visibility to Structural Similarity.” IEEE Trans-
Image Generation from Freehand Scene Sketches.” actions on Image Processing, 2004.
ArXiv:2003.02683 [Cs], April 7, 2020. [24] Zhao, Bo, Lili Meng, Weidong Yin, and Leonid Sigal.
[8] Ha, D., and D. Eck. “A Neural Representation of “Image Generation From Layout.” In 2019 IEEE/CVF
Sketch Drawings,” 2017. Conference on Computer Vision and Pattern Recog-
[9] Heusel, M., H. Ramsauer, T. Unterthiner, B. Nessler, nition (CVPR), 8576–85. Long Beach, CA, USA:
and S. Hochreiter. “GANs Trained by a Two Time- IEEE, 2019.
Scale Update Rule Converge to a Local Nash [25] Zou, Changqing, Haoran Mo, Chengying Gao, Ruofei
Equilibrium,” 2017. Du, and Hongbo Fu. “Language-Based Colorization
[10] Hong, Seunghoon, Dingdong Yang, Jongwook Choi, of Scene Sketches.” ACM Transactions on Graphics
and Honglak Lee. “Inferring Semantic Layout for 38, no. 6 (November 8, 2019): 1–16.
Hierarchical Text-to-Image Synthesis.” [26] Wang, Ting-Chun, Ming-Yu Liu, Jun-Yan Zhu,
ArXiv:1801.05091 [Cs], July 25, 2018. Andrew Tao, Jan Kautz, and Bryan Catanzaro. “High-
[11] Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei Resolution Image Synthesis and Semantic Manipu-
A Efros. “Image-to-Image Translation with Condi- lation with Conditional GANs.” ArXiv:1711.11585
tional Adversarial Networks.” In Proceedings of the [Cs], August 20, 2018.
IEEE Conference on Computer Vision and Pattern
Recognition, 1125–34, 2017.
[12] Karras, Tero, Miika Aittala, Janne Hellsten, Samuli
Laine, Jaakko Lehtinen, and Timo Aila. “Training
Generative Adversarial Networks with Limited Data.”
ArXiv:2006.06676 [Cs, Stat], October 7, 2020.
[13] Karras, Tero, Samuli Laine, and Timo Aila. “A Style-
Based Generator Architecture for Generative Adver-
sarial Networks.” ArXiv:1812.04948 [Cs, Stat],
March 29, 2019.
[14] Lu, Yongyi, Shangzhe Wu, Yu-Wing Tai, and Chi-
Keung Tang. “Image Generation from Sketch
Constraint Using Contextual GAN.” ArXiv:1711.08972
[Cs], July 25, 2018.
[15] Mirza, Mehdi, and Simon Osindero. “Conditional
Generative Adversarial Nets.” ArXiv:1411.1784 [Cs,
Stat], November 6, 2014.
[16] Park, Taesung, Ming-Yu Liu, Ting-Chun Wang, and
Jun-Yan Zhu. “Semantic Image Synthesis with Spa-
tially-Adaptive Normalization.” ArXiv:1903.07291
[Cs], November 5, 2019.
[17] Sangkloy, Patsorn, Nathan Burnell, Cusuh Ham, and
James Hays. “The Sketchy Database: Learning to
Retrieve Badly Drawn Bunnies.” ACM Transactions
on Graphics 35, no. 4 (July 11, 2016): 1–12.
[18] Sun, Wei, and Tianfu Wu. “Image Synthesis From
Reconfigurable Layout and Style,” n.d., 10.
,6%1
Authorized licensed use limited to: VTU Consortium. Downloaded on 9'(9(5/$**0%+Â%HUOLQÂ2IIHQEDFK
May 16,2025 at 15:50:44 UTC from IEEE Xplore. Restrictions apply.