0% found this document useful (0 votes)
6 views10 pages

Image_generation

The document presents a novel method called EdgeGAN for generating realistic images from freehand scene sketches, emphasizing controllable image synthesis based on user intentions. It introduces a large-scale dataset named SketchyCOCO to support the model's training and evaluation, addressing challenges such as abstractness and incompleteness in sketches. The proposed approach utilizes a two-stage process for foreground and background generation, demonstrating superior performance compared to existing techniques in sketch-to-image synthesis.

Uploaded by

vbit0123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Image_generation

The document presents a novel method called EdgeGAN for generating realistic images from freehand scene sketches, emphasizing controllable image synthesis based on user intentions. It introduces a large-scale dataset named SketchyCOCO to support the model's training and evaluation, addressing challenges such as abstractness and incompleteness in sketches. The proposed approach utilizes a two-stage process for foreground and background generation, demonstrating superior performance compared to existing techniques in sketch-to-image synthesis.

Uploaded by

vbit0123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SketchyCOCO: Image Generation from Freehand Scene Sketches

Chengying Gao1 Qi Liu1 Qi Xu1 Limin Wang2 Jianzhuang Liu3 Changqing Zou 4∗
1
School of Data and Computer Science, Sun Yat-sen University, China
2
State Key Laboratory for Novel Software Technology, Nanjing University, China
3
Noah’s Ark Lab, Huawei Technologies 4 HMI Lab, Huawei Technologies
[email protected] {liuq99, xuqi5}@mail2.sysu.edu.cn
[email protected] [email protected] [email protected]

Abstract Specifically, to describe an object or scene, sketches can


better convey the user’s intention than other sources since
We introduce the first method for automatic image gen- they lessen the uncertainty by naturally providing more de-
eration from scene-level freehand sketches. Our model al- tails such as object location, pose and shape.
lows for controllable image generation by specifying the In this paper, we extend the use of Generative Adversar-
synthesis goal via freehand sketches. The key contribution ial Networks into a new problem: controllably generating
is an attribute vector bridged Generative Adversarial Net- realistic images with many objects and relationships from
work called EdgeGAN, which supports high visual-quality a freehand scene-level sketch as shown in Figure 1. This
object-level image content generation without using free- problem is extremely challenging because of several fac-
hand sketches as training data. We have built a large- tors. Freehand sketches are characterized by various lev-
scale composite dataset called SketchyCOCO to support els of abstractness, for which there are a thousand differ-
and evaluate the solution. We validate our approach on the ent appearances from a thousand users, which even express
tasks of both object-level and scene-level image generation the same common object, depending on the users’ depictive
on SketchyCOCO. Through quantitative, qualitative results, abilities, thereby making it difficult for existing techniques
human evaluation and ablation studies, we demonstrate the to model the mapping from a freehand scene sketch to re-
method’s capacity to generate realistic complex scene-level alistic natural images that precisely meet the users’ inten-
images from various freehand sketches. tion. More importantly, freehand scene sketches are often
incomplete and contain a foreground and background. For
example, users often prefer to sketch the foreground ob-
1. Introduction ject, which are most concerned, with specific detailed ap-
pearances and they would like the result to exactly satisfy
In recent years Generative Adversarial Networks
this requirement while they leave blank space and just draw
(GANs) [16] have shown significant success in modeling
the background objects roughly without paying attention to
high dimensional distributions of visual data. In partic-
their details, thereby requiring the algorithm to be capable
ular, high-fidelity images could be achieved by uncondi-
of coping with the different requirements of users.
tional generative models trained on object-level data (e.g.,
To make this challenging problem resolvable, we decom-
animal pictures in [4]), class-specific datasets (e.g., in-
pose it into two sequential stages, foreground and back-
door scenes [33]), or even a single image with repeated
ground generation, based on the characteristics of scene-
textures [32]. For practical applications, automatic im-
level sketching. The first stage focuses on foreground gen-
age synthesis which can generate images and videos in
eration where the generated image content is supposed to
response to specific requirements could be more useful.
exactly meet the user’s specific requirement. The second
This explains why there are increasingly studies on the
stage is responsible for background generation where the
adversarial networks conditioned on another input signal
generated image content may be loosely aligned with the
like texts [37, 20], semantic maps [2, 21, 6, 34, 27], lay-
sketches. Since the appearance of each object in the fore-
outs [2, 20, 38], and scene graphs [2, 23]. Compared to
ground has been specified by the user, it is possible to gener-
these sources, a freehand sketch has its unique strength in
ate realistic and reasonable image content from the individ-
expressing the user’s idea in an intuitive and flexible way.
ual foreground objects separately. Moreover, the generated
∗ Corresponding author. foreground can provide more constraints on the background

5174
BigGAN [4] StackGAN [37] Sg2im[23] Layout2im [38] ContexturalGAN [26] Ours Pix2pix [21] Ashual et al. [2]

Figure 1: The proposed approach allows users to controllably generate realistic scene-level images with many objects from
freehand sketches, which is in stark contrast to unconditional GAN and conditional GAN in that we use scene sketch as
context (a weak constraint) instead of generating from noise [4] or with harder condition like semantic maps [2, 28] or edge
maps [21]. The constraints of input become stronger from left to right.

generation, which makes background generation easier, i.e., sketches. This model can be trained in an end-to-end
progressive scene generation reduces the complexity of the manner and does not require sketch-image pairwise
problem. ground truth for training.
To address the data variance problem caused by the ab-
stractness of sketches, we propose a new neural network ar- • We construct a large scale composite dataset called
chitecture called EdgeGAN. It learns a joint embedding to SketchyCOCO based on MS COCO Stuff [5]. This
transform images and the corresponding various-style edge dataset will greatly facilitate related research.
maps into a shared latent space in which vectors can rep-
resent high-level attribute information (i.e., object pose and 2. Related Work
appearance information) from cross-domain data. With the
bridge of the attribute vectors in the shared latent space, Sketch-Based Image Synthesis. Early sketch-based
we are able to transform the problem of image generation image synthesis approaches are based on image retrieval.
from freehand sketches to the one from edge maps without Sketch2Photo [7] and PhotoSketcher [15] synthesize real-
the need to collect foreground freehand sketches as training istic images by compositing objects and backgrounds re-
data, and we can address the challenge of modeling one-to- trieved from a given sketch. PoseShop [8] composites im-
many correspondences between an image and infinite free- ages of people by letting users input an additional 2D skele-
hand sketches. ton into the query so that the retrieval will be more precise.
To evaluate our approach, we build a large-scale com- Recently, SketchyGAN [9] and ContextualGAN [26] have
posite dataset called SketchyCOCO based on MS COCO demonstrated the value of variant GANs for image gen-
Stuff [5]. The current version of this dataset includes 14K+ eration from freehand sketches. Different from Sketchy-
pairwise examples of scene-level images and sketches, GAN [9] and ContextualGAN [26], which mainly solve the
20K+ triplet examples of foreground sketches, images, and problem of image generation from object-level sketches de-
edge maps which cover 14 classes, 27K+ pairwise examples picting single objects, our approach focuses on generating
of background sketches and image examples which cover 3 images from scene-level sketches.
classes, and the segmentation ground truth of 14K+ scene Conditional Image Generation. Several recent studies
sketches. We compare the proposed EdgeGAN to existing have demonstrated the potential of variant GANs for scene-
sketch-to-image approaches. Both qualitative and quantita- level complex image generation from text [37, 20], scene
tive results show that the proposed EdgeGAN achieves sig- graph [23], semantic layout map [20, 38]. Most of these
nificantly superior performance. methods use a multi-stage coarse-to-fine strategy to infer
We summarize our contributions as follows: the image appearances of all semantic layouts in the input
or intermediate results at the same time. We instead take an-
• We propose the first deep neural network based frame-
other way and use a divide-and-conquer strategy to sequen-
work for image generation from scene-level freehand
tially generate the foreground and background appearances
sketches.
of the image because of the unique characteristics of free-
• We contribute a novel generative model called Edge- hand scene sketches where foreground and background are
GAN for object-level image generation from freehand obvious different.

5175
(1) (2) (3)
tor GE and discriminator DE for edge map generation, the
other including generator GI and discriminator DI for im-
age generation. Both GI and GE take the same noise vector
Scene Sketch Results of Generated Output Image
Segmentation Foreground together with an one-hot vector indicting a specific category
as input. Discriminators DI and DE attempt to distinguish
Figure 2: Workflow of the proposed framework. the generated images or edge maps from real distribution.
Another discriminator DJ is used to encourage the gener-
On object-level image generation, our EdgeGAN is ated fake image and the edge map depicting the same object
in stark contrast to unconditional GANs and conditional by telling if the generated fake image matches the fake edge
GANs in that we use a sketch as context (a weak constraint) map, which takes the outputs of both GI and GE as input
instead of generating from noise like DCGAN [29], Wasser- (the image and edge map are concatenated along the width
stein GANs [1], WGAN-GP [17] and their variants, or with dimension). The Edge Encoder is used to encourage the
hard condition such as an edge map [10, 11, 24, 21], se- encoded attribute information of edge maps to be close to
mantic map [2, 21, 6, 34, 27], while providing more precise the noise vector fed to GI and GE through a L1 loss. The
control than those using text [37, 20], layout [2, 20, 38] and classifier is used to infer the category label of the output of
scene graph [2, 23] as context. GI , which is used to encourage the generated fake image to
be recognized as the desired category via a focal loss [25].
3. Method The detailed structures of each module of EdgeGAN are il-
lustrated in Fig. 3(c).
Our approach mainly includes two sequential modules:
We implement the Edge Encoder with the same encoder
foreground generation and background generation. As il-
module in bicycleGAN [39] since they play a similar role
lustrated in Fig. 2, given a scene sketch, the object instances
functionally, i.e., our encoder encodes the “content” (e.g.,
are first located and recognized by leveraging the sketch
the pose and shape information), while the encoder in bicy-
segmentation method in [40]. After that image content is
cleGAN encodes properties into latent vectors. For Classi-
generated for each foreground object instance (i.e., sketch
fier, we use an architecture similar to the discriminator of
instances belonging to the foreground categories) individu-
SketchyGAN while ignoring the adversarial loss and only
ally in a random order by the foreground generation mod-
using the focal loss [25] as the classification loss. The ar-
ule. By taking background sketches and the generated fore-
chitecture of all generators and discriminators are based on
ground image as input, the final image is achieved by gen-
WGAP-GP [17]. Objective function and more training de-
erating the background image in a single pass. The two
tails can be found in the supplementary materials.
modules are trained separately. We next describe the details
of each module. 3.2. Background Generation
3.1. Foreground Generation Once all of the foreground instances have been synthe-
sized, we train pix2pix [21] to generate the background.
Overall Architecture of EdgeGAN. Directly modeling the
The major challenge of the background generation task
mapping between a single image and its corresponding
is that the background of most scene sketches contains
sketches, such as SketchyGAN [9], is difficult because of
both the background instance and the blank area within the
the enormous size of the mapping space. We therefore in-
area(as shown in Fig. 2), which means some area belong-
stead address the challenge in another feasible way instead:
ing to the background is uncertain because of the lack of
we learn a common representation for an object expressed
sketch constraint. By leveraging pix2pix and using the gen-
by cross-domain data. To this end, we design an adversar-
erated foreground instances as constraints, we can allow the
ial architecture, which is shown in Fig. 3(a), for EdgeGAN.
network to generate a reasonable background matching the
Rather than directly inferring images from sketches, Edge-
synthesized foreground instances. Taking Fig. 2 as an ex-
GAN transfers the problem of sketch-to-image generation
ample, the region below the zebras of the input image con-
to the problem of generating the image from an attribute
tains no background sketches for constraints, and the output
vector that is encoding the expression intent of the freehand
image shows that such a region can be reasonably filled in
sketch. At the training stage, EdgeGAN learns a common
with grass and ground.
attribute vector for an object image and its edge maps by
feeding adversarial networks with images and their various-
4. SketchyCOCO Dataset
drawing-style edge maps. At the inference stage (Fig. 3 (b)),
EdgeGAN captures the user’s expression intent with an at- We initialize the construction by collecting instance
tribute vector and then generates the desired image from it. freehand sketches covering 3 background classes and 14
Structure of EdgeGAN. As shown in Fig. 3(a), the pro- foreground classes from the Sketchy dataset [31], Tuber-
posed EdgeGAN has two channels: one including genera- lin dataset [12], and QuickDraw dataset [18] (around 700

5176
attribute vector
fake mp mp mp
Edge image

{one-hot + noise} vector


fake
Encoder image

One-hot vector
L1 Loss

attribute vector
Edge Generation
{one-hot + noise} vector

attribute vector
Edge
Encoder
Edge fake
Fake edge ResNet
Encoder edge real mp mp mp
image
Test sketch Image Generation
one-hot vector Image Classifier
From sketch segmentation

real/fake input
Focal Loss
Real?
Fake image Conv-LeakyReLU FC-Reshape-IN-ReLU MRU Blocks
Classifier Conv-IN-LeakyReLU DeConv-IN-ReLU mp Mean-pooling
FC DeConv-IN-Tanh One-hot vector

(a) Training stage (b) Inference stage (c) Network structure

Figure 3: Structure of the proposed EdgeGAN. It contains four sub-networks: two generators GI and GE , three discrimi-
nators DI , DE , and DJ , an edge encoder E and an image classifier C. EdgeGAN learns a joint embedding for an image
and various-style edge maps depicting this image into a shared latent space where vectors can encode high-level attribute
information from cross-modality data.

1819/232 1690/189 997/104 3258/7 2125/80 1297/132 726/111 1067/156 249/15 683/55 1145/21 1848/168 892/32 481/27 7230/1831 7825/1910 7116/1741

Figure 4: Representative sketch-image pairwise examples from 14 foreground and 3 background categories in SketchyCOCO.
The data size of each individual category, splitting to training/test, is shown on the top.

sketches for each foreground class). For each class, we split above steps, scene sketches in training and test set can only
these sketches into two parts: 80% for the training set, and be made up by instance sketches from the training and test
the remaining 20% for the test set. We collect 14081 natural sets, respectively.
images from COCO Stuff [5] containing at least one of 17
categories and split them into two sets, 80% for training and 5. Experiments
the remaining 20% for test. Using the segmentation masks
of these natural images, we place background instance
5.1. Object-level Image Generation
sketches (clouds, grass, and tree sketches) at random posi- Baselines. We compare EdgeGAN with the general image-
tions within the corresponding background regions of these to-image model pix2pix [21] and two existing sketch-to-
images. This step produces 27, 683(22, 171 + 5, 512) pairs image models, ContextualGAN [26] and SketchyGAN[9],
of background sketch-image examples (shown in Fig. 4). on the collected 20,198 triplets {foreground sketch, fore-
ground image, foreground edge maps} examples. Unlike
After that, for each foreground object in the natural im-
SketchyGAN and pix2pix which may use both edge maps
age, we retrieve the most similar sketch with the same
and freehand sketches for training data, EdgeGAN and Con-
class label as the corresponding foreground object in the
textualGAN take as input only edge maps and do not use
image. This step employs the sketch-image embedding
any freehand sketches for training. For fair and thorough
method proposed in the Sketchy database [31]. In addi-
evaluation, we set up several different training modes for
tion, in order to obtain more data for training object gen-
SketchyGAN, pix2pix, and ContextualGAN. We next intro-
eration model, we collect foreground objects from the full
duce these modes for each model.
COCO Stuff dataset. With this step and the artificial se-
lection, we obtain 20, 198(18, 869 + 1, 329) triplets exam- • EdgeGAN: we train a single model using foreground
ples of foreground sketches, images and edge maps. Since images and only the extracted edge maps for all 14
all the background objects and foreground objects of natu- foreground object categories.
ral images from COCO Stuff have category and layout in-
formation, we therefore obtain the layout (e.g., bounding • ContextualGAN [26]: we use foreground images and
boxes of objects) and segmentation information for the syn- their edge maps to separately train a model for each
thesized scene sketches as well. After the construction of foreground object category, since the original method
both background and foreground sketches, we naturally ob- cannot use a single model to learn the sketch-to-image
tain five-tuple ground truth data (Fig. 5). Note that in the correspondence for multiple categories.

5177
(a) (b) (c) (d) (e)

Figure 5: Illustration of five-tuple ground truth data of SketchyCOCO, i.e., (a) {foreground image, foreground sketch, fore-
ground edge maps} (training: 18,869, test: 1,329), (b) {background image, background sketch} (training: 11,265, test:
2,816), (c) {scene image, foreground image & background sketch} (training: 11,265, test: 2,816), (d) {scene image, scene
sketch} (training: 11,265, test: 2,816), and (e) sketch segmentation (training: 11,265, test: 2,816).

• SketchyGAN [9]: we train the original Sketchy- are weak to produce realistic results since they only see edge
GAN in two modes. The first mode denoted as maps during the training. In contrast, the output of Edge-
SketchyGAN-E uses foreground images and only their GAN is relatively weakly constrained by the input sketch
edge maps for training. Since SketchyGAN may use since its generator takes as input the attribute vector learnt
both edge maps and freehand sketches for training data from cross-domain data rather than the input sketch. There-
in their experiments, we also train SketchyGAN in an- fore, EdgeGAN can achieve better results than pix2pix and
other mode: using foreground images and {their edge SketchyGAN because it is relatively insensitive to cross-
maps + sketches} for training. In this training mode domain input data.
called SketchyGAN-E&S, we follow the same train- By augmenting or changing the training data with free-
ing strategy as SketchyGAN did to feed edge maps to hand sketches, both SketchyGAN and pix2pix can produce
the model first and then fine-tune it with sketches. realistic local patches for some categories but fail to pre-
serve the global shape information, as we can see that the
• pix2pix [21]: we train the original pix2pix architec- shapes of the results in Fig. 6 (b2), (c3), and (c4) are dis-
ture in four modes. The first two modes are denoted as torted.
pix2pix-E-SEP and pix2pix-S-SEP, in which we sep-
Input Ours (a) (b1) (b2) (c1) (c2) (c3) (c4)
arately train 14 models by using only edge maps or
sketches from the 14 foreground categories, respec-
tively. The other two modes are denoted as pix2pix-
E-MIX and pix2pix-S-MIX, in which we train a single
model respectively using only edge maps or sketches
from all 14 categories.

Qualitative results. We show the representative results of


the four comparison methods in Fig 6. In general, Edge-
GAN provides much more realistic results than Contextu-
alGAN. In terms of the faithfulness (i.e., whether the in- Figure 6: From left to right: input sketches, results from
put sketches can depict the generated images), EdgeGAN is EdgeGAN, ContextualGAN (a), two training modes of
also superior than ContextualGAN. This can be explained SketchyGAN (i.e., SketchyGAN-E (b1) and SketchyGAN-
by the fact that EdgeGAN uses the learned attribute vec- E&S) (b2), four training modes of pix2pix, i.e, pix2pix-
tor, which captures reliable high-level attribute information E-SEP (c1), pix2pix-E-MIX (c2),pix2pix-S-MIX(c3), and
from the cross-domain data for the supervision of image pix2pix-S-SEP(c4)
generation. In contrast, ContextualGAN uses a low-level
sketch-edge similarity metric for the supervision of image Quantitative results. We carry out both realism and faith-
generation, which is sensitive to the abstractness level of fulness evaluations for quantitative comparison. We use
the input sketch. FID [19] and Accuracy [2] as the realism metrics. Lower
Compared to EdgeGAN which produces realistic im- FID value and higher accuracy value indicate better image
ages, pix2pix and SketchyGAN which just colorize the in- realism. It is worth mentioning that the Inception Score [30]
put sketches and do not change the original shapes of the metric is not suitable for our task, as several recent re-
input sketches when the two models are trained with only searches including [3] find the Inception Score is basically
edge maps (e.g., see Fig. 6 (b1), (c1), and (c2)). This may only reliable for the models trained on ImageNet. We mea-
be because the outputs of both SketchyGAN and pix2pix sure the faithfulness of the generated image by computing
are strongly constrained by the input (i.e., one-to-one corre- the extent of the similarity between the edge map of the gen-
spondence provided by the training data). When the input is erated image and the corresponding input sketch. Specifi-
a freehand sketch from another domain, these two models cally, we use Shape Similarity (SS), which is the L2 Ga-

5178
Table 1: The results of quantitative experiments and human variety is too huge to be modeled. Even the size of 14K
evaluation. pairs is still insufficient to complete a successful training.
However, even with 80% the 14081 pairs of {foreground
SS Real- Faith-
Model (object) FID Acc.
(e+04) ism fulness image & background sketch, scene image} examples, we
Ours 87.6 0.887 2.294 0.637 0.576 can still use the same pix2pix model for background gen-
ContextualGAN 225.2 0.377 2.660 0.038 0.273 eration without any mode collapse. This may be because
SketchyGAN-E 141.5 0.277 1.996 0.093 0.945 the pix2pix model in this case avoids the challenging map-
SketchyGAN-E&S 137.9 0.127 2.315 0.023 0.691
pix2pix-E-SEP 143.1 0.613 2.136 0.071 0.918 ping between the foreground sketches and the correspond-
pix2pix-E-MIX 128.8 0.499 2.103 0.058 0.889 ing foreground image contents. More importantly, the train-
pix2pix-S-MIX 163.3 0.223 2.569 0.047 0.353 ing can converge fast because the foreground image pro-
pix2pix-S-SEP 196.0 0.458 2.527 0.033 0.310
vides sufficient prior information and constraints for back-
FID Real- Faith-
Model (scene) FID SSIM
(local) ism fulness ground generation.
Ashual et al. [2]-layout 123.1 0.304 183.6 0.083 1.874 Comparison with other systems. We also compare our
Ashual et al. [2]-scene graph 167.7 0.280 181.9 0.118 1.570 approach with the advanced approaches which generate im-
GauGAN-semantic map 80.3 0.306 123.0 0.208 2.894 ages using constraints from other modalities.
GauGAN-semantic sketch 215.1 0.285 239.5 0.000 1.210
Ours 164.8 0.288 112.0 0.591 2.168 • GauGAN [28]: The original GauGAN model takes the
semantic maps as input. We found that the GauGAN
bor feature [14] distance between the input sketch and the model can also be used as a method to generate im-
edge map generated by the canny edge detector from the ages from semantic sketches where the edges of the
generated image, to measure the faithfulness (lower value sketches have category labels as shown in the 7th col-
indicates higher faithfulness). umn of Fig. 7. In our experiments, we test the public
The quantitative results are summarized as Table 1 where model pre-trained on the dataset COCO Stuff. In addi-
we can see that the proposed EdgeGAN achieves the best re- tion, we trained a model by taking as input the seman-
sults in terms of the realism metrics. However, in terms of tic sketches on our collected SketchyCOCO dataset.
the faithfulness metric, our method is better than most of the The results are shown in Fig. 7 columns 6 and 8.
competitors but is not as good as pix2pix-E-SEP, pix2pix-E-
MIX, SketchyGAN-E. This is because the results generated • Ashual et al. [2]: the approach proposed by Ashual et
by these methods look more like a colorization of the in- al. can use either layouts or scene graphs as input. We
put sketches whose shapes are almost the same as the input therefore compared both of the two modes with their
sketch (see Fig. 6 (b1), (c1), (c2)), rather than being realis- pre-trained model. To ensure fairness, we test only the
tic. The quantitative results basically confirm our observa- categories included in the SketchyCOCO dataset and
tions in the qualitative study. set the parameter of the minimal object number to 1.
The results are shown in Fig. 7 columns 2 and 4.
5.2. Scene-level Image Generation
Qualitative results. From Fig. 7, we can see the images
Baselines. There is no existing approach which is specifi- generated by freehand sketches are much more realistic than
cally designed for image generation from scene-level free- those generated from scene graphs or layouts by Ashual et
hand sketches. SketchyGAN was originally proposed for al. [2], especially in the foreground object regions. This is
object-level image generation from freehand sketches. The- because freehand sketches provide a harder constraint com-
oretically, it can also be used for the scene-level freehand pared to scene graphs or layouts (it provides more informa-
sketches. pix2pix [21] is a popular general image-to-image tion including the pose and shape information than scene
model which is supposed to be applied in all the image graphs or layouts). Compared to GauGAN with semantic
translation tasks. We therefore use SketchyGAN [9] and sketches as input, our approach generally produce more re-
pix2pix [21] as the baseline methods. alistic images. Moreover, compared to the GauGAN model
Since we have 14081 pairs of {scene sketch, scene im- trained using semantic maps, our approach also achieves
age} examples, it is intuitive to directly train the pix2pix better results, evidence of which can be found in the gen-
and SketchyGAN models to learn the mapping from erated foreground object regions (the cows and elephants
sketches to images. We therefore conducted the experi- generated by GauGAN have blurred or unreasonable tex-
ments on the entities with lower resolutions, e.g., 128×128. tures).
We found that the training of either pix2pix or Sketchy- In general, our approach can produce much better results
GAN was prone to mode collapse, often after 60 epochs in terms of the overall visual quality and the realism of the
(80 epochs for SketchyGAN), even all the 14081 pairs of foreground objects than both GauGAN and Ashual et al.’s
{scene sketch, scene image} examples from the Sketchy- method. The overall visual quality of the whole image is
COCO dataset were used. The reason may be that the data also comparative to the state-of-the-art system.

5179
Layout Ashual et al. Scene Graph Ashual et al. Semantic Map GauGAN Semantic Sketch GauGAN Sketch Ours Ground Truth

Figure 7: Scene-level comparison. Please see the text in Section 5.2 for the details.
Quantitative results. We adopt three metrics to evaluate level and scene-level results. As shown in Table 1, we eval-
the faithfulness and realism of the generated scene-level uate the realism and faithfulness of the results from eight
images. Apart from FID, the structural similarity metric object-level and five scene-level comparison models. We
(SSIM) [35] is another metric used to quantify how simi- select 51 sets of object-level test samples and 37 sets of
lar the generated images and the ground truth images are. scene-level test samples, respectively. In the realism evalu-
Higher SSIM value means closer. The last metrics, called ation, 30 participants are asked to pick out the resulting im-
FID (local), is used to compute the FID value of the fore- age that they think is most “realistic” from the images gen-
ground object regions in the generated images. From Ta- erated by the comparison models for each test sample. For
ble 1 we can see most comparison results confirm our obser- the faithfulness evaluation, we conduct the evaluation fol-
vations and conclusions in the qualitative study except for lowing SketchyGAN [9] for eight object-level comparison
the comparisons with the GauGAN-semantic map model models. Specifically, with each sample image, the same 30
and the Ashual et al. [2]-layout model in some metrics. participants see six random sketches of the same category,
There are several reasons why the GauGAN model one of which is the actual input/query sketch. The partici-
trained using semantic maps is superior to our model in pants are asked to select the sketch that they think prompts
terms of FID and SSIM. Apart from the inherent advantages the output image. For five scene-level comparison models,
offered by the semantic map data as a tighter constraint, the 30 participants are asked to rate the similarity between
the GauGAN model trained using the semantic maps con- the GT image and the resulting images on a scale of 1 to 4,
tains all the categories in the COCO Stuff dataset, while our with 4 meaning very satisfied and 1 meaning very dissatis-
model sees only 17 categories in the SketchyCOCO dataset. fied. In total, 51 × 8 × 30 = 12, 240 and 51 × 30 = 1, 530
Therefore, the categories and number of instances in the im- trails are respectively collected for object-level faithfulness
age generated by GauGAN are the same with ground truth, and realism evaluations, and 37 × 5 × 30 = 5, 550 and
while our results can contain only a part of them. The 37 × 30 = 1, 110 trails are respectively collected for scene-
Ashual et al. [2]-layout model is superior to ours in terms level faithfulness and realism evaluations.
of FID and SSIM. This may be because the input layout The object-level statistic results in Table 1 generally con-
information can provide a more explicit spatial constraint firm the quantitative results of faithfulness. The scene-level
than sketches when generating the background. However, evaluation shows that our method has the best score on re-
our method has greater advantages on the metric of FID alism, which is not consistent with the quantitative results
(local), which confirms our observation in the qualitative measured by FID. This may be because the participants care
result analysis-that is, our method can generate more realis- more about the visual quality of foreground objects than that
tic foreground images. Because our approach takes as input of background regions. In terms of scene-level faithfulness,
the freehand sketches, which may be much more accessi- GauGAN is superior to our method because the input se-
ble than the semantic maps used by GauGAN, we believe mantic map generated from the ground truth image provides
that our approach might still be a competitive system for an more accurate constraints.
image-generation tool compared to the GauGAN model.
5.4. Ablation Study
5.3. Human Evaluation
We conduct comprehensive experiments to analyze each
We carry out a human evaluation study for both object- component of our approach, which includes: a) whether the

5180
encoder E has learnt the high level cross-domain attribute
information, b) how the joint discriminator DJ works, and 15% 13% 27% 4%4%3% 24% 10%
front left front left left back back right back right right front
c) which GAN model suits our approach the most, and d)
whether multi-scale discriminators can be used to improve Figure 10: Statistical results of the view angles of fore-
the results. Due to the limited space, in this section we only ground objects in SketchyCOCO.
present our investigation towards the most important study,
i.e., study a) and put the other studies into the supplemen- and size of the foreground object in the scene sketch while
tary materials. keeping the background unchanged. As a result, there are
significant changes in the background generation. Taking
the foreground as a constraint for background training, the
foreground and background blend well. We can see the ap-
proach even generates shadow under the giraffe.
Dataset Bias. In the current version of SketchyCOCO, all
the foreground images for object-level training are collected
Figure 8: Results from edges or sketches with different from the COCO-Stuff dataset. We discard only the fore-
style. Column 1 to 4: different freehand sketches. Col- ground objects with major parts occluded from COCO-Stuff
umn 5 to 9: edges from canny, FDoG [22], Photocopy (PC), in the data collection phrase. To measure the view diversity
Photo-sketch [13] and XDoG. [36] of the foreground objects, we randomly sample 50 examples
from each class in the training data and quantify the views
We test different styles of drawings, including sketches into eight ranges according to the view angles on the x-y
and edge maps generated by various filters as input. We plane. This result is shown in Fig. 10. As we can see, there
show the results in Fig. 8. We can see that our model works are some dominant view angles, such as the side views. We
for a large variety of line drawing styles although some of are considering augmenting SketchyCOCO to create a more
them are not included in the training dataset. We believe balanced dataset.
that the attribute vector from the Encoder E can extract the Sketch Segmentation. We currently employ the instance
high-level attribute information of the line drawings no mat- segmentation algorithm in [40] in the instance segmenta-
ter what styles they are. tion step of the scene sketch. Our experiment finds that
the adopted segmentation algorithm may fail to segment
6. Discussion and Limitation some objects in the scene sketches in which the object-level
sketches are too abstract. To address this problem, we are
considering tailoring a more effective algorithm for the task
of scene sketch segmentation in the future.

7. Conclusion
For the first time, this paper has presented a neural net-
work based framework to tackle the problem of generating
(a) (b) (c) (d) (e)
scene-level images from freehand sketches. We have built
a large scale composite dataset called SketchyCOCO based
Figure 9: From top to bottom: input sketches, and the im- on MS COCO Stuff for the evaluation of our solution. Com-
ages generated by our approach. prehensive experiments demonstrate the proposed approach
can generate realistic and faithful images from a wide range
Background generation. We study the controllability and of freehand sketches.
robustness of background generation. As shown in Fig. 9
(a) to (c), we progressively add background categories to Acknowledgement
the blank background. As a result, the output images are
changed reasonably according to the newly added back- We thank all the reviewers for their valuable comments
ground sketches, which indicates these sketches do control and feedback. We owe our gratitude to Jiajun Wu for his
the generation of different regions of the image. It can be valuable suggestions and fruitful discussions that leads to
seen that although there is a large unconstrained blank in the EdgeGAN model. This work was supported by the
the background, the output image is still reasonable. We Natural Science Foundation of Guangdong Province, China
study our approach’s capability of producing diverse re- (Grant No. 2019A1515011075), National Natural Science
sults. As shown in Fig. 9 (c) to (e), we change the location Foundation of China (Grant No. 61972433, 61921006).

5181
References in neural information processing systems, pages 2672–2680,
2014.
[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou.
[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent
Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
Dumoulin, and Aaron C Courville. Improved training of
[2] Oron Ashual and Lior Wolf. Specifying object attributes wasserstein gans. In Advances in neural information pro-
and relations in interactive scene generation. In Proceedings cessing systems, pages 5767–5777, 2017.
of the IEEE International Conference on Computer Vision,
[18] David Ha and Douglas Eck. A neural representation of
pages 4561–4569, 2019.
sketch drawings. arXiv preprint arXiv:1704.03477, 2017.
[3] Ali Borji. Pros and cons of gan evaluation measures. Com-
[19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
puter Vision and Image Understanding, 179:41–65, 2019.
Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large two time-scale update rule converge to a local nash equilib-
scale gan training for high fidelity natural image synthesis. rium. In Advances in neural information processing systems,
arXiv preprint arXiv:1809.11096, 2018. pages 6626–6637, 2017.
[5] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- [20] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and
stuff: Thing and stuff classes in context. In Proceedings Honglak Lee. Inferring semantic layout for hierarchical text-
of the IEEE Conference on Computer Vision and Pattern to-image synthesis. In Proceedings of the IEEE Conference
Recognition, pages 1209–1218, 2018. on Computer Vision and Pattern Recognition, pages 7986–
[6] Qifeng Chen and Vladlen Koltun. Photographic image syn- 7994, 2018.
thesis with cascaded refinement networks. In Proceedings of [21] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
the IEEE international conference on computer vision, pages Efros. Image-to-image translation with conditional adver-
1511–1520, 2017. sarial networks. In Proceedings of the IEEE conference on
[7] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and computer vision and pattern recognition, pages 1125–1134,
Shi-Min Hu. Sketch2photo: Internet image montage. ACM 2017.
transactions on graphics (TOG), 28(5):1–10, 2009. [22] Chenfanfu Jiang, Yixin Zhu, Siyuan Qi, Siyuan Huang,
[8] Tao Chen, Ping Tan, Li-Qian Ma, Ming-Ming Cheng, Jenny Lin, Xiongwen Guo, Lap-Fai Yu, Demetri Terzopou-
Ariel Shamir, and Shi-Min Hu. Poseshop: Human image los, and Song-Chun Zhu. Configurable, photorealistic image
database construction and personalized content synthesis. rendering and ground truth synthesis by sampling stochas-
IEEE Transactions on Visualization and Computer Graph- tic grammars representing indoor scenes. arXiv preprint
ics, 19(5):824–837, 2012. arXiv:1704.00112, 2, 2017.
[9] Wengling Chen and James Hays. Sketchygan: Towards di- [23] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image gener-
verse and realistic sketch to image synthesis. In Proceed- ation from scene graphs. In Proceedings of the IEEE con-
ings of the IEEE Conference on Computer Vision and Pattern ference on computer vision and pattern recognition, pages
Recognition, pages 9416–9425, 2018. 1219–1228, 2018.
[10] Zezhou Cheng, Qingxiong Yang, and Bin Sheng. Deep col- [24] Gustav Larsson, Michael Maire, and Gregory
orization. In Proceedings of the IEEE International Confer- Shakhnarovich. Learning representations for automatic
ence on Computer Vision, pages 415–423, 2015. colorization. In European Conference on Computer Vision,
[11] Aditya Deshpande, Jason Rock, and David Forsyth. Learn- pages 577–593. Springer, 2016.
ing large-scale automatic image colorization. In Proceedings [25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
of the IEEE International Conference on Computer Vision, Piotr Dollár. Focal loss for dense object detection. In Pro-
pages 567–575, 2015. ceedings of the IEEE international conference on computer
[12] Mathias Eitz, James Hays, and Marc Alexa. How do hu- vision, pages 2980–2988, 2017.
mans sketch objects? ACM Transactions on graphics (TOG), [26] Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung
31(4):1–10, 2012. Tang. Image generation from sketch constraint using con-
[13] Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and textual gan. In Proceedings of the European Conference on
Marc Alexa. Photosketch: A sketch based image query and Computer Vision, pages 205–220, 2018.
compositing system. In SIGGRAPH 2009: talks, pages 1–1. [27] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
2009. Zhu. Semantic image synthesis with spatially-adaptive nor-
[14] Mathias Eitz, Ronald Richter, Tamy Boubekeur, Kristian malization. In Proceedings of the IEEE Conference on Com-
Hildebrand, and Marc Alexa. Sketch-based shape retrieval. puter Vision and Pattern Recognition, pages 2337–2346,
ACM Transactions on graphics (TOG), 31(4):31, 2012. 2019.
[15] Mathias Eitz, Ronald Richter, Kristian Hildebrand, Tamy [28] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Boubekeur, and Marc Alexa. Photosketcher: interactive Zhu. Semantic image synthesis with spatially-adaptive nor-
sketch-based image synthesis. IEEE Computer Graphics and malization. In Proceedings of the IEEE Conference on Com-
Applications, 31(6):56–66, 2011. puter Vision and Pattern Recognition, pages 2337–2346,
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing 2019.
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [29] Alec Radford, Luke Metz, and Soumith Chintala. Un-
Yoshua Bengio. Generative adversarial nets. In Advances supervised representation learning with deep convolu-

5182
tional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.
[30] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
Cheung, Alec Radford, and Xi Chen. Improved techniques
for training gans. In Advances in neural information pro-
cessing systems, pages 2234–2242, 2016.
[31] Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James
Hays. The sketchy database: learning to retrieve badly drawn
bunnies. ACM Transactions on Graphics (TOG), 35(4):1–12,
2016.
[32] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. Sin-
gan: Learning a generative model from a single natural im-
age. In Proceedings of the IEEE International Conference
on Computer Vision, pages 4570–4580, 2019.
[33] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Fergus. Indoor segmentation and support inference from
rgbd images. In European conference on computer vision,
pages 746–760. Springer, 2012.
[34] Mehmet Ozgur Turkoglu, William Thong, Luuk Spreeuw-
ers, and Berkay Kicanaoglu. A layer-based sequential
framework for scene generation with gans. arXiv preprint
arXiv:1902.00671, 2019.
[35] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
moncelli. Image quality assessment: from error visibility to
structural similarity. IEEE transactions on image processing,
13(4):600–612, 2004.
[36] Holger Winnemöller, Jan Eric Kyprianidis, and Sven C
Olsen. Xdog: an extended difference-of-gaussians com-
pendium including advanced image stylization. Computers
& Graphics, 36(6):740–753, 2012.
[37] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
gan: Text to photo-realistic image synthesis with stacked
generative adversarial networks. In Proceedings of the IEEE
international conference on computer vision, pages 5907–
5915, 2017.
[38] Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. Image
generation from layout. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
8584–8593, 2019.
[39] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Dar-
rell, Alexei A Efros, Oliver Wang, and Eli Shechtman. To-
ward multimodal image-to-image translation. In Advances
in neural information processing systems, pages 465–476,
2017.
[40] Changqing Zou, Haoran Mo, Chengying Gao, Ruofei Du,
and Hongbo Fu. Language-based colorization of scene
sketches. ACM Transactions on Graphics (TOG), 38(6):1–
16, 2019.

5183

You might also like