0% found this document useful (0 votes)
106 views

Sketchygan: Towards Diverse and Realistic Sketch To Image Synthesis

This document presents SketchyGAN, a GAN-based approach for sketch-to-image synthesis. It aims to generate realistic images from sketches across 50 object categories. The approach uses a data augmentation technique to increase the number of sketch-image pairs for training. It also introduces a new network block called a Masked Residual Unit that helps generate higher quality images by incorporating input images at multiple scales. Evaluation shows the approach generates more realistic images than state-of-the-art image translation methods and achieves higher Inception Scores.

Uploaded by

aaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Sketchygan: Towards Diverse and Realistic Sketch To Image Synthesis

This document presents SketchyGAN, a GAN-based approach for sketch-to-image synthesis. It aims to generate realistic images from sketches across 50 object categories. The approach uses a data augmentation technique to increase the number of sketch-image pairs for training. It also introduces a new network block called a Masked Residual Unit that helps generate higher quality images by incorporating input images at multiple scales. Evaluation shows the approach generates more realistic images than state-of-the-art image translation methods and achieves higher Inception Scores.

Uploaded by

aaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis

Wengling Chen James Hays


Georgia Institute of Technology Georgia Institute of Technology, Argo AI
[email protected] [email protected]

Figure 1: A sample of sketch-to-photo synthesis results from our 50 categories. Best viewed in color.

Abstract image synthesis techniques are driven by image retrieval


methods such as Photosketcher [14] and Sketch2photo [6].
Synthesizing realistic images from human drawn Such approaches often require carefully designed feature
sketches is a challenging problem in computer graphics and representations which are invariant between sketches and
vision. Existing approaches either need exact edge maps, or photos. They also involve complicated post-processing pro-
rely on retrieval of existing photographs. In this work, we cedures like graph cut compositing and gradient domain
propose a novel Generative Adversarial Network (GAN) ap- blending in order to make the synthesized images realistic.
proach that synthesizes plausible images from 50 categories The recent emergence of deep convolutional neural net-
including motorcycles, horses and couches. We demon- works [34, 33, 19] has provided enticing methods for image
strate a data augmentation technique for sketches which synthesis, among which Generative Adversarial Networks
is fully automatic, and we show that the augmented data (GANs) [15] have shown great potential. A GAN frames
is helpful to our task. We introduce a new network build- its training as a zero-sum game between the generator and
ing block suitable for both the generator and discriminator the discriminator. The goal of the discriminator is to de-
which improves the information flow by injecting the input cide whether a given image is fake or real, while the gen-
image at multiple scales. Compared to state-of-the-art im- erator tries to generate realistic images so the discriminator
age translation methods, our approach generates more re- will misclassify them as real. Sketch-based image synthesis
alistic images and achieves significantly higher Inception can be formulated as an image translation problem condi-
Scores. tioned on an input sketch. There exist several methods that
use GANs to translate images from one domain to another
[26, 60]. However, none of them is specifically designed for
1. Introduction image synthesis from sketches.
In this paper, we propose SketchyGAN, a GAN-based,
How can we visualize a scene or object quickly? One of
end-to-end trainable sketch to image synthesis approach
the easiest ways is to draw a sketch. Compared to photogra-
that can generate objects from 50 classes. The input is a
phy, drawing a sketch does not require any capture devices
sketch illustrating an object and the output is a realistic im-
and is not limited to faithfully sampling reality. However,
age containing that object in a similar pose. This is chal-
sketches are often simple and imperfect, so it is challenging
lenging because: (i) paired photos and sketches are difficult
to synthesize realistic images from novice sketches. Sketch-
to acquire so there is no massive database to learn from. (ii)
based image synthesis enables non-artists to create realistic
There is no established neural network method for sketch to
images without significant artistic skill or domain expertise
image synthesis for diverse categories. Previous works train
in image synthesis. It is generally hard because sketches are
models for single or few categories [29, 49].
sparse, and novice human artists cannot draw sketches that
We resolve the first challenge by augmenting the Sketchy
precisely reflect object boundaries. A real-looking image
database [48], which contains nearly 75,000 actual hu-
synthesized from a sketch should respect the intent of the
man sketches paired with photos, with a larger dataset of
artist as much as possible, but might need to deviate from
paired edge maps and photos. This augmentation dataset
the coarse strokes in order to stay on the natural image man-
is obtained by collecting 2,299,144 Flickr images from 50
ifold. In the past 30 years, the most popular sketch-based

9416
13, 22, 3, 4, 53, 24, 23, 27, 52, 38, 54, 35]. Most meth-
ods use bag of words representations and edge detection
to build features that are (ideally) invariant across both do-
(a) Photo (b) Edge map (c) Sample sketches of (a)
mains. Common shortcomings include the inability to per-
form fine-grained retrieval and the inability to map from
Figure 2: Comparison between an edge map and sketches badly drawn sketch edges to photo boundaries. To address
of the same image. The photo and sketches are from the these problems, Yu et al. [58] and Sangkloy et al. [48]
Sketchy Database. Compared to sketches, the edge map train deep convolutional neural networks(CNNs) to relate
contains more background information. The sketches, in sketches and photos, treating the sketch-based image re-
contrast, do not precisely reflect actual object boundaries trieval as a search in the learned feature embedding space.
and are not spatially aligned with the object. They show that using CNNs greatly improves performance
and they are able to do fine-grained and instance-level re-
categories and synthesizing edge maps from them. Dur- trieval. Beyond the task of retrieval, Sketch2Photo [6]
ing training, we adjust the ratio between edge map-image and PhotoSketcher [14] synthesize realistic images by com-
and sketch-image pairs so that the network can transfer its positing objects and backgrounds retrieved from a given
knowledge gradually from edge-image synthesis to sketch- sketch. PoseShop [7] composites images of people by let-
image synthesis. For the second challenge, we build a ting users input an additional 2D skeleton into the query so
GAN-based model, conditioned on an input sketch, with that the retrieval will be more precise.
several additional loss terms which improve synthesis qual- Sketch-Based Datasets. There are only a few datasets
ity. We also introduce a new building block called Masked of human-drawn sketches and they are generally small due
Residual Unit (MRU) which helps generate higher quality to the effort needed to collect drawings. One of the most
images. This block takes an extra image input and utilizes commonly used sketch dataset is the TU-Berlin dataset [11]
its internal mask to dynamically decide the information flow which contains 20,000 human sketches spanning 250 cate-
of the network. By chaining these blocks we are able to in- gories. Yu et al. [58] introduced a new dataset with paired
put a pyramid of images at different scales. We show that sketches and images, but there are only two categories –
this structure outperforms naive convolutional approaches shoes and chairs. There is also the CUHK Face Sketches
and ResNet blocks on our sketch to image synthesis tasks. [55] containing 606 face sketches drawn by artists. The
Our main contributions are: newly published QuickDraw dataset [17] has an impressive
50 million sketches. However, the sketches are particularly
• We present SketchyGAN, a deep learning approach crude because of a 10 second time limit. The sketches lack
to sketch to image synthesis. Unlike previous non- detail and tend to be iconic or canonical views. The Sketchy
parametric approaches, we do not do image retrieval database [48], in contrast, has more detailed drawings in a
at test time. Unlike previous deep image translation greater variety of poses. It spans 125 categories with a to-
methods, our network does not learn to directly copy tal of 75,471 sketches of 12,500 objects. Critically, it is the
input edges (effectively colorizing instead of convert- only substantial dataset of paired sketches and photographs
ing sketches to photos). Our method is capable of gen- spanning diverse categories so we choose to use this dataset.
erating plausible objects from 50 diverse categories. Image-to-Image Translation with GANs. Generative
Sketch-based image synthesis is very challenging and Adversarial Networks(GANs) have shown great potential in
our results are not generally photorealistic, but we generating natural, realistic images [1, 16, 42]. Instead of
demonstrate an increase in quality compared to exist- directly optimizing per pixel reconstruction error, which of-
ing deep generative models. ten leads to blurry and conservative results, GANs use a dis-
• We demonstrate a data augmentation technique for criminator to distinguish unrealistic images from real ones
sketch data that address the lack of sufficient human- thus forcing the generator to produce sharper images. The
annotated training data. “pix2pix” work of Isola et al. [26] demonstrates a straight-
• We formulate a GAN model with additional objective forward approach to translate one image to another using
functions and a new network building block. We show conditional GANs. Conditional settings are also adapted
that all of them are beneficial for our task, and lacking in other image translation tasks, including sketch coloring
any of them will reduce the quality of our results. [49], style transformation [57] and domain adaptation [2]
tasks. In contrast with using conditional GANs and paired
data, Liu et al. [39] introduce an unsupervised image trans-
2. Related Work lation framework consists of CoupledGAN [40] and a pair
of variational autoencoders [31]. More recently, CycleGAN
Sketch-Based Image Retrieval and Synthesis. There [60] shows promising results on unsupervised image trans-
exist numerous works on sketch-based image retrieval [12,

9417
Figure 4: Images synthesized from the same input sketch
with different noise vectors. The network learned to change
(a) input (b) HED (c) binarization
and thinning a significant portion of the image (the flower), which is not
conditioned by the input sketch. In each case, the bee re-
mains plausible.

images per category. ImageNet only has around 1,000 im-


ages per class, and photos in COCO tend to be cluttered
and thus not ideal as object sketch exemplars. Ideally we
(d) small (e) erosion (f) spur removal (g) distance field
component want photographs with one dominant object as is the case
removal for the Sketchy database photographs. Accordingly, we col-
lect images directly from Flickr through the Flickr API by
Figure 3: Pipeline of edge map creation. Images from in- querying category names as keywords. 100,000 images are
termediate steps show that each step helps remove some ar- gathered for each category, sorted by “relevance”. Two dif-
tifacts and make the edge maps more sketch-like. ferent models are used for filtering out unrelated images.
We use an Inception-ResNet-v2 network [50] to filter im-
lation by enforcing cycle-consistency losses. ages from the 38 ImageNet [46] categories that overlap with
Sketchy, and a Single Shot MultiBox Detector [41] to detect
3. Sketchy Database Augmentation whether an image contains an object in the 18 COCO [37]
categories that overlap with Sketchy. For SSD, the bound-
In this section, we discuss how we augment the Sketchy
ing box of a detected object must cover more than 5% of
database [48] with Flickr images and synthesize edge maps
the image area or the image is discarded. After filtering, we
which we hope approximate human sketches. The dataset is
obtain a dataset with an average of 46,265 images per Ima-
publicly available. Section 3.2 describes image collection,
geNet category and 61,365 images per COCO category. For
image content filtering, and category selection. Section 3.3
the remainder of the paper, we use 50 out of the 56 avail-
describes our edge map synthesis. Section 3.4 describes the
able categories after excluding six categories that often have
way we use the augmented dataset.
a human as a main object. The excluded classes are harp,
3.1. Edges vs Sketches violin, umbrella, saxophone, racket, and trumpet.

Figure 2 visualizes the difference between image edges 3.3. Edge Map Creation
and sketches. A sketch is set of human-drawn strokes mim- We use edge detection and several post-processing steps
icking the approximate boundary and internal contours of to obtain sketch-like edge maps. The pipeline is illus-
an object, and an edge map is machine-generated array of trated in Figure 3. The first step is to detect edges with
pixels that precisely correspond to photo intensity bound- Holistically-nested edge detection (HED) [56] as in Isola et
aries. Generating photos from sketches is considerably al. [26]. After binarizing the output and thinning all edges
harder than from edges. Unlike edge maps, sketches are [59], we clean isolated pixels and remove small connected
not precisely aligned to object boundaries, so a generative components. Next we perform erosion with a threshold on
model needs to learn spatial transformations to correct de- all edges, further decreasing number of edge fragments. Re-
formed strokes. Second, edge maps usually contain more maining spurs are then removed. Because edges are very
information about backgrounds and details, while sketches sparse, we calculate an unsigned euclidean distance field for
do not, so a generative model must insert more information each edge map to obtain a dense representation (see Figure
itself. Finally, sketches may contain caricatured or iconic 3g). Similar distance-field representations are used in recent
features, like the “tiger” stripes on the cat’s face in Figure works on 3D shape recovery [51, 18]. We also calculate dis-
2c, which a model must learn to handle. Despite these con- tance fields for sketches in the Sketchy database.
siderable differences, edge maps are still a valuable aug-
mentation to the limited Sketchy database. 3.4. Training Adaptation from Edges to Sketches
Because our final goal is a network that generates im-
3.2. Data Collection
ages from sketches, it is necessary to train the network on
Learning the mapping between edges or sketches to pho- both edge maps and sketches. To simplify training process,
tos requires significant training data. We want thousands of we use a strategy that gradually shifts the inputs from edge

9418
Figure 5: Complete structure of our network. Since we are
using MRU blocks, both the generator and the discriminator
can take multi-scale inputs.

maps to sketches: at the beginning of training, the train-


ing data are mostly pairs of images and edge maps. Dur-
Figure 6: Structure of a Masked Residual Unit (MRU). It
ing training, we slowly increase the proportion of sketch-
takes in feature maps xi and an extra image I, then outputs
image pairs. Let imax be the maximum number of training
new feature maps yi .
iterations, icur be the number of current iteration, then the
proportion of sketches and edge maps at current iteration is
4.1. Masked Residual Unit (MRU)
given by:
icur λ We introduce a network module which allows a ConvNet
Psk = 0.1 + min(0.8, ( ) ) (1)
imax to be repeatedly conditioned on an input image. The module
Pedge = 1 − Psk (2) uses a learned internal mask to selectively extract new fea-
tures from the input images to combine with feature maps
respectively, where λ is an adjustable hyperparameter indi-
computed by the network thus far. We call this module the
cating how fast the portion of sketches grows. We use λ = 1
Masked Residual Unit or MRU.
in our experiments. It is easy to see that Psk grows from
Figure 6 shows the structure of Masked Residual Unit
0.1 slowly to 0.9. Using this training schedule, we elimi-
(MRU). Qualitative and quantitative comparison to DC-
nate the need of separate pre-training on edge maps, so the
GAN [45] and ResNet generative architectures can be found
whole training process is unified. We compare this method
in Section 5.3. An MRU block takes two inputs: input fea-
to training on edge maps first then fine-tuning on sketches.
ture maps xi and an image I, and outputs feature maps yi .
We find that discrete pre-training and then fine-tuning leads
For convenience we only discuss the case in which inputs
to lower inception scores on the test set compared to a grad-
and outputs have the same spacial dimension. Let [·, ·] de-
ual ramp from edges to sketches (6.73 vs 7.90).
note concatenation, Conv(x) denote convolution on x, and
4. SketchyGAN f (x) be an activation function. We want to first merge the
information in input image I into input feature maps xi . A
In this section we present a Generative Adversarial Net- naive approach will be concatenating them along the feature
work framework that transforms input sketches into images. depth dimension and performing convolution:
Our GAN learns a mapping from an input sketch x to an zi = f (Conv([xi , I])) (3)
output image y, so that G : x → y. The GAN has two
parts, a generator G and a discriminator D. Section 4.1 However it is better if the block can decide how much infor-
introduces the Masked Residual Unit (MRU), Section 4.2 mation it wants to preserve upon receiving the new image.
illustrates the network structure, and Section 4.3 discusses So instead we use the following approach:
the objective functions. zi = f (Conv([mi ⊙ xi , I])) (4)

9419
Model Inception Score
pix2pix, Sketchy only 3.94
pix2pix, Augmented 4.53
pix2pix, Augmented+Label 5.49
Ours 7.90
Real Image 15.46
Table 1: Comparison of our method to baselines methods.
We compared to three variants of pix2pix, and our method
shows a much higher score on test images.

dress vanishing gradients in recurrent neural networks. 2)


GRU cells are recurrent so part of the output is fed back
into the same cell, while MRU blocks are cascaded so the
outputs of a previous block are fed into the next block. 3)
GRU shares weights for each step so it can only receive
fixed length inputs. No two MRU blocks share weights, so
we can shrink or expand the size of output feature maps like
normal convolutional layers.

Figure 7: Image generated by pix2pix variations and our 4.2. Network Structure
method. The four columns labeled by a to d are: (a)
pix2pix on Sketchy (b) pix2pix on Augmented Sketchy (c) Our complete network structure is shown in Figure 5.
Label-supervised pix2pix on Augmented Sketchy and (d) The generator uses an encoder-decoder structure. Both the
our method. Comparing to our method, pix2pix results are encoder and the decoder are built with MRU blocks, where
blurry and noisy, often containing color patches and un- the sketches are resized and fed into every MRU block on
wanted artifacts. the path. In our best results in Figure 9, we also apply skip-
connections between encoder and decoder blocks, so the
where output feature maps from encoder blocks will be concate-
mi = σ(Conv([xi , I])) (5) nated to the outputs of corresponding decoder blocks. The
discriminator is also built with MRU blocks but will shrink
is a mask over the input feature maps. Multiple convolu- in spatial dimension. At the end of the discriminator, we
tional layers can be stacked here to increase performance. output two logits, one for the GAN loss and one for classi-
We then want to dynamically combine the information from fication loss.
the newly convolved feature maps and the original input
feature maps, so we use another mask 4.3. Objective Function
ni = σ(Conv([xi , I])) (6)
Let x, y be either an image or a sketch, z be a noise
to combine the input feature maps with the new feature
vector, and c be a class label, Our GAN objective function
maps to get the final output:
can be expressed as
yi = (1 − ni ) ⊙ zi + ni ⊙ xi (7)
LGAN (D, G) =Ey∼Pimage [log D(y)]+
The second term in Equation 7 serves as a residual connec- Ex∼Psketch ,z∼Pz [log(1 − D(G(x, z)))]
tion. Because there are internal masks to determine infor- (8)
mation flow, we call this structure masked residual unit. We
can stack multiple of these units and input the same image and the objective of generator LGAN (G) will be to mini-
at different scales repetitively so that the network can re- mize the second term.
trieve information from the input image dynamically on its
computation path. It is shown that giving the model side information will
The MRU formulation is similar to that of the Gated Re- improve the quality of generated images [43], so we use
current Unit (GRU) [8]. However, we are driven by differ- conditional instance normalization [10] in the generator and
ent motivations and there are several crucial differences: 1) pass in labels of input sketches. In addition, we let the dis-
We are motivated by repetitively inputting the same image criminator predict class labels out of the images it sees. The
to improve the information flow. GRU is designed to ad- auxiliary classification loss of discriminator maximize the

9420
To further encourage diversity, we concatenate Gaussian
noise to feature maps at the bottleneck of the generator. Pre-
vious works reach the conclusion that conditional GANs
tend to ignore the noise completely [26] or produce worse
results because of noise [44]. A simple diversity loss
Ldiv (G) = −λdiv kG(x, z1 ) − G(x, z2 )k1 (12)
will improve both quality and diversity of generated images.
The interpretation is straightforward: with a pair of different
noise vectors z1 and z2 conditioned on the same image, the
generator should output a pair of sightly different images.
Our complete discriminator and generator losses are thus
L(D) = LGAN (D, G) + Lac (D) (13)
L(G) = LGAN (G) − Lac (G)
+ Lsup (G) + Lp (G) + Ldiv (G) (14)

Figure 8: Visual results from DCGAN, CRN, ResNet and where the discriminator maximizes Equation 13 and the
MRU. The MRU structure emphasize more on the main ob- generator minimizes Equation 14. In practice, we use DRA-
ject than the other three. GAN loss [32] in order to stabilize training and use focal
loss [36] as classification loss.
Inception
Model Num of params
Score 5. Experiments
DCGAN G:35.1M D: 4.3M 4.73
CRN G:21.4M D:22.3M 4.56 5.1. Experiment settings
Improved ResNet G:33.0M D:31.2M 5.76 Dataset splitting We use the sketch-image pairs in se-
MRU (GAN loss only) G:28.1M D:29.9M 8.31 lected 50 categories from training split of Sketchy as ba-
MRU G:28.1M D:29.9M 7.90 sic training data, and augment them with edge map-image
Table 2: Comparison of MRU, CRN, ResNet and DCGAN pairs. In the following sections, we call data from Sketchy
under the same setting. DCGAN structure is included for Database “Sketchy”, and Sketchy augmented with edge
completeness. Under similar number of parameters, MRU maps “Augmented Sketchy”. Since we are only interested
outperforms ResNet block significantly on our generative in sketch to image synthesis, all models are tested on the
task. test split of Sketchy. All images are resized to 64×64 re-
gardless of the original aspect ratio. Both sketches and edge
log-likelihood between predicted and ground-truth labels: maps are converted into distance fields.
Implementation Details In all experiments, we use
Lac (D) =E[log P (C = c|y)] (9)
batch size of 8, except for Figure 9 which uses a batch size
and the generator maximizes the same log-likelihood of 32. We use random horizontal flipping during training.
Lac (G) = Lac (D) with discriminator fixed. We use the Adam optimizer [30], and set the initial learn-
Since we have paired image data, we are able to provide ing rate of generator at 0.0001 and that of discriminator at
direct supervision to the network with L1-distance between 0.0002 [21].
generated images and ground truth images: Evaluation Metrics For our task of image synthesis, we
Lsup (G) = kG(x, z) − yk1 (10) use Inception Scores [47] to measure the quality of synthe-
sized images. The intuition behind Inception Score is that
However, directly minimizing L1 loss between gener- a good synthesized image should have easily recognizable
ated image and ground truth image discourages diversity, so objects by an off-the-shelf recognition system. Beyond In-
we add a perceptual loss to encourage the network to gen- ception Scores, we also perform a perceptual study evaluat-
erate diverse images [9, 28, 5]. We use four intermediate ing how realistic the generated images are and how faithful
layers from an Inception-V4 [50] to calculate the perceptual they are to the input sketches.
loss. Let φi be the filter response of a layer in the Inception
model. We define perceptual loss on the generator as: 5.2. Comparison to Baselines
X
Lp (G) = λp kφi (G(x, z)) − φi (y)k1 (11) Our comparisons focus on the popular pix2pix and its
i variations. All models are trained for 300k iterations except

9421
Model Input correctly identified?
Sketchy 1-NN retrieval 35.3%
pix2pix, Augmented+Label 65.9%
Ours 47.4%
Table 3: Faithfulness test on three models. Models for
which participants could pick the input sketch are consid-
ered more “faithful”.

Model Picked as more realistic?


pix2pix, Sketchy only 6.03%
pix2pix, Augmented 18.4%
pix2pix, Augmented+Label 21.8%
Ours 53.7%
Table 4: Realism test on four generative models. We report Input Full -GAN -L-AC -P -DIV
how often results from each model were chosen by partici- None 7.90 1.49 6.64 6.70 7.29
pants to be more “realistic” than a competing model.
Table 5: Table of Inception scores for models with particu-
for the first model. We include three baselines: lar components removed. “Full” is the full model described
pix2pix on Sketchy This is the simplest model. We di- in this work. “-GAN” means no GAN loss and no discrim-
rectly take the authors’ pix2pix code and train it on the 50 inator. “-L-AC” means no labels-supervision on generator
categories from Sketchy. Since we find the image quality and no auxiliary loss on discriminator. “-P” means no L1
stops improving after 100k iterations, we stop early at 150k and no perceptual loss, and “-DIV” means no diversity loss.
iteration and report the results.
pix2pix on Augmented Sketchy In this model, we train 5.3. Component Analysis
pix2pix on both the image-edge map and image-sketch
pairs, as we do in our method. The network structure and Here we analyze which part of our model is more impor-
loss functions remain unchanged. tant. We decouple our objective function and analyze the
Label-Supervised pix2pix on Augmented Sketchy In this influence of each part of it. All models are trained on Aug-
model, we modify pix2pix to pass class labels into the gen- mented Sketchy with the same set of parameters. Detailed
erator using conditional instance normalization, and also comparison can be found in Table 5. We first remove the
add auxiliary classification loss to its discriminator. This is GAN loss and the discriminator. The result is surprisingly
a much stronger baseline, since the label information helps poor as the images are extremely vague. This observation
the network decide the object type and in turn improves the is consistent with that of Isola et al. [26]. Next we remove
generated image quality [16, 43]. the auxiliary loss and substitute conditional instance nor-
The comparison of Inception Scores can be found in Ta- malization with batch normalization [25]. This leads to a
ble 1 and visual results can be found in Figure 7. Our obser- significant decrease in image quality as well as wrong col-
vations are as follows: (i) pix2pix trained on Sketchy fails, ors and misplaced textures. This indicates that class infor-
generating unidentifiable color patches. The model is un- mation helps a lot, which makes sense because we are gen-
able to translate from sketches to images. Since pix2pix has erating 50 categories from a single model. We then remove
been successful with edge-to-image translations, this im- the L1 loss and the perceptual loss. We find they also have
plies that sketch-to-image synthesis is more difficult. (ii) a large impact on image quality. From sample images we
pix2pix trained on Augmented Sketchy performs slightly can see the model uses incorrect colors and fails and object
better, starting to produce the general shape of the object. boundaries are unrealistic or missing. Finally, we remove
This shows that edge maps help the training. (iii) The label- the diversity loss, and doing so also decreases image qual-
supervised pix2pix on Augmented Sketchy is better than the ity slightly. This can be related to how we apply this di-
previous two baselines. It correctly colors the object more versity loss, which forces the generator to generate image
often and starts to generate some meaningful backgrounds. pairs that are realistic but different. This encourages gener-
The results are still blurry, and many artifacts can be ob- alization because the generator needs to find a solution that
served. (iv) Comparing to baselines, our method generates when given different noise vectors only makes changes in
sharper images, gets the object color correct, puts more de- unconstrained areas (e.g. the background).
tailed textures on the object, and outputs some meaningful Comparison between MRU and other structures To
backgrounds. The whole images are also more realistic and demonstrate the effectiveness of our MRU blocks, we com-
colorful. pare the performance of MRU, ResNet, Cascaded Refine-

9422
produces a more “faithful” output. In the “realism” test, a
participant sees the output of pix2pix variants and Sketchy-
GAN compared in pairs, alongside the corresponding in-
put sketch. The participant is asked to pick the image that
they think is more realistic. For each model we calculate
how often participants think it is more realistic. The image
retrieval baseline is not evaluated for realism since it only
returns existing, realistic photographs. We conducted 696
trails for the “faithfulness” test and 348 trails for the “real-
ism” test. The results show that SketchyGAN is more faith-
ful than the retrieval model, but is less faithful than pix2pix
which often preserves the input edges precisely (Table 3).
Meanwhile, SketchyGAN is considered more realistic than
pix2pix variants (Table 4). The results are consistent with
our goal that our model should respect the intent of input
sketches, but at the same time deviate from the strokes if
Figure 9: Some of the best output images from our full necessary in order to produce realistic images.
model. For each input sketch, we show a pair of output
images to demonstrate the diversity of our model. 6. Conclusion
In this work, we presented a novel approach to the
ment Network (CRN) [5] and DCGAN structures in our im- sketch-to-image synthesis problem. The problem is chal-
age synthesis task. We train several additional models: one lenging given the nature of sketches, and this introduced a
uses improved ResNet blocks [20], which is the best variant deep generative model that is promising in sketch to im-
published [19], in both generator and discriminator; one is age synthesis. We introduced a data augmentation tech-
a weak baseline, using DCGAN structure; one uses CRN in nique for sketch-image pairs to encourage research in this
generator instead of MRU; and one MRU model using only direction. The demonstrated GAN framework can synthe-
GAN loss and ACGAN loss. We keep the number of param- size more realistic images than popular generative models,
eters of MRU model and that of ResNet model roughly the and the generated images are diverse. Currently, the main
same by reducing feature depth in MRU. Detailed parame- focus on GANs is to find better probability metrics as objec-
ter counts can be found in Table 2. Judging from both visual tive functions, but there has been very few works searching
quality and the Inception Scores, the MRU model generates for better network structures in GANs. We proposed a new
better images than both ResNet and CRN models, and we network structure for our generative task, and we showed
show that even using only standard GAN losses, MRU out- that it performs better than existing structures.
performs other structures significantly. From Figure 8, we Limitations. Ideally, we want our results to be both real-
notice that the MRU model tends to produce higher quality istic and faithful to the intent of the input sketch. For many
foreground objects. This can be due to the internal masks of sketches, we fail to meet one or both of these goals. Results
MRU serving as an attention mechanism, causing the net- generally aren’t photorealistic, nor are they high enough
work to selectively focus on the main object. In our task resolution. Sometimes realism is lost by being overly faith-
this is helpful, since we are mainly interested in generating ful to the sketch – e.g. Skinny horse legs that too closely fol-
a specific object from sketch. low the badly drawn input boundaries (Figure 9). In other
cases, we do deviate from the user sketch to make the out-
5.4. Human Evaluation of Realism and Faithfulness
put more realistic (motorcycle and plane in Figure 1, mush-
We do two human evaluations to measure how our model room, church, geyser, and castle in Figure 9) but still re-
compares against baselines in terms of realism and faithful- spect the pose and position of the object in the input sketch.
ness to the input sketch. In the “faithfulness” test, a partic- This is more desirable. Human intent is hard to learn, and
ipant sees the output of either pix2pix, SketchyGAN or 1- SketchyGAN failures that treat the input sketch too literally
nearest-neighbor retrieval using the representation learned may be due to lack of sketch-photo training pairs. Despite
in the Sketchy Database [48]. With each image, the partici- the fact that our results are not yet photorealistic, we think
pant also sees 9 random sketches of the same category, one they show a substantial improvement over previous meth-
of which is the actual input/query sketch. The participant ods.
is asked to pick the sketch that prompted the output image. Acknowledgements. This work was funded by NSF
We then count how often participants pick the correct input award 1561968.
sketch, so a higher correct selection rate indicates the model

9423
References [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
[1] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary erative adversarial nets. In Advances in Neural Information
equilibrium generative adversarial networks. arXiv preprint Processing Systems, pages 2672–2680. 2014.
arXiv:1703.10717, 2017.
[16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and
[2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr- A. C. Courville. Improved training of wasserstein gans. In
ishnan. Unsupervised pixel-level domain adaptation with Advances in Neural Information Processing Systems, pages
generative adversarial networks. In The IEEE Conference 5769–5779, 2017.
on Computer Vision and Pattern Recognition (CVPR), July [17] D. Ha and D. Eck. A neural representation of sketch draw-
2017. ings. arXiv preprint arXiv:1704.03477, 2017.
[3] Y. Cao, C. Wang, L. Zhang, and L. Zhang. Edgel index for [18] X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y. Yu. High-
large-scale sketch-based image search. In The IEEE Confer- resolution shape completion using deep neural networks for
ence on Computer Vision and Pattern Recognition (CVPR), global structure and local geometry inference. In The IEEE
pages 761–768. IEEE, 2011. International Conference on Computer Vision (ICCV), Oct
[4] Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, and L. Zhang. 2017.
Mindfinder: interactive sketch-based image search on mil- [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
lions of images. In Proceedings of the 18th ACM interna- for image recognition. In The IEEE Conference on Computer
tional conference on Multimedia, pages 1605–1608. ACM, Vision and Pattern Recognition (CVPR), June 2016.
2010. [20] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
[5] Q. Chen and V. Koltun. Photographic image synthesis with deep residual networks. In European Conference on Com-
cascaded refinement networks. In The IEEE International puter Vision, pages 630–645, 2016.
Conference on Computer Vision (ICCV), Oct 2017. [21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
[6] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. S. Hochreiter. Gans trained by a two time-scale update rule
Sketch2photo: Internet image montage. ACM Transactions converge to a local nash equilibrium. In Advances in Neural
on Graphics (TOG), 28(5):124, 2009. Information Processing Systems, pages 6629–6640, 2017.
[7] T. Chen, P. Tan, L.-Q. Ma, M.-M. Cheng, A. Shamir, and [22] R. Hu, M. Barnard, and J. Collomosse. Gradient field de-
S.-M. Hu. Poseshop: Human image database construction scriptor for sketch based retrieval and localization. In Im-
and personalized content synthesis. IEEE Transactions on age Processing (ICIP), 2010 17th IEEE International Con-
Visualization and Computer Graphics, 19(5):824–837, 2013. ference on, pages 1025–1028. IEEE, 2010.
[8] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, [23] R. Hu and J. Collomosse. A performance evaluation of gra-
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase dient field hog descriptor for sketch based image retrieval.
representations using rnn encoder-decoder for statistical ma- Computer Vision and Image Understanding, 117(7):790–
chine translation. In Proceedings of the 2014 Confer- 806, 2013.
ence on Empirical Methods in Natural Language Processing [24] R. Hu, T. Wang, and J. Collomosse. A bag-of-regions ap-
(EMNLP), pages 1724–1734, 2014. proach to sketch-based image retrieval. In Image Processing
[9] A. Dosovitskiy and T. Brox. Generating images with percep- (ICIP), 2011 18th IEEE International Conference on, pages
tual similarity metrics based on deep networks. In D. D. Lee, 3661–3664. IEEE, 2011.
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, ed- [25] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
itors, Advances in Neural Information Processing Systems, deep network training by reducing internal covariate shift. In
pages 658–666. Curran Associates, Inc., 2016. International Conference on Machine Learning, pages 448–
[10] V. Dumoulin, J. Shlens, and M. Kudlur. A learned represen- 456, 2015.
tation for artistic style. ICLR, 2017. [26] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
[11] M. Eitz, J. Hays, and M. Alexa. How do humans sketch translation with conditional adversarial networks. In The
objects? ACM Transactions on Graphics (proceedings of IEEE Conference on Computer Vision and Pattern Recog-
SIGGRAPH), 31(4):44:1–44:10, 2012. nition (CVPR), July 2017.
[12] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. An [27] S. James, M. J. Fonseca, and J. Collomosse. Reenact: Sketch
evaluation of descriptors for large-scale image retrieval from based choreographic design from archival dance footage. In
sketched feature lines. Computers & Graphics, 34(5):482– Proceedings of International Conference on Multimedia Re-
498, 2010. trieval, page 313. ACM, 2014.
[13] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. [28] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for
Sketch-based image retrieval: Benchmark and bag-of- real-time style transfer and super-resolution. In European
features descriptors. IEEE transactions on visualization and Conference on Computer Vision, 2016.
computer graphics, 17(11):1624–1636, 2011. [29] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to
[14] M. Eitz, R. Richter, K. Hildebrand, T. Boubekeur, and discover cross-domain relations with generative adversarial
M. Alexa. Photosketcher: Interactive sketch-based im- networks. arXiv preprint arXiv:1703.05192, 2017.
age synthesis. IEEE Computer Graphics and Applications, [30] D. Kingma and J. Ba. Adam: A method for stochastic opti-
31(6):56–66, Nov 2011. mization. arXiv preprint arXiv:1412.6980, 2014.

9424
[31] D. P. Kingma and M. Welling. Auto-encoding variational A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog-
bayes. In ICLR, Apr. 2014. nition challenge. International Journal of Computer Vision
[32] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. How to train (IJCV), 115(3):211–252, 2015.
your dragan. arXiv preprint arXiv:1705.07215, 2017. [47] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet ford, and X. Chen. Improved techniques for training gans. In
classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages
Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
1097–1105. Curran Associates, Inc., 2012. [48] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The
[34] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, sketchy database: Learning to retrieve badly drawn bun-
521(7553):436–444, 2015. nies. ACM Transactions on Graphics (proceedings of SIG-
[35] K. Li, K. Pang, Y. Z. Song, T. Hospedales, H. Zhang, and GRAPH), 2016.
Y. Hu. Fine-grained sketch-based image retrieval: The role [49] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scrib-
of part-aware attributes. In The IEEE Winter Conference on bler: Controlling deep image synthesis with sketch and color.
Applications of Computer Vision (WACV), pages 1–9, March In The IEEE Conference on Computer Vision and Pattern
2016. Recognition (CVPR), volume 2, 2017.
[36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal [50] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
loss for dense object detection. In The IEEE International Inception-v4, inception-resnet and the impact of residual
Conference on Computer Vision (ICCV), Oct 2017. connections on learning. In AAAI, 2017.
[37] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [51] D. Thanh Nguyen, B.-S. Hua, K. Tran, Q.-H. Pham, and S.-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com- K. Yeung. A field model for repairing 3d shapes. In The
mon objects in context. In European conference on computer IEEE Conference on Computer Vision and Pattern Recogni-
vision, pages 740–755. Springer, 2014. tion (CVPR), June 2016.
[38] Y.-L. Lin, C.-Y. Huang, H.-J. Wang, and W. Hsu. 3d sub- [52] D. Turmukhambetov, N. D. Campbell, D. B. Goldman, and
query expansion for improving sketch-based multi-view im- J. Kautz. Interactive sketch-driven image synthesis. Comput.
age retrieval. In The IEEE International Conference on Com- Graph. Forum, 34(8):130–142, Dec. 2015.
puter Vision (ICCV), December 2013. [53] C. Wang, Z. Li, and L. Zhang. Mindfinder: image search by
[39] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to- interactive sketching and tagging. In Proceedings of the 19th
image translation networks. In Advances in Neural Informa- international conference on World wide web, pages 1309–
tion Processing Systems, pages 700–708, 2017. 1312. ACM, 2010.
[40] M.-Y. Liu and O. Tuzel. Coupled generative adversarial net- [54] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval
works. In Advances in Neural Information Processing Sys- using convolutional neural networks. In The IEEE Confer-
tems, pages 469–477, 2016. ence on Computer Vision and Pattern Recognition (CVPR),
[41] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- June 2015.
Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. [55] X. Wang and X. Tang. Face photo-sketch synthesis and
In European Conference on Computer Vision, pages 21–37, recognition. IEEE Transactions on Pattern Analysis and Ma-
2016. chine Intelligence, 31(11):1955–1967, 2009.
[42] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and [56] S. Xie and Z. Tu. Holistically-nested edge detection. In The
J. Yosinski. Plug & play generative networks: Conditional IEEE International Conference on Computer Vision (ICCV),
iterative generation of images in latent space. In The IEEE December 2015.
Conference on Computer Vision and Pattern Recognition [57] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-
(CVPR), July 2017. level domain transfer. In European Conference on Computer
[43] A. Odena, C. Olah, and J. Shlens. Conditional image syn- Vision, 2016.
thesis with auxiliary classifier GANs. In D. Precup and [58] Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T. M. Hospedales, and
Y. W. Teh, editors, Proceedings of the 34th International C.-C. Loy. Sketch me that shoe. In The IEEE Conference
Conference on Machine Learning, volume 70 of Proceed- on Computer Vision and Pattern Recognition (CVPR), June
ings of Machine Learning Research, pages 2642–2651, Inter- 2016.
national Convention Centre, Sydney, Australia, 06–11 Aug [59] T. Zhang and C. Y. Suen. A fast parallel algorithm for
2017. PMLR. thinning digital patterns. Communications of the ACM,
[44] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. 27(3):236–239, 1984.
Efros. Context encoders: Feature learning by inpainting. [60] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
In The IEEE Conference on Computer Vision and Pattern to-image translation using cycle-consistent adversarial net-
Recognition (CVPR), June 2016. works. In The IEEE International Conference on Computer
[45] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- Vision (ICCV), Oct 2017.
sentation learning with deep convolutional generative adver-
sarial networks. arXiv preprint arXiv:1511.06434, 2015.
[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,

9425

You might also like