Sketchygan: Towards Diverse and Realistic Sketch To Image Synthesis
Sketchygan: Towards Diverse and Realistic Sketch To Image Synthesis
Figure 1: A sample of sketch-to-photo synthesis results from our 50 categories. Best viewed in color.
9416
13, 22, 3, 4, 53, 24, 23, 27, 52, 38, 54, 35]. Most meth-
ods use bag of words representations and edge detection
to build features that are (ideally) invariant across both do-
(a) Photo (b) Edge map (c) Sample sketches of (a)
mains. Common shortcomings include the inability to per-
form fine-grained retrieval and the inability to map from
Figure 2: Comparison between an edge map and sketches badly drawn sketch edges to photo boundaries. To address
of the same image. The photo and sketches are from the these problems, Yu et al. [58] and Sangkloy et al. [48]
Sketchy Database. Compared to sketches, the edge map train deep convolutional neural networks(CNNs) to relate
contains more background information. The sketches, in sketches and photos, treating the sketch-based image re-
contrast, do not precisely reflect actual object boundaries trieval as a search in the learned feature embedding space.
and are not spatially aligned with the object. They show that using CNNs greatly improves performance
and they are able to do fine-grained and instance-level re-
categories and synthesizing edge maps from them. Dur- trieval. Beyond the task of retrieval, Sketch2Photo [6]
ing training, we adjust the ratio between edge map-image and PhotoSketcher [14] synthesize realistic images by com-
and sketch-image pairs so that the network can transfer its positing objects and backgrounds retrieved from a given
knowledge gradually from edge-image synthesis to sketch- sketch. PoseShop [7] composites images of people by let-
image synthesis. For the second challenge, we build a ting users input an additional 2D skeleton into the query so
GAN-based model, conditioned on an input sketch, with that the retrieval will be more precise.
several additional loss terms which improve synthesis qual- Sketch-Based Datasets. There are only a few datasets
ity. We also introduce a new building block called Masked of human-drawn sketches and they are generally small due
Residual Unit (MRU) which helps generate higher quality to the effort needed to collect drawings. One of the most
images. This block takes an extra image input and utilizes commonly used sketch dataset is the TU-Berlin dataset [11]
its internal mask to dynamically decide the information flow which contains 20,000 human sketches spanning 250 cate-
of the network. By chaining these blocks we are able to in- gories. Yu et al. [58] introduced a new dataset with paired
put a pyramid of images at different scales. We show that sketches and images, but there are only two categories –
this structure outperforms naive convolutional approaches shoes and chairs. There is also the CUHK Face Sketches
and ResNet blocks on our sketch to image synthesis tasks. [55] containing 606 face sketches drawn by artists. The
Our main contributions are: newly published QuickDraw dataset [17] has an impressive
50 million sketches. However, the sketches are particularly
• We present SketchyGAN, a deep learning approach crude because of a 10 second time limit. The sketches lack
to sketch to image synthesis. Unlike previous non- detail and tend to be iconic or canonical views. The Sketchy
parametric approaches, we do not do image retrieval database [48], in contrast, has more detailed drawings in a
at test time. Unlike previous deep image translation greater variety of poses. It spans 125 categories with a to-
methods, our network does not learn to directly copy tal of 75,471 sketches of 12,500 objects. Critically, it is the
input edges (effectively colorizing instead of convert- only substantial dataset of paired sketches and photographs
ing sketches to photos). Our method is capable of gen- spanning diverse categories so we choose to use this dataset.
erating plausible objects from 50 diverse categories. Image-to-Image Translation with GANs. Generative
Sketch-based image synthesis is very challenging and Adversarial Networks(GANs) have shown great potential in
our results are not generally photorealistic, but we generating natural, realistic images [1, 16, 42]. Instead of
demonstrate an increase in quality compared to exist- directly optimizing per pixel reconstruction error, which of-
ing deep generative models. ten leads to blurry and conservative results, GANs use a dis-
• We demonstrate a data augmentation technique for criminator to distinguish unrealistic images from real ones
sketch data that address the lack of sufficient human- thus forcing the generator to produce sharper images. The
annotated training data. “pix2pix” work of Isola et al. [26] demonstrates a straight-
• We formulate a GAN model with additional objective forward approach to translate one image to another using
functions and a new network building block. We show conditional GANs. Conditional settings are also adapted
that all of them are beneficial for our task, and lacking in other image translation tasks, including sketch coloring
any of them will reduce the quality of our results. [49], style transformation [57] and domain adaptation [2]
tasks. In contrast with using conditional GANs and paired
data, Liu et al. [39] introduce an unsupervised image trans-
2. Related Work lation framework consists of CoupledGAN [40] and a pair
of variational autoencoders [31]. More recently, CycleGAN
Sketch-Based Image Retrieval and Synthesis. There [60] shows promising results on unsupervised image trans-
exist numerous works on sketch-based image retrieval [12,
9417
Figure 4: Images synthesized from the same input sketch
with different noise vectors. The network learned to change
(a) input (b) HED (c) binarization
and thinning a significant portion of the image (the flower), which is not
conditioned by the input sketch. In each case, the bee re-
mains plausible.
Figure 2 visualizes the difference between image edges 3.3. Edge Map Creation
and sketches. A sketch is set of human-drawn strokes mim- We use edge detection and several post-processing steps
icking the approximate boundary and internal contours of to obtain sketch-like edge maps. The pipeline is illus-
an object, and an edge map is machine-generated array of trated in Figure 3. The first step is to detect edges with
pixels that precisely correspond to photo intensity bound- Holistically-nested edge detection (HED) [56] as in Isola et
aries. Generating photos from sketches is considerably al. [26]. After binarizing the output and thinning all edges
harder than from edges. Unlike edge maps, sketches are [59], we clean isolated pixels and remove small connected
not precisely aligned to object boundaries, so a generative components. Next we perform erosion with a threshold on
model needs to learn spatial transformations to correct de- all edges, further decreasing number of edge fragments. Re-
formed strokes. Second, edge maps usually contain more maining spurs are then removed. Because edges are very
information about backgrounds and details, while sketches sparse, we calculate an unsigned euclidean distance field for
do not, so a generative model must insert more information each edge map to obtain a dense representation (see Figure
itself. Finally, sketches may contain caricatured or iconic 3g). Similar distance-field representations are used in recent
features, like the “tiger” stripes on the cat’s face in Figure works on 3D shape recovery [51, 18]. We also calculate dis-
2c, which a model must learn to handle. Despite these con- tance fields for sketches in the Sketchy database.
siderable differences, edge maps are still a valuable aug-
mentation to the limited Sketchy database. 3.4. Training Adaptation from Edges to Sketches
Because our final goal is a network that generates im-
3.2. Data Collection
ages from sketches, it is necessary to train the network on
Learning the mapping between edges or sketches to pho- both edge maps and sketches. To simplify training process,
tos requires significant training data. We want thousands of we use a strategy that gradually shifts the inputs from edge
9418
Figure 5: Complete structure of our network. Since we are
using MRU blocks, both the generator and the discriminator
can take multi-scale inputs.
9419
Model Inception Score
pix2pix, Sketchy only 3.94
pix2pix, Augmented 4.53
pix2pix, Augmented+Label 5.49
Ours 7.90
Real Image 15.46
Table 1: Comparison of our method to baselines methods.
We compared to three variants of pix2pix, and our method
shows a much higher score on test images.
Figure 7: Image generated by pix2pix variations and our 4.2. Network Structure
method. The four columns labeled by a to d are: (a)
pix2pix on Sketchy (b) pix2pix on Augmented Sketchy (c) Our complete network structure is shown in Figure 5.
Label-supervised pix2pix on Augmented Sketchy and (d) The generator uses an encoder-decoder structure. Both the
our method. Comparing to our method, pix2pix results are encoder and the decoder are built with MRU blocks, where
blurry and noisy, often containing color patches and un- the sketches are resized and fed into every MRU block on
wanted artifacts. the path. In our best results in Figure 9, we also apply skip-
connections between encoder and decoder blocks, so the
where output feature maps from encoder blocks will be concate-
mi = σ(Conv([xi , I])) (5) nated to the outputs of corresponding decoder blocks. The
discriminator is also built with MRU blocks but will shrink
is a mask over the input feature maps. Multiple convolu- in spatial dimension. At the end of the discriminator, we
tional layers can be stacked here to increase performance. output two logits, one for the GAN loss and one for classi-
We then want to dynamically combine the information from fication loss.
the newly convolved feature maps and the original input
feature maps, so we use another mask 4.3. Objective Function
ni = σ(Conv([xi , I])) (6)
Let x, y be either an image or a sketch, z be a noise
to combine the input feature maps with the new feature
vector, and c be a class label, Our GAN objective function
maps to get the final output:
can be expressed as
yi = (1 − ni ) ⊙ zi + ni ⊙ xi (7)
LGAN (D, G) =Ey∼Pimage [log D(y)]+
The second term in Equation 7 serves as a residual connec- Ex∼Psketch ,z∼Pz [log(1 − D(G(x, z)))]
tion. Because there are internal masks to determine infor- (8)
mation flow, we call this structure masked residual unit. We
can stack multiple of these units and input the same image and the objective of generator LGAN (G) will be to mini-
at different scales repetitively so that the network can re- mize the second term.
trieve information from the input image dynamically on its
computation path. It is shown that giving the model side information will
The MRU formulation is similar to that of the Gated Re- improve the quality of generated images [43], so we use
current Unit (GRU) [8]. However, we are driven by differ- conditional instance normalization [10] in the generator and
ent motivations and there are several crucial differences: 1) pass in labels of input sketches. In addition, we let the dis-
We are motivated by repetitively inputting the same image criminator predict class labels out of the images it sees. The
to improve the information flow. GRU is designed to ad- auxiliary classification loss of discriminator maximize the
9420
To further encourage diversity, we concatenate Gaussian
noise to feature maps at the bottleneck of the generator. Pre-
vious works reach the conclusion that conditional GANs
tend to ignore the noise completely [26] or produce worse
results because of noise [44]. A simple diversity loss
Ldiv (G) = −λdiv kG(x, z1 ) − G(x, z2 )k1 (12)
will improve both quality and diversity of generated images.
The interpretation is straightforward: with a pair of different
noise vectors z1 and z2 conditioned on the same image, the
generator should output a pair of sightly different images.
Our complete discriminator and generator losses are thus
L(D) = LGAN (D, G) + Lac (D) (13)
L(G) = LGAN (G) − Lac (G)
+ Lsup (G) + Lp (G) + Ldiv (G) (14)
Figure 8: Visual results from DCGAN, CRN, ResNet and where the discriminator maximizes Equation 13 and the
MRU. The MRU structure emphasize more on the main ob- generator minimizes Equation 14. In practice, we use DRA-
ject than the other three. GAN loss [32] in order to stabilize training and use focal
loss [36] as classification loss.
Inception
Model Num of params
Score 5. Experiments
DCGAN G:35.1M D: 4.3M 4.73
CRN G:21.4M D:22.3M 4.56 5.1. Experiment settings
Improved ResNet G:33.0M D:31.2M 5.76 Dataset splitting We use the sketch-image pairs in se-
MRU (GAN loss only) G:28.1M D:29.9M 8.31 lected 50 categories from training split of Sketchy as ba-
MRU G:28.1M D:29.9M 7.90 sic training data, and augment them with edge map-image
Table 2: Comparison of MRU, CRN, ResNet and DCGAN pairs. In the following sections, we call data from Sketchy
under the same setting. DCGAN structure is included for Database “Sketchy”, and Sketchy augmented with edge
completeness. Under similar number of parameters, MRU maps “Augmented Sketchy”. Since we are only interested
outperforms ResNet block significantly on our generative in sketch to image synthesis, all models are tested on the
task. test split of Sketchy. All images are resized to 64×64 re-
gardless of the original aspect ratio. Both sketches and edge
log-likelihood between predicted and ground-truth labels: maps are converted into distance fields.
Implementation Details In all experiments, we use
Lac (D) =E[log P (C = c|y)] (9)
batch size of 8, except for Figure 9 which uses a batch size
and the generator maximizes the same log-likelihood of 32. We use random horizontal flipping during training.
Lac (G) = Lac (D) with discriminator fixed. We use the Adam optimizer [30], and set the initial learn-
Since we have paired image data, we are able to provide ing rate of generator at 0.0001 and that of discriminator at
direct supervision to the network with L1-distance between 0.0002 [21].
generated images and ground truth images: Evaluation Metrics For our task of image synthesis, we
Lsup (G) = kG(x, z) − yk1 (10) use Inception Scores [47] to measure the quality of synthe-
sized images. The intuition behind Inception Score is that
However, directly minimizing L1 loss between gener- a good synthesized image should have easily recognizable
ated image and ground truth image discourages diversity, so objects by an off-the-shelf recognition system. Beyond In-
we add a perceptual loss to encourage the network to gen- ception Scores, we also perform a perceptual study evaluat-
erate diverse images [9, 28, 5]. We use four intermediate ing how realistic the generated images are and how faithful
layers from an Inception-V4 [50] to calculate the perceptual they are to the input sketches.
loss. Let φi be the filter response of a layer in the Inception
model. We define perceptual loss on the generator as: 5.2. Comparison to Baselines
X
Lp (G) = λp kφi (G(x, z)) − φi (y)k1 (11) Our comparisons focus on the popular pix2pix and its
i variations. All models are trained for 300k iterations except
9421
Model Input correctly identified?
Sketchy 1-NN retrieval 35.3%
pix2pix, Augmented+Label 65.9%
Ours 47.4%
Table 3: Faithfulness test on three models. Models for
which participants could pick the input sketch are consid-
ered more “faithful”.
9422
produces a more “faithful” output. In the “realism” test, a
participant sees the output of pix2pix variants and Sketchy-
GAN compared in pairs, alongside the corresponding in-
put sketch. The participant is asked to pick the image that
they think is more realistic. For each model we calculate
how often participants think it is more realistic. The image
retrieval baseline is not evaluated for realism since it only
returns existing, realistic photographs. We conducted 696
trails for the “faithfulness” test and 348 trails for the “real-
ism” test. The results show that SketchyGAN is more faith-
ful than the retrieval model, but is less faithful than pix2pix
which often preserves the input edges precisely (Table 3).
Meanwhile, SketchyGAN is considered more realistic than
pix2pix variants (Table 4). The results are consistent with
our goal that our model should respect the intent of input
sketches, but at the same time deviate from the strokes if
Figure 9: Some of the best output images from our full necessary in order to produce realistic images.
model. For each input sketch, we show a pair of output
images to demonstrate the diversity of our model. 6. Conclusion
In this work, we presented a novel approach to the
ment Network (CRN) [5] and DCGAN structures in our im- sketch-to-image synthesis problem. The problem is chal-
age synthesis task. We train several additional models: one lenging given the nature of sketches, and this introduced a
uses improved ResNet blocks [20], which is the best variant deep generative model that is promising in sketch to im-
published [19], in both generator and discriminator; one is age synthesis. We introduced a data augmentation tech-
a weak baseline, using DCGAN structure; one uses CRN in nique for sketch-image pairs to encourage research in this
generator instead of MRU; and one MRU model using only direction. The demonstrated GAN framework can synthe-
GAN loss and ACGAN loss. We keep the number of param- size more realistic images than popular generative models,
eters of MRU model and that of ResNet model roughly the and the generated images are diverse. Currently, the main
same by reducing feature depth in MRU. Detailed parame- focus on GANs is to find better probability metrics as objec-
ter counts can be found in Table 2. Judging from both visual tive functions, but there has been very few works searching
quality and the Inception Scores, the MRU model generates for better network structures in GANs. We proposed a new
better images than both ResNet and CRN models, and we network structure for our generative task, and we showed
show that even using only standard GAN losses, MRU out- that it performs better than existing structures.
performs other structures significantly. From Figure 8, we Limitations. Ideally, we want our results to be both real-
notice that the MRU model tends to produce higher quality istic and faithful to the intent of the input sketch. For many
foreground objects. This can be due to the internal masks of sketches, we fail to meet one or both of these goals. Results
MRU serving as an attention mechanism, causing the net- generally aren’t photorealistic, nor are they high enough
work to selectively focus on the main object. In our task resolution. Sometimes realism is lost by being overly faith-
this is helpful, since we are mainly interested in generating ful to the sketch – e.g. Skinny horse legs that too closely fol-
a specific object from sketch. low the badly drawn input boundaries (Figure 9). In other
cases, we do deviate from the user sketch to make the out-
5.4. Human Evaluation of Realism and Faithfulness
put more realistic (motorcycle and plane in Figure 1, mush-
We do two human evaluations to measure how our model room, church, geyser, and castle in Figure 9) but still re-
compares against baselines in terms of realism and faithful- spect the pose and position of the object in the input sketch.
ness to the input sketch. In the “faithfulness” test, a partic- This is more desirable. Human intent is hard to learn, and
ipant sees the output of either pix2pix, SketchyGAN or 1- SketchyGAN failures that treat the input sketch too literally
nearest-neighbor retrieval using the representation learned may be due to lack of sketch-photo training pairs. Despite
in the Sketchy Database [48]. With each image, the partici- the fact that our results are not yet photorealistic, we think
pant also sees 9 random sketches of the same category, one they show a substantial improvement over previous meth-
of which is the actual input/query sketch. The participant ods.
is asked to pick the sketch that prompted the output image. Acknowledgements. This work was funded by NSF
We then count how often participants pick the correct input award 1561968.
sketch, so a higher correct selection rate indicates the model
9423
References [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
[1] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary erative adversarial nets. In Advances in Neural Information
equilibrium generative adversarial networks. arXiv preprint Processing Systems, pages 2672–2680. 2014.
arXiv:1703.10717, 2017.
[16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and
[2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr- A. C. Courville. Improved training of wasserstein gans. In
ishnan. Unsupervised pixel-level domain adaptation with Advances in Neural Information Processing Systems, pages
generative adversarial networks. In The IEEE Conference 5769–5779, 2017.
on Computer Vision and Pattern Recognition (CVPR), July [17] D. Ha and D. Eck. A neural representation of sketch draw-
2017. ings. arXiv preprint arXiv:1704.03477, 2017.
[3] Y. Cao, C. Wang, L. Zhang, and L. Zhang. Edgel index for [18] X. Han, Z. Li, H. Huang, E. Kalogerakis, and Y. Yu. High-
large-scale sketch-based image search. In The IEEE Confer- resolution shape completion using deep neural networks for
ence on Computer Vision and Pattern Recognition (CVPR), global structure and local geometry inference. In The IEEE
pages 761–768. IEEE, 2011. International Conference on Computer Vision (ICCV), Oct
[4] Y. Cao, H. Wang, C. Wang, Z. Li, L. Zhang, and L. Zhang. 2017.
Mindfinder: interactive sketch-based image search on mil- [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
lions of images. In Proceedings of the 18th ACM interna- for image recognition. In The IEEE Conference on Computer
tional conference on Multimedia, pages 1605–1608. ACM, Vision and Pattern Recognition (CVPR), June 2016.
2010. [20] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in
[5] Q. Chen and V. Koltun. Photographic image synthesis with deep residual networks. In European Conference on Com-
cascaded refinement networks. In The IEEE International puter Vision, pages 630–645, 2016.
Conference on Computer Vision (ICCV), Oct 2017. [21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and
[6] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. S. Hochreiter. Gans trained by a two time-scale update rule
Sketch2photo: Internet image montage. ACM Transactions converge to a local nash equilibrium. In Advances in Neural
on Graphics (TOG), 28(5):124, 2009. Information Processing Systems, pages 6629–6640, 2017.
[7] T. Chen, P. Tan, L.-Q. Ma, M.-M. Cheng, A. Shamir, and [22] R. Hu, M. Barnard, and J. Collomosse. Gradient field de-
S.-M. Hu. Poseshop: Human image database construction scriptor for sketch based retrieval and localization. In Im-
and personalized content synthesis. IEEE Transactions on age Processing (ICIP), 2010 17th IEEE International Con-
Visualization and Computer Graphics, 19(5):824–837, 2013. ference on, pages 1025–1028. IEEE, 2010.
[8] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, [23] R. Hu and J. Collomosse. A performance evaluation of gra-
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase dient field hog descriptor for sketch based image retrieval.
representations using rnn encoder-decoder for statistical ma- Computer Vision and Image Understanding, 117(7):790–
chine translation. In Proceedings of the 2014 Confer- 806, 2013.
ence on Empirical Methods in Natural Language Processing [24] R. Hu, T. Wang, and J. Collomosse. A bag-of-regions ap-
(EMNLP), pages 1724–1734, 2014. proach to sketch-based image retrieval. In Image Processing
[9] A. Dosovitskiy and T. Brox. Generating images with percep- (ICIP), 2011 18th IEEE International Conference on, pages
tual similarity metrics based on deep networks. In D. D. Lee, 3661–3664. IEEE, 2011.
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, ed- [25] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
itors, Advances in Neural Information Processing Systems, deep network training by reducing internal covariate shift. In
pages 658–666. Curran Associates, Inc., 2016. International Conference on Machine Learning, pages 448–
[10] V. Dumoulin, J. Shlens, and M. Kudlur. A learned represen- 456, 2015.
tation for artistic style. ICLR, 2017. [26] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image
[11] M. Eitz, J. Hays, and M. Alexa. How do humans sketch translation with conditional adversarial networks. In The
objects? ACM Transactions on Graphics (proceedings of IEEE Conference on Computer Vision and Pattern Recog-
SIGGRAPH), 31(4):44:1–44:10, 2012. nition (CVPR), July 2017.
[12] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. An [27] S. James, M. J. Fonseca, and J. Collomosse. Reenact: Sketch
evaluation of descriptors for large-scale image retrieval from based choreographic design from archival dance footage. In
sketched feature lines. Computers & Graphics, 34(5):482– Proceedings of International Conference on Multimedia Re-
498, 2010. trieval, page 313. ACM, 2014.
[13] M. Eitz, K. Hildebrand, T. Boubekeur, and M. Alexa. [28] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for
Sketch-based image retrieval: Benchmark and bag-of- real-time style transfer and super-resolution. In European
features descriptors. IEEE transactions on visualization and Conference on Computer Vision, 2016.
computer graphics, 17(11):1624–1636, 2011. [29] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to
[14] M. Eitz, R. Richter, K. Hildebrand, T. Boubekeur, and discover cross-domain relations with generative adversarial
M. Alexa. Photosketcher: Interactive sketch-based im- networks. arXiv preprint arXiv:1703.05192, 2017.
age synthesis. IEEE Computer Graphics and Applications, [30] D. Kingma and J. Ba. Adam: A method for stochastic opti-
31(6):56–66, Nov 2011. mization. arXiv preprint arXiv:1412.6980, 2014.
9424
[31] D. P. Kingma and M. Welling. Auto-encoding variational A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog-
bayes. In ICLR, Apr. 2014. nition challenge. International Journal of Computer Vision
[32] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. How to train (IJCV), 115(3):211–252, 2015.
your dragan. arXiv preprint arXiv:1705.07215, 2017. [47] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet ford, and X. Chen. Improved techniques for training gans. In
classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages
Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
1097–1105. Curran Associates, Inc., 2012. [48] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The
[34] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, sketchy database: Learning to retrieve badly drawn bun-
521(7553):436–444, 2015. nies. ACM Transactions on Graphics (proceedings of SIG-
[35] K. Li, K. Pang, Y. Z. Song, T. Hospedales, H. Zhang, and GRAPH), 2016.
Y. Hu. Fine-grained sketch-based image retrieval: The role [49] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scrib-
of part-aware attributes. In The IEEE Winter Conference on bler: Controlling deep image synthesis with sketch and color.
Applications of Computer Vision (WACV), pages 1–9, March In The IEEE Conference on Computer Vision and Pattern
2016. Recognition (CVPR), volume 2, 2017.
[36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal [50] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
loss for dense object detection. In The IEEE International Inception-v4, inception-resnet and the impact of residual
Conference on Computer Vision (ICCV), Oct 2017. connections on learning. In AAAI, 2017.
[37] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [51] D. Thanh Nguyen, B.-S. Hua, K. Tran, Q.-H. Pham, and S.-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com- K. Yeung. A field model for repairing 3d shapes. In The
mon objects in context. In European conference on computer IEEE Conference on Computer Vision and Pattern Recogni-
vision, pages 740–755. Springer, 2014. tion (CVPR), June 2016.
[38] Y.-L. Lin, C.-Y. Huang, H.-J. Wang, and W. Hsu. 3d sub- [52] D. Turmukhambetov, N. D. Campbell, D. B. Goldman, and
query expansion for improving sketch-based multi-view im- J. Kautz. Interactive sketch-driven image synthesis. Comput.
age retrieval. In The IEEE International Conference on Com- Graph. Forum, 34(8):130–142, Dec. 2015.
puter Vision (ICCV), December 2013. [53] C. Wang, Z. Li, and L. Zhang. Mindfinder: image search by
[39] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to- interactive sketching and tagging. In Proceedings of the 19th
image translation networks. In Advances in Neural Informa- international conference on World wide web, pages 1309–
tion Processing Systems, pages 700–708, 2017. 1312. ACM, 2010.
[40] M.-Y. Liu and O. Tuzel. Coupled generative adversarial net- [54] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval
works. In Advances in Neural Information Processing Sys- using convolutional neural networks. In The IEEE Confer-
tems, pages 469–477, 2016. ence on Computer Vision and Pattern Recognition (CVPR),
[41] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- June 2015.
Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. [55] X. Wang and X. Tang. Face photo-sketch synthesis and
In European Conference on Computer Vision, pages 21–37, recognition. IEEE Transactions on Pattern Analysis and Ma-
2016. chine Intelligence, 31(11):1955–1967, 2009.
[42] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and [56] S. Xie and Z. Tu. Holistically-nested edge detection. In The
J. Yosinski. Plug & play generative networks: Conditional IEEE International Conference on Computer Vision (ICCV),
iterative generation of images in latent space. In The IEEE December 2015.
Conference on Computer Vision and Pattern Recognition [57] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-
(CVPR), July 2017. level domain transfer. In European Conference on Computer
[43] A. Odena, C. Olah, and J. Shlens. Conditional image syn- Vision, 2016.
thesis with auxiliary classifier GANs. In D. Precup and [58] Q. Yu, F. Liu, Y.-Z. Song, T. Xiang, T. M. Hospedales, and
Y. W. Teh, editors, Proceedings of the 34th International C.-C. Loy. Sketch me that shoe. In The IEEE Conference
Conference on Machine Learning, volume 70 of Proceed- on Computer Vision and Pattern Recognition (CVPR), June
ings of Machine Learning Research, pages 2642–2651, Inter- 2016.
national Convention Centre, Sydney, Australia, 06–11 Aug [59] T. Zhang and C. Y. Suen. A fast parallel algorithm for
2017. PMLR. thinning digital patterns. Communications of the ACM,
[44] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. 27(3):236–239, 1984.
Efros. Context encoders: Feature learning by inpainting. [60] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
In The IEEE Conference on Computer Vision and Pattern to-image translation using cycle-consistent adversarial net-
Recognition (CVPR), June 2016. works. In The IEEE International Conference on Computer
[45] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- Vision (ICCV), Oct 2017.
sentation learning with deep convolutional generative adver-
sarial networks. arXiv preprint arXiv:1511.06434, 2015.
[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
9425