CycleGAN_CVPR2017
CycleGAN_CVPR2017
Abstract 1. Introduction
Image-to-image translation is a class of vision and
What did Claude Monet see as he placed his easel by the
graphics problems where the goal is to learn the mapping
bank of the Seine near Argenteuil on a lovely spring day
between an input image and an output image using a train-
in 1873 (Figure 1, top-left)? A color photograph, had it
ing set of aligned image pairs. However, for many tasks,
been invented, may have documented a crisp blue sky and
paired training data will not be available. We present an
a glassy river reflecting it. Monet conveyed his impression
approach for learning to translate an image from a source
of this same scene through wispy brush strokes and a bright
domain X to a target domain Y in the absence of paired
palette.
examples. Our goal is to learn a mapping G : X → Y
such that the distribution of images from G(X) is indistin- What if Monet had happened upon the little harbor in
guishable from the distribution Y using an adversarial loss. Cassis on a cool summer evening (Figure 1, bottom-left)?
Because this mapping is highly under-constrained, we cou- A brief stroll through a gallery of Monet paintings makes it
ple it with an inverse mapping F : Y → X and introduce a possible to imagine how he would have rendered the scene:
cycle consistency loss to enforce F (G(X)) ≈ X (and vice perhaps in pastel shades, with abrupt dabs of paint, and a
versa). Qualitative results are presented on several tasks somewhat flattened dynamic range.
where paired training data does not exist, including collec- We can imagine all this despite never having seen a side
tion style transfer, object transfiguration, season transfer, by side example of a Monet painting next to a photo of the
photo enhancement, etc. Quantitative comparisons against scene he painted. Instead, we have knowledge of the set of
several prior methods demonstrate the superiority of our Monet paintings and of the set of landscape photographs.
approach. We can reason about the stylistic differences between these
* indicates equal contribution
1
a mapping G : X → Y such that the output ŷ = G(x),
Paired Unpaired
x ∈ X, is indistinguishable from images y ∈ Y by an ad-
versary trained to classify ŷ apart from y. In theory, this ob-
jective can induce an output distribution over ŷ that matches
the empirical distribution pdata (y) (in general, this requires
G to be stochastic) [16]. The optimal G thereby translates
the domain X to a domain Ŷ distributed identically to Y .
However, such a translation does not guarantee that an in-
dividual input x and output y are paired up in a meaningful
way – there are infinitely many mappings G that will in-
duce the same distribution over ŷ. Moreover, in practice,
we have found it difficult to optimize the adversarial objec-
tive in isolation: standard procedures often lead to the well-
Figure 2: Paired training data (left) consists of training ex- known problem of mode collapse, where all input images
amples {xi , yi }N
i=1 , where the correspondence between xi map to the same output image and the optimization fails to
and yi exists [22]. We instead consider unpaired training make progress [15].
data (right), consisting of a source set {xi }N i=1 (xi ∈ X) These issues call for adding more structure to our ob-
and a target set {yj }Mj=1 (y j ∈ Y ), with no information pro- jective. Therefore, we exploit the property that translation
vided as to which xi matches which yj . should be “cycle consistent”, in the sense that if we trans-
late, e.g., a sentence from English to French, and then trans-
two sets, and thereby imagine what a scene might look like
late it back from French to English, we should arrive back
if we were to “translate” it from one set into the other.
at the original sentence [3]. Mathematically, if we have a
In this paper, we present a method that can learn to do the
translator G : X → Y and another translator F : Y → X,
same: capturing special characteristics of one image col-
then G and F should be inverses of each other, and both
lection and figuring out how these characteristics could be
mappings should be bijections. We apply this structural as-
translated into the other image collection, all in the absence
sumption by training both the mapping G and F simultane-
of any paired training examples.
ously, and adding a cycle consistency loss [64] that encour-
This problem can be more broadly described as image-
ages F (G(x)) ≈ x and G(F (y)) ≈ y. Combining this loss
to-image translation [22], converting an image from one
with adversarial losses on domains X and Y yields our full
representation of a given scene, x, to another, y, e.g.,
objective for unpaired image-to-image translation.
grayscale to color, image to semantic labels, edge-map to
We apply our method to a wide range of applications,
photograph. Years of research in computer vision, image
including collection style transfer, object transfiguration,
processing, computational photography, and graphics have
season transfer and photo enhancement. We also compare
produced powerful translation systems in the supervised
against previous approaches that rely either on hand-defined
setting, where example image pairs {xi , yi }N i=1 are avail- factorizations of style and content, or on shared embed-
able (Figure 2, left), e.g., [11, 19, 22, 23, 28, 33, 45, 56, 58,
ding functions, and show that our method outperforms these
62]. However, obtaining paired training data can be difficult
baselines. We provide both PyTorch and Torch implemen-
and expensive. For example, only a couple of datasets ex-
tations. Check out more results at our website.
ist for tasks like semantic segmentation (e.g., [4]), and they
are relatively small. Obtaining input-output pairs for graph-
2. Related work
ics tasks like artistic stylization can be even more difficult
since the desired output is highly complex, typically requir- Generative Adversarial Networks (GANs) [16, 63]
ing artistic authoring. For many tasks, like object transfigu- have achieved impressive results in image generation [6,
ration (e.g., zebra↔horse, Figure 1 top-middle), the desired 39], image editing [66], and representation learning [39, 43,
output is not even well-defined. 37]. Recent methods adopt the same idea for conditional
We therefore seek an algorithm that can learn to trans- image generation applications, such as text2image [41], im-
late between domains without paired input-output examples age inpainting [38], and future prediction [36], as well as to
(Figure 2, right). We assume there is some underlying rela- other domains like videos [54] and 3D data [57]. The key to
tionship between the domains – for example, that they are GANs’ success is the idea of an adversarial loss that forces
two different renderings of the same underlying scene – and the generated images to be, in principle, indistinguishable
seek to learn that relationship. Although we lack supervi- from real photos. This loss is particularly powerful for im-
sion in the form of paired examples, we can exploit super- age generation tasks, as this is exactly the objective that
vision at the level of sets: we are given one set of images in much of computer graphics aims to optimize. We adopt an
domain X and a different set in domain Y . We may train adversarial loss to learn the mapping such that the translated
DY DX
G G
DX DY x Ŷ x̂ y X̂ ŷ
G F F
X( Y X Y
X Y ( cycle-consistency
loss
cycle-consistency
loss
F
(a) (b) (c)
Figure 3: (a) Our model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial
discriminators DY and DX . DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa
for DX and F . To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if
we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency
loss: x → G(x) → F (G(x)) ≈ x, and (c) backward cycle-consistency loss: y → F (y) → G(F (y)) ≈ y
images cannot be distinguished from images in the target tween the input and output, nor do we assume that the input
domain. and output have to lie in the same low-dimensional embed-
Image-to-Image Translation The idea of image-to- ding space. This makes our method a general-purpose solu-
image translation goes back at least to Hertzmann et al.’s tion for many vision and graphics tasks. We directly com-
Image Analogies [19], who employ a non-parametric tex- pare against several prior and contemporary approaches in
ture model [10] on a single input-output training image pair. Section 5.1.
More recent approaches use a dataset of input-output exam- Cycle Consistency The idea of using transitivity as a
ples to learn a parametric translation function using CNNs way to regularize structured data has a long history. In
(e.g., [33]). Our approach builds on the “pix2pix” frame- visual tracking, enforcing simple forward-backward con-
work of Isola et al. [22], which uses a conditional generative sistency has been a standard trick for decades [24, 48].
adversarial network [16] to learn a mapping from input to In the language domain, verifying and improving transla-
output images. Similar ideas have been applied to various tions via “back translation and reconciliation” is a technique
tasks such as generating photographs from sketches [44] or used by human translators [3] (including, humorously, by
from attribute and semantic layouts [25]. However, unlike Mark Twain [51]), as well as by machines [17]. More
the above prior work, we learn the mapping without paired recently, higher-order cycle consistency has been used in
training examples. structure from motion [61], 3D shape matching [21], co-
Unpaired Image-to-Image Translation Several other segmentation [55], dense semantic alignment [65, 64], and
methods also tackle the unpaired setting, where the goal is depth estimation [14]. Of these, Zhou et al. [64] and Go-
to relate two data domains: X and Y . Rosales et al. [42] dard et al. [14] are most similar to our work, as they use a
propose a Bayesian framework that includes a prior based cycle consistency loss as a way of using transitivity to su-
on a patch-based Markov random field computed from a pervise CNN training. In this work, we are introducing a
source image and a likelihood term obtained from multiple similar loss to push G and F to be consistent with each
style images. More recently, CoGAN [32] and cross-modal other. Concurrent with our work, in these same proceed-
scene networks [1] use a weight-sharing strategy to learn a ings, Yi et al. [59] independently use a similar objective
common representation across domains. Concurrent to our for unpaired image-to-image translation, inspired by dual
method, Liu et al. [31] extends the above framework with learning in machine translation [17].
a combination of variational autoencoders [27] and genera- Neural Style Transfer [13, 23, 52, 12] is another way
tive adversarial networks [16]. Another line of concurrent to perform image-to-image translation, which synthesizes a
work [46, 49, 2] encourages the input and output to share novel image by combining the content of one image with
specific “content” features even though they may differ in the style of another image (typically a painting) based on
“style“. These methods also use adversarial networks, with matching the Gram matrix statistics of pre-trained deep fea-
additional terms to enforce the output to be close to the input tures. Our primary focus, on the other hand, is learning
in a predefined metric space, such as class label space [2], the mapping between two image collections, rather than be-
image pixel space [46], and image feature space [49]. tween two specific images, by trying to capture correspon-
Unlike the above approaches, our formulation does not dences between higher-level appearance structures. There-
rely on any task-specific, predefined similarity function be- fore, our method can be applied to other tasks, such as
painting→ photo, object transfiguration, etc. where single Input 𝑥 Output 𝐺(𝑥) Reconstruction F(𝐺 𝑥 )
sample transfer methods do not perform well. We compare
these two methods in Section 5.2.
3. Formulation
Our goal is to learn mapping functions between two
domains X and Y given training samples {xi }N i=1 where
xi ∈ X and {yj }M 1
j=1 where yj ∈ Y . We denote the data
distribution as x ∼ pdata (x) and y ∼ pdata (y). As illus-
trated in Figure 3 (a), our model includes two mappings
G : X → Y and F : Y → X. In addition, we in-
troduce two adversarial discriminators DX and DY , where
DX aims to distinguish between images {x} and translated
images {F (y)}; in the same way, DY aims to discriminate
between {y} and {G(x)}. Our objective contains two types
of terms: adversarial losses [16] for matching the distribu-
tion of generated images to the data distribution in the target
domain; and cycle consistency losses to prevent the learned
mappings G and F from contradicting each other.
Shrivastava et al.’s strategy [46] and update the discrimi- the model was trained on 256 × 256 patches of 512 × 512 images, and
Input BiGAN CoGAN feature loss GAN SimGAN CycleGAN pix2pix Ground truth
Figure 5: Different methods for mapping labels↔photos trained on Cityscapes images. From left to right: input, Bi-
GAN/ALI [7, 9], CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data,
and ground truth.
Input BiGAN CoGAN feature loss GAN SimGAN CycleGAN pix2pix Ground truth
Figure 6: Different methods for mapping aerial photos↔maps on Google Maps. From left to right: input, BiGAN/ALI [7, 9],
CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data, and ground truth.
tributed from those tested in [22] (due to running the exper- tation metrics described below. The intuition is that if we
iment at a different date and time). Therefore, our numbers generate a photo from a label map of “car on the road”,
should only be used to compare our current method against then we have succeeded if the FCN applied to the generated
the baselines (which were run under identical conditions), photo detects “car on the road”.
rather than against [22]. Semantic segmentation metrics To evaluate the perfor-
FCN score Although perceptual studies may be the gold mance of photo→labels, we use the standard metrics from
standard for assessing graphical realism, we also seek an the Cityscapes benchmark [4], including per-pixel accuracy,
automatic quantitative measure that does not require human per-class accuracy, and mean class Intersection-Over-Union
experiments. For this, we adopt the “FCN score” from [22], (Class IOU) [4].
and use it to evaluate the Cityscapes labels→photo task.
The FCN metric evaluates how interpretable the generated 5.1.2 Baselines
photos are according to an off-the-shelf semantic segmen- CoGAN [32] This method learns one GAN generator for
tation algorithm (the fully-convolutional network, FCN, domain X and one for domain Y , with tied weights on the
from [33]). The FCN predicts a label map for a generated first few layers for shared latent representations. Translation
photo. This label map can then be compared against the from X to Y can be achieved by finding a latent represen-
input ground truth labels using standard semantic segmen- tation that generates image X and then rendering this latent
representation into style Y .
run convolutionally on the 512 × 512 images at test time. We choose
256 × 256 in our experiments as many baselines cannot scale up to high- SimGAN [46] Like our method, Shrivastava et al.[46]
resolution images, and CoGAN cannot be tested fully convolutionally. uses an adversarial loss to train a translation from X to Y .
Map → Photo Photo → Map
Loss % Turkers labeled real % Turkers labeled real Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [32] 0.6% ± 0.5% 0.9% ± 0.5% Cycle alone 0.22 0.07 0.02
BiGAN/ALI [9, 7] 2.1% ± 1.0% 1.9% ± 0.9% GAN alone 0.51 0.11 0.08
SimGAN [46] 0.7% ± 0.5% 2.6% ± 1.1% GAN + forward cycle 0.55 0.18 0.12
Feature loss + GAN 1.2% ± 0.6% 0.3% ± 0.2% GAN + backward cycle 0.39 0.14 0.06
CycleGAN (ours) 26.8% ± 2.8% 23.2% ± 3.4% CycleGAN (ours) 0.52 0.17 0.11
Table 1: AMT “real vs fake” test on maps↔aerial photos at Table 4: Ablation study: FCN-scores for different variants
256 × 256 resolution. of our method, evaluated on Cityscapes labels→photo.
Loss Per-pixel acc. Per-class acc. Class IOU Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [32] 0.40 0.10 0.06 Cycle alone 0.10 0.05 0.02
BiGAN/ALI [9, 7] 0.19 0.06 0.02 GAN alone 0.53 0.11 0.07
SimGAN [46] 0.20 0.10 0.04 GAN + forward cycle 0.49 0.11 0.07
Feature loss + GAN 0.06 0.04 0.01 GAN + backward cycle 0.01 0.06 0.01
CycleGAN (ours) 0.52 0.17 0.11 CycleGAN (ours) 0.58 0.22 0.16
pix2pix [22] 0.71 0.25 0.18
Table 5: Ablation study: classification performance of
Table 2: FCN-scores for different methods, evaluated on photo→labels for different losses, evaluated on Cityscapes.
Cityscapes labels→photo.
method, on the other hand, can produce translations that are
Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [32] 0.45 0.11 0.08 often of similar quality to the fully supervised pix2pix.
BiGAN/ALI [9, 7] 0.41 0.13 0.07 Table 1 reports performance regarding the AMT per-
SimGAN [46] 0.47 0.11 0.07 ceptual realism task. Here, we see that our method can
Feature loss + GAN 0.50 0.10 0.06
CycleGAN (ours) 0.58 0.22 0.16 fool participants on around a quarter of trials, in both the
pix2pix [22] 0.85 0.40 0.32 maps→aerial photos direction and the aerial photos→maps
Table 3: Classification performance of photo→labels for direction at 256 × 256 resolution3 . All the baselines almost
different methods on cityscapes. never fooled participants.
Table 2 assesses the performance of the labels→photo
The regularization term kx − G(x)k1 i s used to penalize task on the Cityscapes and Table 3 evaluates the opposite
making large changes at pixel level. mapping (photos→labels). In both cases, our method again
Feature loss + GAN We also test a variant of Sim- outperforms the baselines.
GAN [46] where the L1 loss is computed over deep
image features using a pretrained network (VGG-16 5.1.4 Analysis of the loss function
relu4 2 [47]), rather than over RGB pixel values. Com- In Table 4 and Table 5, we compare against ablations
puting distances in deep feature space, like this, is also of our full loss. Removing the GAN loss substantially
sometimes referred to as using a “perceptual loss” [8, 23]. degrades results, as does removing the cycle-consistency
BiGAN/ALI [9, 7] Unconditional GANs [16] learn a loss. We therefore conclude that both terms are critical
generator G : Z → X, that maps a random noise z to an to our results. We also evaluate our method with the cy-
image x. The BiGAN [9] and ALI [7] propose to also learn cle loss in only one direction: GAN + forward cycle loss
the inverse mapping function F : X → Z. Though they Ex∼pdata (x) [kF (G(x))−xk1 ], or GAN + backward cycle loss
were originally designed for mapping a latent vector z to an Ey∼pdata (y) [kG(F (y))−yk1 ] (Equation 2) and find that it of-
image x, we implemented the same objective for mapping a ten incurs training instability and causes mode collapse, es-
source image x to a target image y. pecially for the direction of the mapping that was removed.
pix2pix [22] We also compare against pix2pix [22], Figure 7 shows several qualitative examples.
which is trained on paired data, to see how close we can
get to this “upper bound” without using any paired data.
5.1.5 Image reconstruction quality
For a fair comparison, we implement all the baselines
using the same architecture and details as our method, ex- In Figure 4, we show a few random samples of the recon-
cept for CoGAN [32]. CoGAN builds on generators that structed images F (G(x)). We observed that the recon-
produce images from a shared latent representation, which structed images were often close to the original inputs x,
is incompatible with our image-to-image network. We use at both training and testing time, even in cases where one
the public implementation of CoGAN instead. domain represents significantly more diverse information,
such as map↔aerial photos.
5.1.3 Comparison against baselines 3 We also train CycleGAN and pix2pix at 512 × 512 resolution, and
observe the comparable performance: maps→aerial photos: CycleGAN:
As can be seen in Figure 5 and Figure 6, we were unable to 37.5% ± 3.6% and pix2pix: 33.9% ± 3.1%; aerial photos→maps: Cy-
achieve compelling results with any of the baselines. Our cleGAN: 16.5% ± 4.1% and pix2pix: 8.5% ± 2.6%
Input Cycle alone GAN alone GAN+forward GAN+backward CycleGAN Ground truth
Figure 7: Different variants of our method for mapping labels↔photos trained on cityscapes. From left to right: input, cycle-
consistency loss alone, adversarial loss alone, GAN + forward cycle-consistency loss (F (G(x)) ≈ x), GAN + backward
cycle-consistency loss (G(F (y)) ≈ y), CycleGAN (our full method), and ground truth. Both Cycle alone and GAN +
backward fail to produce images similar to the target domain. GAN alone and GAN + forward suffer from mode collapse,
producing identical label maps regardless of the input photo.
Input Output Input Output Input Output the appendix (Section 7) for more details about the datasets.
We observe that translations on training data are often more
appealing than those on test data, and full results of all ap-
plications on both training and test data can be viewed on
label → facade our project website.
Collection style transfer (Figure 10 and Figure 11)
We train the model on landscape photographs downloaded
from Flickr and WikiArt. Unlike recent work on “neural
facade → label style transfer” [13], our method learns to mimic the style
of an entire collection of artworks, rather than transferring
the style of a single selected piece of art. Therefore, we
can learn to generate photos in the style of, e.g., Van Gogh,
edges → shoes rather than just in the style of Starry Night. The size of the
dataset for each artist/style was 526, 1073, 400, and 563 for
Cezanne, Monet, Van Gogh, and Ukiyo-e.
Object transfiguration (Figure 13) The model is
trained to translate one object class from ImageNet [5] to
shoes → edges
another (each class contains around 1000 training images).
Figure 8: Example results of CycleGAN on paired datasets Turmukhambetov et al. [50] propose a subspace model to
used in “pix2pix” [22] such as architectural labels↔photos translate one object into another object of the same category,
and edges↔shoes. while our method focuses on object transfiguration between
two visually similar categories.
5.1.6 Additional results on paired datasets
Season transfer (Figure 13) The model is trained on
Figure 8 shows some example results on other paired 854 winter photos and 1273 summer photos of Yosemite
datasets used in “pix2pix” [22], such as architectural downloaded from Flickr.
labels↔photos from the CMP Facade Database [40], and Photo generation from paintings (Figure 12) For
edges↔shoes from the UT Zappos50K dataset [60]. The painting→photo, we find that it is helpful to introduce an
image quality of our results is close to those produced by additional loss to encourage the mapping to preserve color
the fully supervised pix2pix while our method learns the composition between the input and output. In particular, we
mapping without paired supervision. adopt the technique of Taigman et al. [49] and regularize the
generator to be near an identity mapping when real samples
5.2. Applications of the target domain are provided as the input to the gen-
We demonstrate our method on several applications erator: i.e., Lidentity (G, F ) = Ey∼pdata (y) [kG(y) − yk1 ] +
where paired training data does not exist. Please refer to Ex∼pdata (x) [kF (x) − xk1 ].
Input CycleGAN CycleGAN+L"#$%&"&' collection, we compute the average Gram Matrix across the
target domain and use this matrix to transfer the “average
style” with Gatys et al [13].
Figure 16 demonstrates similar comparisons for other
translation tasks. We observe that Gatys et al. [13] requires
finding target style images that closely match the desired
output, but still often fails to produce photorealistic results,
while our method succeeds to generate natural-looking re-
sults, similar to the target domain.
Figure 10: Collection style transfer I: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, and
Ukiyo-e. Please see our website for additional examples.
Input Monet Van Gogh Cezanne Ukiyo-e
Figure 11: Collection style transfer II: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, Ukiyo-e.
Please see our website for additional examples.
Input Output Input Output
Figure 12: Relatively successful results on mapping Monet’s paintings to a photographic style. Please see our website for
additional examples.
Input Output Input Output Input Output
horse → zebra
zebra → horse
apple → orange
orange → apple
Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results
– please see our website for more comprehensive and random results. In the top two rows, we show results on object
transfiguration between horses and zebras, trained on 939 images from the wild horse class and 1177 images from the zebra
class in Imagenet [5]. Also check out the horse→zebra demo video. The middle two rows show results on season transfer,
trained on winter and summer photos of Yosemite from Flickr. In the bottom two rows, we train our method on 996 apple
images and 1020 navel orange images from ImageNet.
Input Output Input Output Input Output Input Output
Figure 14: Photo enhancement: mapping from a set of smartphone snaps to professional DSLR photographs, the system often
learns to produce shallow focus. Here we show some of the most successful results in our test set – average performance is
considerably worse. Please see our website for more comprehensive and random examples.
Input Gatys et al. (image I) Gatys et al. (image II) Gatys et al. (collection) CycleGAN
Photo → Ukiyo-e
Photo → Cezanne
Figure 15: We compare our method with neural style transfer [13] on photo stylization. Left to right: input image, results
from Gatys et al. [13] using two different representative artworks as style images, results from Gatys et al. [13] using the
entire collection of the artist, and CycleGAN (ours).
Input Gatys et al. (image I) Gatys et al. (image II) Gatys et al. (collection) CycleGAN
apple → orange
horse → zebra
Monet → photo
Figure 16: We compare our method with neural style transfer [13] on various applications. From top to bottom:
apple→orange, horse→zebra, and Monet→photo. Left to right: input image, results from Gatys et al. [13] using two different
images as style images, results from Gatys et al. [13] using all the images from the target domain, and CycleGAN (ours).
photo → Ukiyo-e photo → Van Gogh iPhone photo → DSLR photo ImageNet “wild horse” training images
Figure 17: Typical failure cases of our method. Left: in the task of dog→cat transfiguration, CycleGAN can only make
minimal changes to the input. Right: CycleGAN also fails in this horse → zebra example as our model has not seen images
of horseback riding during training. Please see our website for more comprehensive results.
References [17] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and
W.-Y. Ma. Dual learning for machine translation. In
[1] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and NIPS, 2016. 3
A. Torralba. Cross-modal scene networks. PAMI,
2016. 3 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
[2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and learning for image recognition. In CVPR, 2016. 5
D. Krishnan. Unsupervised pixel-level domain adap- [19] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and
tation with generative adversarial networks. In CVPR, D. H. Salesin. Image analogies. In SIGGRAPH, 2001.
2017. 3 2, 3
[3] R. W. Brislin. Back-translation for cross-cultural [20] G. E. Hinton and R. R. Salakhutdinov. Reducing the
research. Journal of cross-cultural psychology, dimensionality of data with neural networks. Science,
1(3):185–216, 1970. 2, 3 313(5786):504–507, 2006. 5
[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- [21] Q.-X. Huang and L. Guibas. Consistent shape maps
zweiler, R. Benenson, U. Franke, S. Roth, and via semidefinite programming. In Symposium on Ge-
B. Schiele. The cityscapes dataset for semantic urban ometry Processing, 2013. 3
scene understanding. In CVPR, 2016. 2, 5, 6, 18
[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
to-image translation with conditional adversarial net-
L. Fei-Fei. Imagenet: A large-scale hierarchical im-
works. In CVPR, 2017. 2, 3, 5, 6, 7, 8, 18
age database. In CVPR, 2009. 8, 13, 18
[6] E. L. Denton, S. Chintala, R. Fergus, et al. Deep gen- [23] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
erative image models using a laplacian pyramid of ad- for real-time style transfer and super-resolution. In
versarial networks. In NIPS, 2015. 2 ECCV, 2016. 2, 3, 5, 7, 18
[7] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial [24] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-
feature learning. In ICLR, 2017. 6, 7 backward error: Automatic detection of tracking fail-
[8] A. Dosovitskiy and T. Brox. Generating images with ures. In ICPR, 2010. 3
perceptual similarity metrics based on deep networks. [25] L. Karacan, Z. Akata, A. Erdem, and E. Erdem.
In NIPS, 2016. 7 Learning to generate images of outdoor scenes from
[9] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Ar- attributes and semantic layouts. arXiv preprint
jovsky, O. Mastropietro, and A. Courville. Adversari- arXiv:1612.00215, 2016. 3
ally learned inference. In ICLR, 2017. 6, 7 [26] D. Kingma and J. Ba. Adam: A method for stochastic
[10] A. A. Efros and T. K. Leung. Texture synthesis by optimization. In ICLR, 2015. 5
non-parametric sampling. In ICCV, 1999. 3 [27] D. P. Kingma and M. Welling. Auto-encoding varia-
[11] D. Eigen and R. Fergus. Predicting depth, surface nor- tional bayes. ICLR, 2014. 3
mals and semantic labels with a common multi-scale
[28] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays.
convolutional architecture. In ICCV, 2015. 2
Transient attributes for high-level understanding and
[12] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shecht- editing of outdoor scenes. ACM TOG, 33(4):149,
man. Preserving color in neural artistic style transfer. 2014. 2
arXiv preprint arXiv:1606.05897, 2016. 3
[29] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun-
[13] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style
ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
transfer using convolutional neural networks. CVPR,
Z. Wang, et al. Photo-realistic single image super-
2016. 3, 8, 9, 14, 15
resolution using a generative adversarial network. In
[14] C. Godard, O. Mac Aodha, and G. J. Brostow. Un- CVPR, 2017. 5
supervised monocular depth estimation with left-right
consistency. In CVPR, 2017. 3 [30] C. Li and M. Wand. Precomputed real-time texture
synthesis with markovian generative adversarial net-
[15] I. Goodfellow. NIPS 2016 tutorial: Generative ad-
works. ECCV, 2016. 5
versarial networks. arXiv preprint arXiv:1701.00160,
2016. 2, 4, 5 [31] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, image-to-image translation networks. In NIPS, 2017.
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- 3
gio. Generative adversarial nets. In NIPS, 2014. 2, 3, [32] M.-Y. Liu and O. Tuzel. Coupled generative adversar-
4, 7 ial networks. In NIPS, 2016. 3, 6, 7
[33] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- [50] D. Turmukhambetov, N. D. Campbell, S. J. Prince,
tional networks for semantic segmentation. In CVPR, and J. Kautz. Modeling object appearance using
2015. 2, 3, 6 context-conditioned component analysis. In CVPR,
[34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and 2015. 8
B. Frey. Adversarial autoencoders. In ICLR, 2016. 5 [51] M. Twain. The jumping frog: in english, then in
[35] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. french, and then clawed back into a civilized language
Smolley. Least squares generative adversarial net- once more by patient. Unremunerated Toil, 3, 1903. 3
works. In CVPR. IEEE, 2017. 5 [52] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempit-
[36] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi- sky. Texture networks: Feed-forward synthesis of tex-
scale video prediction beyond mean square error. In tures and stylized images. In ICML, 2016. 3
ICLR, 2016. 2 [53] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance
[37] M. F. Mathieu, J. Zhao, A. Ramesh, P. Sprechmann, normalization: The missing ingredient for fast styliza-
and Y. LeCun. Disentangling factors of variation tion. arXiv preprint arXiv:1607.08022, 2016. 5
in deep representation using adversarial training. In [54] C. Vondrick, H. Pirsiavash, and A. Torralba. Generat-
NIPS, 2016. 2 ing videos with scene dynamics. In NIPS, 2016. 2
[38] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and [55] F. Wang, Q. Huang, and L. J. Guibas. Image co-
A. A. Efros. Context encoders: Feature learning by segmentation via consistent functional maps. In ICCV,
inpainting. CVPR, 2016. 2 2013. 3
[39] A. Radford, L. Metz, and S. Chintala. Unsupervised [56] X. Wang and A. Gupta. Generative image model-
representation learning with deep convolutional gen- ing using style and structure adversarial networks. In
erative adversarial networks. In ICLR, 2016. 2 ECCV, 2016. 2
[40] R. Š. Radim Tyleček. Spatial pattern templates for [57] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenen-
recognition of objects with regular structure. In Proc. baum. Learning a probabilistic latent space of ob-
GCPR, Saarbrucken, Germany, 2013. 8, 18 ject shapes via 3d generative-adversarial modeling. In
[41] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, NIPS, 2016. 2
and H. Lee. Generative adversarial text to image syn- [58] S. Xie and Z. Tu. Holistically-nested edge detection.
thesis. In ICML, 2016. 2 In ICCV, 2015. 2
[42] R. Rosales, K. Achan, and B. J. Frey. Unsupervised [59] Z. Yi, H. Zhang, T. Gong, Tan, and M. Gong. Dual-
image translation. In ICCV, 2003. 3 gan: Unsupervised dual learning for image-to-image
[43] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, translation. In ICCV, 2017. 3
A. Radford, and X. Chen. Improved techniques for [60] A. Yu and K. Grauman. Fine-grained visual compar-
training GANs. In NIPS, 2016. 2 isons with local learning. In CVPR, 2014. 8, 18
[44] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scrib- [61] C. Zach, M. Klopschitz, and M. Pollefeys. Disam-
bler: Controlling deep image synthesis with sketch biguating visual relations using loop constraints. In
and color. In CVPR, 2017. 3 CVPR, 2010. 3
[45] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data- [62] R. Zhang, P. Isola, and A. A. Efros. Colorful image
driven hallucination of different times of day from a colorization. In ECCV, 2016. 2
single outdoor photo. ACM TOG, 32(6):200, 2013. 2
[63] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based
[46] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, generative adversarial network. In ICLR, 2017. 2
W. Wang, and R. Webb. Learning from simulated and
[64] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and
unsupervised images through adversarial training. In
A. A. Efros. Learning dense correspondence via 3d-
CVPR, 2017. 3, 5, 6, 7
guided cycle consistency. In CVPR, 2016. 2, 3
[47] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. In [65] T. Zhou, Y. J. Lee, S. Yu, and A. A. Efros. Flowweb:
ICLR, 2015. 7 Joint image set alignment by weaving consistent,
pixel-wise correspondences. In CVPR, 2015. 3
[48] N. Sundaram, T. Brox, and K. Keutzer. Dense point
trajectories by gpu-accelerated large displacement op- [66] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A.
tical flow. In ECCV, 2010. 3 Efros. Generative visual manipulation on the natural
image manifold. In ECCV, 2016. 2
[49] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised
cross-domain image generation. In ICLR, 2017. 3, 8
7. Appendix Flower photo enhancement Flower images taken on
smartphones were downloaded from Flickr by searching for
7.1. Training details the photos taken by Apple iPhone 5, 5s, or 6, with search
We train our networks from scratch, with a learning rate text flower. DSLR images with shallow DoF were also
of 0.0002. In practice, we divide the objective by 2 while downloaded from Flickr by search tag flower, dof. The im-
optimizing D, which slows down the rate at which D learns, ages were scaled to 360 pixels by width. The identity map-
relative to the rate of G. We keep the same learning rate ping loss of weight 0.5λ was used. The training set size
for the first 100 epochs and linearly decay the rate to zero of the smartphone and DSLR dataset were 1813 and 3326,
over the next 100 epochs. Weights are initialized from a respectively. We set λ = 10.
Gaussian distribution N (0, 0.02).
7.2. Network architectures
Cityscapes label↔Photo 2975 training images from the
Cityscapes training set [4] with image size 128 × 128. We We provide both PyTorch and Torch implementations.
used the Cityscapes val set for testing. Generator architectures We adopt our architectures
Maps↔aerial photograph 1096 training images were from Johnson et al. [23]. We use 6 residual blocks for
scraped from Google Maps [22] with image size 256 × 256. 128 × 128 training images, and 9 residual blocks for 256 ×
Images were sampled from in and around New York City. 256 or higher-resolution training images. Below, we follow
Data was then split into train and test about the median lat- the naming convention used in the Johnson et al.’s Github
itude of the sampling region (with a buffer region added to repository.
ensure that no training pixel appeared in the test set). Let c7s1-k denote a 7 × 7 Convolution-InstanceNorm-
Architectural facades labels↔photo 400 training im- ReLU layer with k filters and stride 1. dk denotes a 3 × 3
ages from the CMP Facade Database [40]. Convolution-InstanceNorm-ReLU layer with k filters and
Edges→shoes around 50, 000 training images from UT stride 2. Reflection padding was used to reduce artifacts.
Zappos50K dataset [60]. The model was trained for 5 Rk denotes a residual block that contains two 3 × 3 con-
epochs. volutional layers with the same number of filters on both
Horse↔Zebra and Apple↔Orange We downloaded layer. uk denotes a 3 × 3 fractional-strided-Convolution-
the images from ImageNet [5] using keywords wild horse, InstanceNorm-ReLU layer with k filters and stride 12 .
zebra, apple, and navel orange. The images were scaled to The network with 6 residual blocks consists of:
256 × 256 pixels. The training set size of each class: 939 c7s1-64,d128,d256,R256,R256,R256,
(horse), 1177 (zebra), 996 (apple), and 1020 (orange). R256,R256,R256,u128,u64,c7s1-3
Summer↔Winter Yosemite The images were down- The network with 9 residual blocks consists of:
loaded using Flickr API with the tag yosemite and the date- c7s1-64,d128,d256,R256,R256,R256,
taken field. Black-and-white photos were pruned. The im- R256,R256,R256,R256,R256,R256,u128
ages were scaled to 256 × 256 pixels. The training size of u64,c7s1-3
each class: 1273 (summer) and 854 ( winter). Discriminator architectures For discriminator net-
Photo↔Art for style transfer The art images were works, we use 70 × 70 PatchGAN [22]. Let Ck denote a
downloaded from Wikiart.org. Some artworks that were 4 × 4 Convolution-InstanceNorm-LeakyReLU layer with k
sketches or too obscene were pruned by hand. The pho- filters and stride 2. After the last layer, we apply a convo-
tos were downloaded from Flickr using the combination lution to produce a 1-dimensional output. We do not use
of tags landscape and landscapephotography. Black-and- InstanceNorm for the first C64 layer. We use leaky ReLUs
white photos were pruned. The images were scaled to with a slope of 0.2. The discriminator architecture is:
256 × 256 pixels. The training set size of each class C64-C128-C256-C512
was 1074 (Monet), 584 (Cezanne), 401 (Van Gogh), 1433
(Ukiyo-e), and 6853 (Photographs). The Monet dataset was
particularly pruned to include only landscape paintings, and
the Van Gogh dataset included only his later works that rep-
resent his most recognizable artistic style.
Monet’s paintings→photos To achieve high resolution
while conserving memory, we used random square crops
of the original images for training. To generate results, we
passed images of width 512 pixels with correct aspect ra-
tio to the generator network as input. The weight for the
identity mapping loss was 0.5λ where λ was the weight for
cycle consistency loss. We set λ = 10.