0% found this document useful (0 votes)
46 views

Conditional Image To Image Translation

The document proposes a method called conditional dual-GAN (cd-GAN) for conditional image-to-image translation. Cd-GAN can control the style of translated images using a conditional image and extract domain-independent and domain-specific features. It tackles challenges of conditional translation without parallel data by jointly learning dual mappings with GANs and dual learning.

Uploaded by

Eyvaz Najafli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Conditional Image To Image Translation

The document proposes a method called conditional dual-GAN (cd-GAN) for conditional image-to-image translation. Cd-GAN can control the style of translated images using a conditional image and extract domain-independent and domain-specific features. It tackles challenges of conditional translation without parallel data by jointly learning dual mappings with GANs and dual learning.

Uploaded by

Eyvaz Najafli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Conditional Image-to-Image Translation

Jianxin Lin1 Yingce Xia1 Tao Qin2 Zhibo Chen1 Tie-Yan Liu2
1
University of Science and Technology of China 2 Microsoft Research Asia
[email protected] [email protected]
{taoqin, tie-yan.liu}@microsoft.com [email protected]
arXiv:1805.00251v1 [cs.CV] 1 May 2018

Abstract collect a large amount of parallel data for such tasks, un-
supervised learning algorithms have been widely adopted.
Image-to-image translation tasks have been widely in- Particularly, the generative adversarial networks (GAN) [6]
vestigated with Generative Adversarial Networks (GANs) and dual learning [7, 21] are extensively studied in image-
and dual learning. However, existing models lack the abil- to-image translations. [22, 9, 25] tackle image-to-image
ity to control the translated results in the target domain and translation by the aforementioned two techniques, where
their results usually lack of diversity in the sense that a fixed the GANs are used to ensure the generated images belong-
image usually leads to (almost) deterministic translation re- ing to the target domain, and dual learning can help improve
sult. In this paper, we study a new problem, conditional image qualities by minimizing reconstruction loss.
image-to-image translation, which is to translate an image An implicit assumption of image-to-image translation
from the source domain to the target domain conditioned on is that an image contains two kinds of features1 : domain-
a given image in the target domain. It requires that the gen- independent features, which are preserved during the trans-
erated image should inherit some domain-specific features lation (i.e., the edges of face, eyes, nose and mouse while
of the conditional image from the target domain. Therefore, translating a man’ face to a woman’ face), and domain-
changing the conditional image in the target domain will specific features, which are changed during the transla-
lead to diverse translation results for a fixed input image tion (i.e., the color and style of the hair for face image
from the source domain, and therefore the conditional in- translation). Image-to-Image translation aims at transfer-
put image helps to control the translation results. We tackle ring images from the source domain to the target domain
this problem with unpaired data based on GANs and dual by preserving domain-independent features while replacing
learning. We twist two conditional translation models (one domain-specific features.
translation from A domain to B domain, and the other one While it is not difficult for existing image-to-image
from B domain to A domain) together for inputs combina- translation methods to convert an image from a source do-
tion and reconstruction while preserving domain indepen- main to a target domain, it is not easy for them to con-
dent features. We carry out experiments on men’s faces trol or manipulate the style in fine granularity of the gen-
from-to women’s faces translation and edges to shoes&bags erated image in the target domain. Consider the gender
translations. The results demonstrate the effectiveness of transform problem studied in [9], which is to translate a
our proposed method. man’s photo to a woman’s. Can we translate Hillary’s photo
to a man’ photo with the hair style and color of Trump?
DiscoGAN [9] can indeed output a woman’s photo given
1. Introduction a man’s photo as input, but cannot control the hair style or
color of the output image. DualGAN [22, 25] cannot imple-
Image-to-image translation covers a large variety of
ment this kind of fine-granularity control neither. To fulfill
computer vision problems, including image stylization [4],
such a blank in image translation, we propose the concept
segmentation [13] and saliency detection [5]. It aims at
of conditional image-to-image translation, which can spec-
learning a mapping that can convert an image from a source
ify domain-specific features in the target domain, carried
domain to a target domain, while preserving the main pre-
by another input image from the target domain. An exam-
sentations of the input images. For example, in the afore-
ple of conditional image-to-image translation is shown in
mentioned three tasks, an input image might be converted
to a portrait similar to Van Gogh’s styles, a heat map split- 1 Note that the two kinds of features are relative concepts, and domain-
ted into different regions, or a pencil sketch, while the edges specific features in one task might be domain-independent features in an-
and outlines remain unchanged. Since it is usually hard to other task, depending on what domains one focuses on in the task.
1.2. Our Results
There are three main challenges in solving the condi-
tional image translation problem. The first one is how to ex-
tract the domain-independent and domain-specific features
for a given image. The second is how to merge the features
from two different domains into a natural image in the tar-
get domain. The third one is that there is no parallel data
for us to learn such the mappings.
To tackle these challenges, we propose the condi-
Figure 1. Conditional image-to-image translation. (a) Condi- tional dual-GAN (briefly, cd-GAN), which can leverage the
tional women-to-men photo translation. (b) Conditional edges-to- strengths of both GAN and dual learning. Under such a
handbags translation. The purple arrow represents translation flow framework, the mappings of two directions, GA→B and
and the green arrow represents the conditional information flow. GB→A , are jointly learned. The model of cd-GAN follows
the encoder-decoder based framework: the encoder is used
to extract the domain-independent and domain-specific fea-
Figure 1, in which we want to convert Hillary’s photo to a tures and the decoder is to merge the two kinds of fea-
man’s photo. As shown in the figure, with an addition man’s tures to generate images. We chose GAN and dual learn-
photo as input, we can control the translated image (e.g., the ing due to the following considerations: (1) The dual learn-
hair color and style). ing framework can help learn to extract and merge the
domain-specific and domain-independent features by min-
1.1. Problem Setup imizing carefully designed reconstruction errors, includ-
We first define some notations. Suppose there are two ing reconstruction errors of the whole image, the domain-
image domains DA and DB . Following the implicit as- independent features, and the domain-specific features. (2)
sumption, an image xA ∈ DA can be represented as GAN can ensure that the generated images well mimic the
xA = xiA ⊕ xsA , where xiA ’s are domain-independent fea- natural images in the target domain. (3) Both dual learning
tures, xsA ’s are domain-specific features, and ⊕ is the op- [7, 22, 25] and GAN [6, 19, 1] work well under unsuper-
erator that can merge the two kinds of features into a com- vised settings.
plete image. Similarly, for an image xB ∈ DB , we have We carry out experiments on different tasks, including
xB = xiB ⊕ xsB . Take the images in Figure 1 as exam- face-to-face translation, edge-to-shoe translation, and edge-
ples: (1) If the two domains are man’s and woman’s pho- to-handbag translation. The results demonstrate that our
tos, the domain-independent features are individual facial network can effectively translate image with conditional in-
organs like eyes and mouths and the domain-specific fea- formation and robust to various applications.
tures are beard and hair style. (2) If the two domains are Our main contributions lie in two folds: (1) We de-
real bags and the edges of bags, the domain-independent fine a new problem, conditional image-to-image translation,
features are exactly the edges of bags themselves, and the which is a more general framework than conventional im-
domain-specific are the colors and textures. age translation. (2) We propose the cd-GAN algorithm to
The problem of conditional image-to-image translation solve the problem in an end-to-end way.
from domain DA to DB is as follows: Taken an image The remaining parts are organized follows. We introduce
xA ∈ DA as input and an image xB ∈ DB as conditional related work in Section 2 and present the details of cd-GAN
input, outputs an image xAB in domain DB that keeping in Section 3, including network architecture and the training
the domain-independent features of xA and combining the algorithm. Then we report experimental results in Section
domain-specific features carried in xB , i.e., 4 and conclude in Section 5.

xAB = GA→B (xA , xB ) = xiA ⊕ xsB , (1) 2. Related Work


where GA→B denotes the translation function. Similarly, Image generation has been widely explored in recent
we have the reverse conditional translation years. Models based on variational autoencoder (VAE)
[11] aim to improve the quality and efficiency of image
xBA = GB→A (xB , xA ) = xiB ⊕ xsA . (2) generation by learning an inference network. GANs [6]
were firstly proposed to generate images from random vari-
For simplicity, we call GA→B the forward translation ables by a two-player minimax game. Researchers have
and GB→A the reverse translation. In this work we study been exploited the capability of GANs for various image
how to learn such two translations. generation tasks. [1] proposed to synthesize images at
multiple resolutions with a Laplacian pyramid of adver- 3.1. The Encoder-Decoder Framework
sarial generators and discriminators, and can condition on
As shown in the figure, there are two encoders eA and
class labels for controllable generation. [19] introduced a
eB and two decoders gA and gB .
class of deep convolutional generative networks (DCGANs)
The encoders serve as feature extractors, which take an
for high-quality image generation and unsupervised image
image as input and output the two kinds of features, domain-
classification tasks.
independent features and domain-specific features, with the
Instead of learning to generate image samples from corresponding modules in the encoders. In particular, given
scratch (i.e., random vectors), the basic idea of image-to- two images xA and xB , we have
image translation is to learn a parametric translation func-
tion that transforms an input image in a source domain to (xiA , xsA ) = eA (xA ); (xiB , xsB ) = eB (xB ). (3)
an image in a target domain. [13] proposed a fully con-
If only looking at the encoder, there is no difference be-
volutional network (FCN) for image-to-segmentation trans-
tween the two kinds of features. It is the remaining parts
lation. Pix2pix [8] extended the basic FCN framework to
of the overall model and the training process that differenti-
other image-to-image translation tasks, including label-to-
ate the two kinds of features. More details are discussed in
street scene and aerial-to-map. Meanwhile, pix2pix utilized
Section 3.3.
adversarial training technique to ensure high-level domain
The decoders serve as generators, which take as inputs
similarity of the translation results.
the domain-independent features from the image in the
The image-to-image models mentioned above require source domain and the domain-specific features from the
paired training data between the source and target domains. image in the target domain and output a generated image in
There is another line of works studying unpaired domain the target domain. That is,
translation. Based on adversarial training, [3] and [2] pro-
posed algorithms to jointly learn to map latent space to xAB = gB (xiA , xsB ); xBA = gA (xiB , xsA ). (4)
data space and project the data space back to latent space.
3.2. Training Algorithm
[20] presented a domain transfer network (DTN) for unsu-
pervised cross-domain image generation employing a com- We leverage dual learning techniques and the GAN tech-
pound loss function including multiclass adversarial loss niques to train the encoders and decoders. The optimization
and f -constancy component, which could generate con- process is shown in the right part of Figure 2.
vincing novel images of previously unseen entities and pre-
serve their identity. [7] developed a dual learning mecha- 3.2.1 GAN loss
nism which can enable a neural machine translation system
to automatically learn from unlabeled data through a dual To ensure the generated xAB and xBA are in the corre-
learning game. Following the idea of dual learning, Dual- sponding domains, we employ two discriminators dA and
GAN [22], DiscoGAN [9] and CycleGAN [25] were pro- dB to differentiate the real images and synthetic ones. dA
posed to tackle the unpaired image translation problem by (or dB ) takes an image as input and outputs a probability
training two cross domain transfer GANs at the same time. indicating how likely the input is a natural image from do-
[15] proposed to utilize dual learning for semantic image main DA (or DB ). The objective function is
segmentation. [14] further proposed a conditional Cycle- `GAN = log(dA (xA )) + log(1 − dA (xBA ))
GAN for face super-resolution by adding facial attributes (5)
+ log(dB (xB )) + log(1 − dB (xAB )).
obtained from human annotation. However, collecting a
large amount of such human annotated data can be hard and The goal of the encoders and decoders eA , eB , gA , gB is
expensive. to generate images as similar to natural images and fool the
In this work, we study a new setting of image-to-image discriminators dA and dB , i.e., they try to minimize `GAN .
translation, in which we hope to control the generated im- The goal of dA and dB is to differentiate generated images
ages in fine granularity with unpaired data. We call such a from natural images, i.e., they try to maximize `GAN .
new problem conditional image-to-image translation.
3.2.2 Dual learning loss
3. Conditional Dual GAN The key idea of dual learning is to improve the performance
of a model by minimizing the reconstruction error.
Figure 2 shows the overall architecture of the proposed
To reconstruct the two images x̂A and x̂B , as shown in
model, in which the left part is an encoder-decoder based
Figure 2, we first extract the two kinds of features of the
framework for image translation and the right part includes
generated images:
additional components introduced to train the encoder and
decoder. (x̂iA , x̂sB ) = eB (xAB ); (x̂iB , x̂sA ) = eA (xBA ), (6)
Figure 2. Architecture of the proposed conditional dual GAN (cd-GAN).

and then reconstruct images as follows: normalize the gradients so that their magnitudes are compa-
rable across 4 losses. We summarize the training process in
x̂A = gA (x̂iA , xsA ); x̂B = gB (x̂iB , xsB ). (7) Algorithm 1.

We evaluate the reconstruction quality from three as-


Algorithm 1 cd-GAN training process
pects: the image level reconstruction error `im dual , the recon-
struction error `di of the domain-independent features, and Require: Training images {xA,i }m m
i=1 ⊂ DA , {xB,j }j=1 ⊂
dual
the reconstruction error `ds of the domain-specific features DB , batch size K, optimizer Opt(·, ·);
dual
1: Randomly initialize eA , eB , gA , gB , dA and dB .
as follows:
2: Randomly sample a minibatch of images and prepare
`im 2 2 the data pairs S = {(xA,k , xB,k )}K k=1 .
dual (xA , xB ) = kxA − x̂A k + kxB − x̂B k , (8)
3: For any data pair (xA,k , xB,k ) ∈ S, generate condi-
i i 2 i i 2
`di
dual (xA , xB ) = kxA − x̂A k + kxB − x̂B k , (9) tional translations by Eqn.(3,4), and reconstruct the im-
ages by Eqn.(6,7);
s s 2 s s 2
`ds
dual (xA , xB ) = kxA − x̂A k + kxB − x̂B k . (10) 4: Update the discriminators as follows:
PK
Compared with the existing dual learning ap- dA ← Opt(dA , (1/K)∇dA k=1 `GAN (xA,k , xB,k )),
PK
proaches [22] which only consider the image level dB ← Opt(dB , (1/K)∇dB k=1 `GAN (xA,k , xB,k ));
reconstruction error, our method considers more aspects 5: For each Θ ∈ {eA , eB , gA , gB }, compute the gradients
PK
and therefore is expected to achieve better accuracy. ∆GAN = (1/K)∇Θ k=1 `GAN (xA,k , xB,k ),
PK
∆im = (1/K)∇Θ k=1 `im (x , x ),
PK didual A,k B,k
3.2.3 Overall training process ∆di = (1/K)∇Θ k=1 `dual (xA,k , xB,k ),
PK
∆ds = (1/K)∇Θ k=1 `ds dual (xA,k , xB,k ),
Since the discriminators only impact the GAN loss `GAN , normalize the four gradients to make their magni-
we only use this loss to compute the gradients and update tudes comparable, sum them to obtain ∆, and Θ →
dA and dB . In contrast, the encoders and decoders impact Opt(Θ, ∆).
all the 4 losses (i.e., the GAN loss and three reconstruction 6: Repeat step 2 to step 6 until convergence
errors), we use all the 4 objectives to compute gradients and
update models for them. Note that since the 4 objectives
are of different magnitudes, their gradients may vary a lot In Algorithm 1, the choice of optimizers Opt(·, ·) is quite
in terms of magnitudes. To smooth the training process, we flexible, whose two inputs are the parameters to be opti-
mized and the corresponding gradients. One can choose dif- convolutional layers, each convolutional layer followed by
ferent optimizers (e.g. Adam [10], or nesterov gradient de- leaky rectified linear units (Leaky ReLU) [16]. Then the
scend [18]) for different tasks, depending on common prac- network is splitted into two branches: in one branch, a con-
tice for specific tasks and personal preferences. Besides, volutional layer is attached to extract domain-independent
the eA , eB , gA , gB , dA , dB might refer to either the models features; in the other branch, two fully-connected layers
themselves, or their parameters, depending on the context. are attached to extract domain-specific features. Decoder
networks gA and gB contain 4 deconvolutional layers with
3.3. Discussions
ReLU units [17], except for the last layer using tanh ac-
Our proposed framework can learn to separate the tivation function. The discriminators dA and dB consist
domain-independent features and domain-specific features. of 4 convolution layers, two fully-connected layers. Each
In Figure 2, consider the path of xA → eA → xiA → gB → layer is followed by Leaky ReLU units except for the last
xAB . Note that after training we ensure that xAB is an im- layer using sigmoid activation function. Details (e.g., num-
age in domain DB and the features xiA are still preserved in ber and size of filters, number of nodes in fully-connected
xAB . Thus, xiA should try to inherent the features that are layers) can be found in the supplementary document.
independent to domain DA . Given that xiA is domain inde- We use Adam [10] as the optimization algorithm with
pendent, it is xsB that carries information about domain DB . learning rate 0.0002. Batch normalization is applied to all
Thus, xsB is domain-specific features. Similarly, we can see convolution layers and deconvolution layers except for the
that xsA is domain-specific and xiB is domain-independent. first and last ones. Minibatch size is fixed as 200 for all the
DualGAN [22], DiscoGAN [9] and CycleGAN [25] can tasks.
be treated as simplified versions of our cd-GAN, by remov- We implement three related baselines for comparison.
ing the domain-specific features. For example, in Cycle-
GAN, given an xA ∈ DA , any xAB ∈ DB is a legal trans-
lation, no matter what xB ∈ DB is. In our work, we require 1. DualGAN [22, 9, 25]. DualGAN was primitively
that the generated images should match the inputs from two proposed for unconditional image-to-image translation
domains, which is more difficult. which does not require conditional input. Similar to
Furthermore, cd-GAN works for both symmetric trans- our cd-GAN, DualGAN trains two translation models
lations and asymmetric translations. In symmetric transla- jointly.
tions, both directions of translations need conditional inputs
(illustrated in Figure 1(a)). In asymmetric translations, only 2. DualGAN-c. In order to enable DualGAN to utilize
one direction of translation needs a conditional image as in- conditional input, we design a network as DualGAN-
put (illustrated in Figure 1(b)). That is, the translation from c. The main difference between DualGAN and
bag to edge does not need another edge image as input; even DualGAN-c is that DualGAN-c translates the target
given an additional edge image as the conditional input, it outputs as Eqn.(3,4), and reconstructs inputs as x̂A =
does not change or help to control the translation result. gA (eB (xAB )) and x̂B = gB (eA (xBA )).
For asymmetric translations, we only need to slightly
modify objectives for cd-GAN training. Suppose the trans- 3. GAN-c. To verify the effectiveness of dual learning,
lation direction of GB→A does not need conditional in- we remove the dual learning losses of cd-GAN during
put. Then we do not need to reconstruct the domain- training and obtain GAN-c.
specific features xsA . Accordingly, we modify the error of
domain-specific features as follows, and other 3 losses do
not change. For symmetric translations, we carry out experiments
on men-to-women face translations. We use the CelebA
s s 2
`ds
dual (xA , xB ) = kxB − x̂B k (11) dataset [12], which consists of 84434 men’s images (de-
4. Experiments noted as domain DA ) and 118165 women’s images (de-
noted as domain DB ). We randomly choose 4732 men’s
We conduct a set of experiments to test the proposed images and 6379 women’s images for testing, and use the
model. We first describe experimental settings, and then re- rest for training. In this task, the domain-independent fea-
port results for both symmetric translations and asymmetric tures are organs (e.g., eyes, nose, mouse) and domain-
translations. Finally we study individual components and specific features refer to hair-style, beard, the usage of lip-
loss functions of the proposed model. stick. For asymmetric translations, we work on edges-to-
shoes and edges-to-bags translations with datasets used in
4.1. Settings
[23] and [24] respectively. In these two tasks, the domain-
For all experiments, the networks take images of 64 × 64 independent features are edges and domain-specific features
resolution as inputs. The encoders eA and eB start with 3 are colors, textures, etc.
Figure 5. Results of conditional edges→shoes translation.

4.2. Results
The translation results of face-to-face, edges-to-bags and
edges-to-shoes are shown in Figure 3-5 respectively.
For men-to-women translations, from Figure 3(a), we
have several observations. (1) DualGAN can indeed gen-
erate woman’s photo, but its results are purely based on
the men’s photos, since it does not take the conditional im-
ages as inputs. (2) Although taking the conditional image
as input, DualGAN-c fails to integrate the information (e.g.,
style) from the conditional input into its translation output.
(3) For GAN-c, sometimes its translation result is not rele-
vant to the original source-domain input, e.g., the 4-th row
Figure 3. Conditional face-to-face translation. (a) Results of Figure 3(a). This is because in training it is required to gen-
conditional men→women translation. (b) Results of conditional erate a target-domain image, but its output is not required
women→men translation. to be similar (in certain aspects) to the original input. (4)
cd-GAN works best among all the models by preserving
domain-independent features from the source-domain input
and combining the domain-specific features from the target-
domain conditional input. Here are two examples. (1) In 6-
th column of 1-st row, the woman is put on red lipstick. (2)
In 6-th column of 5-th row, the hair-style of the generated
image is the most similar to the conditional input.
We can get similar observations for women-to-men
translations as shown in Figure 3(b), especially for the
domain-specific features such as hair style and beard.
From Figure 4 and 5, we find that cd-GAN can well
leverage the domain-specific information carried in the con-
ditional inputs and control the generated target-domain im-
ages accordingly. DualGAN, DuanGAN-c and GAN-c do
not effectively utilize the conditional inputs.
Figure 4. Results of conditional edges→handbags translation. One important characteristic of conditional image-to-
image translation model is that it can generate diverse
target-domain images for a fixed source-domain image,
only if different target-domain images are provided as in-
Figure 7. Results produced by different connections and losses of
cd-GANs.

the corresponding men’s. Similar translations of the other


images can also be found. Note that there are several failure
cases in face translations, such as first column of Figure 6
(a) and last column of Figure 6 (b). Most translated results
demonstrate the effectiveness of our model. More examples
can be found in our supplementary document.
4.3. Component Study
In this sub section, we study other possible design
choices for the model architecture in Figure 2 and losses
used in training. We compare cd-GAN with other four mod-
Figure 6. Our cd-GAN model can produce diverse results with dif- els as follows:
ferent conditional images. (a) Results of women→men translation • cd-GAN-rec. The inputs are reconstructed as
with two different men’s images as conditional inputs. (b) Re-
sults of edges→handbags translation with two different handbags x̂A = gA (x̂iA , x̂sA ); x̂B = gB (x̂iB , x̂sB ) (12)
as conditional inputs.
instead of Eqn.(7). That is, the connection from xsA
to gA in the right box of Figure 2 is replaced by the
puts. To verify such this ability of cd-GAN, we conduct connection from x̂sA to gA , and the connection from
two experiments: (1) for each woman’s photo, we work xsB to gB in the right box of Figure 2 is replaced by the
on women-to-men translations with different man’s pho- connection from x̂sB to gB .
tos as conditional inputs; (2) for each edge of a bag, we • cd-GAN-nof. Both domain-specific and domain-
work on edges-to-bags translations with different bags as independent feature reconstruction losses, i.e.,
conditional inputs. The results are shown in Figure 6. Fig- Eqn.(10) and Eqn.(9), are removed from dual learning
ure 6(b) shows that cd-GAN can fulfill edges with the col- losses.
ors and textures provided by the conditional inputs. Be-
sides, cd-GAN also achieves reasonable improvements on • cd-GAN-nos. The domain-specific feature reconstruc-
most face translations: The domain-independent features tion loss, i.e., Eqn.(10), is removed from dual learning
like woman’s facial outline, orientations and expressions are losses.
preserved, while the women specific features like hair-style
• cd-GAN-noi. The domain-independent feature recon-
and the usage the lipstick are replaced with men’s. An ex-
struction loss, i.e., Eqn.(10) is removed from dual
ample is the second row of Figure 6(a), where pointed chins,
learning losses.
serious expressions and looking forward are preserved in
the generated images. The hairstyles (bald v.s. short hair) The comparison experiments are conducted on the edges-
and the beard (no beard v.s. short beard) are reflected by to-handbags task. The results are shown in Figure 7. Our
cd-GAN outperforms the other four candidate models with
better color schemes. Failure of cd-GAN-rec demonstrates
the necessity of “skip connections” (i.e., the connections
from xsA to gA and from xsB to gB ) for image reconstruc-
tion. Since the domain-specific feature level and image
level reconstruction losses have implicitly put constrains on
domain-specific feature to some extent, the results produced
by cd-GAN-noi are closest to results of cd-GAN among the
four candidate models.
So far, we have shown the translation results of cd-GAN
generated from the combination domain-specific features
and domain-independent features. One may be interested
in what we really learn in the two kinds of features. Here
we try to understand them by generating translation results
using each kind of features separately:
• We generate an image using the domain-specific fea-
tures only: Figure 8. Images generated using only domain-independent fea-
tures or domain-specific features.
xA=0 i s
AB = gB (xA = 0, xB ),

in which we set the domain-independent features to 0.


• We generate an image using the domain-independent
features only:

xB=0 i s
AB = gB (xA , xB = 0),

in which we set the domain-specific features to 0.


The results are shown in Figure 8. As we can see, the image
xA=0
AB has similar style to xB , which indicates that our cd-
GAN can indeed extract domain-specific features. While
xB=0 Figure 9. The result of user study.
AB already loses conditional information of xB , it still
preserves main shape of xA , which demonstrates that cd-
GAN indeed extracts domain-independent features.
can well leverage the conditional inputs to control and di-
4.4. User Study versify the translation results. Experiments on two set-
tings (symmetric translations and asymmetric translations)
We have conducted user study to compare domain-
and three tasks (face-to-face, edges-to-shoes and edges-to-
specific features similarity between generated images and
handbags translations) have demonstrated the effectiveness
conditional images. Total 17 subjects (10 males, 7 females,
of the proposed model.
age range 20 − 35) from different backgrounds are asked
to make comparison of 32 sets of images. We show the There are multiple aspects to explore for conditional im-
subjects source image, conditional image, our result and re- age translation. First, we will apply the proposed model to
sults from other methods. Then each subject selects gen- more image translation tasks. Second, it is interesting to de-
erated image most similar to conditional image. The result sign better models for this translation problem. Third, the
of user study shows that our model obviously outperforms problem of conditional translations may be extend to other
other methods. applications, such as conditional video translations and con-
ditional text translations.
5. Conclusions and Future Work
6. Acknowledgement
In this paper, we have studied the problem of conditional
image-to-image translation, in which we translate an image This work was supported in part by the National Key
from a source domain to a target domain conditioned on an- Research and Development Program of China under Grant
other target-domain image as input. We have proposed a No.2016YFC0801001, NSFC under Grant 61571413,
new model based on GANs and dual learning. The model 61632001, 61390514, and Intel ICRI MNC.
References [19] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
sentation learning with deep convolutional generative adver-
[1] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera- sarial networks. arXiv preprint arXiv:1511.06434, 2015.
tive image models using a laplacian pyramid of adversarial
[20] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-
networks. In Advances in neural information processing sys-
domain image generation. arXiv preprint arXiv:1611.02200,
tems, pages 1486–1494, 2015.
2016.
[2] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial fea-
[21] Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T.-Y. Liu.
ture learning. arXiv preprint arXiv:1605.09782, 2016.
Dual supervised learning. In D. Precup and Y. W. Teh, ed-
[3] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, itors, Proceedings of the 34th International Conference on
O. Mastropietro, and A. Courville. Adversarially learned in- Machine Learning, volume 70 of Proceedings of Machine
ference. arXiv preprint arXiv:1606.00704, 2016. Learning Research, pages 3789–3798, International Conven-
[4] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer tion Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
using convolutional neural networks. In Proceedings of the [22] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper-
IEEE Conference on Computer Vision and Pattern Recogni- vised dual learning for image-to-image translation. In The
tion, pages 2414–2423, 2016. IEEE International Conference on Computer Vision (ICCV),
[5] S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware Oct 2017.
saliency detection. IEEE Transactions on Pattern Analysis [23] A. Yu and K. Grauman. Fine-grained visual comparisons
and Machine Intelligence, 34(10):1915–1926, 2012. with local learning. In Proceedings of the IEEE Conference
[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, on Computer Vision and Pattern Recognition, pages 192–
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- 199, 2014.
erative adversarial nets. In Advances in neural information [24] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
processing systems, pages 2672–2680, 2014. Generative visual manipulation on the natural image mani-
[7] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. fold. In European Conference on Computer Vision, pages
Dual learning for machine translation. In Advances in Neural 597–613. Springer, 2016.
Information Processing Systems, pages 820–828, 2016. [25] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
[8] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image- to-image translation using cycle-consistent adversarial net-
to-image translation with conditional adversarial networks. works. arXiv preprint arXiv:1703.10593, 2017.
arXiv preprint arXiv:1611.07004, 2016.
[9] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to
discover cross-domain relations with generative adversarial
networks. arXiv preprint arXiv:1703.05192, 2017.
[10] D. Kingma and J. Ba. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980, 2014.
[11] D. P. Kingma and M. Welling. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
[12] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face
attributes in the wild. In Proceedings of International Con-
ference on Computer Vision (ICCV), 2015.
[13] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3431–3440, 2015.
[14] Y. Lu, Y.-W. Tai, and C.-K. Tang. Conditional cyclegan
for attribute guided face image generation. arXiv preprint
arXiv:1705.09966, 2017.
[15] P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning
for semantic image segmentation. In The IEEE International
Conference on Computer Vision (ICCV), Oct 2017.
[16] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-
earities improve neural network acoustic models. In Proc.
ICML, volume 30, 2013.
[17] V. Nair and G. E. Hinton. Rectified linear units improve re-
stricted boltzmann machines. In Proc. ICML, pages 807–
814, 2010.
[18] Y. Nesterov. A method of solving a convex programming
problem with convergence rate o (1/k2). In Soviet Mathe-
matics Doklady, volume 27, pages 372–376, 1983.

You might also like