Conditional Image To Image Translation
Conditional Image To Image Translation
Jianxin Lin1 Yingce Xia1 Tao Qin2 Zhibo Chen1 Tie-Yan Liu2
1
University of Science and Technology of China 2 Microsoft Research Asia
[email protected] [email protected]
{taoqin, tie-yan.liu}@microsoft.com [email protected]
arXiv:1805.00251v1 [cs.CV] 1 May 2018
Abstract collect a large amount of parallel data for such tasks, un-
supervised learning algorithms have been widely adopted.
Image-to-image translation tasks have been widely in- Particularly, the generative adversarial networks (GAN) [6]
vestigated with Generative Adversarial Networks (GANs) and dual learning [7, 21] are extensively studied in image-
and dual learning. However, existing models lack the abil- to-image translations. [22, 9, 25] tackle image-to-image
ity to control the translated results in the target domain and translation by the aforementioned two techniques, where
their results usually lack of diversity in the sense that a fixed the GANs are used to ensure the generated images belong-
image usually leads to (almost) deterministic translation re- ing to the target domain, and dual learning can help improve
sult. In this paper, we study a new problem, conditional image qualities by minimizing reconstruction loss.
image-to-image translation, which is to translate an image An implicit assumption of image-to-image translation
from the source domain to the target domain conditioned on is that an image contains two kinds of features1 : domain-
a given image in the target domain. It requires that the gen- independent features, which are preserved during the trans-
erated image should inherit some domain-specific features lation (i.e., the edges of face, eyes, nose and mouse while
of the conditional image from the target domain. Therefore, translating a man’ face to a woman’ face), and domain-
changing the conditional image in the target domain will specific features, which are changed during the transla-
lead to diverse translation results for a fixed input image tion (i.e., the color and style of the hair for face image
from the source domain, and therefore the conditional in- translation). Image-to-Image translation aims at transfer-
put image helps to control the translation results. We tackle ring images from the source domain to the target domain
this problem with unpaired data based on GANs and dual by preserving domain-independent features while replacing
learning. We twist two conditional translation models (one domain-specific features.
translation from A domain to B domain, and the other one While it is not difficult for existing image-to-image
from B domain to A domain) together for inputs combina- translation methods to convert an image from a source do-
tion and reconstruction while preserving domain indepen- main to a target domain, it is not easy for them to con-
dent features. We carry out experiments on men’s faces trol or manipulate the style in fine granularity of the gen-
from-to women’s faces translation and edges to shoes&bags erated image in the target domain. Consider the gender
translations. The results demonstrate the effectiveness of transform problem studied in [9], which is to translate a
our proposed method. man’s photo to a woman’s. Can we translate Hillary’s photo
to a man’ photo with the hair style and color of Trump?
DiscoGAN [9] can indeed output a woman’s photo given
1. Introduction a man’s photo as input, but cannot control the hair style or
color of the output image. DualGAN [22, 25] cannot imple-
Image-to-image translation covers a large variety of
ment this kind of fine-granularity control neither. To fulfill
computer vision problems, including image stylization [4],
such a blank in image translation, we propose the concept
segmentation [13] and saliency detection [5]. It aims at
of conditional image-to-image translation, which can spec-
learning a mapping that can convert an image from a source
ify domain-specific features in the target domain, carried
domain to a target domain, while preserving the main pre-
by another input image from the target domain. An exam-
sentations of the input images. For example, in the afore-
ple of conditional image-to-image translation is shown in
mentioned three tasks, an input image might be converted
to a portrait similar to Van Gogh’s styles, a heat map split- 1 Note that the two kinds of features are relative concepts, and domain-
ted into different regions, or a pencil sketch, while the edges specific features in one task might be domain-independent features in an-
and outlines remain unchanged. Since it is usually hard to other task, depending on what domains one focuses on in the task.
1.2. Our Results
There are three main challenges in solving the condi-
tional image translation problem. The first one is how to ex-
tract the domain-independent and domain-specific features
for a given image. The second is how to merge the features
from two different domains into a natural image in the tar-
get domain. The third one is that there is no parallel data
for us to learn such the mappings.
To tackle these challenges, we propose the condi-
Figure 1. Conditional image-to-image translation. (a) Condi- tional dual-GAN (briefly, cd-GAN), which can leverage the
tional women-to-men photo translation. (b) Conditional edges-to- strengths of both GAN and dual learning. Under such a
handbags translation. The purple arrow represents translation flow framework, the mappings of two directions, GA→B and
and the green arrow represents the conditional information flow. GB→A , are jointly learned. The model of cd-GAN follows
the encoder-decoder based framework: the encoder is used
to extract the domain-independent and domain-specific fea-
Figure 1, in which we want to convert Hillary’s photo to a tures and the decoder is to merge the two kinds of fea-
man’s photo. As shown in the figure, with an addition man’s tures to generate images. We chose GAN and dual learn-
photo as input, we can control the translated image (e.g., the ing due to the following considerations: (1) The dual learn-
hair color and style). ing framework can help learn to extract and merge the
domain-specific and domain-independent features by min-
1.1. Problem Setup imizing carefully designed reconstruction errors, includ-
We first define some notations. Suppose there are two ing reconstruction errors of the whole image, the domain-
image domains DA and DB . Following the implicit as- independent features, and the domain-specific features. (2)
sumption, an image xA ∈ DA can be represented as GAN can ensure that the generated images well mimic the
xA = xiA ⊕ xsA , where xiA ’s are domain-independent fea- natural images in the target domain. (3) Both dual learning
tures, xsA ’s are domain-specific features, and ⊕ is the op- [7, 22, 25] and GAN [6, 19, 1] work well under unsuper-
erator that can merge the two kinds of features into a com- vised settings.
plete image. Similarly, for an image xB ∈ DB , we have We carry out experiments on different tasks, including
xB = xiB ⊕ xsB . Take the images in Figure 1 as exam- face-to-face translation, edge-to-shoe translation, and edge-
ples: (1) If the two domains are man’s and woman’s pho- to-handbag translation. The results demonstrate that our
tos, the domain-independent features are individual facial network can effectively translate image with conditional in-
organs like eyes and mouths and the domain-specific fea- formation and robust to various applications.
tures are beard and hair style. (2) If the two domains are Our main contributions lie in two folds: (1) We de-
real bags and the edges of bags, the domain-independent fine a new problem, conditional image-to-image translation,
features are exactly the edges of bags themselves, and the which is a more general framework than conventional im-
domain-specific are the colors and textures. age translation. (2) We propose the cd-GAN algorithm to
The problem of conditional image-to-image translation solve the problem in an end-to-end way.
from domain DA to DB is as follows: Taken an image The remaining parts are organized follows. We introduce
xA ∈ DA as input and an image xB ∈ DB as conditional related work in Section 2 and present the details of cd-GAN
input, outputs an image xAB in domain DB that keeping in Section 3, including network architecture and the training
the domain-independent features of xA and combining the algorithm. Then we report experimental results in Section
domain-specific features carried in xB , i.e., 4 and conclude in Section 5.
and then reconstruct images as follows: normalize the gradients so that their magnitudes are compa-
rable across 4 losses. We summarize the training process in
x̂A = gA (x̂iA , xsA ); x̂B = gB (x̂iB , xsB ). (7) Algorithm 1.
4.2. Results
The translation results of face-to-face, edges-to-bags and
edges-to-shoes are shown in Figure 3-5 respectively.
For men-to-women translations, from Figure 3(a), we
have several observations. (1) DualGAN can indeed gen-
erate woman’s photo, but its results are purely based on
the men’s photos, since it does not take the conditional im-
ages as inputs. (2) Although taking the conditional image
as input, DualGAN-c fails to integrate the information (e.g.,
style) from the conditional input into its translation output.
(3) For GAN-c, sometimes its translation result is not rele-
vant to the original source-domain input, e.g., the 4-th row
Figure 3. Conditional face-to-face translation. (a) Results of Figure 3(a). This is because in training it is required to gen-
conditional men→women translation. (b) Results of conditional erate a target-domain image, but its output is not required
women→men translation. to be similar (in certain aspects) to the original input. (4)
cd-GAN works best among all the models by preserving
domain-independent features from the source-domain input
and combining the domain-specific features from the target-
domain conditional input. Here are two examples. (1) In 6-
th column of 1-st row, the woman is put on red lipstick. (2)
In 6-th column of 5-th row, the hair-style of the generated
image is the most similar to the conditional input.
We can get similar observations for women-to-men
translations as shown in Figure 3(b), especially for the
domain-specific features such as hair style and beard.
From Figure 4 and 5, we find that cd-GAN can well
leverage the domain-specific information carried in the con-
ditional inputs and control the generated target-domain im-
ages accordingly. DualGAN, DuanGAN-c and GAN-c do
not effectively utilize the conditional inputs.
Figure 4. Results of conditional edges→handbags translation. One important characteristic of conditional image-to-
image translation model is that it can generate diverse
target-domain images for a fixed source-domain image,
only if different target-domain images are provided as in-
Figure 7. Results produced by different connections and losses of
cd-GANs.
xB=0 i s
AB = gB (xA , xB = 0),