来源材料

资料来源

← 首页

                                                                            Unpaired Image-to-Image Translation
                                                                        using Cycle-Consistent Adversarial Networks

                                                             Jun-Yan Zhu∗      Taesung Park∗     Phillip Isola      Alexei A. Efros
                                                                      Berkeley AI Research (BAIR) laboratory, UC Berkeley

                                                     Monet       Photos                        Zebras        Horses                                   Summer     Winter




arXiv:1703.10593v7 [cs.CV] 24 Aug 2020
                                                      Monet     photo                            zebra       horse                                    summer     winter




                                                       photo    Monet                            horse       zebra                                      winter   summer




                                                Photograph                       Monet                   Van Gogh                         Cezanne                    Ukiyo-e
                                         Figure 1: Given any two unordered image collections X and Y , our algorithm learns to automatically “translate” an image
                                         from one into the other and vice versa: (left) Monet paintings and landscape photos from Flickr; (center) zebras and horses
                                         from ImageNet; (right) summer and winter Yosemite photos from Flickr. Example application (bottom): using a collection
                                         of paintings of famous artists, our method learns to render natural photographs into the respective styles.

                                                                  Abstract                                     1. Introduction
                                            Image-to-image translation is a class of vision and
                                                                                                                  What did Claude Monet see as he placed his easel by the
                                         graphics problems where the goal is to learn the mapping
                                                                                                               bank of the Seine near Argenteuil on a lovely spring day
                                         between an input image and an output image using a train-
                                                                                                               in 1873 (Figure 1, top-left)? A color photograph, had it
                                         ing set of aligned image pairs. However, for many tasks,
                                                                                                               been invented, may have documented a crisp blue sky and
                                         paired training data will not be available. We present an
                                                                                                               a glassy river reflecting it. Monet conveyed his impression
                                         approach for learning to translate an image from a source
                                                                                                               of this same scene through wispy brush strokes and a bright
                                         domain X to a target domain Y in the absence of paired
                                                                                                               palette.
                                         examples. Our goal is to learn a mapping G : X → Y
                                         such that the distribution of images from G(X) is indistin-              What if Monet had happened upon the little harbor in
                                         guishable from the distribution Y using an adversarial loss.          Cassis on a cool summer evening (Figure 1, bottom-left)?
                                         Because this mapping is highly under-constrained, we cou-             A brief stroll through a gallery of Monet paintings makes it
                                         ple it with an inverse mapping F : Y → X and introduce a              possible to imagine how he would have rendered the scene:
                                         cycle consistency loss to enforce F (G(X)) ≈ X (and vice              perhaps in pastel shades, with abrupt dabs of paint, and a
                                         versa). Qualitative results are presented on several tasks            somewhat flattened dynamic range.
                                         where paired training data does not exist, including collec-             We can imagine all this despite never having seen a side
                                         tion style transfer, object transfiguration, season transfer,         by side example of a Monet painting next to a photo of the
                                         photo enhancement, etc. Quantitative comparisons against              scene he painted. Instead, we have knowledge of the set of
                                         several prior methods demonstrate the superiority of our              Monet paintings and of the set of landscape photographs.
                                         approach.                                                             We can reason about the stylistic differences between these
                                                                                                                     * indicates equal contribution


                                                                                                         1
                                                                    a mapping G : X → Y such that the output ŷ = G(x),
         Paired                             Unpaired
                                                                    x ∈ X, is indistinguishable from images y ∈ Y by an ad-
                                                                    versary trained to classify ŷ apart from y. In theory, this ob-
                                                                    jective can induce an output distribution over ŷ that matches
                                                                    the empirical distribution pdata (y) (in general, this requires
                                                                    G to be stochastic) [16]. The optimal G thereby translates
                                                                    the domain X to a domain Ŷ distributed identically to Y .
                                                                    However, such a translation does not guarantee that an in-
                                                                    dividual input x and output y are paired up in a meaningful
                                                                    way – there are infinitely many mappings G that will in-
                                                                    duce the same distribution over ŷ. Moreover, in practice,
         …                           …                 …            we have found it difficult to optimize the adversarial objec-
                                                                    tive in isolation: standard procedures often lead to the well-
Figure 2: Paired training data (left) consists of training ex-      known problem of mode collapse, where all input images
amples {xi , yi }N
                 i=1 , where the correspondence between xi          map to the same output image and the optimization fails to
and yi exists [22]. We instead consider unpaired training           make progress [15].
data (right), consisting of a source set {xi }N  i=1 (xi ∈ X)           These issues call for adding more structure to our ob-
and a target set {yj }Mj=1 (y j ∈ Y ), with no information pro-     jective. Therefore, we exploit the property that translation
vided as to which xi matches which yj .                             should be “cycle consistent”, in the sense that if we trans-
                                                                    late, e.g., a sentence from English to French, and then trans-
two sets, and thereby imagine what a scene might look like
                                                                    late it back from French to English, we should arrive back
if we were to “translate” it from one set into the other.
                                                                    at the original sentence [3]. Mathematically, if we have a
    In this paper, we present a method that can learn to do the
                                                                    translator G : X → Y and another translator F : Y → X,
same: capturing special characteristics of one image col-
                                                                    then G and F should be inverses of each other, and both
lection and figuring out how these characteristics could be
                                                                    mappings should be bijections. We apply this structural as-
translated into the other image collection, all in the absence
                                                                    sumption by training both the mapping G and F simultane-
of any paired training examples.
                                                                    ously, and adding a cycle consistency loss [64] that encour-
    This problem can be more broadly described as image-
                                                                    ages F (G(x)) ≈ x and G(F (y)) ≈ y. Combining this loss
to-image translation [22], converting an image from one
                                                                    with adversarial losses on domains X and Y yields our full
representation of a given scene, x, to another, y, e.g.,
                                                                    objective for unpaired image-to-image translation.
grayscale to color, image to semantic labels, edge-map to
                                                                        We apply our method to a wide range of applications,
photograph. Years of research in computer vision, image
                                                                    including collection style transfer, object transfiguration,
processing, computational photography, and graphics have
                                                                    season transfer and photo enhancement. We also compare
produced powerful translation systems in the supervised
                                                                    against previous approaches that rely either on hand-defined
setting, where example image pairs {xi , yi }N    i=1 are avail-    factorizations of style and content, or on shared embed-
able (Figure 2, left), e.g., [11, 19, 22, 23, 28, 33, 45, 56, 58,
                                                                    ding functions, and show that our method outperforms these
62]. However, obtaining paired training data can be difficult
                                                                    baselines. We provide both PyTorch and Torch implemen-
and expensive. For example, only a couple of datasets ex-
                                                                    tations. Check out more results at our website.
ist for tasks like semantic segmentation (e.g., [4]), and they
are relatively small. Obtaining input-output pairs for graph-
                                                                    2. Related work
ics tasks like artistic stylization can be even more difficult
since the desired output is highly complex, typically requir-          Generative Adversarial Networks (GANs) [16, 63]
ing artistic authoring. For many tasks, like object transfigu-      have achieved impressive results in image generation [6,
ration (e.g., zebra↔horse, Figure 1 top-middle), the desired        39], image editing [66], and representation learning [39, 43,
output is not even well-defined.                                    37]. Recent methods adopt the same idea for conditional
    We therefore seek an algorithm that can learn to trans-         image generation applications, such as text2image [41], im-
late between domains without paired input-output examples           age inpainting [38], and future prediction [36], as well as to
(Figure 2, right). We assume there is some underlying rela-         other domains like videos [54] and 3D data [57]. The key to
tionship between the domains – for example, that they are           GANs’ success is the idea of an adversarial loss that forces
two different renderings of the same underlying scene – and         the generated images to be, in principle, indistinguishable
seek to learn that relationship. Although we lack supervi-          from real photos. This loss is particularly powerful for im-
sion in the form of paired examples, we can exploit super-          age generation tasks, as this is exactly the objective that
vision at the level of sets: we are given one set of images in      much of computer graphics aims to optimize. We adopt an
domain X and a different set in domain Y . We may train             adversarial loss to learn the mapping such that the translated
                                                            DY                                          DX
                                                    G                                                              G
 DX                        DY              x                 Ŷ            x̂           y                X̂                  ŷ
               G                                                   F                             F
                                                           X(               Y          X                     Y
  X                         Y                                                                            (       cycle-consistency
                                                                                                                        loss
                                       cycle-consistency
                                             loss
               F
                (a)                                         (b)                                        (c)

Figure 3: (a) Our model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial
discriminators DY and DX . DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa
for DX and F . To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if
we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency
loss: x → G(x) → F (G(x)) ≈ x, and (c) backward cycle-consistency loss: y → F (y) → G(F (y)) ≈ y
images cannot be distinguished from images in the target          tween the input and output, nor do we assume that the input
domain.                                                           and output have to lie in the same low-dimensional embed-
    Image-to-Image Translation The idea of image-to-              ding space. This makes our method a general-purpose solu-
image translation goes back at least to Hertzmann et al.’s        tion for many vision and graphics tasks. We directly com-
Image Analogies [19], who employ a non-parametric tex-            pare against several prior and contemporary approaches in
ture model [10] on a single input-output training image pair.     Section 5.1.
More recent approaches use a dataset of input-output exam-            Cycle Consistency The idea of using transitivity as a
ples to learn a parametric translation function using CNNs        way to regularize structured data has a long history. In
(e.g., [33]). Our approach builds on the “pix2pix” frame-         visual tracking, enforcing simple forward-backward con-
work of Isola et al. [22], which uses a conditional generative    sistency has been a standard trick for decades [24, 48].
adversarial network [16] to learn a mapping from input to         In the language domain, verifying and improving transla-
output images. Similar ideas have been applied to various         tions via “back translation and reconciliation” is a technique
tasks such as generating photographs from sketches [44] or        used by human translators [3] (including, humorously, by
from attribute and semantic layouts [25]. However, unlike         Mark Twain [51]), as well as by machines [17]. More
the above prior work, we learn the mapping without paired         recently, higher-order cycle consistency has been used in
training examples.                                                structure from motion [61], 3D shape matching [21], co-
    Unpaired Image-to-Image Translation Several other             segmentation [55], dense semantic alignment [65, 64], and
methods also tackle the unpaired setting, where the goal is       depth estimation [14]. Of these, Zhou et al. [64] and Go-
to relate two data domains: X and Y . Rosales et al. [42]         dard et al. [14] are most similar to our work, as they use a
propose a Bayesian framework that includes a prior based          cycle consistency loss as a way of using transitivity to su-
on a patch-based Markov random field computed from a              pervise CNN training. In this work, we are introducing a
source image and a likelihood term obtained from multiple         similar loss to push G and F to be consistent with each
style images. More recently, CoGAN [32] and cross-modal           other. Concurrent with our work, in these same proceed-
scene networks [1] use a weight-sharing strategy to learn a       ings, Yi et al. [59] independently use a similar objective
common representation across domains. Concurrent to our           for unpaired image-to-image translation, inspired by dual
method, Liu et al. [31] extends the above framework with          learning in machine translation [17].
a combination of variational autoencoders [27] and genera-           Neural Style Transfer [13, 23, 52, 12] is another way
tive adversarial networks [16]. Another line of concurrent        to perform image-to-image translation, which synthesizes a
work [46, 49, 2] encourages the input and output to share         novel image by combining the content of one image with
specific “content” features even though they may differ in        the style of another image (typically a painting) based on
“style“. These methods also use adversarial networks, with        matching the Gram matrix statistics of pre-trained deep fea-
additional terms to enforce the output to be close to the input   tures. Our primary focus, on the other hand, is learning
in a predefined metric space, such as class label space [2],      the mapping between two image collections, rather than be-
image pixel space [46], and image feature space [49].             tween two specific images, by trying to capture correspon-
    Unlike the above approaches, our formulation does not         dences between higher-level appearance structures. There-
rely on any task-specific, predefined similarity function be-     fore, our method can be applied to other tasks, such as
painting→ photo, object transfiguration, etc. where single               Input 𝑥           Output 𝐺(𝑥) Reconstruction F(𝐺 𝑥 )
sample transfer methods do not perform well. We compare
these two methods in Section 5.2.

3. Formulation
    Our goal is to learn mapping functions between two
domains X and Y given training samples {xi }N      i=1 where
xi ∈ X and {yj }M                     1
                   j=1 where yj ∈ Y . We denote the data
distribution as x ∼ pdata (x) and y ∼ pdata (y). As illus-
trated in Figure 3 (a), our model includes two mappings
G : X → Y and F : Y → X. In addition, we in-
troduce two adversarial discriminators DX and DY , where
DX aims to distinguish between images {x} and translated
images {F (y)}; in the same way, DY aims to discriminate
between {y} and {G(x)}. Our objective contains two types
of terms: adversarial losses [16] for matching the distribu-
tion of generated images to the data distribution in the target
domain; and cycle consistency losses to prevent the learned
mappings G and F from contradicting each other.

3.1. Adversarial Loss
   We apply adversarial losses [16] to both mapping func-          Figure 4: The input images x, output images G(x) and the
tions. For the mapping function G : X → Y and its dis-             reconstructed images F (G(x)) from various experiments.
criminator DY , we express the objective as:                       From top to bottom: photo↔Cezanne, horses↔zebras,
                                                                   winter→summer Yosemite, aerial photos↔Google maps.
 LGAN (G, DY , X, Y ) = Ey∼pdata (y) [log DY (y)]
                                                                   functions should be cycle-consistent: as shown in Figure 3
                            + Ex∼pdata (x) [log(1 − DY (G(x))],    (b), for each image x from domain X, the image translation
                                                             (1)   cycle should be able to bring x back to the original image,
                                                                   i.e., x → G(x) → F (G(x)) ≈ x. We call this forward cy-
where G tries to generate images G(x) that look similar to
                                                                   cle consistency. Similarly, as illustrated in Figure 3 (c), for
images from domain Y , while DY aims to distinguish be-
                                                                   each image y from domain Y , G and F should also satisfy
tween translated samples G(x) and real samples y. G aims
                                                                   backward cycle consistency: y → F (y) → G(F (y)) ≈ y.
to minimize this objective against an adversary D that tries
                                                                   We incentivize this behavior using a cycle consistency loss:
to maximize it, i.e., minG maxDY LGAN (G, DY , X, Y ).
We introduce a similar adversarial loss for the mapping                    Lcyc (G, F ) = Ex∼pdata (x) [kF (G(x)) − xk1 ]
function F : Y → X and its discriminator DX as well:
                                                                                       + Ey∼pdata (y) [kG(F (y)) − yk1 ].     (2)
i.e., minF maxDX LGAN (F, DX , Y, X).
                                                                   In preliminary experiments, we also tried replacing the L1
3.2. Cycle Consistency Loss
                                                                   norm in this loss with an adversarial loss between F (G(x))
    Adversarial training can, in theory, learn mappings G          and x, and between G(F (y)) and y, but did not observe
and F that produce outputs identically distributed as target       improved performance.
domains Y and X respectively (strictly speaking, this re-             The behavior induced by the cycle consistency loss can
quires G and F to be stochastic functions) [15]. However,          be observed in Figure 4: the reconstructed images F (G(x))
with large enough capacity, a network can map the same             end up matching closely to the input images x.
set of input images to any random permutation of images in
the target domain, where any of the learned mappings can
                                                                   3.3. Full Objective
induce an output distribution that matches the target dis-
tribution. Thus, adversarial losses alone cannot guarantee            Our full objective is:
that the learned function can map an individual input xi to
a desired output yi . To further reduce the space of possi-                 L(G, F, DX , DY ) =LGAN (G, DY , X, Y )
ble mapping functions, we argue that the learned mapping                                        + LGAN (F, DX , Y, X)
  1 We often omit the subscript i and j for simplicity.                                         + λLcyc (G, F ),              (3)
where λ controls the relative importance of the two objec-        nators using a history of generated images rather than the
tives. We aim to solve:                                           ones produced by the latest generators. We keep an image
                                                                  buffer that stores the 50 previously created images.
       G∗ , F ∗ = arg min max L(G, F, DX , DY ).           (4)       For all the experiments, we set λ = 10 in Equation 3.
                       G,F Dx ,DY
                                                                  We use the Adam solver [26] with a batch size of 1. All
    Notice that our model can be viewed as training two “au-      networks were trained from scratch with a learning rate of
toencoders” [20]: we learn one autoencoder F ◦ G : X →            0.0002. We keep the same learning rate for the first 100
X jointly with another G ◦ F : Y → Y . However, these au-         epochs and linearly decay the rate to zero over the next 100
toencoders each have special internal structures: they map        epochs. Please see the appendix (Section 7) for more details
an image to itself via an intermediate representation that        about the datasets, architectures, and training procedures.
is a translation of the image into another domain. Such a
setup can also be seen as a special case of “adversarial au-      5. Results
toencoders” [34], which use an adversarial loss to train the
bottleneck layer of an autoencoder to match an arbitrary tar-         We first compare our approach against recent methods
get distribution. In our case, the target distribution for the    for unpaired image-to-image translation on paired datasets
X → X autoencoder is that of the domain Y .                       where ground truth input-output pairs are available for eval-
    In Section 5.1.4, we compare our method against ab-           uation. We then study the importance of both the adversar-
lations of the full objective, including the adversarial loss     ial loss and the cycle consistency loss and compare our full
LGAN alone and the cycle consistency loss Lcyc alone, and         method against several variants. Finally, we demonstrate
empirically show that both objectives play critical roles         the generality of our algorithm on a wide range of applica-
in arriving at high-quality results. We also evaluate our         tions where paired data does not exist. For brevity, we refer
method with only cycle loss in one direction and show that        to our method as CycleGAN. The PyTorch and Torch code,
a single cycle is not sufficient to regularize the training for   models, and full results can be found at our website.
this under-constrained problem.
                                                                  5.1. Evaluation
4. Implementation                                                    Using the same evaluation datasets and metrics as
                                                                  “pix2pix” [22], we compare our method against several
Network Architecture We adopt the architecture for our
                                                                  baselines both qualitatively and quantitatively. The tasks in-
generative networks from Johnson et al. [23] who have
                                                                  clude semantic labels↔photo on the Cityscapes dataset [4],
shown impressive results for neural style transfer and super-
                                                                  and map↔aerial photo on data scraped from Google Maps.
resolution. This network contains three convolutions, sev-
                                                                  We also perform ablation study on the full loss function.
eral residual blocks [18], two fractionally-strided convo-
lutions with stride 12 , and one convolution that maps fea-
                                                                  5.1.1   Evaluation Metrics
tures to RGB. We use 6 blocks for 128 × 128 images and 9
blocks for 256 × 256 and higher-resolution training images.           AMT perceptual studies On the map↔aerial photo
Similar to Johnson et al. [23], we use instance normaliza-        task, we run “real vs fake” perceptual studies on Amazon
tion [53]. For the discriminator networks we use 70 × 70          Mechanical Turk (AMT) to assess the realism of our out-
PatchGANs [22, 30, 29], which aim to classify whether             puts. We follow the same perceptual study protocol from
70 × 70 overlapping image patches are real or fake. Such a        Isola et al. [22], except we only gather data from 25 partic-
patch-level discriminator architecture has fewer parameters       ipants per algorithm we tested. Participants were shown a
than a full-image discriminator and can work on arbitrarily-      sequence of pairs of images, one a real photo or map and
sized images in a fully convolutional fashion [22].               one fake (generated by our algorithm or a baseline), and
                                                                  asked to click on the image they thought was real. The first
Training details We apply two techniques from recent              10 trials of each session were practice and feedback was
works to stabilize our model training procedure. First,           given as to whether the participant’s response was correct
for LGAN (Equation 1), we replace the negative log like-          or incorrect. The remaining 40 trials were used to assess
lihood objective by a least-squares loss [35]. This loss is       the rate at which each algorithm fooled participants. Each
more stable during training and generates higher quality          session only tested a single algorithm, and participants were
results. In particular, for a GAN loss LGAN (G, D, X, Y ),        only allowed to complete a single session. The numbers we
we train the G to minimize Ex∼pdata (x) [(D(G(x)) − 1)2 ]         report here are not directly comparable to those in [22] as
and train the D to minimize Ey∼pdata (y) [(D(y) − 1)2 ] +         our ground truth images were processed slightly differently
                                                                  2
Ex∼pdata (x) [D(G(x))2 ].                                           and the participant pool we tested may be differently dis-
   Second, to reduce model oscillation [15], we follow               2 We train all the models on 256 × 256 images while in pix2pix [22],

Shrivastava et al.’s strategy [46] and update the discrimi-       the model was trained on 256 × 256 patches of 512 × 512 images, and
         Input            BiGAN            CoGAN feature loss GAN SimGAN               CycleGAN          pix2pix       Ground truth




Figure 5: Different methods for mapping labels↔photos trained on Cityscapes images. From left to right: input, Bi-
GAN/ALI [7, 9], CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data,
and ground truth.
       Input        BiGAN        CoGAN feature loss GAN SimGAN        CycleGAN       pix2pix    Ground truth




Figure 6: Different methods for mapping aerial photos↔maps on Google Maps. From left to right: input, BiGAN/ALI [7, 9],
CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data, and ground truth.

tributed from those tested in [22] (due to running the exper-             tation metrics described below. The intuition is that if we
iment at a different date and time). Therefore, our numbers               generate a photo from a label map of “car on the road”,
should only be used to compare our current method against                 then we have succeeded if the FCN applied to the generated
the baselines (which were run under identical conditions),                photo detects “car on the road”.
rather than against [22].                                                     Semantic segmentation metrics To evaluate the perfor-
    FCN score Although perceptual studies may be the gold                 mance of photo→labels, we use the standard metrics from
standard for assessing graphical realism, we also seek an                 the Cityscapes benchmark [4], including per-pixel accuracy,
automatic quantitative measure that does not require human                per-class accuracy, and mean class Intersection-Over-Union
experiments. For this, we adopt the “FCN score” from [22],                (Class IOU) [4].
and use it to evaluate the Cityscapes labels→photo task.
The FCN metric evaluates how interpretable the generated                  5.1.2   Baselines
photos are according to an off-the-shelf semantic segmen-                     CoGAN [32] This method learns one GAN generator for
tation algorithm (the fully-convolutional network, FCN,                   domain X and one for domain Y , with tied weights on the
from [33]). The FCN predicts a label map for a generated                  first few layers for shared latent representations. Translation
photo. This label map can then be compared against the                    from X to Y can be achieved by finding a latent represen-
input ground truth labels using standard semantic segmen-                 tation that generates image X and then rendering this latent
                                                                          representation into style Y .
run convolutionally on the 512 × 512 images at test time. We choose
256 × 256 in our experiments as many baselines cannot scale up to high-       SimGAN [46] Like our method, Shrivastava et al.[46]
resolution images, and CoGAN cannot be tested fully convolutionally.      uses an adversarial loss to train a translation from X to Y .
                           Map → Photo              Photo → Map
  Loss                  % Turkers labeled real   % Turkers labeled real     Loss                   Per-pixel acc.   Per-class acc.   Class IOU
  CoGAN [32]                0.6% ± 0.5%              0.9% ± 0.5%            Cycle alone                0.22             0.07            0.02
  BiGAN/ALI [9, 7]          2.1% ± 1.0%              1.9% ± 0.9%            GAN alone                  0.51             0.11            0.08
  SimGAN [46]               0.7% ± 0.5%              2.6% ± 1.1%            GAN + forward cycle        0.55             0.18            0.12
  Feature loss + GAN        1.2% ± 0.6%              0.3% ± 0.2%            GAN + backward cycle       0.39             0.14            0.06
  CycleGAN (ours)          26.8% ± 2.8%             23.2% ± 3.4%            CycleGAN (ours)            0.52             0.17            0.11
Table 1: AMT “real vs fake” test on maps↔aerial photos at                 Table 4: Ablation study: FCN-scores for different variants
256 × 256 resolution.                                                     of our method, evaluated on Cityscapes labels→photo.
   Loss                  Per-pixel acc.   Per-class acc.   Class IOU        Loss                   Per-pixel acc.   Per-class acc.   Class IOU
   CoGAN [32]                0.40             0.10            0.06          Cycle alone                0.10             0.05            0.02
   BiGAN/ALI [9, 7]          0.19             0.06            0.02          GAN alone                  0.53             0.11            0.07
   SimGAN [46]               0.20             0.10            0.04          GAN + forward cycle        0.49             0.11            0.07
   Feature loss + GAN        0.06             0.04            0.01          GAN + backward cycle       0.01             0.06            0.01
   CycleGAN (ours)           0.52             0.17            0.11          CycleGAN (ours)            0.58             0.22            0.16
   pix2pix [22]              0.71             0.25            0.18
                                                                          Table 5: Ablation study: classification performance of
Table 2: FCN-scores for different methods, evaluated on                   photo→labels for different losses, evaluated on Cityscapes.
Cityscapes labels→photo.
                                                                          method, on the other hand, can produce translations that are
   Loss                  Per-pixel acc.   Per-class acc.   Class IOU
   CoGAN [32]                0.45             0.11            0.08        often of similar quality to the fully supervised pix2pix.
   BiGAN/ALI [9, 7]          0.41             0.13            0.07           Table 1 reports performance regarding the AMT per-
   SimGAN [46]               0.47             0.11            0.07        ceptual realism task. Here, we see that our method can
   Feature loss + GAN        0.50             0.10            0.06
   CycleGAN (ours)           0.58             0.22            0.16        fool participants on around a quarter of trials, in both the
   pix2pix [22]              0.85             0.40            0.32        maps→aerial photos direction and the aerial photos→maps
Table 3: Classification performance of photo→labels for                   direction at 256 × 256 resolution3 . All the baselines almost
different methods on cityscapes.                                          never fooled participants.
                                                                             Table 2 assesses the performance of the labels→photo
The regularization term kx − G(x)k1 i s used to penalize                  task on the Cityscapes and Table 3 evaluates the opposite
making large changes at pixel level.                                      mapping (photos→labels). In both cases, our method again
    Feature loss + GAN We also test a variant of Sim-                     outperforms the baselines.
GAN [46] where the L1 loss is computed over deep
image features using a pretrained network (VGG-16                         5.1.4   Analysis of the loss function
relu4 2 [47]), rather than over RGB pixel values. Com-                    In Table 4 and Table 5, we compare against ablations
puting distances in deep feature space, like this, is also                of our full loss. Removing the GAN loss substantially
sometimes referred to as using a “perceptual loss” [8, 23].               degrades results, as does removing the cycle-consistency
    BiGAN/ALI [9, 7] Unconditional GANs [16] learn a                      loss. We therefore conclude that both terms are critical
generator G : Z → X, that maps a random noise z to an                     to our results. We also evaluate our method with the cy-
image x. The BiGAN [9] and ALI [7] propose to also learn                  cle loss in only one direction: GAN + forward cycle loss
the inverse mapping function F : X → Z. Though they                       Ex∼pdata (x) [kF (G(x))−xk1 ], or GAN + backward cycle loss
were originally designed for mapping a latent vector z to an              Ey∼pdata (y) [kG(F (y))−yk1 ] (Equation 2) and find that it of-
image x, we implemented the same objective for mapping a                  ten incurs training instability and causes mode collapse, es-
source image x to a target image y.                                       pecially for the direction of the mapping that was removed.
    pix2pix [22] We also compare against pix2pix [22],                    Figure 7 shows several qualitative examples.
which is trained on paired data, to see how close we can
get to this “upper bound” without using any paired data.
                                                                          5.1.5   Image reconstruction quality
    For a fair comparison, we implement all the baselines
using the same architecture and details as our method, ex-                In Figure 4, we show a few random samples of the recon-
cept for CoGAN [32]. CoGAN builds on generators that                      structed images F (G(x)). We observed that the recon-
produce images from a shared latent representation, which                 structed images were often close to the original inputs x,
is incompatible with our image-to-image network. We use                   at both training and testing time, even in cases where one
the public implementation of CoGAN instead.                               domain represents significantly more diverse information,
                                                                          such as map↔aerial photos.
5.1.3   Comparison against baselines                                        3 We also train CycleGAN and pix2pix at 512 × 512 resolution, and

                                                                          observe the comparable performance: maps→aerial photos: CycleGAN:
As can be seen in Figure 5 and Figure 6, we were unable to                37.5% ± 3.6% and pix2pix: 33.9% ± 3.1%; aerial photos→maps: Cy-
achieve compelling results with any of the baselines. Our                 cleGAN: 16.5% ± 4.1% and pix2pix: 8.5% ± 2.6%
        Input          Cycle alone       GAN alone      GAN+forward GAN+backward            CycleGAN           Ground truth




Figure 7: Different variants of our method for mapping labels↔photos trained on cityscapes. From left to right: input, cycle-
consistency loss alone, adversarial loss alone, GAN + forward cycle-consistency loss (F (G(x)) ≈ x), GAN + backward
cycle-consistency loss (G(F (y)) ≈ y), CycleGAN (our full method), and ground truth. Both Cycle alone and GAN +
backward fail to produce images similar to the target domain. GAN alone and GAN + forward suffer from mode collapse,
producing identical label maps regardless of the input photo.
  Input      Output     Input   Output     Input     Output      the appendix (Section 7) for more details about the datasets.
                                                                 We observe that translations on training data are often more
                                                                 appealing than those on test data, and full results of all ap-
                                                                 plications on both training and test data can be viewed on
                        label → facade                           our project website.
                                                                     Collection style transfer (Figure 10 and Figure 11)
                                                                 We train the model on landscape photographs downloaded
                                                                 from Flickr and WikiArt. Unlike recent work on “neural
                        facade → label                           style transfer” [13], our method learns to mimic the style
                                                                 of an entire collection of artworks, rather than transferring
                                                                 the style of a single selected piece of art. Therefore, we
                                                                 can learn to generate photos in the style of, e.g., Van Gogh,
                       edges → shoes                             rather than just in the style of Starry Night. The size of the
                                                                 dataset for each artist/style was 526, 1073, 400, and 563 for
                                                                 Cezanne, Monet, Van Gogh, and Ukiyo-e.
                                                                     Object transfiguration (Figure 13) The model is
                                                                 trained to translate one object class from ImageNet [5] to
                       shoes → edges
                                                                 another (each class contains around 1000 training images).
Figure 8: Example results of CycleGAN on paired datasets         Turmukhambetov et al. [50] propose a subspace model to
used in “pix2pix” [22] such as architectural labels↔photos       translate one object into another object of the same category,
and edges↔shoes.                                                 while our method focuses on object transfiguration between
                                                                 two visually similar categories.
5.1.6     Additional results on paired datasets
                                                                     Season transfer (Figure 13) The model is trained on
Figure 8 shows some example results on other paired              854 winter photos and 1273 summer photos of Yosemite
datasets used in “pix2pix” [22], such as architectural           downloaded from Flickr.
labels↔photos from the CMP Facade Database [40], and                 Photo generation from paintings (Figure 12) For
edges↔shoes from the UT Zappos50K dataset [60]. The              painting→photo, we find that it is helpful to introduce an
image quality of our results is close to those produced by       additional loss to encourage the mapping to preserve color
the fully supervised pix2pix while our method learns the         composition between the input and output. In particular, we
mapping without paired supervision.                              adopt the technique of Taigman et al. [49] and regularize the
                                                                 generator to be near an identity mapping when real samples
5.2. Applications                                                of the target domain are provided as the input to the gen-
  We demonstrate our method on several applications              erator: i.e., Lidentity (G, F ) = Ey∼pdata (y) [kG(y) − yk1 ] +
where paired training data does not exist. Please refer to       Ex∼pdata (x) [kF (x) − xk1 ].
        Input             CycleGAN         CycleGAN+L"#$%&"&'       collection, we compute the average Gram Matrix across the
                                                                    target domain and use this matrix to transfer the “average
                                                                    style” with Gatys et al [13].
                                                                       Figure 16 demonstrates similar comparisons for other
                                                                    translation tasks. We observe that Gatys et al. [13] requires
                                                                    finding target style images that closely match the desired
                                                                    output, but still often fails to produce photorealistic results,
                                                                    while our method succeeds to generate natural-looking re-
                                                                    sults, similar to the target domain.

                                                                    6. Limitations and Discussion
                                                                        Although our method can achieve compelling results in
                                                                    many cases, the results are far from uniformly positive. Fig-
                                                                    ure 17 shows several typical failure cases. On translation
                                                                    tasks that involve color and texture changes, as many of
Figure 9: The effect of the identity mapping loss on Monet’s        those reported above, the method often succeeds. We have
painting→ photos. From left to right: input paintings, Cy-          also explored tasks that require geometric changes, with lit-
cleGAN without identity mapping loss, CycleGAN with                 tle success. For example, on the task of dog→cat transfigu-
identity mapping loss. The identity mapping loss helps pre-         ration, the learned translation degenerates into making min-
serve the color of the input paintings.                             imal changes to the input (Figure 17). This failure might be
   Without Lidentity , the generator G and F are free to            caused by our generator architectures which are tailored for
change the tint of input images when there is no need to.           good performance on the appearance changes. Handling
For example, when learning the mapping between Monet’s              more varied and extreme transformations, especially geo-
paintings and Flickr photographs, the generator often maps          metric changes, is an important problem for future work.
paintings of daytime to photographs taken during sunset,                Some failure cases are caused by the distribution charac-
because such a mapping may be equally valid under the ad-           teristics of the training datasets. For example, our method
versarial loss and cycle consistency loss. The effect of this       has got confused in the horse → zebra example (Figure 17,
identity mapping loss are shown in Figure 9.                        right), because our model was trained on the wild horse and
   In Figure 12, we show additional results translating             zebra synsets of ImageNet, which does not contain images
Monet’s paintings to photographs. This figure and Figure 9          of a person riding a horse or zebra.
show results on paintings that were included in the train-              We also observe a lingering gap between the results
ing set, whereas for all other experiments in the paper, we         achievable with paired training data and those achieved by
only evaluate and show test set results. Because the training       our unpaired method. In some cases, this gap may be very
set does not include paired data, coming up with a plausi-          hard – or even impossible – to close: for example, our
ble translation for a training set painting is a nontrivial task.   method sometimes permutes the labels for tree and build-
Indeed, since Monet is no longer able to create new paint-          ing in the output of the photos→labels task. Resolving this
ings, generalization to unseen, “test set”, paintings is not a      ambiguity may require some form of weak semantic super-
pressing problem.                                                   vision. Integrating weak or semi-supervised data may lead
   Photo enhancement (Figure 14) We show that our                   to substantially more powerful translators, still at a fraction
method can be used to generate photos with shallower depth          of the annotation cost of the fully-supervised systems.
of field. We train the model on flower photos downloaded                Nonetheless, in many cases completely unpaired data is
from Flickr. The source domain consists of flower photos            plentifully available and should be made use of. This paper
taken by smartphones, which usually have deep DoF due               pushes the boundaries of what is possible in this “unsuper-
to a small aperture. The target contains photos captured by         vised” setting.
DSLRs with a larger aperture. Our model successfully gen-               Acknowledgments: We thank Aaron Hertzmann, Shiry
erates photos with shallower depth of field from the photos         Ginosar, Deepak Pathak, Bryan Russell, Eli Shechtman,
taken by smartphones.                                               Richard Zhang, and Tinghui Zhou for many helpful com-
   Comparison with Gatys et al. [13] In Figure 15, we               ments. This work was supported in part by NSF SMA-
compare our results with neural style transfer [13] on photo        1514512, NSF IIS-1633310, a Google Research Award, In-
stylization. For each row, we first use two representative          tel Corp, and hardware donations from NVIDIA. JYZ is
artworks as the style images for [13]. Our method, on the           supported by the Facebook Graduate Fellowship and TP is
other hand, can produce photos in the style of entire collec-       supported by the Samsung Scholarship. The photographs
tion. To compare against neural style transfer of an entire         used for style transfer were taken by AE, mostly in France.
       Input                      Monet                  Van Gogh                 Cezanne                 Ukiyo-e




Figure 10: Collection style transfer I: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, and
Ukiyo-e. Please see our website for additional examples.
        Input                        Monet                  Van Gogh                  Cezanne                  Ukiyo-e




Figure 11: Collection style transfer II: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, Ukiyo-e.
Please see our website for additional examples.
            Input                        Output                          Input                        Output




Figure 12: Relatively successful results on mapping Monet’s paintings to a photographic style. Please see our website for
additional examples.
       Input             Output                 Input               Output               Input             Output




                                                        horse → zebra




                                                        zebra → horse




                                           winter Yosemite → summer Yosemite




                                          summer Yosemite → winter Yosemite




                                                    apple → orange




                                                   orange → apple

Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results
– please see our website for more comprehensive and random results. In the top two rows, we show results on object
transfiguration between horses and zebras, trained on 939 images from the wild horse class and 1177 images from the zebra
class in Imagenet [5]. Also check out the horse→zebra demo video. The middle two rows show results on season transfer,
trained on winter and summer photos of Yosemite from Flickr. In the bottom two rows, we train our method on 996 apple
images and 1020 navel orange images from ImageNet.
    Input          Output            Input           Output          Input          Output           Input       Output




Figure 14: Photo enhancement: mapping from a set of smartphone snaps to professional DSLR photographs, the system often
learns to produce shallow focus. Here we show some of the most successful results in our test set – average performance is
considerably worse. Please see our website for more comprehensive and random examples.



         Input              Gatys et al. (image I)   Gatys et al. (image II) Gatys et al. (collection)       CycleGAN




                                                       Photo → Van Gogh




                                                        Photo → Ukiyo-e




                                                        Photo → Cezanne
Figure 15: We compare our method with neural style transfer [13] on photo stylization. Left to right: input image, results
from Gatys et al. [13] using two different representative artworks as style images, results from Gatys et al. [13] using the
entire collection of the artist, and CycleGAN (ours).
            Input            Gatys et al. (image I)   Gatys et al. (image II) Gatys et al. (collection)       CycleGAN




                                                         apple → orange




                                                          horse → zebra




                                                         Monet → photo

Figure 16: We compare our method with neural style transfer [13] on various applications. From top to bottom:
apple→orange, horse→zebra, and Monet→photo. Left to right: input image, results from Gatys et al. [13] using two different
images as style images, results from Gatys et al. [13] using all the images from the target domain, and CycleGAN (ours).



    Input           Output      Input       Output         Input          Output              Input                   Output




      apple → orange              zebra → horse               winter → summer




        dog → cat                   cat → dog                 Monet → photo                           horse → zebra




      photo → Ukiyo-e            photo → Van Gogh      iPhone photo → DSLR photo        ImageNet “wild horse” training images

Figure 17: Typical failure cases of our method. Left: in the task of dog→cat transfiguration, CycleGAN can only make
minimal changes to the input. Right: CycleGAN also fails in this horse → zebra example as our model has not seen images
of horseback riding during training. Please see our website for more comprehensive results.
References                                                      [17] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and
                                                                     W.-Y. Ma. Dual learning for machine translation. In
 [1] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and         NIPS, 2016. 3
     A. Torralba. Cross-modal scene networks. PAMI,
     2016. 3                                                    [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
 [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and             learning for image recognition. In CVPR, 2016. 5
     D. Krishnan. Unsupervised pixel-level domain adap-         [19] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and
     tation with generative adversarial networks. In CVPR,           D. H. Salesin. Image analogies. In SIGGRAPH, 2001.
     2017. 3                                                         2, 3
 [3] R. W. Brislin. Back-translation for cross-cultural         [20] G. E. Hinton and R. R. Salakhutdinov. Reducing the
     research.     Journal of cross-cultural psychology,             dimensionality of data with neural networks. Science,
     1(3):185–216, 1970. 2, 3                                        313(5786):504–507, 2006. 5
 [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-          [21] Q.-X. Huang and L. Guibas. Consistent shape maps
     zweiler, R. Benenson, U. Franke, S. Roth, and                   via semidefinite programming. In Symposium on Ge-
     B. Schiele. The cityscapes dataset for semantic urban           ometry Processing, 2013. 3
     scene understanding. In CVPR, 2016. 2, 5, 6, 18
                                                                [22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
 [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
                                                                     to-image translation with conditional adversarial net-
     L. Fei-Fei. Imagenet: A large-scale hierarchical im-
                                                                     works. In CVPR, 2017. 2, 3, 5, 6, 7, 8, 18
     age database. In CVPR, 2009. 8, 13, 18
 [6] E. L. Denton, S. Chintala, R. Fergus, et al. Deep gen-     [23] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
     erative image models using a laplacian pyramid of ad-           for real-time style transfer and super-resolution. In
     versarial networks. In NIPS, 2015. 2                            ECCV, 2016. 2, 3, 5, 7, 18
 [7] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial   [24] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-
     feature learning. In ICLR, 2017. 6, 7                           backward error: Automatic detection of tracking fail-
 [8] A. Dosovitskiy and T. Brox. Generating images with              ures. In ICPR, 2010. 3
     perceptual similarity metrics based on deep networks.      [25] L. Karacan, Z. Akata, A. Erdem, and E. Erdem.
     In NIPS, 2016. 7                                                Learning to generate images of outdoor scenes from
 [9] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Ar-             attributes and semantic layouts.    arXiv preprint
     jovsky, O. Mastropietro, and A. Courville. Adversari-           arXiv:1612.00215, 2016. 3
     ally learned inference. In ICLR, 2017. 6, 7                [26] D. Kingma and J. Ba. Adam: A method for stochastic
[10] A. A. Efros and T. K. Leung. Texture synthesis by               optimization. In ICLR, 2015. 5
     non-parametric sampling. In ICCV, 1999. 3                  [27] D. P. Kingma and M. Welling. Auto-encoding varia-
[11] D. Eigen and R. Fergus. Predicting depth, surface nor-          tional bayes. ICLR, 2014. 3
     mals and semantic labels with a common multi-scale
                                                                [28] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays.
     convolutional architecture. In ICCV, 2015. 2
                                                                     Transient attributes for high-level understanding and
[12] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shecht-            editing of outdoor scenes. ACM TOG, 33(4):149,
     man. Preserving color in neural artistic style transfer.        2014. 2
     arXiv preprint arXiv:1606.05897, 2016. 3
                                                                [29] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun-
[13] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style
                                                                     ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
     transfer using convolutional neural networks. CVPR,
                                                                     Z. Wang, et al. Photo-realistic single image super-
     2016. 3, 8, 9, 14, 15
                                                                     resolution using a generative adversarial network. In
[14] C. Godard, O. Mac Aodha, and G. J. Brostow. Un-                 CVPR, 2017. 5
     supervised monocular depth estimation with left-right
     consistency. In CVPR, 2017. 3                              [30] C. Li and M. Wand. Precomputed real-time texture
                                                                     synthesis with markovian generative adversarial net-
[15] I. Goodfellow. NIPS 2016 tutorial: Generative ad-
                                                                     works. ECCV, 2016. 5
     versarial networks. arXiv preprint arXiv:1701.00160,
     2016. 2, 4, 5                                              [31] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,               image-to-image translation networks. In NIPS, 2017.
     D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-            3
     gio. Generative adversarial nets. In NIPS, 2014. 2, 3,     [32] M.-Y. Liu and O. Tuzel. Coupled generative adversar-
     4, 7                                                            ial networks. In NIPS, 2016. 3, 6, 7
[33] J. Long, E. Shelhamer, and T. Darrell. Fully convolu-     [50] D. Turmukhambetov, N. D. Campbell, S. J. Prince,
     tional networks for semantic segmentation. In CVPR,            and J. Kautz. Modeling object appearance using
     2015. 2, 3, 6                                                  context-conditioned component analysis. In CVPR,
[34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and          2015. 8
     B. Frey. Adversarial autoencoders. In ICLR, 2016. 5       [51] M. Twain. The jumping frog: in english, then in
[35] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P.           french, and then clawed back into a civilized language
     Smolley. Least squares generative adversarial net-             once more by patient. Unremunerated Toil, 3, 1903. 3
     works. In CVPR. IEEE, 2017. 5                             [52] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempit-
[36] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-              sky. Texture networks: Feed-forward synthesis of tex-
     scale video prediction beyond mean square error. In            tures and stylized images. In ICML, 2016. 3
     ICLR, 2016. 2                                             [53] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance
[37] M. F. Mathieu, J. Zhao, A. Ramesh, P. Sprechmann,              normalization: The missing ingredient for fast styliza-
     and Y. LeCun. Disentangling factors of variation               tion. arXiv preprint arXiv:1607.08022, 2016. 5
     in deep representation using adversarial training. In     [54] C. Vondrick, H. Pirsiavash, and A. Torralba. Generat-
     NIPS, 2016. 2                                                  ing videos with scene dynamics. In NIPS, 2016. 2
[38] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and     [55] F. Wang, Q. Huang, and L. J. Guibas. Image co-
     A. A. Efros. Context encoders: Feature learning by             segmentation via consistent functional maps. In ICCV,
     inpainting. CVPR, 2016. 2                                      2013. 3
[39] A. Radford, L. Metz, and S. Chintala. Unsupervised        [56] X. Wang and A. Gupta. Generative image model-
     representation learning with deep convolutional gen-           ing using style and structure adversarial networks. In
     erative adversarial networks. In ICLR, 2016. 2                 ECCV, 2016. 2
[40] R. Š. Radim Tyleček. Spatial pattern templates for      [57] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenen-
     recognition of objects with regular structure. In Proc.        baum. Learning a probabilistic latent space of ob-
     GCPR, Saarbrucken, Germany, 2013. 8, 18                        ject shapes via 3d generative-adversarial modeling. In
[41] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele,          NIPS, 2016. 2
     and H. Lee. Generative adversarial text to image syn-     [58] S. Xie and Z. Tu. Holistically-nested edge detection.
     thesis. In ICML, 2016. 2                                       In ICCV, 2015. 2
[42] R. Rosales, K. Achan, and B. J. Frey. Unsupervised        [59] Z. Yi, H. Zhang, T. Gong, Tan, and M. Gong. Dual-
     image translation. In ICCV, 2003. 3                            gan: Unsupervised dual learning for image-to-image
[43] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung,             translation. In ICCV, 2017. 3
     A. Radford, and X. Chen. Improved techniques for          [60] A. Yu and K. Grauman. Fine-grained visual compar-
     training GANs. In NIPS, 2016. 2                                isons with local learning. In CVPR, 2014. 8, 18
[44] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scrib-   [61] C. Zach, M. Klopschitz, and M. Pollefeys. Disam-
     bler: Controlling deep image synthesis with sketch             biguating visual relations using loop constraints. In
     and color. In CVPR, 2017. 3                                    CVPR, 2010. 3
[45] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-    [62] R. Zhang, P. Isola, and A. A. Efros. Colorful image
     driven hallucination of different times of day from a          colorization. In ECCV, 2016. 2
     single outdoor photo. ACM TOG, 32(6):200, 2013. 2
                                                               [63] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based
[46] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind,             generative adversarial network. In ICLR, 2017. 2
     W. Wang, and R. Webb. Learning from simulated and
                                                               [64] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and
     unsupervised images through adversarial training. In
                                                                    A. A. Efros. Learning dense correspondence via 3d-
     CVPR, 2017. 3, 5, 6, 7
                                                                    guided cycle consistency. In CVPR, 2016. 2, 3
[47] K. Simonyan and A. Zisserman. Very deep convolu-
     tional networks for large-scale image recognition. In     [65] T. Zhou, Y. J. Lee, S. Yu, and A. A. Efros. Flowweb:
     ICLR, 2015. 7                                                  Joint image set alignment by weaving consistent,
                                                                    pixel-wise correspondences. In CVPR, 2015. 3
[48] N. Sundaram, T. Brox, and K. Keutzer. Dense point
     trajectories by gpu-accelerated large displacement op-    [66] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A.
     tical flow. In ECCV, 2010. 3                                   Efros. Generative visual manipulation on the natural
                                                                    image manifold. In ECCV, 2016. 2
[49] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised
     cross-domain image generation. In ICLR, 2017. 3, 8
7. Appendix                                                        Flower photo enhancement Flower images taken on
                                                                smartphones were downloaded from Flickr by searching for
7.1. Training details                                           the photos taken by Apple iPhone 5, 5s, or 6, with search
   We train our networks from scratch, with a learning rate     text flower. DSLR images with shallow DoF were also
of 0.0002. In practice, we divide the objective by 2 while      downloaded from Flickr by search tag flower, dof. The im-
optimizing D, which slows down the rate at which D learns,      ages were scaled to 360 pixels by width. The identity map-
relative to the rate of G. We keep the same learning rate       ping loss of weight 0.5λ was used. The training set size
for the first 100 epochs and linearly decay the rate to zero    of the smartphone and DSLR dataset were 1813 and 3326,
over the next 100 epochs. Weights are initialized from a        respectively. We set λ = 10.
Gaussian distribution N (0, 0.02).
                                                                7.2. Network architectures
   Cityscapes label↔Photo 2975 training images from the
Cityscapes training set [4] with image size 128 × 128. We           We provide both PyTorch and Torch implementations.
used the Cityscapes val set for testing.                            Generator architectures We adopt our architectures
   Maps↔aerial photograph 1096 training images were             from Johnson et al. [23]. We use 6 residual blocks for
scraped from Google Maps [22] with image size 256 × 256.        128 × 128 training images, and 9 residual blocks for 256 ×
Images were sampled from in and around New York City.           256 or higher-resolution training images. Below, we follow
Data was then split into train and test about the median lat-   the naming convention used in the Johnson et al.’s Github
itude of the sampling region (with a buffer region added to     repository.
ensure that no training pixel appeared in the test set).            Let c7s1-k denote a 7 × 7 Convolution-InstanceNorm-
   Architectural facades labels↔photo 400 training im-          ReLU layer with k filters and stride 1. dk denotes a 3 × 3
ages from the CMP Facade Database [40].                         Convolution-InstanceNorm-ReLU layer with k filters and
   Edges→shoes around 50, 000 training images from UT           stride 2. Reflection padding was used to reduce artifacts.
Zappos50K dataset [60]. The model was trained for 5             Rk denotes a residual block that contains two 3 × 3 con-
epochs.                                                         volutional layers with the same number of filters on both
   Horse↔Zebra and Apple↔Orange We downloaded                   layer. uk denotes a 3 × 3 fractional-strided-Convolution-
the images from ImageNet [5] using keywords wild horse,         InstanceNorm-ReLU layer with k filters and stride 12 .
zebra, apple, and navel orange. The images were scaled to           The network with 6 residual blocks consists of:
256 × 256 pixels. The training set size of each class: 939      c7s1-64,d128,d256,R256,R256,R256,
(horse), 1177 (zebra), 996 (apple), and 1020 (orange).          R256,R256,R256,u128,u64,c7s1-3
   Summer↔Winter Yosemite The images were down-                     The network with 9 residual blocks consists of:
loaded using Flickr API with the tag yosemite and the date-     c7s1-64,d128,d256,R256,R256,R256,
taken field. Black-and-white photos were pruned. The im-        R256,R256,R256,R256,R256,R256,u128
ages were scaled to 256 × 256 pixels. The training size of      u64,c7s1-3
each class: 1273 (summer) and 854 ( winter).                        Discriminator architectures For discriminator net-
   Photo↔Art for style transfer The art images were             works, we use 70 × 70 PatchGAN [22]. Let Ck denote a
downloaded from Wikiart.org. Some artworks that were            4 × 4 Convolution-InstanceNorm-LeakyReLU layer with k
sketches or too obscene were pruned by hand. The pho-           filters and stride 2. After the last layer, we apply a convo-
tos were downloaded from Flickr using the combination           lution to produce a 1-dimensional output. We do not use
of tags landscape and landscapephotography. Black-and-          InstanceNorm for the first C64 layer. We use leaky ReLUs
white photos were pruned. The images were scaled to             with a slope of 0.2. The discriminator architecture is:
256 × 256 pixels. The training set size of each class           C64-C128-C256-C512
was 1074 (Monet), 584 (Cezanne), 401 (Van Gogh), 1433
(Ukiyo-e), and 6853 (Photographs). The Monet dataset was
particularly pruned to include only landscape paintings, and
the Van Gogh dataset included only his later works that rep-
resent his most recognizable artistic style.
   Monet’s paintings→photos To achieve high resolution
while conserving memory, we used random square crops
of the original images for training. To generate results, we
passed images of width 512 pixels with correct aspect ra-
tio to the generator network as input. The weight for the
identity mapping loss was 0.5λ where λ was the weight for
cycle consistency loss. We set λ = 10.