资料来源 — AI 计算结构史

Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks

Jun-Yan Zhu∗ Taesung Park∗ Phillip Isola Alexei A. Efros
Berkeley AI Research (BAIR) laboratory, UC Berkeley

Monet Photos Zebras Horses Summer Winter

arXiv:1703.10593v7 [cs.CV] 24 Aug 2020
Monet photo zebra horse summer winter

photo Monet horse zebra winter summer

Photograph Monet Van Gogh Cezanne Ukiyo-e
Figure 1: Given any two unordered image collections X and Y , our algorithm learns to automatically “translate” an image
from one into the other and vice versa: (left) Monet paintings and landscape photos from Flickr; (center) zebras and horses
from ImageNet; (right) summer and winter Yosemite photos from Flickr. Example application (bottom): using a collection
of paintings of famous artists, our method learns to render natural photographs into the respective styles.

Abstract 1. Introduction
Image-to-image translation is a class of vision and
What did Claude Monet see as he placed his easel by the
graphics problems where the goal is to learn the mapping
bank of the Seine near Argenteuil on a lovely spring day
between an input image and an output image using a train-
in 1873 (Figure 1, top-left)? A color photograph, had it
ing set of aligned image pairs. However, for many tasks,
been invented, may have documented a crisp blue sky and
paired training data will not be available. We present an
a glassy river reflecting it. Monet conveyed his impression
approach for learning to translate an image from a source
of this same scene through wispy brush strokes and a bright
domain X to a target domain Y in the absence of paired
palette.
examples. Our goal is to learn a mapping G : X → Y
such that the distribution of images from G(X) is indistin- What if Monet had happened upon the little harbor in
guishable from the distribution Y using an adversarial loss. Cassis on a cool summer evening (Figure 1, bottom-left)?
Because this mapping is highly under-constrained, we cou- A brief stroll through a gallery of Monet paintings makes it
ple it with an inverse mapping F : Y → X and introduce a possible to imagine how he would have rendered the scene:
cycle consistency loss to enforce F (G(X)) ≈ X (and vice perhaps in pastel shades, with abrupt dabs of paint, and a
versa). Qualitative results are presented on several tasks somewhat flattened dynamic range.
where paired training data does not exist, including collec- We can imagine all this despite never having seen a side
tion style transfer, object transfiguration, season transfer, by side example of a Monet painting next to a photo of the
photo enhancement, etc. Quantitative comparisons against scene he painted. Instead, we have knowledge of the set of
several prior methods demonstrate the superiority of our Monet paintings and of the set of landscape photographs.
approach. We can reason about the stylistic differences between these
* indicates equal contribution

1
a mapping G : X → Y such that the output ŷ = G(x),
Paired Unpaired
x ∈ X, is indistinguishable from images y ∈ Y by an ad-
versary trained to classify ŷ apart from y. In theory, this ob-
jective can induce an output distribution over ŷ that matches
the empirical distribution pdata (y) (in general, this requires
G to be stochastic) [16]. The optimal G thereby translates
the domain X to a domain Ŷ distributed identically to Y .
However, such a translation does not guarantee that an in-
dividual input x and output y are paired up in a meaningful
way – there are infinitely many mappings G that will in-
duce the same distribution over ŷ. Moreover, in practice,
… … … we have found it difficult to optimize the adversarial objec-
tive in isolation: standard procedures often lead to the well-
Figure 2: Paired training data (left) consists of training ex- known problem of mode collapse, where all input images
amples {xi , yi }N
i=1 , where the correspondence between xi map to the same output image and the optimization fails to
and yi exists [22]. We instead consider unpaired training make progress [15].
data (right), consisting of a source set {xi }N i=1 (xi ∈ X) These issues call for adding more structure to our ob-
and a target set {yj }Mj=1 (y j ∈ Y ), with no information pro- jective. Therefore, we exploit the property that translation
vided as to which xi matches which yj . should be “cycle consistent”, in the sense that if we trans-
late, e.g., a sentence from English to French, and then trans-
two sets, and thereby imagine what a scene might look like
late it back from French to English, we should arrive back
if we were to “translate” it from one set into the other.
at the original sentence [3]. Mathematically, if we have a
In this paper, we present a method that can learn to do the
translator G : X → Y and another translator F : Y → X,
same: capturing special characteristics of one image col-
then G and F should be inverses of each other, and both
lection and figuring out how these characteristics could be
mappings should be bijections. We apply this structural as-
translated into the other image collection, all in the absence
sumption by training both the mapping G and F simultane-
of any paired training examples.
ously, and adding a cycle consistency loss [64] that encour-
This problem can be more broadly described as image-
ages F (G(x)) ≈ x and G(F (y)) ≈ y. Combining this loss
to-image translation [22], converting an image from one
with adversarial losses on domains X and Y yields our full
representation of a given scene, x, to another, y, e.g.,
objective for unpaired image-to-image translation.
grayscale to color, image to semantic labels, edge-map to
We apply our method to a wide range of applications,
photograph. Years of research in computer vision, image
including collection style transfer, object transfiguration,
processing, computational photography, and graphics have
season transfer and photo enhancement. We also compare
produced powerful translation systems in the supervised
against previous approaches that rely either on hand-defined
setting, where example image pairs {xi , yi }N i=1 are avail- factorizations of style and content, or on shared embed-
able (Figure 2, left), e.g., [11, 19, 22, 23, 28, 33, 45, 56, 58,
ding functions, and show that our method outperforms these
62]. However, obtaining paired training data can be difficult
baselines. We provide both PyTorch and Torch implemen-
and expensive. For example, only a couple of datasets ex-
tations. Check out more results at our website.
ist for tasks like semantic segmentation (e.g., [4]), and they
are relatively small. Obtaining input-output pairs for graph-
2. Related work
ics tasks like artistic stylization can be even more difficult
since the desired output is highly complex, typically requir- Generative Adversarial Networks (GANs) [16, 63]
ing artistic authoring. For many tasks, like object transfigu- have achieved impressive results in image generation [6,
ration (e.g., zebra↔horse, Figure 1 top-middle), the desired 39], image editing [66], and representation learning [39, 43,
output is not even well-defined. 37]. Recent methods adopt the same idea for conditional
We therefore seek an algorithm that can learn to trans- image generation applications, such as text2image [41], im-
late between domains without paired input-output examples age inpainting [38], and future prediction [36], as well as to
(Figure 2, right). We assume there is some underlying rela- other domains like videos [54] and 3D data [57]. The key to
tionship between the domains – for example, that they are GANs’ success is the idea of an adversarial loss that forces
two different renderings of the same underlying scene – and the generated images to be, in principle, indistinguishable
seek to learn that relationship. Although we lack supervi- from real photos. This loss is particularly powerful for im-
sion in the form of paired examples, we can exploit super- age generation tasks, as this is exactly the objective that
vision at the level of sets: we are given one set of images in much of computer graphics aims to optimize. We adopt an
domain X and a different set in domain Y . We may train adversarial loss to learn the mapping such that the translated
DY DX
G G
DX DY x Ŷ x̂ y X̂ ŷ
G F F
X( Y X Y
X Y ( cycle-consistency
loss
cycle-consistency
loss
F
(a) (b) (c)

Figure 3: (a) Our model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial
discriminators DY and DX . DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa
for DX and F . To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if
we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency
loss: x → G(x) → F (G(x)) ≈ x, and (c) backward cycle-consistency loss: y → F (y) → G(F (y)) ≈ y
images cannot be distinguished from images in the target tween the input and output, nor do we assume that the input
domain. and output have to lie in the same low-dimensional embed-
Image-to-Image Translation The idea of image-to- ding space. This makes our method a general-purpose solu-
image translation goes back at least to Hertzmann et al.’s tion for many vision and graphics tasks. We directly com-
Image Analogies [19], who employ a non-parametric tex- pare against several prior and contemporary approaches in
ture model [10] on a single input-output training image pair. Section 5.1.
More recent approaches use a dataset of input-output exam- Cycle Consistency The idea of using transitivity as a
ples to learn a parametric translation function using CNNs way to regularize structured data has a long history. In
(e.g., [33]). Our approach builds on the “pix2pix” frame- visual tracking, enforcing simple forward-backward con-
work of Isola et al. [22], which uses a conditional generative sistency has been a standard trick for decades [24, 48].
adversarial network [16] to learn a mapping from input to In the language domain, verifying and improving transla-
output images. Similar ideas have been applied to various tions via “back translation and reconciliation” is a technique
tasks such as generating photographs from sketches [44] or used by human translators [3] (including, humorously, by
from attribute and semantic layouts [25]. However, unlike Mark Twain [51]), as well as by machines [17]. More
the above prior work, we learn the mapping without paired recently, higher-order cycle consistency has been used in
training examples. structure from motion [61], 3D shape matching [21], co-
Unpaired Image-to-Image Translation Several other segmentation [55], dense semantic alignment [65, 64], and
methods also tackle the unpaired setting, where the goal is depth estimation [14]. Of these, Zhou et al. [64] and Go-
to relate two data domains: X and Y . Rosales et al. [42] dard et al. [14] are most similar to our work, as they use a
propose a Bayesian framework that includes a prior based cycle consistency loss as a way of using transitivity to su-
on a patch-based Markov random field computed from a pervise CNN training. In this work, we are introducing a
source image and a likelihood term obtained from multiple similar loss to push G and F to be consistent with each
style images. More recently, CoGAN [32] and cross-modal other. Concurrent with our work, in these same proceed-
scene networks [1] use a weight-sharing strategy to learn a ings, Yi et al. [59] independently use a similar objective
common representation across domains. Concurrent to our for unpaired image-to-image translation, inspired by dual
method, Liu et al. [31] extends the above framework with learning in machine translation [17].
a combination of variational autoencoders [27] and genera- Neural Style Transfer [13, 23, 52, 12] is another way
tive adversarial networks [16]. Another line of concurrent to perform image-to-image translation, which synthesizes a
work [46, 49, 2] encourages the input and output to share novel image by combining the content of one image with
specific “content” features even though they may differ in the style of another image (typically a painting) based on
“style“. These methods also use adversarial networks, with matching the Gram matrix statistics of pre-trained deep fea-
additional terms to enforce the output to be close to the input tures. Our primary focus, on the other hand, is learning
in a predefined metric space, such as class label space [2], the mapping between two image collections, rather than be-
image pixel space [46], and image feature space [49]. tween two specific images, by trying to capture correspon-
Unlike the above approaches, our formulation does not dences between higher-level appearance structures. There-
rely on any task-specific, predefined similarity function be- fore, our method can be applied to other tasks, such as
painting→ photo, object transfiguration, etc. where single Input 𝑥 Output 𝐺(𝑥) Reconstruction F(𝐺 𝑥 )
sample transfer methods do not perform well. We compare
these two methods in Section 5.2.

3. Formulation
Our goal is to learn mapping functions between two
domains X and Y given training samples {xi }N i=1 where
xi ∈ X and {yj }M 1
j=1 where yj ∈ Y . We denote the data
distribution as x ∼ pdata (x) and y ∼ pdata (y). As illus-
trated in Figure 3 (a), our model includes two mappings
G : X → Y and F : Y → X. In addition, we in-
troduce two adversarial discriminators DX and DY , where
DX aims to distinguish between images {x} and translated
images {F (y)}; in the same way, DY aims to discriminate
between {y} and {G(x)}. Our objective contains two types
of terms: adversarial losses [16] for matching the distribu-
tion of generated images to the data distribution in the target
domain; and cycle consistency losses to prevent the learned
mappings G and F from contradicting each other.

3.1. Adversarial Loss
We apply adversarial losses [16] to both mapping func- Figure 4: The input images x, output images G(x) and the
tions. For the mapping function G : X → Y and its dis- reconstructed images F (G(x)) from various experiments.
criminator DY , we express the objective as: From top to bottom: photo↔Cezanne, horses↔zebras,
winter→summer Yosemite, aerial photos↔Google maps.
LGAN (G, DY , X, Y ) = Ey∼pdata (y) [log DY (y)]
functions should be cycle-consistent: as shown in Figure 3
+ Ex∼pdata (x) [log(1 − DY (G(x))], (b), for each image x from domain X, the image translation
(1) cycle should be able to bring x back to the original image,
i.e., x → G(x) → F (G(x)) ≈ x. We call this forward cy-
where G tries to generate images G(x) that look similar to
cle consistency. Similarly, as illustrated in Figure 3 (c), for
images from domain Y , while DY aims to distinguish be-
each image y from domain Y , G and F should also satisfy
tween translated samples G(x) and real samples y. G aims
backward cycle consistency: y → F (y) → G(F (y)) ≈ y.
to minimize this objective against an adversary D that tries
We incentivize this behavior using a cycle consistency loss:
to maximize it, i.e., minG maxDY LGAN (G, DY , X, Y ).
We introduce a similar adversarial loss for the mapping Lcyc (G, F ) = Ex∼pdata (x) [kF (G(x)) − xk1 ]
function F : Y → X and its discriminator DX as well:
+ Ey∼pdata (y) [kG(F (y)) − yk1 ]. (2)
i.e., minF maxDX LGAN (F, DX , Y, X).
In preliminary experiments, we also tried replacing the L1
3.2. Cycle Consistency Loss
norm in this loss with an adversarial loss between F (G(x))
Adversarial training can, in theory, learn mappings G and x, and between G(F (y)) and y, but did not observe
and F that produce outputs identically distributed as target improved performance.
domains Y and X respectively (strictly speaking, this re- The behavior induced by the cycle consistency loss can
quires G and F to be stochastic functions) [15]. However, be observed in Figure 4: the reconstructed images F (G(x))
with large enough capacity, a network can map the same end up matching closely to the input images x.
set of input images to any random permutation of images in
the target domain, where any of the learned mappings can
3.3. Full Objective
induce an output distribution that matches the target dis-
tribution. Thus, adversarial losses alone cannot guarantee Our full objective is:
that the learned function can map an individual input xi to
a desired output yi . To further reduce the space of possi- L(G, F, DX , DY ) =LGAN (G, DY , X, Y )
ble mapping functions, we argue that the learned mapping + LGAN (F, DX , Y, X)
1 We often omit the subscript i and j for simplicity. + λLcyc (G, F ), (3)
where λ controls the relative importance of the two objec- nators using a history of generated images rather than the
tives. We aim to solve: ones produced by the latest generators. We keep an image
buffer that stores the 50 previously created images.
G∗ , F ∗ = arg min max L(G, F, DX , DY ). (4) For all the experiments, we set λ = 10 in Equation 3.
G,F Dx ,DY
We use the Adam solver [26] with a batch size of 1. All
Notice that our model can be viewed as training two “au- networks were trained from scratch with a learning rate of
toencoders” [20]: we learn one autoencoder F ◦ G : X → 0.0002. We keep the same learning rate for the first 100
X jointly with another G ◦ F : Y → Y . However, these au- epochs and linearly decay the rate to zero over the next 100
toencoders each have special internal structures: they map epochs. Please see the appendix (Section 7) for more details
an image to itself via an intermediate representation that about the datasets, architectures, and training procedures.
is a translation of the image into another domain. Such a
setup can also be seen as a special case of “adversarial au- 5. Results
toencoders” [34], which use an adversarial loss to train the
bottleneck layer of an autoencoder to match an arbitrary tar- We first compare our approach against recent methods
get distribution. In our case, the target distribution for the for unpaired image-to-image translation on paired datasets
X → X autoencoder is that of the domain Y . where ground truth input-output pairs are available for eval-
In Section 5.1.4, we compare our method against ab- uation. We then study the importance of both the adversar-
lations of the full objective, including the adversarial loss ial loss and the cycle consistency loss and compare our full
LGAN alone and the cycle consistency loss Lcyc alone, and method against several variants. Finally, we demonstrate
empirically show that both objectives play critical roles the generality of our algorithm on a wide range of applica-
in arriving at high-quality results. We also evaluate our tions where paired data does not exist. For brevity, we refer
method with only cycle loss in one direction and show that to our method as CycleGAN. The PyTorch and Torch code,
a single cycle is not sufficient to regularize the training for models, and full results can be found at our website.
this under-constrained problem.
5.1. Evaluation
4. Implementation Using the same evaluation datasets and metrics as
“pix2pix” [22], we compare our method against several
Network Architecture We adopt the architecture for our
baselines both qualitatively and quantitatively. The tasks in-
generative networks from Johnson et al. [23] who have
clude semantic labels↔photo on the Cityscapes dataset [4],
shown impressive results for neural style transfer and super-
and map↔aerial photo on data scraped from Google Maps.
resolution. This network contains three convolutions, sev-
We also perform ablation study on the full loss function.
eral residual blocks [18], two fractionally-strided convo-
lutions with stride 12 , and one convolution that maps fea-
5.1.1 Evaluation Metrics
tures to RGB. We use 6 blocks for 128 × 128 images and 9
blocks for 256 × 256 and higher-resolution training images. AMT perceptual studies On the map↔aerial photo
Similar to Johnson et al. [23], we use instance normaliza- task, we run “real vs fake” perceptual studies on Amazon
tion [53]. For the discriminator networks we use 70 × 70 Mechanical Turk (AMT) to assess the realism of our out-
PatchGANs [22, 30, 29], which aim to classify whether puts. We follow the same perceptual study protocol from
70 × 70 overlapping image patches are real or fake. Such a Isola et al. [22], except we only gather data from 25 partic-
patch-level discriminator architecture has fewer parameters ipants per algorithm we tested. Participants were shown a
than a full-image discriminator and can work on arbitrarily- sequence of pairs of images, one a real photo or map and
sized images in a fully convolutional fashion [22]. one fake (generated by our algorithm or a baseline), and
asked to click on the image they thought was real. The first
Training details We apply two techniques from recent 10 trials of each session were practice and feedback was
works to stabilize our model training procedure. First, given as to whether the participant’s response was correct
for LGAN (Equation 1), we replace the negative log like- or incorrect. The remaining 40 trials were used to assess
lihood objective by a least-squares loss [35]. This loss is the rate at which each algorithm fooled participants. Each
more stable during training and generates higher quality session only tested a single algorithm, and participants were
results. In particular, for a GAN loss LGAN (G, D, X, Y ), only allowed to complete a single session. The numbers we
we train the G to minimize Ex∼pdata (x) [(D(G(x)) − 1)2 ] report here are not directly comparable to those in [22] as
and train the D to minimize Ey∼pdata (y) [(D(y) − 1)2 ] + our ground truth images were processed slightly differently
2
Ex∼pdata (x) [D(G(x))2 ]. and the participant pool we tested may be differently dis-
Second, to reduce model oscillation [15], we follow 2 We train all the models on 256 × 256 images while in pix2pix [22],

Shrivastava et al.’s strategy [46] and update the discrimi- the model was trained on 256 × 256 patches of 512 × 512 images, and
Input BiGAN CoGAN feature loss GAN SimGAN CycleGAN pix2pix Ground truth

Figure 5: Different methods for mapping labels↔photos trained on Cityscapes images. From left to right: input, Bi-
GAN/ALI [7, 9], CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data,
and ground truth.
Input BiGAN CoGAN feature loss GAN SimGAN CycleGAN pix2pix Ground truth

Figure 6: Different methods for mapping aerial photos↔maps on Google Maps. From left to right: input, BiGAN/ALI [7, 9],
CoGAN [32], feature loss + GAN, SimGAN [46], CycleGAN (ours), pix2pix [22] trained on paired data, and ground truth.

tributed from those tested in [22] (due to running the exper- tation metrics described below. The intuition is that if we
iment at a different date and time). Therefore, our numbers generate a photo from a label map of “car on the road”,
should only be used to compare our current method against then we have succeeded if the FCN applied to the generated
the baselines (which were run under identical conditions), photo detects “car on the road”.
rather than against [22]. Semantic segmentation metrics To evaluate the perfor-
FCN score Although perceptual studies may be the gold mance of photo→labels, we use the standard metrics from
standard for assessing graphical realism, we also seek an the Cityscapes benchmark [4], including per-pixel accuracy,
automatic quantitative measure that does not require human per-class accuracy, and mean class Intersection-Over-Union
experiments. For this, we adopt the “FCN score” from [22], (Class IOU) [4].
and use it to evaluate the Cityscapes labels→photo task.
The FCN metric evaluates how interpretable the generated 5.1.2 Baselines
photos are according to an off-the-shelf semantic segmen- CoGAN [32] This method learns one GAN generator for
tation algorithm (the fully-convolutional network, FCN, domain X and one for domain Y , with tied weights on the
from [33]). The FCN predicts a label map for a generated first few layers for shared latent representations. Translation
photo. This label map can then be compared against the from X to Y can be achieved by finding a latent represen-
input ground truth labels using standard semantic segmen- tation that generates image X and then rendering this latent
representation into style Y .
run convolutionally on the 512 × 512 images at test time. We choose
256 × 256 in our experiments as many baselines cannot scale up to high- SimGAN [46] Like our method, Shrivastava et al.[46]
resolution images, and CoGAN cannot be tested fully convolutionally. uses an adversarial loss to train a translation from X to Y .
Map → Photo Photo → Map
Loss % Turkers labeled real % Turkers labeled real Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [32] 0.6% ± 0.5% 0.9% ± 0.5% Cycle alone 0.22 0.07 0.02
BiGAN/ALI [9, 7] 2.1% ± 1.0% 1.9% ± 0.9% GAN alone 0.51 0.11 0.08
SimGAN [46] 0.7% ± 0.5% 2.6% ± 1.1% GAN + forward cycle 0.55 0.18 0.12
Feature loss + GAN 1.2% ± 0.6% 0.3% ± 0.2% GAN + backward cycle 0.39 0.14 0.06
CycleGAN (ours) 26.8% ± 2.8% 23.2% ± 3.4% CycleGAN (ours) 0.52 0.17 0.11
Table 1: AMT “real vs fake” test on maps↔aerial photos at Table 4: Ablation study: FCN-scores for different variants
256 × 256 resolution. of our method, evaluated on Cityscapes labels→photo.
Loss Per-pixel acc. Per-class acc. Class IOU Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [32] 0.40 0.10 0.06 Cycle alone 0.10 0.05 0.02
BiGAN/ALI [9, 7] 0.19 0.06 0.02 GAN alone 0.53 0.11 0.07
SimGAN [46] 0.20 0.10 0.04 GAN + forward cycle 0.49 0.11 0.07
Feature loss + GAN 0.06 0.04 0.01 GAN + backward cycle 0.01 0.06 0.01
CycleGAN (ours) 0.52 0.17 0.11 CycleGAN (ours) 0.58 0.22 0.16
pix2pix [22] 0.71 0.25 0.18
Table 5: Ablation study: classification performance of
Table 2: FCN-scores for different methods, evaluated on photo→labels for different losses, evaluated on Cityscapes.
Cityscapes labels→photo.
method, on the other hand, can produce translations that are
Loss Per-pixel acc. Per-class acc. Class IOU
CoGAN [32] 0.45 0.11 0.08 often of similar quality to the fully supervised pix2pix.
BiGAN/ALI [9, 7] 0.41 0.13 0.07 Table 1 reports performance regarding the AMT per-
SimGAN [46] 0.47 0.11 0.07 ceptual realism task. Here, we see that our method can
Feature loss + GAN 0.50 0.10 0.06
CycleGAN (ours) 0.58 0.22 0.16 fool participants on around a quarter of trials, in both the
pix2pix [22] 0.85 0.40 0.32 maps→aerial photos direction and the aerial photos→maps
Table 3: Classification performance of photo→labels for direction at 256 × 256 resolution3 . All the baselines almost
different methods on cityscapes. never fooled participants.
Table 2 assesses the performance of the labels→photo
The regularization term kx − G(x)k1 i s used to penalize task on the Cityscapes and Table 3 evaluates the opposite
making large changes at pixel level. mapping (photos→labels). In both cases, our method again
Feature loss + GAN We also test a variant of Sim- outperforms the baselines.
GAN [46] where the L1 loss is computed over deep
image features using a pretrained network (VGG-16 5.1.4 Analysis of the loss function
relu4 2 [47]), rather than over RGB pixel values. Com- In Table 4 and Table 5, we compare against ablations
puting distances in deep feature space, like this, is also of our full loss. Removing the GAN loss substantially
sometimes referred to as using a “perceptual loss” [8, 23]. degrades results, as does removing the cycle-consistency
BiGAN/ALI [9, 7] Unconditional GANs [16] learn a loss. We therefore conclude that both terms are critical
generator G : Z → X, that maps a random noise z to an to our results. We also evaluate our method with the cy-
image x. The BiGAN [9] and ALI [7] propose to also learn cle loss in only one direction: GAN + forward cycle loss
the inverse mapping function F : X → Z. Though they Ex∼pdata (x) [kF (G(x))−xk1 ], or GAN + backward cycle loss
were originally designed for mapping a latent vector z to an Ey∼pdata (y) [kG(F (y))−yk1 ] (Equation 2) and find that it of-
image x, we implemented the same objective for mapping a ten incurs training instability and causes mode collapse, es-
source image x to a target image y. pecially for the direction of the mapping that was removed.
pix2pix [22] We also compare against pix2pix [22], Figure 7 shows several qualitative examples.
which is trained on paired data, to see how close we can
get to this “upper bound” without using any paired data.
5.1.5 Image reconstruction quality
For a fair comparison, we implement all the baselines
using the same architecture and details as our method, ex- In Figure 4, we show a few random samples of the recon-
cept for CoGAN [32]. CoGAN builds on generators that structed images F (G(x)). We observed that the recon-
produce images from a shared latent representation, which structed images were often close to the original inputs x,
is incompatible with our image-to-image network. We use at both training and testing time, even in cases where one
the public implementation of CoGAN instead. domain represents significantly more diverse information,
such as map↔aerial photos.
5.1.3 Comparison against baselines 3 We also train CycleGAN and pix2pix at 512 × 512 resolution, and

observe the comparable performance: maps→aerial photos: CycleGAN:
As can be seen in Figure 5 and Figure 6, we were unable to 37.5% ± 3.6% and pix2pix: 33.9% ± 3.1%; aerial photos→maps: Cy-
achieve compelling results with any of the baselines. Our cleGAN: 16.5% ± 4.1% and pix2pix: 8.5% ± 2.6%
Input Cycle alone GAN alone GAN+forward GAN+backward CycleGAN Ground truth

Figure 7: Different variants of our method for mapping labels↔photos trained on cityscapes. From left to right: input, cycle-
consistency loss alone, adversarial loss alone, GAN + forward cycle-consistency loss (F (G(x)) ≈ x), GAN + backward
cycle-consistency loss (G(F (y)) ≈ y), CycleGAN (our full method), and ground truth. Both Cycle alone and GAN +
backward fail to produce images similar to the target domain. GAN alone and GAN + forward suffer from mode collapse,
producing identical label maps regardless of the input photo.
Input Output Input Output Input Output the appendix (Section 7) for more details about the datasets.
We observe that translations on training data are often more
appealing than those on test data, and full results of all ap-
plications on both training and test data can be viewed on
label → facade our project website.
Collection style transfer (Figure 10 and Figure 11)
We train the model on landscape photographs downloaded
from Flickr and WikiArt. Unlike recent work on “neural
facade → label style transfer” [13], our method learns to mimic the style
of an entire collection of artworks, rather than transferring
the style of a single selected piece of art. Therefore, we
can learn to generate photos in the style of, e.g., Van Gogh,
edges → shoes rather than just in the style of Starry Night. The size of the
dataset for each artist/style was 526, 1073, 400, and 563 for
Cezanne, Monet, Van Gogh, and Ukiyo-e.
Object transfiguration (Figure 13) The model is
trained to translate one object class from ImageNet [5] to
shoes → edges
another (each class contains around 1000 training images).
Figure 8: Example results of CycleGAN on paired datasets Turmukhambetov et al. [50] propose a subspace model to
used in “pix2pix” [22] such as architectural labels↔photos translate one object into another object of the same category,
and edges↔shoes. while our method focuses on object transfiguration between
two visually similar categories.
5.1.6 Additional results on paired datasets
Season transfer (Figure 13) The model is trained on
Figure 8 shows some example results on other paired 854 winter photos and 1273 summer photos of Yosemite
datasets used in “pix2pix” [22], such as architectural downloaded from Flickr.
labels↔photos from the CMP Facade Database [40], and Photo generation from paintings (Figure 12) For
edges↔shoes from the UT Zappos50K dataset [60]. The painting→photo, we find that it is helpful to introduce an
image quality of our results is close to those produced by additional loss to encourage the mapping to preserve color
the fully supervised pix2pix while our method learns the composition between the input and output. In particular, we
mapping without paired supervision. adopt the technique of Taigman et al. [49] and regularize the
generator to be near an identity mapping when real samples
5.2. Applications of the target domain are provided as the input to the gen-
We demonstrate our method on several applications erator: i.e., Lidentity (G, F ) = Ey∼pdata (y) [kG(y) − yk1 ] +
where paired training data does not exist. Please refer to Ex∼pdata (x) [kF (x) − xk1 ].
Input CycleGAN CycleGAN+L"#$%&"&' collection, we compute the average Gram Matrix across the
target domain and use this matrix to transfer the “average
style” with Gatys et al [13].
Figure 16 demonstrates similar comparisons for other
translation tasks. We observe that Gatys et al. [13] requires
finding target style images that closely match the desired
output, but still often fails to produce photorealistic results,
while our method succeeds to generate natural-looking re-
sults, similar to the target domain.

6. Limitations and Discussion
Although our method can achieve compelling results in
many cases, the results are far from uniformly positive. Fig-
ure 17 shows several typical failure cases. On translation
tasks that involve color and texture changes, as many of
Figure 9: The effect of the identity mapping loss on Monet’s those reported above, the method often succeeds. We have
painting→ photos. From left to right: input paintings, Cy- also explored tasks that require geometric changes, with lit-
cleGAN without identity mapping loss, CycleGAN with tle success. For example, on the task of dog→cat transfigu-
identity mapping loss. The identity mapping loss helps pre- ration, the learned translation degenerates into making min-
serve the color of the input paintings. imal changes to the input (Figure 17). This failure might be
Without Lidentity , the generator G and F are free to caused by our generator architectures which are tailored for
change the tint of input images when there is no need to. good performance on the appearance changes. Handling
For example, when learning the mapping between Monet’s more varied and extreme transformations, especially geo-
paintings and Flickr photographs, the generator often maps metric changes, is an important problem for future work.
paintings of daytime to photographs taken during sunset, Some failure cases are caused by the distribution charac-
because such a mapping may be equally valid under the ad- teristics of the training datasets. For example, our method
versarial loss and cycle consistency loss. The effect of this has got confused in the horse → zebra example (Figure 17,
identity mapping loss are shown in Figure 9. right), because our model was trained on the wild horse and
In Figure 12, we show additional results translating zebra synsets of ImageNet, which does not contain images
Monet’s paintings to photographs. This figure and Figure 9 of a person riding a horse or zebra.
show results on paintings that were included in the train- We also observe a lingering gap between the results
ing set, whereas for all other experiments in the paper, we achievable with paired training data and those achieved by
only evaluate and show test set results. Because the training our unpaired method. In some cases, this gap may be very
set does not include paired data, coming up with a plausi- hard – or even impossible – to close: for example, our
ble translation for a training set painting is a nontrivial task. method sometimes permutes the labels for tree and build-
Indeed, since Monet is no longer able to create new paint- ing in the output of the photos→labels task. Resolving this
ings, generalization to unseen, “test set”, paintings is not a ambiguity may require some form of weak semantic super-
pressing problem. vision. Integrating weak or semi-supervised data may lead
Photo enhancement (Figure 14) We show that our to substantially more powerful translators, still at a fraction
method can be used to generate photos with shallower depth of the annotation cost of the fully-supervised systems.
of field. We train the model on flower photos downloaded Nonetheless, in many cases completely unpaired data is
from Flickr. The source domain consists of flower photos plentifully available and should be made use of. This paper
taken by smartphones, which usually have deep DoF due pushes the boundaries of what is possible in this “unsuper-
to a small aperture. The target contains photos captured by vised” setting.
DSLRs with a larger aperture. Our model successfully gen- Acknowledgments: We thank Aaron Hertzmann, Shiry
erates photos with shallower depth of field from the photos Ginosar, Deepak Pathak, Bryan Russell, Eli Shechtman,
taken by smartphones. Richard Zhang, and Tinghui Zhou for many helpful com-
Comparison with Gatys et al. [13] In Figure 15, we ments. This work was supported in part by NSF SMA-
compare our results with neural style transfer [13] on photo 1514512, NSF IIS-1633310, a Google Research Award, In-
stylization. For each row, we first use two representative tel Corp, and hardware donations from NVIDIA. JYZ is
artworks as the style images for [13]. Our method, on the supported by the Facebook Graduate Fellowship and TP is
other hand, can produce photos in the style of entire collec- supported by the Samsung Scholarship. The photographs
tion. To compare against neural style transfer of an entire used for style transfer were taken by AE, mostly in France.
Input Monet Van Gogh Cezanne Ukiyo-e

Figure 10: Collection style transfer I: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, and
Ukiyo-e. Please see our website for additional examples.
Input Monet Van Gogh Cezanne Ukiyo-e

Figure 11: Collection style transfer II: we transfer input images into the artistic styles of Monet, Van Gogh, Cezanne, Ukiyo-e.
Please see our website for additional examples.
Input Output Input Output

Figure 12: Relatively successful results on mapping Monet’s paintings to a photographic style. Please see our website for
additional examples.
Input Output Input Output Input Output

horse → zebra

zebra → horse

winter Yosemite → summer Yosemite

summer Yosemite → winter Yosemite

apple → orange

orange → apple

Figure 13: Our method applied to several translation problems. These images are selected as relatively successful results
– please see our website for more comprehensive and random results. In the top two rows, we show results on object
transfiguration between horses and zebras, trained on 939 images from the wild horse class and 1177 images from the zebra
class in Imagenet [5]. Also check out the horse→zebra demo video. The middle two rows show results on season transfer,
trained on winter and summer photos of Yosemite from Flickr. In the bottom two rows, we train our method on 996 apple
images and 1020 navel orange images from ImageNet.
Input Output Input Output Input Output Input Output

Figure 14: Photo enhancement: mapping from a set of smartphone snaps to professional DSLR photographs, the system often
learns to produce shallow focus. Here we show some of the most successful results in our test set – average performance is
considerably worse. Please see our website for more comprehensive and random examples.

Input Gatys et al. (image I) Gatys et al. (image II) Gatys et al. (collection) CycleGAN

Photo → Van Gogh

Photo → Ukiyo-e

Photo → Cezanne
Figure 15: We compare our method with neural style transfer [13] on photo stylization. Left to right: input image, results
from Gatys et al. [13] using two different representative artworks as style images, results from Gatys et al. [13] using the
entire collection of the artist, and CycleGAN (ours).
Input Gatys et al. (image I) Gatys et al. (image II) Gatys et al. (collection) CycleGAN

apple → orange

horse → zebra

Monet → photo

Figure 16: We compare our method with neural style transfer [13] on various applications. From top to bottom:
apple→orange, horse→zebra, and Monet→photo. Left to right: input image, results from Gatys et al. [13] using two different
images as style images, results from Gatys et al. [13] using all the images from the target domain, and CycleGAN (ours).

Input Output Input Output Input Output Input Output

apple → orange zebra → horse winter → summer

dog → cat cat → dog Monet → photo horse → zebra

photo → Ukiyo-e photo → Van Gogh iPhone photo → DSLR photo ImageNet “wild horse” training images

Figure 17: Typical failure cases of our method. Left: in the task of dog→cat transfiguration, CycleGAN can only make
minimal changes to the input. Right: CycleGAN also fails in this horse → zebra example as our model has not seen images
of horseback riding during training. Please see our website for more comprehensive results.
References [17] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and
W.-Y. Ma. Dual learning for machine translation. In
[1] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and NIPS, 2016. 3
A. Torralba. Cross-modal scene networks. PAMI,
2016. 3 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
[2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and learning for image recognition. In CVPR, 2016. 5
D. Krishnan. Unsupervised pixel-level domain adap- [19] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and
tation with generative adversarial networks. In CVPR, D. H. Salesin. Image analogies. In SIGGRAPH, 2001.
2017. 3 2, 3
[3] R. W. Brislin. Back-translation for cross-cultural [20] G. E. Hinton and R. R. Salakhutdinov. Reducing the
research. Journal of cross-cultural psychology, dimensionality of data with neural networks. Science,
1(3):185–216, 1970. 2, 3 313(5786):504–507, 2006. 5
[4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En- [21] Q.-X. Huang and L. Guibas. Consistent shape maps
zweiler, R. Benenson, U. Franke, S. Roth, and via semidefinite programming. In Symposium on Ge-
B. Schiele. The cityscapes dataset for semantic urban ometry Processing, 2013. 3
scene understanding. In CVPR, 2016. 2, 5, 6, 18
[22] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
to-image translation with conditional adversarial net-
L. Fei-Fei. Imagenet: A large-scale hierarchical im-
works. In CVPR, 2017. 2, 3, 5, 6, 7, 8, 18
age database. In CVPR, 2009. 8, 13, 18
[6] E. L. Denton, S. Chintala, R. Fergus, et al. Deep gen- [23] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses
erative image models using a laplacian pyramid of ad- for real-time style transfer and super-resolution. In
versarial networks. In NIPS, 2015. 2 ECCV, 2016. 2, 3, 5, 7, 18
[7] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial [24] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-
feature learning. In ICLR, 2017. 6, 7 backward error: Automatic detection of tracking fail-
[8] A. Dosovitskiy and T. Brox. Generating images with ures. In ICPR, 2010. 3
perceptual similarity metrics based on deep networks. [25] L. Karacan, Z. Akata, A. Erdem, and E. Erdem.
In NIPS, 2016. 7 Learning to generate images of outdoor scenes from
[9] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Ar- attributes and semantic layouts. arXiv preprint
jovsky, O. Mastropietro, and A. Courville. Adversari- arXiv:1612.00215, 2016. 3
ally learned inference. In ICLR, 2017. 6, 7 [26] D. Kingma and J. Ba. Adam: A method for stochastic
[10] A. A. Efros and T. K. Leung. Texture synthesis by optimization. In ICLR, 2015. 5
non-parametric sampling. In ICCV, 1999. 3 [27] D. P. Kingma and M. Welling. Auto-encoding varia-
[11] D. Eigen and R. Fergus. Predicting depth, surface nor- tional bayes. ICLR, 2014. 3
mals and semantic labels with a common multi-scale
[28] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays.
convolutional architecture. In ICCV, 2015. 2
Transient attributes for high-level understanding and
[12] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shecht- editing of outdoor scenes. ACM TOG, 33(4):149,
man. Preserving color in neural artistic style transfer. 2014. 2
arXiv preprint arXiv:1606.05897, 2016. 3
[29] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cun-
[13] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style
ningham, A. Acosta, A. Aitken, A. Tejani, J. Totz,
transfer using convolutional neural networks. CVPR,
Z. Wang, et al. Photo-realistic single image super-
2016. 3, 8, 9, 14, 15
resolution using a generative adversarial network. In
[14] C. Godard, O. Mac Aodha, and G. J. Brostow. Un- CVPR, 2017. 5
supervised monocular depth estimation with left-right
consistency. In CVPR, 2017. 3 [30] C. Li and M. Wand. Precomputed real-time texture
synthesis with markovian generative adversarial net-
[15] I. Goodfellow. NIPS 2016 tutorial: Generative ad-
works. ECCV, 2016. 5
versarial networks. arXiv preprint arXiv:1701.00160,
2016. 2, 4, 5 [31] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, image-to-image translation networks. In NIPS, 2017.
D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- 3
gio. Generative adversarial nets. In NIPS, 2014. 2, 3, [32] M.-Y. Liu and O. Tuzel. Coupled generative adversar-
4, 7 ial networks. In NIPS, 2016. 3, 6, 7
[33] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- [50] D. Turmukhambetov, N. D. Campbell, S. J. Prince,
tional networks for semantic segmentation. In CVPR, and J. Kautz. Modeling object appearance using
2015. 2, 3, 6 context-conditioned component analysis. In CVPR,
[34] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and 2015. 8
B. Frey. Adversarial autoencoders. In ICLR, 2016. 5 [51] M. Twain. The jumping frog: in english, then in
[35] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. french, and then clawed back into a civilized language
Smolley. Least squares generative adversarial net- once more by patient. Unremunerated Toil, 3, 1903. 3
works. In CVPR. IEEE, 2017. 5 [52] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempit-
[36] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi- sky. Texture networks: Feed-forward synthesis of tex-
scale video prediction beyond mean square error. In tures and stylized images. In ICML, 2016. 3
ICLR, 2016. 2 [53] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance
[37] M. F. Mathieu, J. Zhao, A. Ramesh, P. Sprechmann, normalization: The missing ingredient for fast styliza-
and Y. LeCun. Disentangling factors of variation tion. arXiv preprint arXiv:1607.08022, 2016. 5
in deep representation using adversarial training. In [54] C. Vondrick, H. Pirsiavash, and A. Torralba. Generat-
NIPS, 2016. 2 ing videos with scene dynamics. In NIPS, 2016. 2
[38] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and [55] F. Wang, Q. Huang, and L. J. Guibas. Image co-
A. A. Efros. Context encoders: Feature learning by segmentation via consistent functional maps. In ICCV,
inpainting. CVPR, 2016. 2 2013. 3
[39] A. Radford, L. Metz, and S. Chintala. Unsupervised [56] X. Wang and A. Gupta. Generative image model-
representation learning with deep convolutional gen- ing using style and structure adversarial networks. In
erative adversarial networks. In ICLR, 2016. 2 ECCV, 2016. 2
[40] R. Š. Radim Tyleček. Spatial pattern templates for [57] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenen-
recognition of objects with regular structure. In Proc. baum. Learning a probabilistic latent space of ob-
GCPR, Saarbrucken, Germany, 2013. 8, 18 ject shapes via 3d generative-adversarial modeling. In
[41] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, NIPS, 2016. 2
and H. Lee. Generative adversarial text to image syn- [58] S. Xie and Z. Tu. Holistically-nested edge detection.
thesis. In ICML, 2016. 2 In ICCV, 2015. 2
[42] R. Rosales, K. Achan, and B. J. Frey. Unsupervised [59] Z. Yi, H. Zhang, T. Gong, Tan, and M. Gong. Dual-
image translation. In ICCV, 2003. 3 gan: Unsupervised dual learning for image-to-image
[43] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, translation. In ICCV, 2017. 3
A. Radford, and X. Chen. Improved techniques for [60] A. Yu and K. Grauman. Fine-grained visual compar-
training GANs. In NIPS, 2016. 2 isons with local learning. In CVPR, 2014. 8, 18
[44] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scrib- [61] C. Zach, M. Klopschitz, and M. Pollefeys. Disam-
bler: Controlling deep image synthesis with sketch biguating visual relations using loop constraints. In
and color. In CVPR, 2017. 3 CVPR, 2010. 3
[45] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data- [62] R. Zhang, P. Isola, and A. A. Efros. Colorful image
driven hallucination of different times of day from a colorization. In ECCV, 2016. 2
single outdoor photo. ACM TOG, 32(6):200, 2013. 2
[63] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based
[46] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, generative adversarial network. In ICLR, 2017. 2
W. Wang, and R. Webb. Learning from simulated and
[64] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and
unsupervised images through adversarial training. In
A. A. Efros. Learning dense correspondence via 3d-
CVPR, 2017. 3, 5, 6, 7
guided cycle consistency. In CVPR, 2016. 2, 3
[47] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. In [65] T. Zhou, Y. J. Lee, S. Yu, and A. A. Efros. Flowweb:
ICLR, 2015. 7 Joint image set alignment by weaving consistent,
pixel-wise correspondences. In CVPR, 2015. 3
[48] N. Sundaram, T. Brox, and K. Keutzer. Dense point
trajectories by gpu-accelerated large displacement op- [66] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A.
tical flow. In ECCV, 2010. 3 Efros. Generative visual manipulation on the natural
image manifold. In ECCV, 2016. 2
[49] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised
cross-domain image generation. In ICLR, 2017. 3, 8
7. Appendix Flower photo enhancement Flower images taken on
smartphones were downloaded from Flickr by searching for
7.1. Training details the photos taken by Apple iPhone 5, 5s, or 6, with search
We train our networks from scratch, with a learning rate text flower. DSLR images with shallow DoF were also
of 0.0002. In practice, we divide the objective by 2 while downloaded from Flickr by search tag flower, dof. The im-
optimizing D, which slows down the rate at which D learns, ages were scaled to 360 pixels by width. The identity map-
relative to the rate of G. We keep the same learning rate ping loss of weight 0.5λ was used. The training set size
for the first 100 epochs and linearly decay the rate to zero of the smartphone and DSLR dataset were 1813 and 3326,
over the next 100 epochs. Weights are initialized from a respectively. We set λ = 10.
Gaussian distribution N (0, 0.02).
7.2. Network architectures
Cityscapes label↔Photo 2975 training images from the
Cityscapes training set [4] with image size 128 × 128. We We provide both PyTorch and Torch implementations.
used the Cityscapes val set for testing. Generator architectures We adopt our architectures
Maps↔aerial photograph 1096 training images were from Johnson et al. [23]. We use 6 residual blocks for
scraped from Google Maps [22] with image size 256 × 256. 128 × 128 training images, and 9 residual blocks for 256 ×
Images were sampled from in and around New York City. 256 or higher-resolution training images. Below, we follow
Data was then split into train and test about the median lat- the naming convention used in the Johnson et al.’s Github
itude of the sampling region (with a buffer region added to repository.
ensure that no training pixel appeared in the test set). Let c7s1-k denote a 7 × 7 Convolution-InstanceNorm-
Architectural facades labels↔photo 400 training im- ReLU layer with k filters and stride 1. dk denotes a 3 × 3
ages from the CMP Facade Database [40]. Convolution-InstanceNorm-ReLU layer with k filters and
Edges→shoes around 50, 000 training images from UT stride 2. Reflection padding was used to reduce artifacts.
Zappos50K dataset [60]. The model was trained for 5 Rk denotes a residual block that contains two 3 × 3 con-
epochs. volutional layers with the same number of filters on both
Horse↔Zebra and Apple↔Orange We downloaded layer. uk denotes a 3 × 3 fractional-strided-Convolution-
the images from ImageNet [5] using keywords wild horse, InstanceNorm-ReLU layer with k filters and stride 12 .
zebra, apple, and navel orange. The images were scaled to The network with 6 residual blocks consists of:
256 × 256 pixels. The training set size of each class: 939 c7s1-64,d128,d256,R256,R256,R256,
(horse), 1177 (zebra), 996 (apple), and 1020 (orange). R256,R256,R256,u128,u64,c7s1-3
Summer↔Winter Yosemite The images were down- The network with 9 residual blocks consists of:
loaded using Flickr API with the tag yosemite and the date- c7s1-64,d128,d256,R256,R256,R256,
taken field. Black-and-white photos were pruned. The im- R256,R256,R256,R256,R256,R256,u128
ages were scaled to 256 × 256 pixels. The training size of u64,c7s1-3
each class: 1273 (summer) and 854 ( winter). Discriminator architectures For discriminator net-
Photo↔Art for style transfer The art images were works, we use 70 × 70 PatchGAN [22]. Let Ck denote a
downloaded from Wikiart.org. Some artworks that were 4 × 4 Convolution-InstanceNorm-LeakyReLU layer with k
sketches or too obscene were pruned by hand. The pho- filters and stride 2. After the last layer, we apply a convo-
tos were downloaded from Flickr using the combination lution to produce a 1-dimensional output. We do not use
of tags landscape and landscapephotography. Black-and- InstanceNorm for the first C64 layer. We use leaky ReLUs
white photos were pruned. The images were scaled to with a slope of 0.2. The discriminator architecture is:
256 × 256 pixels. The training set size of each class C64-C128-C256-C512
was 1074 (Monet), 584 (Cezanne), 401 (Van Gogh), 1433
(Ukiyo-e), and 6853 (Photographs). The Monet dataset was
particularly pruned to include only landscape paintings, and
the Van Gogh dataset included only his later works that rep-
resent his most recognizable artistic style.
Monet’s paintings→photos To achieve high resolution
while conserving memory, we used random square crops
of the original images for training. To generate results, we
passed images of width 512 pixels with correct aspect ra-
tio to the generator network as input. The weight for the
identity mapping loss was 0.5λ where λ was the weight for
cycle consistency loss. We set λ = 10.