Sources — Compute-Structure AI History

                                                   Image-to-Image Translation with Conditional Adversarial Networks

                                                            Phillip Isola                Jun-Yan Zhu             Tinghui Zhou             Alexei A. Efros

                                                                          Berkeley AI Research (BAIR) Laboratory, UC Berkeley
                                                                            {isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
                                                            Labels to Street Scene                           Labels to Facade                           BW to Color




arXiv:1611.07004v3 [cs.CV] 26 Nov 2018
                                                    input                            output
                                                                Aerial to Map

                                                                                                         input               output             input                 output
                                                                                                                  Day to Night                            Edges to Photo




                                                   input                             output              input                   output           input                output

                                         Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
                                         These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
                                         Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
                                         results of the method on several. In each case we use the same architecture and objective, and simply train on different data.

                                                                     Abstract                                    1. Introduction
                                                                                                                     Many problems in image processing, computer graphics,
                                                                                                                 and computer vision can be posed as “translating” an input
                                            We investigate conditional adversarial networks as a                 image into a corresponding output image. Just as a concept
                                         general-purpose solution to image-to-image translation                  may be expressed in either English or French, a scene may
                                         problems. These networks not only learn the mapping from                be rendered as an RGB image, a gradient field, an edge map,
                                         input image to output image, but also learn a loss func-                a semantic label map, etc. In analogy to automatic language
                                         tion to train this mapping. This makes it possible to apply             translation, we define automatic image-to-image translation
                                         the same generic approach to problems that traditionally                as the task of translating one possible representation of a
                                         would require very different loss formulations. We demon-               scene into another, given sufficient training data (see Figure
                                         strate that this approach is effective at synthesizing photos           1). Traditionally, each of these tasks has been tackled with
                                         from label maps, reconstructing objects from edge maps,                 separate, special-purpose machinery (e.g., [16, 25, 20, 9,
                                         and colorizing images, among other tasks. Indeed, since the             11, 53, 33, 39, 18, 58, 62]), despite the fact that the setting
                                         release of the pix2pix software associated with this pa-                is always the same: predict pixels from pixels. Our goal in
                                         per, a large number of internet users (many of them artists)            this paper is to develop a common framework for all these
                                         have posted their own experiments with our system, further              problems.
                                         demonstrating its wide applicability and ease of adoption                   The community has already taken significant steps in this
                                         without the need for parameter tweaking. As a commu-                    direction, with convolutional neural nets (CNNs) becoming
                                         nity, we no longer hand-engineer our mapping functions,                 the common workhorse behind a wide variety of image pre-
                                         and this work suggests we can achieve reasonable results                diction problems. CNNs learn to minimize a loss function –
                                         without hand-engineering our loss functions either.                     an objective that scores the quality of results – and although
                                                                                                                 the learning process is automatic, a lot of manual effort still


                                                                                                            1
goes into designing effective losses. In other words, we still      x              G(x)                        y
                                                                             G
have to tell the CNN what we wish it to minimize. But, just
                                                                                              D                         D
like King Midas, we must be careful what we wish for! If
                                                                                                    fake                       real
we take a naive approach and ask the CNN to minimize the
Euclidean distance between predicted and ground truth pix-
els, it will tend to produce blurry results [43, 62]. This is                       x                          x
because Euclidean distance is minimized by averaging all
                                                                  Figure 2: Training a conditional GAN to map edges→photo. The
plausible outputs, which causes blurring. Coming up with
                                                                  discriminator, D, learns to classify between fake (synthesized by
loss functions that force the CNN to do what we really want       the generator) and real {edge, photo} tuples. The generator, G,
– e.g., output sharp, realistic images – is an open problem       learns to fool the discriminator. Unlike an unconditional GAN,
and generally requires expert knowledge.                          both the generator and discriminator observe the input edge map.
    It would be highly desirable if we could instead specify
only a high-level goal, like “make the output indistinguish-      large body of literature has considered losses of this kind,
able from reality”, and then automatically learn a loss func-     with methods including conditional random fields [10], the
tion appropriate for satisfying this goal. Fortunately, this is   SSIM metric [56], feature matching [15], nonparametric
exactly what is done by the recently proposed Generative          losses [37], the convolutional pseudo-prior [57], and losses
Adversarial Networks (GANs) [24, 13, 44, 52, 63]. GANs            based on matching covariance statistics [30]. The condi-
learn a loss that tries to classify if the output image is real   tional GAN is different in that the loss is learned, and can, in
or fake, while simultaneously training a generative model         theory, penalize any possible structure that differs between
to minimize this loss. Blurry images will not be tolerated        output and target.
since they look obviously fake. Because GANs learn a loss             Conditional GANs We are not the first to apply GANs
that adapts to the data, they can be applied to a multitude of    in the conditional setting. Prior and concurrent works have
tasks that traditionally would require very different kinds of    conditioned GANs on discrete labels [41, 23, 13], text [46],
loss functions.                                                   and, indeed, images. The image-conditional models have
    In this paper, we explore GANs in the conditional set-        tackled image prediction from a normal map [55], future
ting. Just as GANs learn a generative model of data, condi-       frame prediction [40], product photo generation [59], and
tional GANs (cGANs) learn a conditional generative model          image generation from sparse annotations [31, 48] (c.f. [47]
[24]. This makes cGANs suitable for image-to-image trans-         for an autoregressive approach to the same problem). Sev-
lation tasks, where we condition on an input image and gen-       eral other papers have also used GANs for image-to-image
erate a corresponding output image.                               mappings, but only applied the GAN unconditionally, re-
    GANs have been vigorously studied in the last two             lying on other terms (such as L2 regression) to force the
years and many of the techniques we explore in this pa-           output to be conditioned on the input. These papers have
per have been previously proposed. Nonetheless, ear-              achieved impressive results on inpainting [43], future state
lier papers have focused on specific applications, and            prediction [64], image manipulation guided by user con-
it has remained unclear how effective image-conditional           straints [65], style transfer [38], and superresolution [36].
GANs can be as a general-purpose solution for image-to-           Each of the methods was tailored for a specific applica-
image translation. Our primary contribution is to demon-          tion. Our framework differs in that nothing is application-
strate that on a wide variety of problems, conditional            specific. This makes our setup considerably simpler than
GANs produce reasonable results. Our second contri-               most others.
bution is to present a simple framework sufficient to                 Our method also differs from the prior works in several
achieve good results, and to analyze the effects of sev-          architectural choices for the generator and discriminator.
eral important architectural choices. Code is available at        Unlike past work, for our generator we use a “U-Net”-based
https://github.com/phillipi/pix2pix.                              architecture [50], and for our discriminator we use a convo-
                                                                  lutional “PatchGAN” classifier, which only penalizes struc-
2. Related work                                                   ture at the scale of image patches. A similar PatchGAN ar-
                                                                  chitecture was previously proposed in [38] to capture local
    Structured losses for image modeling Image-to-image           style statistics. Here we show that this approach is effective
translation problems are often formulated as per-pixel clas-      on a wider range of problems, and we investigate the effect
sification or regression (e.g., [39, 58, 28, 35, 62]). These      of changing the patch size.
formulations treat the output space as “unstructured” in the
sense that each output pixel is considered conditionally in-      3. Method
dependent from all others given the input image. Condi-
tional GANs instead learn a structured loss. Structured              GANs are generative models that learn a mapping from
losses penalize the joint configuration of the output. A          random noise vector z to output image y, G : z → y [24]. In
                                                                                                               U-Net
contrast, conditional GANs learn a mapping from observed                 Encoder-decoder

image x and random noise vector z, to y, G : {x, z} → y.
The generator G is trained to produce outputs that cannot be
distinguished from “real” images by an adversarially trained     x                           y     x                           y
discriminator, D, which is trained to do as well as possible
at detecting the generator’s “fakes”. This training procedure
is diagrammed in Figure 2.
                                                                 Figure 3: Two choices for the architecture of the generator. The
3.1. Objective                                                   “U-Net” [50] is an encoder-decoder with skip connections be-
   The objective of a conditional GAN can be expressed as        tween mirrored layers in the encoder and decoder stacks.

      LcGAN (G, D) =Ex,y [log D(x, y)]+
                                                                 3.2. Network architectures
                         Ex,z [log(1 − D(x, G(x, z))],    (1)
                                                                     We adapt our generator and discriminator architectures
where G tries to minimize this objective against an ad-          from those in [44]. Both generator and discriminator use
versarial D that tries to maximize it, i.e.         G∗ =         modules of the form convolution-BatchNorm-ReLu [29].
arg minG maxD LcGAN (G, D).                                      Details of the architecture are provided in the supplemen-
   To test the importance of conditioning the discriminator,     tal materials online, with key features discussed below.
we also compare to an unconditional variant in which the
discriminator does not observe x:                                3.2.1   Generator with skips
        LGAN (G, D) =Ey [log D(y)]+                              A defining feature of image-to-image translation problems
                         Ex,z [log(1 − D(G(x, z))].       (2)    is that they map a high resolution input grid to a high resolu-
                                                                 tion output grid. In addition, for the problems we consider,
Previous approaches have found it beneficial to mix the          the input and output differ in surface appearance, but both
GAN objective with a more traditional loss, such as L2 dis-      are renderings of the same underlying structure. Therefore,
tance [43]. The discriminator’s job remains unchanged, but       structure in the input is roughly aligned with structure in the
the generator is tasked to not only fool the discriminator but   output. We design the generator architecture around these
also to be near the ground truth output in an L2 sense. We       considerations.
also explore this option, using L1 distance rather than L2 as        Many previous solutions [43, 55, 30, 64, 59] to problems
L1 encourages less blurring:                                     in this area have used an encoder-decoder network [26]. In
                                                                 such a network, the input is passed through a series of lay-
            LL1 (G) = Ex,y,z [ky − G(x, z)k1 ].           (3)    ers that progressively downsample, until a bottleneck layer,
                                                                 at which point the process is reversed. Such a network re-
Our final objective is                                           quires that all information flow pass through all the layers,
                                                                 including the bottleneck. For many image translation prob-
    G∗ = arg min max LcGAN (G, D) + λLL1 (G).             (4)    lems, there is a great deal of low-level information shared
                G    D
                                                                 between the input and output, and it would be desirable to
    Without z, the net could still learn a mapping from x        shuttle this information directly across the net. For exam-
to y, but would produce deterministic outputs, and there-        ple, in the case of image colorization, the input and output
fore fail to match any distribution other than a delta func-     share the location of prominent edges.
tion. Past conditional GANs have acknowledged this and               To give the generator a means to circumvent the bottle-
provided Gaussian noise z as an input to the generator, in       neck for information like this, we add skip connections, fol-
addition to x (e.g., [55]). In initial experiments, we did not   lowing the general shape of a “U-Net” [50]. Specifically, we
find this strategy effective – the generator simply learned      add skip connections between each layer i and layer n − i,
to ignore the noise – which is consistent with Mathieu et        where n is the total number of layers. Each skip connec-
al. [40]. Instead, for our final models, we provide noise        tion simply concatenates all channels at layer i with those
only in the form of dropout, applied on several layers of our    at layer n − i.
generator at both training and test time. Despite the dropout
noise, we observe only minor stochasticity in the output of
                                                                 3.2.2   Markovian discriminator (PatchGAN)
our nets. Designing conditional GANs that produce highly
stochastic output, and thereby capture the full entropy of the   It is well known that the L2 loss – and L1, see Fig-
conditional distributions they model, is an important ques-      ure 4 – produces blurry results on image generation prob-
tion left open by the present work.                              lems [34]. Although these losses fail to encourage high-
frequency crispness, in many cases they nonetheless accu-         4. Experiments
rately capture the low frequencies. For problems where this
is the case, we do not need an entirely new framework to             To explore the generality of conditional GANs, we test
enforce correctness at the low frequencies. L1 will already       the method on a variety of tasks and datasets, including both
do.                                                               graphics tasks, like photo generation, and vision tasks, like
                                                                  semantic segmentation:
    This motivates restricting the GAN discriminator to only
model high-frequency structure, relying on an L1 term to            • Semantic labels↔photo, trained on the Cityscapes
force low-frequency correctness (Eqn. 4). In order to model           dataset [12].
high-frequencies, it is sufficient to restrict our attention to     • Architectural labels→photo, trained on CMP Facades
the structure in local image patches. Therefore, we design            [45].
a discriminator architecture – which we term a PatchGAN             • Map↔aerial photo, trained on data scraped from
– that only penalizes structure at the scale of patches. This         Google Maps.
discriminator tries to classify if each N × N patch in an im-       • BW→color photos, trained on [51].
age is real or fake. We run this discriminator convolution-         • Edges→photo, trained on data from [65] and [60]; bi-
ally across the image, averaging all responses to provide the         nary edges generated using the HED edge detector [58]
ultimate output of D.                                                 plus postprocessing.
   In Section 4.4, we demonstrate that N can be much                • Sketch→photo: tests edges→photo models on human-
smaller than the full size of the image and still produce             drawn sketches from [19].
high quality results. This is advantageous because a smaller        • Day→night, trained on [33].
PatchGAN has fewer parameters, runs faster, and can be              • Thermal→color photos, trained on data from [27].
applied to arbitrarily large images.                                • Photo with missing pixels→inpainted photo, trained
                                                                      on Paris StreetView from [14].
   Such a discriminator effectively models the image as a
Markov random field, assuming independence between pix-              Details of training on each of these datasets are provided
els separated by more than a patch diameter. This connec-         in the supplemental materials online. In all cases, the in-
tion was previously explored in [38], and is also the com-        put and output are simply 1-3 channel images. Qualita-
mon assumption in models of texture [17, 21] and style            tive results are shown in Figures 8, 9, 11, 10, 13, 14, 15,
[16, 25, 22, 37]. Therefore, our PatchGAN can be under-           16, 17, 18, 19, 20. Several failure cases are highlighted
stood as a form of texture/style loss.                            in Figure 21. More comprehensive results are available at
                                                                  https://phillipi.github.io/pix2pix/.
                                                                     Data requirements and speed We note that decent re-
3.3. Optimization and inference                                   sults can often be obtained even on small datasets. Our fa-
                                                                  cade training set consists of just 400 images (see results in
   To optimize our networks, we follow the standard ap-           Figure 14), and the day to night training set consists of only
proach from [24]: we alternate between one gradient de-           91 unique webcams (see results in Figure 15). On datasets
scent step on D, then one step on G. As suggested in              of this size, training can be very fast: for example, the re-
the original GAN paper, rather than training G to mini-           sults shown in Figure 14 took less than two hours of training
mize log(1 − D(x, G(x, z)), we instead train to maximize          on a single Pascal Titan X GPU. At test time, all models run
log D(x, G(x, z)) [24]. In addition, we divide the objec-         in well under a second on this GPU.
tive by 2 while optimizing D, which slows down the rate at
which D learns relative to G. We use minibatch SGD and            4.1. Evaluation metrics
apply the Adam solver [32], with a learning rate of 0.0002,
                                                                     Evaluating the quality of synthesized images is an open
and momentum parameters β1 = 0.5, β2 = 0.999.
                                                                  and difficult problem [52]. Traditional metrics such as per-
   At inference time, we run the generator net in exactly         pixel mean-squared error do not assess joint statistics of the
the same manner as during the training phase. This differs        result, and therefore do not measure the very structure that
from the usual protocol in that we apply dropout at test time,    structured losses aim to capture.
and we apply batch normalization [29] using the statistics of        To more holistically evaluate the visual quality of our re-
the test batch, rather than aggregated statistics of the train-   sults, we employ two tactics. First, we run “real vs. fake”
ing batch. This approach to batch normalization, when the         perceptual studies on Amazon Mechanical Turk (AMT).
batch size is set to 1, has been termed “instance normal-         For graphics problems like colorization and photo gener-
ization” and has been demonstrated to be effective at im-         ation, plausibility to a human observer is often the ultimate
age generation tasks [54]. In our experiments, we use batch       goal. Therefore, we test our map generation, aerial photo
sizes between 1 and 10 depending on the experiment.               generation, and image colorization using this approach.
          Input                 Ground truth                     L1                      cGAN                    L1 + cGAN




Figure 4: Different losses induce different quality of results. Each column shows results trained under a different loss. Please see
https://phillipi.github.io/pix2pix/ for additional examples.

   Second, we measure whether or not our synthesized                  lution images, but exploited fully-convolutional translation
cityscapes are realistic enough that off-the-shelf recognition        (described above) to test on 512 × 512 images, which were
system can recognize the objects in them. This metric is              then downsampled and presented to Turkers at 256 × 256
similar to the “inception score” from [52], the object detec-         resolution. For colorization, we trained and tested on
tion evaluation in [55], and the “semantic interpretability”          256 × 256 resolution images and presented the results to
measures in [62] and [42].                                            Turkers at this same resolution.
                                                                          “FCN-score” While quantitative evaluation of genera-
   AMT perceptual studies For our AMT experiments, we                 tive models is known to be challenging, recent works [52,
followed the protocol from [62]: Turkers were presented               55, 62, 42] have tried using pre-trained semantic classifiers
with a series of trials that pitted a “real” image against a          to measure the discriminability of the generated stimuli as a
“fake” image generated by our algorithm. On each trial,               pseudo-metric. The intuition is that if the generated images
each image appeared for 1 second, after which the images              are realistic, classifiers trained on real images will be able
disappeared and Turkers were given unlimited time to re-              to classify the synthesized image correctly as well. To this
spond as to which was fake. The first 10 images of each               end, we adopt the popular FCN-8s [39] architecture for se-
session were practice and Turkers were given feedback. No             mantic segmentation, and train it on the cityscapes dataset.
feedback was provided on the 40 trials of the main experi-            We then score synthesized photos by the classification accu-
ment. Each session tested just one algorithm at a time, and           racy against the labels these photos were synthesized from.
Turkers were not allowed to complete more than one ses-
sion. ∼ 50 Turkers evaluated each algorithm. Unlike [62],             4.2. Analysis of the objective function
we did not include vigilance trials. For our colorization ex-
periments, the real and fake images were generated from the              Which components of the objective in Eqn. 4 are impor-
same grayscale input. For map↔aerial photo, the real and              tant? We run ablation studies to isolate the effect of the L1
fake images were not generated from the same input, in or-            term, the GAN term, and to compare using a discriminator
der to make the task more difficult and avoid floor-level re-         conditioned on the input (cGAN, Eqn. 1) against using an
sults. For map↔aerial photo, we trained on 256 × 256 reso-            unconditional discriminator (GAN, Eqn. 2).
                              L1                                L1+cGAN
                                                                                              that the output look realistic. This variant results in poor



Encoder-decoder
                                                                                              performance; examining the results reveals that the gener-
                                                                                              ator collapsed into producing nearly the exact same output
                                                                                              regardless of input photograph. Clearly, it is important, in
                                                                                              this case, that the loss measure the quality of the match be-
                                                                                              tween input and output, and indeed cGAN performs much
                                                                                              better than GAN. Note, however, that adding an L1 term
 U-Net
                                                                                              also encourages that the output respect the input, since the
                                                                                              L1 loss penalizes the distance between ground truth out-
                                                                                              puts, which correctly match the input, and synthesized out-
                                                                                              puts, which may not. Correspondingly, L1+GAN is also
Figure 5: Adding skip connections to an encoder-decoder to create                             effective at creating realistic renderings that respect the in-
a “U-Net” results in much higher quality results.                                             put label maps. Combining all terms, L1+cGAN, performs
                  Loss              Per-pixel acc.   Per-class acc.   Class IOU               similarly well.
                  L1                    0.42             0.15            0.11                     Colorfulness A striking effect of conditional GANs is
                  GAN                   0.22             0.05            0.01
                  cGAN                  0.57             0.22            0.16
                                                                                              that they produce sharp images, hallucinating spatial struc-
                  L1+GAN                0.64             0.20            0.15                 ture even where it does not exist in the input label map. One
                  L1+cGAN               0.66             0.23            0.17                 might imagine cGANs have a similar effect on “sharpening”
                  Ground truth          0.80             0.26            0.21
                                                                                              in the spectral dimension – i.e. making images more color-
Table 1: FCN-scores for different losses, evaluated on Cityscapes                             ful. Just as L1 will incentivize a blur when it is uncertain
labels↔photos.
                                                                                              where exactly to locate an edge, it will also incentivize an
                                                                                              average, grayish color when it is uncertain which of sev-
           Loss                                Per-pixel acc.   Per-class acc.    Class IOU   eral plausible color values a pixel should take on. Specially,
           Encoder-decoder (L1)                    0.35             0.12             0.08
           Encoder-decoder (L1+cGAN)               0.29             0.09             0.05     L1 will be minimized by choosing the median of the condi-
           U-net (L1)                              0.48             0.18             0.13     tional probability density function over possible colors. An
           U-net (L1+cGAN)                         0.55             0.20             0.14     adversarial loss, on the other hand, can in principle become
Table 2: FCN-scores for different generator architectures (and ob-                            aware that grayish outputs are unrealistic, and encourage
jectives), evaluated on Cityscapes labels↔photos. (U-net (L1-                                 matching the true color distribution [24]. In Figure 7, we
cGAN) scores differ from those reported in other tables since batch                           investigate whether our cGANs actually achieve this effect
size was 10 for this experiment and 1 for other tables, and random                            on the Cityscapes dataset. The plots show the marginal dis-
variation between training runs.)
                                                                                              tributions over output color values in Lab color space. The
                  Discriminator                                                               ground truth distributions are shown with a dotted line. It
                  receptive field   Per-pixel acc.   Per-class acc.   Class IOU               is apparent that L1 leads to a narrower distribution than the
                  1×1                   0.39             0.15            0.10
                  16×16                 0.65             0.21            0.17                 ground truth, confirming the hypothesis that L1 encourages
                  70×70                 0.66             0.23            0.17                 average, grayish colors. Using a cGAN, on the other hand,
                  286×286               0.42             0.16            0.11                 pushes the output distribution closer to the ground truth.
Table 3: FCN-scores for different receptive field sizes of the dis-
criminator, evaluated on Cityscapes labels→photos. Note that in-                              4.3. Analysis of the generator architecture
put images are 256 × 256 pixels and larger receptive fields are
padded with zeros.                                                                               A U-Net architecture allows low-level information to
                                                                                              shortcut across the network. Does this lead to better results?
    Figure 4 shows the qualitative effects of these variations                                Figure 5 and Table 2 compare the U-Net against an encoder-
on two labels→photo problems. L1 alone leads to reason-                                       decoder on cityscape generation. The encoder-decoder is
able but blurry results. The cGAN alone (setting λ = 0 in                                     created simply by severing the skip connections in the U-
Eqn. 4) gives much sharper results but introduces visual ar-                                  Net. The encoder-decoder is unable to learn to generate
tifacts on certain applications. Adding both terms together                                   realistic images in our experiments. The advantages of the
(with λ = 100) reduces these artifacts.                                                       U-Net appear not to be specific to conditional GANs: when
    We quantify these observations using the FCN-score on                                     both U-Net and encoder-decoder are trained with an L1 loss,
the cityscapes labels→photo task (Table 1): the GAN-based                                     the U-Net again achieves the superior results.
objectives achieve higher scores, indicating that the synthe-
sized images include more recognizable structure. We also                                     4.4. From PixelGANs to PatchGANs to ImageGANs
test the effect of removing conditioning from the discrimi-
nator (labeled as GAN). In this case, the loss does not pe-                                      We test the effect of varying the patch size N of our dis-
nalize mismatch between the input and output; it only cares                                   criminator receptive fields, from a 1 × 1 “PixelGAN” to a
                                  L1                                                             1×1                                  16×16                                         70×70                                    286×286




                                                            L LL                                                                                              L LL
                                 −1                                                                                                        −1

CVPRFigure 6: Patch       size variations. Uncertainty in the output manifests itself                        differently for different loss functions. Uncertain regions become
                                                                                                                                         −1 −1
CVPR                                                                                                                                                                                                                                                         CVPR
                                −1−1
CVPR               −3                                                                            −3 −3
                                                                                                                                                                                                                                                             CVPR
                                                                                                                                                                                                                                                           CVPR
 #385 blurry and   desaturated       under      L1.        The      1x1 PixelGAN   encourages    greater       color diversity but has no effect on spatial statistics. The 16x16                                                                            #385
                  −3−3                                                                               −3
#385
 #385                                                                                                                                                                                                                                                      #385
                                                                                                                                                                                                                                                              #385
                   −5                                                                            −5 −5
      PatchGAN−5creates
                    −5
                               locally sharp   CVPR
                                             CVPR
                                                CVPR     results,2016 but
                                                                2016
                                                              2016        also
                                                                      Submission
                                                                     Submissionleads
                                                                       Submission    to
                                                                                   #385.
                                                                                  #385. tiling
                                                                                    #385.      artifacts
                                                                                          CONFIDENTIAL
                                                                                        CONFIDENTIAL
                                                                                                     −5
                                                                                           CONFIDENTIAL       beyond
                                                                                                                   REVIEW
                                                                                                                 REVIEW
                                                                                                                    REVIEW       scale
                                                                                                                          theCOPY.COPY.
                                                                                                                                   COPY.       it
                                                                                                                                              DO   can
                                                                                                                                                   DO
                                                                                                                                                    DO  NOT
                                                                                                                                                      NOTobserve. The
                                                                                                                                                             DISTRIBUTE.
                                                                                                                                                           DISTRIBUTE.
                                                                                                                                                         NOT          70×70 PatchGAN forces
                                                                                                                                                              DISTRIBUTE.
                                                                                                 −7 −7
      outputs that−7−7are sharp, even ifL1L1L1  incorrect, in both the spatial and spectral               (colorfulness)L1cGANL1L1
                                                                                                                                dimensions. The full 286×286 ImageGAN produces
                   −7                                                                                −7
                                               cGAN
                                             cGAN                                                                                cGAN
                                                cGAN                                                                               cGAN
      results that−9−9
                    are visually similar           to the 70×70 PatchGAN, but somewhat           −9 −9
                                                                                                           lower qualityL1+cGAN  according             to our FCN-score metric (Table 3). Please
                   −9                          L1+cGAN                                               −9                          L1+cGAN
                                             L1+cGAN
                                                L1+cGAN                                                                            L1+cGAN
                                               L1+pixelcGAN
                                             L1+pixelcGAN                                                                        L1+pixelcGAN
                                                                                                                              L1+pixelcGAN
                                                L1+pixelcGAN                                                                       L1+pixelcGAN
  648
  648 see
 648              −11
                 −11
          https://phillipi.github.io/pix2pix/
                   −11
                      0
                    0 0    20
                          2020   40
                                40 40    60
                                       60 60
                                               Ground
                                             Ground
                                                   80
                                                 80 80
                                                        truth
                                                     truth
                                                Ground    truth
                                                               100
                                                             100100
                                                                                      for additional
                                                                                                −11−11
                                                                                                   0
                                                                                                    −11
                                                                                                      0   examples.
                                                                                                           20 20  40 40  60 60
                                                                                                                                 Ground
                                                                                                                              Ground
                                                                                                                                 80
                                                                                                                                     truthtruth
                                                                                                                                   Ground
                                                                                                                                     80
                                                                                                                                            truth
                                                                                                                                            100  100                                                                                                         702702
                                                                                                                                                                                                                                                                 702
                                                                                                        0      20     40     60        80         100
  649
  649
   649                                            L LL      b bb                                                a aa                                          b bb                                     a   aa
                                                                                                                                                                                                                                                             703703
                                                                                                                                                                                                                                                                 703
  650
  650
   650                    −1
                         −1−1
                                 −1
                                −1−1                                                      −1−1
                                                                                             −1                                          −1 −1
                                                                                                                                             −1                                    −1 −1
                                                                                                                                                                                       −1
                                                                                                                                                                                                                                                             704704
                                                                                                                                                                                                                                                                 704
  651
  651
                                 −3
                                −3                                                        −3−3                                           −3 −3                                     −3 −3                           Histogram
                                                                                                                                                                                                              Histogram
                                                                                                                                                                                                                    Histogram     intersection
                                                                                                                                                                                                                              intersection
                                                                                                                                                                                                                                   intersection              705705
   651                                                                                                                                                                                                                                                           705
                          −3                                                                                                                 −3                                        −3
                                                                                                                                                                                                                    Histogram      intersection
                         −3−3     −3                                                         −3


                   log
                   logPP
                         (L)                                                         log P (a)                                        log P (b)                                                                      against
                                                                                                                                                                                                                against        ground
                                                                                                                                                                                                                           ground
                                                                                                                                                                                                                      against            truth
                                                                                                                                                                                                                                     truth
                                                                                                                                                                                                                                ground    truth
  652                                                                                                                                                                                                                                                        706706
                       P(L)
                         (L)
  652
   652
                          −5
                         −5      −5
                                −5                                                        −5−5                                           −5 −5
                                                                                                                                             −5
                                                                                                                                                                                      −5 −5
                                                                                                                                                                                           −5                         against   ground    truth                  706
                                                                                     logPP(a)
                                                                                           (a)
                                  −5                                                         −5
                                                                                                                                                                                                  Loss
                  log
                           −5
                                                                                                                                      logPP(b)
                                                                                                                                            (b)                                                 LossLoss        L LL a aa              b bb
  653
  653
                                 −7
                                −7                                                  log   −7−7                                       log −7 −7                                        −7 −7        Loss               L         a           b                707707
   653
                          −7
                         −7
                           −7     −7
                                                               L1L1
                                                                                             −7                                              −7                                            −7
                                                                                                                                                                                                L1L1L1        0.81 0.81  0.69
                                                                                                                                                                                                                    0.81 0.690.69
                                                                                                                                                                                                                              0.69      0.70
                                                                                                                                                                                                                                     0.70 0.70                   707
                                                                  L1
                                                                                                                                                                                           −9 cGAN
                                                                                                                                                                                                   L1               0.81                  0.70
                                                                                                                                                                                                    cGAN 0.87
                                                                                                                                                                                                  cGAN             0.87
                                                                                                                                                                                                                    0.870.74  0.74 0.840.84
                                                                                                                                                                                                                             0.74
                                                               cGAN
                                                                 cGAN                                                                    −9 −9                                        −9 −9
  654
  654                                                                                                                                                                                                                                                        708708
                                −9
                                 −9                                                      −9−9
   654
                         −9
                          −9
                           −9     −9                              cGAN
                                                               L1+cGAN
                                                                 L1+cGAN
                                                                  L1+cGAN
                                                                                            −9                                               −9
                                                                                                                                                                                                   cGAN             0.87      0.74        0.84
                                                                                                                                                                                                                                          0.84                   708
                        −11 −11
                                                               L1+pixelcGAN
                                                                 L1+pixelcGAN
                                                                  L1+pixelcGAN          −11                                            −11−11                                        −11 −11    L1+cGAN
                                                                                                                                                                                                  L1+cGAN
                                                                                                                                                                                                    L1+cGAN   0.86 0.86
                                                                                                                                                                                                                    0.86 0.840.84
                                                                                                                                                                                                                              0.84 0.820.82
                                                                                                                                                                                                                                          0.82
  655                                                                                                                                                                                                                                                        709709
                              −11                                                         −11
  655                  −11                                     Ground
                                                                 Groundtruth
                                                                          truth
   655                   −11
                          00
                               −11
                               7070
                                 20      9090
                                          40             110
                                                           110
                                                         6060
                                                                  Ground truth
                                                                    130
                                                                      130
                                                                   8080130 100   150
                                                                                   150
                                                                                           −11
                                                                                          7070         90 90     110110    130130
                                                                                                                                           −11
                                                                                                                                         70 70     90 90    110110  130 130  150 150
                                                                                                                                                                                          −11
                                                                                                                                                                                       70 70       L1+cGAN
                                                                                                                                                                                             70 PixelGAN
                                                                                                                                                                                                  PixelGAN
                                                                                                                                                                                                   90 90  110 110 0.86
                                                                                                                                                                                                              0.83    130 130 0.84
                                                                                                                                                                                                                   0.83  0.680.68    0.78 0.82
                                                                                                                                                                                                                                        0.78                     709
                                                                                                                                                                                                    PixelGAN 0.83 0.68
                                   20       40                                  100 150
                            0     7020      90
                                             40
                                                  L
                                                            110
                                                            60        80         100          70           90
                                                                                                                aa
                                                                                                                     110       130            70       90
                                                                                                                                                               b bb
                                                                                                                                                                110      130      150                  90      110         130
                                                                                                                                                                                                                                          0.78
  656
  656                             LL              bb
                                                              a(b)a
                                                                                                         a a                                                                                       PixelGAN 0.83 0.68                     0.78               710710
   656                         (a)
                                (a)b
                                 (a)                             (b)
                                                                  (b)                             (c)
                                                                                                   (c)                                 (d)(d)                (c)                                                                                                 710
  657            Figure
                 Figure
                   −1
                        5:    (a) distribution
                         5: Color
                            Color   distributionmatching
                                                 matching
                                                −1−1
                                                               (b)ofofthe
                                                         property
                                                          property      thecGAN,
                                                                            cGAN,tested           (c)
                                                                                   testedononCityscapes.
                                                                                              Cityscapes.  (c.f.Figure
                                                                                                         (c.f.         11
                                                                                                                  Figure         original
                                                                                                                            thethe
                                                                                                                         of of     original(d)GAN
                                                                                                                                           GAN  paper [14]).
                                                                                                                                                  paper      Note
                                                                                                                                                        [14]). Note                                                                                          711711
  657
   657
                    −1
                                                                                                                                           (d)
               Figure 5: Color distribution matching property of the cGAN, tested on Cityscapes. (c.f. Figure 1 of the original GAN paper [14]). Note
                     −1                            −1
                                                                                                                                                                                                                                                                 711
  658
  658        that
              that−3 the histogram intersection
                 −3
               that−3 the histogram intersection
                        the   histogram             intersection
                                                                    −3−3
                                                                         scores
                                                                           scores
                                                                            scores   arearedominated
                                                                                       are    dominated
                                                                                               dominated        byby
                                                                                                                  bydifferences
                                                                                                                       differences
                                                                                                                        differences     inin the
                                                                                                                                            in thehigh
                                                                                                                                                the  high
                                                                                                                                                      highprobability
                                                                                                                                                             probability
                                                                                                                                                              probability    region,
                                                                                                                                                                                region,   which
                                                                                                                                                                                             which   areareimperceptible
                                                                                                                                                                                                               imperceptible     in inthe
                                                                                                                                                                                                                                       inthe plots,
                                                                                                                                                                                                                                                 plots,      712712
   658 Figure                                                                                                                                                         Figure region,
                                                                                                                                                                                   1 of thewhich          are   imperceptible               the   plots,         712
                                                                       −3
                 7:
             which      Color
                         show       distribution
                                    log     probability     matching
                                                                  and           property        of the cGAN,              tested     on   Cityscapes.        (c.f.                               original      GAN      paper [24]).          Note
              which       show        log    probability           and −5therefore
                                                                            therefore     emphasize
                                                                                            emphasize         differences
                                                                                                                 differences      in the
                                                                                                                                    in  thelowlow probability
                                                                                                                                                    probability    regions.
                                                                                                                                                                      regions.
                 −5                                                 −5−5
  659
  659
   659 that thewhich
                  −5
                   −5
                           show        log    probability           and      therefore       emphasize           differences         in  the   low   probability       regions.                                                                              713713
                                                                                                                                                                                                                                                                 713
                 −7 histogram intersection −7−7                   scores are dominated by differences in the high probability region, which are imperceptible in the plots,
  660
  660
                  −7
                   −7                                                  −7                                                                                                                                                                                    714714
   660 which show           log probability
                                    L1                     and therefore             emphasize
                                                                                      1x1                differences in the             low probability regions.70x70
                                                                                                                                    16x16                                                                              256x256                                   714
  661
  661
                 −9
                  −9
                   −9                L1
                                      L1
                                                                    −9−9
                                                                       −9                1x1
                                                                                         1x1                                          16x16
                                                                                                                                       16x16                                      70x70
                                                                                                                                                                                   70x70                                   256x256
                                                                                                                                                                                                                             256x256                         715715
   661          −11                                                −11
                                                                                                                                                                                                                                                                 715
  662
  662
                 −11
                  −11
                                                                     −11
                                                                      −11                                                                                                                                                                                    716716
   662                                                                                                                                                                                                                                                           716
                   70
                    70      9090      110
                                        110      130
                                                   130     150
                                                             150      7070        9090       110110      130130
                     70        90        110        130       150        70          90          110         130
  663                                                               1                                                                                                       Photo → Map                          Map → Photo                                 717717
   663 full 286 × 286 “ImageGAN” . Figure 6 shows qualitative
  663                                                                                                                                                                                                                                                            717
  664                                                                                                                                               Loss               % Turkers labeled real % Turkers labeled real                                         718718
   664 results of this analysis and Table 3 quantifies the effects us-
  664
                                                                                                                                                    L1                        2.8% ± 1.0%                        0.8% ± 0.3%                                     718
  665
  665                                                                                                                                                                                                                                                        719719
   665 ing the FCN-score. Note that elsewhere in this paper, unless                                                                                 L1+cGAN                   6.1% ± 1.3%                       18.9% ± 2.5%                                     719
  666        Figure
              Figure      6:
                           6:  Patch size           variations. Uncertainty                   ininthe    output manifests             itself  differently     forfordifferent loss         functions. Uncertain            regions become                    720720
  666          Figureall
   666 specified,                 Patch size
                            6: Patch
                                 experiments size variations.
                                                       variations.
                                                             use L1. 70Uncertainty
                                                                            Uncertainty
                                                                             ×   701x1           inthe
                                                                                        PatchGANs,    theoutput
                                                                                                            outputandmanifests
                                                                                                                       manifests
                                                                                                                           for greater  itself   differently
                                                                                                                                         itselfTable
                                                                                                                                                  differently
                                                                                                                                                          4: AMT   fordifferent
                                                                                                                                                                        different
                                                                                                                                                                        “real           lossfunctions.
                                                                                                                                                                                      loss
                                                                                                                                                                                  vsnofake”    functions.
                                                                                                                                                                                                 test  on        Uncertainregions
                                                                                                                                                                                                                Uncertain
                                                                                                                                                                                                             maps↔aerial         regions
                                                                                                                                                                                                                                    photos.
                                                                                                                                                                                                                                             become
                                                                                                                                                                                                                                              become             720
  667        blurry
              blurry     and
                           and    desaturated
                                   desaturated           under
                                                           under       L1.   The
                                                                               The     1x1  PixelGAN
                                                                                              PixelGAN          encourages
                                                                                                                  encourages           greater  color
                                                                                                                                                  color diversity
                                                                                                                                                           diversity   butbuthashas    noeffect
                                                                                                                                                                                             effectonon spatial
                                                                                                                                                                                                            spatialstatistics.
                                                                                                                                                                                                                       statistics.  TheThe  16x16
                                                                                                                                                                                                                                                16x16        721721
  667          blurry
   667 this section        and      desaturated
                             allcreates
                                    experiments             underuse    L1.     The
                                                                                L1+cGAN
                                                                         anresults,     1x1     PixelGAN
                                                                                                      loss.        encourages           greater    color    diversity      but    has    no   effect    on    spatial   statistics.       The    16x16           721
  668        PatchGAN
              PatchGAN              creates       locally
                                                    locally     sharp
                                                                  sharp       results,    but
                                                                                           but   also
                                                                                                   also  leads
                                                                                                           leads   tototiling
                                                                                                                           tiling  artifacts
                                                                                                                                     artifacts  beyond
                                                                                                                                                   beyond   thethe
                                                                                                                                                            Methodscale
                                                                                                                                                                     scale     can
                                                                                                                                                                           it it   canobserve.
                                                                                                                                                                                          observe. %  The The
                                                                                                                                                                                                       Turkers70x70
                                                                                                                                                                                                                 70x70  PatchGAN
                                                                                                                                                                                                                  labeled   PatchGAN
                                                                                                                                                                                                                            real            forces
                                                                                                                                                                                                                                                forces       722722
  668          PatchGAN              creates         locally       sharp       results,     but     also    leads     to     tiling   artifacts     beyond      the   scale     it  can    observe.        The    70x70      PatchGAN            forces
   668     The
             outputs
              outputsPixelGAN
                           that     are sharp,has no    even if effect        on in
                                                                     incorrect,      spatial
                                                                                           both the  sharpness
                                                                                                          spatial and     butspectral      (coforfulness)        dimensions.
                                                                                                                                                            L2 regression        fromThe [62] full 256x256         ImageGAN
                                                                                                                                                                                                         16.3% ± 2.4%                  produces                  722
  669
  669          outputs that  that are         sharp, even
                                      are sharp,           even ifif incorrect,
                                                                         incorrect, in    inboth
                                                                                               boththe  thespatial
                                                                                                              spatialand   andspectral
                                                                                                                                   spectral(coforfulness)
                                                                                                                                               (coforfulness)        dimensions.
                                                                                                                                                                      dimensions.           Thefull
                                                                                                                                                                                          The           256x256
                                                                                                                                                                                                   full27.8%
                                                                                                                                                                                                          256x256     ImageGAN
                                                                                                                                                                                                                        ImageGANproduces     produces        723723
   669 doesresults
                increase
              results     that are the      colorfulness
                                         visually         similar to     oftothe
                                                                               the    results
                                                                                   70x70             (quantified
                                                                                               PatchGAN,           but in  somewhat lower            quality according
                                                                                                                                                            Zhang     et  al.    to to
                                                                                                                                                                               2016  ourour
                                                                                                                                                                                      [62]  FCN-score          metric
                                                                                                                                                                                                                  ±  2.7%(Table      2).2).                      723
  670
  670          results that that are are visually
                                            visually similar similar to        the
                                                                                the 70x70
                                                                                      70x70PatchGAN,
                                                                                                  PatchGAN,but        butsomewhat
                                                                                                                               somewhatlower   lowerquality
                                                                                                                                                        quality    according
                                                                                                                                                            Ours according to our FCN-score
                                                                                                                                                                                               FCN-score         metric
                                                                                                                                                                                                         22.5% ±metric
                                                                                                                                                                                                                             (Table
                                                                                                                                                                                                                     1.6% (Table 2).                         724724
   670 Figure 7). For example, the bus in Figure 6 is painted gray                                                                                                                                                                                               724
  671
  671                                                                                                                                                                                                                                                        725725
   671 when the net is trained with an L1 loss, but becomes red                                                                                        Table 5: AMT “real vs fake” test on colorization.                                                         725
  672                                         Classification
                                               Classification                        Ours
                                                                                       Ours                                                          Input
                                                                                                                                                       Input            Ground
                                                                                                                                                                           Ground      truth                L1L1                  cGAN                       726726
  672
   672 with theL2 [44]
  673                 L2 PixelGAN(rebal.)
                            [44]                Classification
                                                 (rebal.)[44]
                                                  loss.         Color
                                                                  [44]       (L1
                                                                               (L1  +
                                                                              histogram Ours
                                                                                       ++cGAN)
                                                                                          cGAN)           Ground
                                                                                                     matching
                                                                                                            Ground      istrutha
                                                                                                                             truth                      Input               Groundtruth     truth                L1                    cGAN
                                                                                                                                                                                                                                        cGAN                 727
                                                                                                                                                                                                                                                                 726
  673                  L2    [44]                 (rebal.)
   673 common problem in image processing [49], and PixelGANs
                                                                   [44]         (L1        cGAN)             Ground           truth                                                                                                                             727
                                                                                                                                                                                                                                                                 727
  674                                                                                                                                     generator convolutionally, on larger images than those on                                                          728
  674
   674                                                                                                                                                                                                                                                          728
                                                                                                                                                                                                                                                                 728
  675 may be a promising lightweight solution.                                                                                            which it was trained. We test this on the map↔aerial photo                                                         729
  675
   675                                                                                                                                                                                                                                                          729
                                                                                                                                                                                                                                                                 729
  676      Using         a  16×16            PatchGAN                 is    sufficient         to   promote          sharp                task. After training a generator on 256×256 images, we test                                                        730
  676
   676                                                                                                                                                                                                                                                          730
                                                                                                                                                                                                                                                                 730
  677 outputs, and achieves good FCN-scores, but also leads to
  677
                                                                                                                                          it on 512 × 512 images. The results in Figure 8 demonstrate                                                        731
                                                                                                                                                                                                                                                                731
   677                                                                                                                                                                                                                                                       732 731
  678
  678
       tiling    artifacts.              The        70     ×      70     PatchGAN                alleviates          these                the    effectiveness
                                                                                                                                             Figure    8: Applying      ofathis       approach.
                                                                                                                                                                               conditional        GAN to semantic segmentation.                                 732
   678                                                                                                                                          Figure
                                                                                                                                                 Figure   8:
                                                                                                                                                           8: Applying
                                                                                                                                                                Applyingasharp  aconditional
                                                                                                                                                                                   conditional        GAN totosemantic             segmentation.             733 732
  679
  679
       artifacts        and       achieves             slightly           better       scores.           Scaling         be-                 The    cGAN        produces                   images GAN   that looksemantic
                                                                                                                                                                                                                       at glance     segmentation.
                                                                                                                                                                                                                                         like the               733
   679                                                                                                                                          The    cGAN        produces         sharp     images        that   look    atat glance       like
                                                                                                                                                                                                                                              like the
  680 yond this, to the full 286 × 286 ImageGAN, does not ap-                                                                             4.5.     Perceptual
                                                                                                                                                 The    cGAN
                                                                                                                                             ground truth,                validation
                                                                                                                                                                    produces
                                                                                                                                                                 but in fact         sharp     images        that   look
                                                                                                                                                                                   include many mall, hallucinated objects.      glance              the     734 733
  680
   680                                                                                                                                          ground     truth,   but    in  fact   include       many      small,   hallucinated          objects.           734
  681  pear to improve the visual quality of the results, and in                                                                                 ground     truth,   but    in  fact    include      many      small,    hallucinated         objects.       735 734
  681
   681                                                                                                                                          We validate the perceptual realism of our results on the                                                        735
  682  fact gets a considerably lower FCN-score (Table 3). This                                                                                                                                                                                              736 735
  682
   682                                                                                                                                    tasks     of map↔aerial photograph and grayscale→color. Re-                                                           736
  683  may be because the ImageGAN has many more parameters                                                                                  nearly discrete, rather than “images”, with their continuous-                                                   737 736
  683                                                                                                                                     sults    of
                                                                                                                                                nearly  our    AMT rather
                                                                                                                                                           discrete,       experiment             for map↔photo                 arecontinuous-
                                                                                                                                                                                                                                       given in                 737
   683
  684 and greater depth than the 70 × 70 PatchGAN, and may be
  684
                                                                                                                                                 nearly
                                                                                                                                             valued         discrete,
                                                                                                                                                         variation.         ratherthan
                                                                                                                                                                            Although
                                                                                                                                                                                                “images”,
                                                                                                                                                                                        thancGANs“images”,        with
                                                                                                                                                                                                                   withtheir
                                                                                                                                                                                                              achieve      their
                                                                                                                                                                                                                           some       continuous-
                                                                                                                                                                                                                                       success,              738 737
                                                                                                                                                                                                                                                                738
   684                                                                                                                                    Table      4.
                                                                                                                                                valued
                                                                                                                                                 valued  The     aerial
                                                                                                                                                            variation.
                                                                                                                                                             variation.       photos
                                                                                                                                                                                Although    generated
                                                                                                                                                                                                  cGANs        by   our
                                                                                                                                                                                                                 achieve   method
                                                                                                                                                                                                                                somefooled
                                                                                                                                                                                                                               some         success,         739 738
  685
  685  harder to train.                                                                                                                      they    are far     from theAlthough best available    cGANs   methodachieve for solving        success,
                                                                                                                                                                                                                                               this             739
   685                                                                                                                                    participants
                                                                                                                                                they    are     on
                                                                                                                                                              far   from
                                                                                                                                                                      18.9%    the  ofbesttrials,     significantly
                                                                                                                                                                                              available         method       above
                                                                                                                                                                                                                              for          the
                                                                                                                                                                                                                                     solving     L1this      740 739
  686
  686      Fully-convolutional                            translation                An      advantage             of     the                problem: simply using L1 regression gets better scores thanthis
                                                                                                                                                 they    are   far   from       the    best    available         method        for    solving                   740
   686
  687                                                                                                                                     baseline,
                                                                                                                                                problem:
                                                                                                                                                 problem:
                                                                                                                                             using
                                                                                                                                                           which
                                                                                                                                                      a cGAN,
                                                                                                                                                                simply
                                                                                                                                                                 simply produces
                                                                                                                                                                              using
                                                                                                                                                                       as shownusinginL1 L1 blurry
                                                                                                                                                                                               regression
                                                                                                                                                                                             Table
                                                                                                                                                                                                         results
                                                                                                                                                                                                regression
                                                                                                                                                                                                       4. We argue
                                                                                                                                                                                                                  getsand      nearly
                                                                                                                                                                                                                          better
                                                                                                                                                                                                                   gets better
                                                                                                                                                                                                                            that for
                                                                                                                                                                                                                                      scores
                                                                                                                                                                                                                                           visionthan
                                                                                                                                                                                                                                       scoresneverthan       741 740
  687
   687
       PatchGAN is that a fixed-size patch discriminator can be                                                                           fooledusing    a  cGAN,
                                                                                                                                                      participants.       as   shown
                                                                                                                                                                                In          in
                                                                                                                                                                                     contrast,  Table  in 4.4.
                                                                                                                                                                                                             theWe   argue
                                                                                                                                                                                                                   photo→map    that     for   vision
                                                                                                                                                                                                                                            direc-              741
  688                                                                                                                                            using
                                                                                                                                             problems,    a  cGAN,
                                                                                                                                                              the    goal  as   shown
                                                                                                                                                                               (i.e.         in  Table
                                                                                                                                                                                         predicting              We
                                                                                                                                                                                                             output   argue
                                                                                                                                                                                                                        close    that
                                                                                                                                                                                                                                    to     for
                                                                                                                                                                                                                                         ground vision       742 741
   688 applied
  688
  689        Figure   to 7:arbitrarily
                               Colorization        large       images.
                                                         results                  We mayGANs
                                                                      of conditional                also versus
                                                                                                              applythe    theL2           tion  problems,
                                                                                                                                                   our
                                                                                                                                                 problems,method  the
                                                                                                                                                                   the   goal
                                                                                                                                                                         only
                                                                                                                                                                          goal     (i.e.
                                                                                                                                                                                    fooled
                                                                                                                                                                                    (i.e.    predicting
                                                                                                                                                                                                  participants
                                                                                                                                                                                              predicting        output
                                                                                                                                                                                                                 output  on close
                                                                                                                                                                                                                               6.1%
                                                                                                                                                                                                                              close     toto ground
                                                                                                                                                                                                                                            of   tri-
                                                                                                                                                                                                                                              ground
                                                                                                                                                                                                                                                                742
                                                                                                                                                                                                                                                             743 742
  689         Figure
               Figure
             regression    7:    Colorization
                                  Colorization
                            7: from          [44] and the  results
                                                            resultsfull  of   conditional
                                                                              conditional
                                                                          of method                GANs
                                                                                                    GANs versus
                                                                                            (classification    versuswiththethe   L2         truth)
                                                                                                                                re-L2 als,truth)       may      be   less     ambiguous             than       raphics      tasks,      and     re-             743
   689
  690         regression           from       [44]                                                                                                and
                                                                                                                                                 truth)
                                                                                                                                             construction may
                                                                                                                                                        this    was
                                                                                                                                                           may be  be    less
                                                                                                                                                                        not
                                                                                                                                                                   losses less    ambiguous
                                                                                                                                                                                significantly
                                                                                                                                                                                   ambiguous
                                                                                                                                                                               like L1       are mostlythan
                                                                                                                                                                                                         different       than   tasks,
                                                                                                                                                                                                                 graphics tasks, and re-
                                                                                                                                                                                                                graphics
                                                                                                                                                                                                         than sufficient.          the       and
                                                                                                                                                                                                                                           perfor-   re-     744 743
                                               [44] and          the    full
                                                                          sizemethod          (classification         ofwith  col-re-
           1 We   achieve this           variation
  690          regression
             balancing)             from
                                  from       [46].       Thepatch
                                                          in
                                                         and      the
                                                                  cGANs  full    by
                                                                                 method
                                                                                 can   adjusting
                                                                                         produce       the   depth
                                                                                                (classification
                                                                                                        compelling          the
                                                                                                                         with      re-          construction          losses      like    L1    are    mostly       sufficient.                                 744
   690
  691         balancing)
              discriminator.
       GANorizations               from Details[46]. of    The
                                                          this      cGANs
                                                                 process,       andcanthe  produce
                                                                                            discriminator compelling
                                                                                                                  architec-      col-     mance        of   the
                                                                                                                                                 construction       L1    baseline
                                                                                                                                                                       losses      like     (based
                                                                                                                                                                                            L1    are     on
                                                                                                                                                                                                        mostly  bootstrap
                                                                                                                                                                                                                     sufficient.  test). This                745 744
  691          balancing)(first     fromtwo     [46].rows), Thebut   cGANshave acan     common
                                                                                           producefailure           mode of
                                                                                                           compelling             col-                                                                                                                          745
                                                                                                                                    of may be because minor structural errors are more visible
   691 tures, are
  692                provided
              orizations          (firstthetwo
                                     in         in the     supplemental
                                                       rows),       but            materials       online.                                                                                                                                                   746 745
  692        producing
               orizations        a (first
                                    grayscale two       or           but have
                                                             desaturated
                                                        rows),               have     aa common
                                                                                  result     (last
                                                                                          common          failure
                                                                                                      row).failure mode mode of                                                                                                                                 746
   692
  693         producing           a  grayscale           or    desaturated          result     (last    row).                                                                                                                                                747 746
               producing a grayscale or desaturated result (last row).
  693
   693
  694                                                                                                                                        4. Conclusion                                                                                                   748747
                                                                                                                                                                                                                                                                 747
  694
   694
  695                 To begin to test this, we train a cGAN (with/without L1
                                                                                                                                                        4.
                                                                                                                                                        4. Conclusion
                                                                                                                                                           Conclusion                                                                                        749748
                                                                                                                                                                                                                                                                 748
  695
  696
   695                 Toonbegin
                 loss)To           to
                                    to test
                             cityscape
                           begin        test this,   we
                                                     we train
                                            photo!labels.
                                               this,     train aaFigure
                                                                  cGAN
                                                                  cGAN8(with/without
                                                                            shows         L1
                                                                                   qualita-
                                                                           (with/without   L1                                                            The results in this paper suggest that conditional adver-                                           750749
                                                                                                                                                                                                                                                                 749
  696
  697
   696            loss)
                 tive    on   cityscape
                       results,  and          photo!labels.
                                       quantitative               Figure
                                                        classification
                   loss) on cityscape photo!labels. Figure 8 shows qualita-8 shows
                                                                         accuracies qualita-
                                                                                    are re-                                                          sarialThe
                                                                                                                                                           The  results  in
                                                                                                                                                                 resultsare
                                                                                                                                                            networks         this
                                                                                                                                                                          in a  promising
                                                                                                                                                                              this  paper suggest
                                                                                                                                                                                   paper           that
                                                                                                                                                                                            approach
                                                                                                                                                                                           suggest       conditional
                                                                                                                                                                                                    thatfor  many       adver-
                                                                                                                                                                                                                    image-
                                                                                                                                                                                                          conditional    adver-                              751750
                                                                                                                                                                                                                                                                 750
  697
  698
   697            tive
                   tive results,
                 ported in
                        results,  and
                                  and4.quantitative
                              Table      quantitative    classification
                                          Interestingly, cGANs,
                                                          classification  accuracies
                                                                       trained without
                                                                           accuracies are
                                                                                      are re-
                                                                                        there-                                                          sarial networks
                                                                                                                                                       sarial
                                                                                                                                                     to-image   translation
                                                                                                                                                               networks     are aa
                                                                                                                                                                           are     promising
                                                                                                                                                                               tasks,          approachfor
                                                                                                                                                                                              approach
                                                                                                                                                                                      especially
                                                                                                                                                                                    promising    those          many
                                                                                                                                                                                                         involving
                                                                                                                                                                                                            for  many  image-
                                                                                                                                                                                                                     highly
                                                                                                                                                                                                                        image-                               752751
                                                                                                                                                                                                                                                                 751
  698
  699
   698                loss,in
                  ported
                 L1ported   areTable
                            in Table 4. to  Interestingly,
                                         4. solve
                                             Interestingly,  cGANs,
                                                    this problem   at a trained
                                                              cGANs,             without
                                                                        reasonable
                                                                         trained    degree
                                                                                 without  the
                                                                                           the                                                         to-image
                                                                                                                                                     structured   translation
                                                                                                                                                        to-imagegraphical
                                                                                                                                                                   translation    tasks,especially
                                                                                                                                                                                 tasks,
                                                                                                                                                                                outputs.  These      thoseinvolving
                                                                                                                                                                                                    those
                                                                                                                                                                                                 networks
                                                                                                                                                                                         especially            learn a highly
                                                                                                                                                                                                             involving  loss
                                                                                                                                                                                                                         highly                              753752
                                                                                                                                                                                                                                                                 752
                        Map to aerial photo                                               Aerial photo to map




              input                            output                          input                            output

Figure 8: Example results on Google Maps at 512x512 resolution (model was trained on images at 256 × 256 resolution, and run convo-
lutionally on the larger images at test time). Contrast adjusted for clarity.
                 Classification    Ours                                  Input         Ground truth        L1             cGAN
    L2 [62]      (rebal.) [62] (L1 + cGAN) Ground truth




                                                                    Figure 10: Applying a conditional GAN to semantic segmenta-
                                                                    tion. The cGAN produces sharp images that look at glance like
                                                                    the ground truth, but in fact include many small, hallucinated ob-
                                                                    jects.

                                                                    ble 5). We also tested the results of [62] and a variant of
                                                                    their method that used an L2 loss (see [62] for details). The
Figure 9: Colorization results of conditional GANs versus the L2    conditional GAN scored similarly to the L2 variant of [62]
regression from [62] and the full method (classification with re-   (difference insignificant by bootstrap test), but fell short of
balancing) from [64]. The cGANs can produce compelling col-
                                                                    [62]’s full method, which fooled participants on 27.8% of
orizations (first two rows), but have a common failure mode of
producing a grayscale or desaturated result (last row).
                                                                    trials in our experiment. We note that their method was
                                                                    specifically engineered to do well on colorization.
in maps, which have rigid geometry, than in aerial pho-             4.6. Semantic segmentation
tographs, which are more chaotic.
   We trained colorization on ImageNet [51], and tested               Conditional GANs appear to be effective on problems
on the test split introduced by [62, 35]. Our method, with          where the output is highly detailed or photographic, as is
L1+cGAN loss, fooled participants on 22.5% of trials (Ta-           common in image processing and graphics tasks. What
                                                                                 by Kaihu Chen           by Jack Qiao         by Mario Klingemann

                                                                                                                             #fotogenerator



       sketch by Ivy Tsai                                                     by Bertrand Gondouin     by Brannon Dorsey      sketch by Yann LeCun
Figure 11: Example applications developed by online community based on our pix2pix codebase: #edges2cats [3] by Christopher Hesse,
Background removal [6] by Kaihu Chen, Palette generation [5] by Jack Qiao, Sketch → Portrait [7] by Mario Klingemann, Sketch→
Pokemon [1] by Bertrand Gondouin, “Do As I Do” pose transfer [2] by Brannon Dorsey, and #fotogenerator by Bosman et al. [4].

        Loss                Per-pixel acc.   Per-class acc.   Class IOU
        L1                      0.86             0.42            0.35
        cGAN                    0.74             0.28            0.22
        L1+cGAN                 0.83             0.36            0.29
      Table 6: Performance of photo→labels on cityscapes.

about vision problems, like semantic segmentation, where
the output is instead less complex than the input?
   To begin to test this, we train a cGAN (with/without L1
loss) on cityscape photo→labels. Figure 10 shows qualita-                        Figure 12: Learning to see: Gloomy Sunday: An interactive artis-
tive results, and quantitative classification accuracies are re-                 tic demo developed by Memo Akten [8] based on our pix2pix
ported in Table 6. Interestingly, cGANs, trained without the                     codebase. Please click the image to play the video in a browser.
L1 loss, are able to solve this problem at a reasonable degree
                                                                                 the pix2pix code we released. Nonetheless, they demon-
of accuracy. To our knowledge, this is the first demonstra-
                                                                                 strate the promise of our approach as a generic commodity
tion of GANs successfully generating “labels”, which are
                                                                                 tool for image-to-image translation problems.
nearly discrete, rather than “images”, with their continuous-
valued variation2 . Although cGANs achieve some success,                         5. Conclusion
they are far from the best available method for solving this
problem: simply using L1 regression gets better scores than                         The results in this paper suggest that conditional adver-
using a cGAN, as shown in Table 6. We argue that for vi-                         sarial networks are a promising approach for many image-
sion problems, the goal (i.e. predicting output close to the                     to-image translation tasks, especially those involving highly
ground truth) may be less ambiguous than graphics tasks,                         structured graphical outputs. These networks learn a loss
and reconstruction losses like L1 are mostly sufficient.                         adapted to the task and data at hand, which makes them ap-
                                                                                 plicable in a wide variety of settings.
4.7. Community-driven Research
                                                                                 Acknowledgments: We thank Richard Zhang, Deepak
   Since the initial release of the paper and our pix2pix                        Pathak, and Shubham Tulsiani for helpful discussions, Sain-
codebase, the Twitter community, including computer vi-                          ing Xie for help with the HED edge detector, and the online
sion and graphics practitioners as well as visual artists, have                  community for exploring many applications and suggesting
successfully applied our framework to a variety of novel                         improvements. Thanks to Christopher Hesse, Memo Ak-
image-to-image translation tasks, far beyond the scope of                        ten, Kaihu Chen, Jack Qiao, Mario Klingemann, Brannon
the original paper. Figure 11 and Figure 12 show just a                          Dorsey, Gerda Bosman, Ivy Tsai, and Yann LeCun for al-
few examples from the #pix2pix hashtag, including Back-                          lowing the use of their creations in Figure 11 and Figure 12.
ground removal, Palette generation, Sketch → Portrait,                           This work was supported in part by NSF SMA-1514512,
Sketch→Pokemon, ”Do as I Do” pose transfer, Learning                             NGA NURI, IARPA via Air Force Research Laboratory, In-
to see: Gloomy Sunday, as well as the bizarrely popular                          tel Corp, Berkeley Deep Drive, and hardware donations by
#edges2cats and #fotogenerator. Note that these applica-                         Nvidia. J.-Y.Z. is supported by the Facebook Graduate Fel-
tions are creative projects, were not obtained in controlled,                    lowship. Disclaimer: The views and conclusions contained
scientific conditions, and may rely on some modifications to                     herein are those of the authors and should not be interpreted
    2 Note that the label maps we train on are not exactly discrete valued,      as necessarily representing the official policies or endorse-
as they are resized from the original maps using bilinear interpolation and      ments, either expressed or implied, of IARPA, AFRL or the
saved as jpeg images, with some compression artifacts.                           U.S. Government.
Input          Ground truth              Output                 Input             Ground truth           Output




        Figure 13: Example results of our method on Cityscapes labels→photo, compared to ground truth.



Input          Ground truth              Output                 Input             Ground truth           Output




         Figure 14: Example results of our method on facades labels→photo, compared to ground truth.
Input           Ground truth              Output                 Input             Ground truth             Output




                Figure 15: Example results of our method on day→night, compared to ground truth.



Input           Ground truth              Output                 Input             Ground truth             Output




  Figure 16: Example results of our method on automatically detected edges→handbags, compared to ground truth.
        Input             Ground truth              Output                 Input             Ground truth                Output




             Figure 17: Example results of our method on automatically detected edges→shoes, compared to ground truth.



        Input                 Output                 Input                 Output                  Input                  Output




Figure 18: Additional results of the edges→photo models applied to human-drawn sketches from [19]. Note that the models were trained
on automatically detected edges, but generalize to human drawings
Figure 19: Example results on photo inpainting, compared to [43], on the Paris StreetView dataset [14]. This experiment demonstrates that
the U-net architecture can be effective even when the predicted pixels are not geometrically aligned with the information in the input – the
information used to fill in the central hole has to be found in the periphery of these photos.


        Input              Ground truth                Output                  Input               Ground truth              Output




                   Figure 20: Example results on translating thermal images to RGB photos, on the dataset from [27].


      Day              Night             Labels           Facade                     Labels                         Street scene




        Edges           Shoe             Edges           Handbag            Sketch             Shoe            Sketch          Handbag




Figure 21: Example failure cases. Each pair of images shows input on the left and output on the right. These examples are selected as some
of the worst results on our tasks. Common failures include artifacts in regions where the input image is sparse, and difficulty in handling
unusual inputs. Please see https://phillipi.github.io/pix2pix/ for more comprehensive results.
References                                                          [21] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis
                                                                         using convolutional neural networks. In NIPS, 2015. 4
 [1] Bertrand gondouin.             https://twitter.com/
                                                                    [22] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer
     bgondouin/status/818571935529377792.
                                                                         using convolutional neural networks. CVPR, 2016. 4
     Accessed, 2017-04-21. 9
                                                                    [23] J. Gauthier. Conditional generative adversarial nets for
 [2] Brannon dorsey.                https://twitter.com/
                                                                         convolutional face generation. Class Project for Stanford
     brannondorsey/status/806283494041223168.
                                                                         CS231N: Convolutional Neural Networks for Visual Recog-
     Accessed, 2017-04-21. 9
                                                                         nition, Winter semester, (5):2, 2014. 2
 [3] Christopher hesse.        https://affinelayer.com/
                                                                    [24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
     pixsrv/. Accessed: 2017-04-21. 9
                                                                         D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
 [4] Gerda bosman, tom kenter, rolf jagerman, and daan gosman.
                                                                         erative adversarial nets. In NIPS, 2014. 2, 4, 6, 7
     https://dekennisvannu.nl/site/artikel/
                                                                    [25] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H.
     Help-ons-kunstmatige-intelligentie-testen/
                                                                         Salesin. Image analogies. In SIGGRAPH, 2001. 1, 4
     9163. Accessed: 2017-08-31. 9
 [5] Jack qiao. http://colormind.io/blog/. Accessed:                [26] G. E. Hinton and R. R. Salakhutdinov. Reducing the
     2017-04-21. 9                                                       dimensionality of data with neural networks. Science,
                                                                         313(5786):504–507, 2006. 3
 [6] Kaihu chen.                http://www.terraai.org/
     imageops/index.html.                Accessed, 2017-04-21.      [27] S. Hwang, J. Park, N. Kim, Y. Choi, and I. So Kweon. Mul-
     9                                                                   tispectral pedestrian detection: Benchmark dataset and base-
                                                                         line. In CVPR, 2015. 4, 13, 16
 [7] Mario klingemann.              https://twitter.com/
     quasimondo/status/826065030944870400.                          [28] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be
     Accessed, 2017-04-21. 9                                             Color!: Joint End-to-end Learning of Global and Local Im-
 [8] Memo akten. https://vimeo.com/260612034. Ac-                        age Priors for Automatic Image Colorization with Simulta-
     cessed, 2018-11-07. 9                                               neous Classification. ACM Transactions on Graphics (TOG),
                                                                         35(4), 2016. 2
 [9] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm
     for image denoising. In CVPR, 2005. 1                          [29] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and              deep network training by reducing internal covariate shift. In
     A. L. Yuille. Semantic image segmentation with deep con-            ICML, 2015. 3, 4
     volutional nets and fully connected crfs. In ICLR, 2015. 2     [30] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for
[11] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu.              real-time style transfer and super-resolution. In ECCV, 2016.
     Sketch2photo: internet image montage. ACM Transactions              2, 3
     on Graphics (TOG), 28(5):124, 2009. 1                          [31] L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning
[12] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,            to generate images of outdoor scenes from attributes and se-
     R. Benenson, U. Franke, S. Roth, and B. Schiele. The                mantic layouts. arXiv preprint arXiv:1612.00215, 2016. 2
     cityscapes dataset for semantic urban scene understanding.     [32] D. Kingma and J. Ba. Adam: A method for stochastic opti-
     In CVPR, 2016. 4, 16                                                mization. ICLR, 2015. 4
[13] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen-     [33] P.-Y. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient
     erative image models using a laplacian pyramid of adversar-         attributes for high-level understanding and editing of outdoor
     ial networks. In NIPS, 2015. 2                                      scenes. ACM Transactions on Graphics (TOG), 33(4):149,
[14] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What        2014. 1, 4, 16
     makes paris look like paris? ACM Transactions on Graphics,     [34] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoen-
     31(4), 2012. 4, 13, 17                                              coding beyond pixels using a learned similarity metric. In
[15] A. Dosovitskiy and T. Brox. Generating images with per-             ICML, 2016. 3
     ceptual similarity metrics based on deep networks. In NIPS,    [35] G. Larsson, M. Maire, and G. Shakhnarovich. Learning rep-
     2016. 2                                                             resentations for automatic colorization. ECCV, 2016. 2, 8,
[16] A. A. Efros and W. T. Freeman. Image quilting for texture           16
     synthesis and transfer. In SIGGRAPH, 2001. 1, 4                [36] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Te-
[17] A. A. Efros and T. K. Leung. Texture synthesis by non-              jani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single im-
     parametric sampling. In ICCV, 1999. 4                               age super-resolution using a generative adversarial network.
[18] D. Eigen and R. Fergus. Predicting depth, surface normals           In CVPR, 2017. 2
     and semantic labels with a common multi-scale convolu-         [37] C. Li and M. Wand. Combining markov random fields and
     tional architecture. In ICCV, 2015. 1                               convolutional neural networks for image synthesis. CVPR,
[19] M. Eitz, J. Hays, and M. Alexa. How do humans sketch                2016. 2, 4
     objects? In SIGGRAPH, 2012. 4, 12                              [38] C. Li and M. Wand. Precomputed real-time texture synthe-
[20] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T.          sis with markovian generative adversarial networks. ECCV,
     Freeman. Removing camera shake from a single photo-                 2016. 2, 4
     graph. ACM Transactions on Graphics (TOG), 25(3):787–          [39] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
     794, 2006. 1                                                        networks for semantic segmentation. In CVPR, 2015. 1, 2, 5
[40] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale          [58] S. Xie and Z. Tu. Holistically-nested edge detection. In
     video prediction beyond mean square error. ICLR, 2016. 2,            ICCV, 2015. 1, 2, 4
     3                                                               [59] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-
[41] M. Mirza and S. Osindero. Conditional generative adversar-           level domain transfer. ECCV, 2016. 2, 3
     ial nets. arXiv preprint arXiv:1411.1784, 2014. 2               [60] A. Yu and K. Grauman. Fine-Grained Visual Comparisons
[42] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adel-           with Local Learning. In CVPR, 2014. 4
     son, and W. T. Freeman. Visually indicated sounds. In           [61] A. Yu and K. Grauman. Fine-grained visual comparisons
     CVPR, 2016. 5                                                        with local learning. In CVPR, 2014. 16
[43] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.     [62] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
     Efros. Context encoders: Feature learning by inpainting. In          tion. ECCV, 2016. 1, 2, 5, 7, 8, 16
     CVPR, 2016. 2, 3, 13, 17                                        [63] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative
[44] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-            adversarial network. In ICLR, 2017. 2
     sentation learning with deep convolutional generative adver-    [64] Y. Zhou and T. L. Berg. Learning temporal transformations
     sarial networks. In ICLR, 2016. 2, 3, 16                             from time-lapse videos. In ECCV, 2016. 2, 3, 8
[45] R. Š. Radim Tyleček. Spatial pattern templates for recogni-   [65] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros.
     tion of objects with regular structure. In German Conference         Generative visual manipulation on the natural image mani-
     on Pattern Recognition, 2013. 4, 16                                  fold. In ECCV, 2016. 2, 4, 16
[46] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
     H. Lee. Generative adversarial text to image synthesis. In
     ICML, 2016. 2
[47] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst,
     M. Botvinick, and N. de Freitas. Generating interpretable
     images with controllable structure. In ICLR Workshop, 2017.
     2
[48] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and
     H. Lee. Learning what and where to draw. In NIPS, 2016. 2
[49] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color
     transfer between images. IEEE Computer Graphics and Ap-
     plications, 21:34–41, 2001. 7
[50] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
     tional networks for biomedical image segmentation. In MIC-
     CAI, 2015. 2, 3
[51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
     S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
     et al. Imagenet large scale visual recognition challenge.
     International Journal of Computer Vision, 115(3):211–252,
     2015. 4, 8, 16
[52] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
     ford, and X. Chen. Improved techniques for training gans. In
     NIPS, 2016. 2, 4, 5
[53] Y. Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven
     hallucination of different times of day from a single outdoor
     photo. ACM Transactions on Graphics (TOG), 32(6):200,
     2013. 1
[54] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normal-
     ization: The missing ingredient for fast stylization. arXiv
     preprint arXiv:1607.08022, 2016. 4
[55] X. Wang and A. Gupta. Generative image modeling using
     style and structure adversarial networks. In ECCV, 2016. 2,
     3, 5
[56] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli.
     Image quality assessment: from error visibility to struc-
     tural similarity. IEEE Transactions on Image Processing,
     13(4):600–612, 2004. 2
[57] S. Xie, X. Huang, and Z. Tu. Top-down learning for struc-
     tured labeling with convolutional pseudoprior. In ECCV,
     2015. 2
6. Appendix                                                       286 × 286 discriminator:
                                                                  C64-C128-C256-C512-C512-C512
6.1. Network architectures
    We adapt our network architectures from those
in [44].        Code for the models is available at               6.2. Training details
https://github.com/phillipi/pix2pix.                                  Random jitter was applied by resizing the 256×256 input
    Let Ck denote a Convolution-BatchNorm-ReLU layer              images to 286 × 286 and then randomly cropping back to
with k filters. CDk denotes a Convolution-BatchNorm-              size 256 × 256.
Dropout-ReLU layer with a dropout rate of 50%. All convo-             All networks were trained from scratch. Weights were
lutions are 4 × 4 spatial filters applied with stride 2. Convo-   initialized from a Gaussian distribution with mean 0 and
lutions in the encoder, and in the discriminator, downsample      standard deviation 0.02.
by a factor of 2, whereas in the decoder they upsample by a           Cityscapes labels→photo 2975 training images from
factor of 2.                                                      the Cityscapes training set [12], trained for 200 epochs, with
                                                                  random jitter and mirroring. We used the Cityscapes val-
6.1.1   Generator architectures                                   idation set for testing. To compare the U-net against an
                                                                  encoder-decoder, we used a batch size of 10, whereas for
The encoder-decoder architecture consists of:
                                                                  the objective function experiments we used batch size 1.
encoder:
                                                                  We find that batch size 1 produces better results for the U-
C64-C128-C256-C512-C512-C512-C512-C512
                                                                  net, but is inappropriate for the encoder-decoder. This is
decoder:
                                                                  because we apply batchnorm on all layers of our network,
CD512-CD512-CD512-C512-C256-C128-C64
                                                                  and for batch size 1 this operation zeros the activations on
   After the last layer in the decoder, a convolution is ap-
                                                                  the bottleneck layer. The U-net can skip over the bottleneck,
plied to map to the number of output channels (3 in general,
                                                                  but the encoder-decoder cannot, and so the encoder-decoder
except in colorization, where it is 2), followed by a Tanh
                                                                  requires a batch size greater than 1. Note, an alternative
function. As an exception to the above notation, Batch-
                                                                  strategy is to remove batchnorm from the bottleneck layer.
Norm is not applied to the first C64 layer in the encoder.
                                                                  See errata for more details.
All ReLUs in the encoder are leaky, with slope 0.2, while
                                                                      Architectural labels→photo 400 training images from
ReLUs in the decoder are not leaky.
                                                                  [45], trained for 200 epochs, batch size 1, with random jitter
   The U-Net architecture is identical except with skip con-
                                                                  and mirroring. Data were split into train and test randomly.
nections between each layer i in the encoder and layer n − i
in the decoder, where n is the total number of layers. The            Maps↔aerial photograph 1096 training images
skip connections concatenate activations from layer i to          scraped from Google Maps, trained for 200 epochs, batch
layer n − i. This changes the number of channels in the           size 1, with random jitter and mirroring. Images were
decoder:                                                          sampled from in and around New York City. Data were
   U-Net decoder:                                                 then split into train and test about the median latitude of the
CD512-CD1024-CD1024-C1024-C1024-C512                              sampling region (with a buffer region added to ensure that
-C256-C128                                                        no training pixel appeared in the test set).
                                                                      BW→color 1.2 million training images (Imagenet train-
                                                                  ing set [51]), trained for ∼ 6 epochs, batch size 4, with only
6.1.2   Discriminator architectures
                                                                  mirroring, no random jitter. Tested on subset of Imagenet
The 70 × 70 discriminator architecture is:                        val set, following protocol of [62] and [35].
C64-C128-C256-C512                                                    Edges→shoes 50k training images from UT Zappos50K
   After the last layer, a convolution is applied to map to       dataset [61] trained for 15 epochs, batch size 4. Data were
a 1-dimensional output, followed by a Sigmoid function.           split into train and test randomly.
As an exception to the above notation, BatchNorm is not               Edges→Handbag 137K Amazon Handbag images from
applied to the first C64 layer. All ReLUs are leaky, with         [65], trained for 15 epochs, batch size 4. Data were split into
slope 0.2.                                                        train and test randomly.
   All other discriminators follow the same basic architec-           Day→night 17823 training images extracted from 91
ture, with depth varied to modify the receptive field size:       webcams, from [33] trained for 17 epochs, batch size 4,
   1 × 1 discriminator:                                           with random jitter and mirroring. We use 91 webcams as
C64-C128 (note, in this special case, all convolutions are        training, and 10 webcams for test.
1 × 1 spatial filters)                                                Thermal→color photos 36609 training images from set
16 × 16 discriminator:                                            00–05 of [27], trained for 10 epochs, batch size 4. Images
C64-C128                                                          from set 06-11 are used for testing.
    Photo with missing pixels→inpainted photo 14900
training images from [14], trained for 25 epochs, batch size
4, and tested on 100 held out images following the split of
[43].
6.3. Errata
   For all experiments reported in this paper with batch
size 1, the activations of the bottleneck layer are zeroed by
the batchnorm operation, effectively making the innermost
layer skipped. This issue can be fixed by removing batch-
norm from this layer, as has been done in the public code.
We observe little difference with this change and therefore
leave the experiments as is in the paper.

6.4. Change log
   arXiv v2 Reran generator architecture comparisons
(Section 4.3) with batch size equal to 10 rather than
1, so that bottleneck layer is not zeroed (see Errata).
Reran FCN-scores with minor details cleaned up (re-
sults saved losslessly as pngs, removed unecessary
downsampling). FCN-scores computed using scripts at
https://github.com/phillipi/pix2pix/tree/
master/scripts/eval cityscapes,                  commit
d7e7b8b. Updated several figures and text. Added addi-
tional results on thermal→color photos and inpainting, as
well as community contributions.
   arXiv v3 Added additional results on community contri-
butions. Fixed minor typos.