来源材料

资料来源

← 首页

                                                               Improved Denoising Diffusion Probabilistic Models


                                                                                    Alex Nichol * 1 Prafulla Dhariwal * 1


                                                                  Abstract                                (2020) showed extremely good results on the CIFAR-10
                                                                                                          (Krizhevsky, 2009) and LSUN (Yu et al., 2015) datasets, it
                                               Denoising diffusion probabilistic models (DDPM)
                                                                                                          is unclear how well DDPMs scale to datasets with higher di-
                                               are a class of generative models which have re-




arXiv:2102.09672v1 [cs.LG] 18 Feb 2021
                                                                                                          versity such as ImageNet. Finally, while Chen et al. (2020b)
                                               cently been shown to produce excellent sam-
                                                                                                          found that DDPMs can efficiently generate audio using a
                                               ples. We show that with a few simple modifi-
                                                                                                          small number of sampling steps, it has yet to be shown that
                                               cations, DDPMs can also achieve competitive log-
                                                                                                          the same is true for images.
                                               likelihoods while maintaining high sample quality.
                                               Additionally, we find that learning variances of           In this paper, we show that DDPMs can achieve log-
                                               the reverse diffusion process allows sampling with         likelihoods competitive with other likelihood-based models,
                                               an order of magnitude fewer forward passes with            even on high-diversity datasets like ImageNet. To more
                                               a negligible difference in sample quality, which           tightly optimise the variational lower-bound (VLB), we
                                               is important for the practical deployment of these         learn the reverse process variances using a simple reparame-
                                               models. We additionally use precision and re-              terization and a hybrid learning objective that combines the
                                               call to compare how well DDPMs and GANs                    VLB with the simplified objective from Ho et al. (2020).
                                               cover the target distribution. Finally, we show
                                                                                                          We find surprisingly that, with our hybrid objective, our
                                               that the sample quality and likelihood of these
                                                                                                          models obtain better log-likelihoods than those obtained
                                               models scale smoothly with model capacity and
                                                                                                          by optimizing the log-likelihood directly, and discover that
                                               training compute, making them easily scalable.
                                                                                                          the latter objective has much more gradient noise during
                                               We release our code at https://github.com/
                                                                                                          training. We show that a simple importance sampling tech-
                                               openai/improved-diffusion.
                                                                                                          nique reduces this noise and allows us to achieve better
                                                                                                          log-likelihoods than with the hybrid objective.
                                         1. Introduction                                                  After incorporating learned variances into our model, we
                                                                                                          surprisingly discovered that we could sample in fewer steps
                                         Sohl-Dickstein et al. (2015) introduced diffusion probabilis-    from our models with very little change in sample quality.
                                         tic models, a class of generative models which match a           While DDPM (Ho et al., 2020) requires hundreds of for-
                                         data distribution by learning to reverse a gradual, multi-step   ward passes to produce good samples, we can achieve good
                                         noising process. More recently, Ho et al. (2020) showed          samples with as few as 50 forward passes, thus speeding
                                         an equivalence between denoising diffusion probabilistic         up sampling for use in practical applications. In parallel to
                                         models (DDPM) and score based generative models (Song            our work, Song et al. (2020a) develops a different approach
                                         & Ermon, 2019; 2020), which learn a gradient of the log-         to fast sampling, and we compare against their approach,
                                         density of the data distribution using denoising score match-    DDIM, in our experiments.
                                         ing (Hyvärinen, 2005). It has recently been shown that this
                                         class of models can produce high-quality images (Ho et al.,      While likelihood is a good metric to compare against other
                                         2020; Song & Ermon, 2020; Jolicoeur-Martineau et al.,            likelihood-based models, we also wanted to compare the
                                         2020) and audio (Chen et al., 2020b; Kong et al., 2020),         distribution coverage of these models with GANs. We use
                                         but it has yet to be shown that DDPMs can achieve log-           the improved precision and recall metrics (Kynkäänniemi
                                         likelihoods competitive with other likelihood-based models       et al., 2019) and discover that diffusion models achieve
                                         such as autoregressive models (van den Oord et al., 2016c)       much higher recall for similar FID, suggesting that they do
                                         and VAEs (Kingma & Welling, 2013). This raises various           indeed cover a much larger portion of the target distribution.
                                         questions, such as whether DDPMs are capable of capturing        Finally, since we expect machine learning models to con-
                                         all the modes of a distribution. Furthermore, while Ho et al.    sume more computational resources in the future, we evalu-
                                           *
                                            Equal contribution 1 OpenAI, San Francisco, USA. Correspon-   ate the performance of these models as we increase model
                                         dence to: <alex@openai.com>, <prafulla@openai.com>.              size and training compute. Similar to (Henighan et al.,
Improved Denoising Diffusion Probabilistic Models                                                                                        2

2020), we observe trends that suggest predictable improve-                                       conditioned on the input x0 . With
                                                                         noised latents directly Q
                                                                                                   t
ments in performance as we increase training compute.                    αt := 1 − βt and ᾱt := s=0 αs , we can write the marginal
                                                                                                       √
                                                                                   q(xt |x0 ) = N (xt ; ᾱt x0 , (1 − ᾱt )I)   (8)
2. Denoising Diffusion Probabilistic Models                                                     √         √
                                                                                          xt = ᾱt x0 + 1 − ᾱt                (9)
We briefly review the formulation of DDPMs from Ho et al.
(2020). This formulation makes various simplifying assump-               where  ∼ N (0, I). Here, 1 − ᾱt tells us the variance of the
tions, such as a fixed noising process q which adds diagonal             noise for an arbitrary timestep, and we could equivalently
Gaussian noise at each timestep. For a more general deriva-              use this to define the noise schedule instead of βt .
tion, see Sohl-Dickstein et al. (2015).
                                                                         Using Bayes theorem, one can calculate the posterior
                                                                         q(xt−1 |xt , x0 ) in terms of β̃t and µ̃t (xt , x0 ) which are de-
2.1. Definitions                                                         fined as follows:
Given a data distribution x0 ∼ q(x0 ), we define a forward                                         1 − ᾱt−1
noising process q which produces latents x1 through xT by                                 β̃t :=              βt                      (10)
                                                                                                    1 − ᾱt
adding Gaussian noise at time t with variance βt ∈ (0, 1) as                                       √               √
                                                                                                     ᾱt−1 βt        αt (1 − ᾱt−1 )
follows:                                                                         µ̃t (xt , x0 ) :=            x0 +                   xt
                                                                                                    1 − ᾱt             1 − ᾱt
                                T
                                Y                                                                                                     (11)
      q(x1 , ..., xT |x0 ) :=         q(xt |xt−1 )                 (1)      q(xt−1 |xt , x0 ) = N (xt−1 ; µ̃(xt , x0 ), β̃t I)       (12)
                                t=1
                                         p
            q(xt |xt−1 ) := N (xt ;          1 − βt xt−1 , βt I)   (2)
                                                                         2.2. Training in Practice

Given sufficiently large T and a well behaved schedule of                The objective in Equation 4 is a sum of independent terms
βt , the latent xT is nearly an isotropic Gaussian distribution.         Lt−1 , and Equation 9 provides an efficient way to sample
Thus, if we know the exact reverse distribution q(xt−1 |xt ),            from an arbitrary step of the forward noising process and
we can sample xT ∼ N (0, I) and run the process in reverse               estimate Lt−1 using the posterior (Equation 12) and prior
to get a sample from q(x0 ). However, since q(xt−1 |xt )                 (Equation 3). We can thus randomly sample t and use the
depends on the entire data distribution, we approximate it               expectation Et,x0 , [Lt−1 ] to estimate Lvlb . Ho et al. (2020)
using a neural network as follows:                                       uniformly sample t for each image in each mini-batch.
                                                                         There are many different ways to parameterize µθ (xt , t) in
       pθ (xt−1 |xt ) := N (xt−1 ; µθ (xt , t), Σθ (xt , t))       (3)   the prior. The most obvious option is to predict µθ (xt , t)
                                                                         directly with a neural network. Alternatively, the network
The combination of q and p is a variational auto-encoder                 could predict x0 , and this output could be used in Equation
(Kingma & Welling, 2013), and we can write the variational               11 to produce µθ (xt , t). The network could also predict the
lower bound (VLB) as follows:                                            noise  and use Equations 9 and 11 to derive
                                                                                                                             
        Lvlb := L0 + L1 + ... + LT −1 + LT                         (4)                       1               βt
                                                                              µθ (xt , t) = √       xt − √         θ (xt , t)   (13)
         L0 := − log pθ (x0 |x1 )                                  (5)                        αt           1 − ᾱt
       Lt−1 := DKL (q(xt−1 |xt , x0 ) || pθ (xt−1 |xt ))           (6)   Ho et al. (2020) found that predicting  worked best, es-
        LT := DKL (q(xT |x0 ) || p(xT ))                           (7)   pecially when combined with a reweighted loss function:

                                                                                      Lsimple = Et,x0 , || − θ (xt , t)||2
                                                                                                                             
                                                                                                                                     (14)
Aside from L0 , each term of Equation 4 is a KL divergence
between two Gaussians, and can thus be evaluated in closed               This objective can be seen as a reweighted form of Lvlb
form. To evaluate L0 for images, we assume that each color               (without the terms affecting Σθ ). The authors found that
component is divided into 256 bins, and we compute the                   optimizing this reweighted objective resulted in much better
probability of pθ (x0 |x1 ) landing in the correct bin (which is         sample quality than optimizing Lvlb directly, and explain
tractable using the CDF of the Gaussian distribution). Also              this by drawing a connection to generative score matching
note that while LT does not depend on θ, it will be close to             (Song & Ermon, 2019; 2020).
zero if the forward noising process adequately destroys the
                                                                         One subtlety is that Lsimple provides no learning signal for
data distribution so that q(xT |x0 ) ≈ N (0, I).
                                                                         Σθ (xt , t). This is irrelevant, however, since Ho et al. (2020)
As noted in (Ho et al., 2020), the noising process defined               achieved their best results by fixing the variance to σt2 I
in Equation 2 allows us to sample an arbitrary step of the               rather than learning it. They found that they achieve similar
Improved Denoising Diffusion Probabilistic Models                                                                                                                        3

sample quality using either σt2 = βt or σt2 = β̃t , which are                                      100

the upper and lower bounds on the variance given by q(x0 )
being either isotropic Gaussian noise or a delta function,
                                                                                            6 × 10−1
respectively.
                                                                         β ̃t /βt


                                                                                            4 × 10−1
3. Improving the Log-likelihood
                                                                                                                                                       T ̃ 100 steps
                                                                                            3 × 10−1
                                                                                                                                                       T ̃ 1000 steps
While Ho et al. (2020) found that DDPMs can generate high-                                                                                             T ̃ 10000 steps
fidelity samples according to FID (Heusel et al., 2017) and                                                  0.0         0.2     0.4          0.6       0.8        1.0
                                                                                                                                diffusion step (t/T)
Inception Score (Salimans et al., 2016), they were unable to
achieve competitive log-likelihoods with these models. Log-
likelihood is a widely used metric in generative modeling,        Figure 1. The ratio β̃t /βt for every diffusion step for diffusion
and it is generally believed that optimizing log-likelihood       processes of different lengths.
forces generative models to capture all of the modes of
the data distribution (Razavi et al., 2019). Additionally,                                   100
recent work (Henighan et al., 2020) has shown that small                                    10−1
improvements in log-likelihood can have a dramatic impact                                   10−2
on sample quality and learnt feature representations. Thus, it
                                                                         loss term (bits)
                                                                                            10−3
is important to explore why DDPMs seem to perform poorly
                                                                                            10−4
on this metric, since this may suggest a fundamental short-
                                                                                            10−5
coming such as bad mode coverage. This section explores
                                                                                            10−6
several modifications to the algorithm described in Section
                                                                                            10−7
2 that, when combined, allow DDPMs to achieve much bet-
ter log-likelihoods on image datasets, suggesting that these                                             0         500   1000 1500 2000 2500 3000 3500 4000
                                                                                                                                 diffusion step
models enjoy the same benefits as other likelihood-based
generative models.
                                                                  Figure 2. Terms of the VLB vs diffusion step. The first few terms
To study the effects of different modifications, we train         contribute most to NLL.
fixed model architectures with fixed hyperparameters on
the ImageNet 64 × 64 (van den Oord et al., 2016b) and             Considering that βt and β̃t represent two opposite extremes,
CIFAR-10 (Krizhevsky, 2009) datasets. While CIFAR-10              it is reasonable to ask why this choice doesn’t affect samples.
has seen more usage for this class of models, we chose            One clue is given by Figure 1, which shows that βt and β̃t
to study ImageNet 64 × 64 as well because it provides a           are almost equal except near t = 0, i.e. where the model
good trade-off between diversity and resolution, allowing us      is dealing with imperceptible details. Furthermore, as we
to train models quickly without worrying about overfitting.       increase the number of diffusion steps, βt and β̃t seem to
Additionally, ImageNet 64×64 has been studied extensively         remain close to one another for more of the diffusion process.
in the context of generative modeling (van den Oord et al.,       This suggests that, in the limit of infinite diffusion steps,
2016c; Menick & Kalchbrenner, 2018; Child et al., 2019;           the choice of σt might not matter at all for sample quality.
Roy et al., 2020), allowing us to compare DDPMs directly          In other words, as we add more diffusion steps, the model
to many other generative models.                                  mean µθ (xt , t) determines the distribution much more than
                                                                  Σθ (xt , t).
The setup from Ho et al. (2020) (optimizing Lsimple while
setting σt2 = βt and T = 1000) achieves a log-likelihood          While the above argument suggests that fixing σt is a reason-
of 3.99 (bits/dim) on ImageNet 64 × 64 after 200K training        able choice for the sake of sample quality, it says nothing
iterations. We found in early experiments that we could           about log-likelihood. In fact, Figure 2 shows that the first
get a boost in log-likelihood by increasing T from 1000 to        few steps of the diffusion process contribute the most to
4000; with this change, the log-likelihood improves to 3.77.      the variational lower bound. Thus, it seems likely that we
For the remainder of this section, we use T = 4000, but we        could improve log-likelihood by using a better choice of
explore this choice in Section 4.                                 Σθ (xt , t). To achieve this, we must learn Σθ (xt , t) without
                                                                  the instabilities encountered by Ho et al. (2020).
3.1. Learning Σθ (xt , t)                                         Since Figure 1 shows that the reasonable range for Σθ (xt , t)
In Ho et al. (2020), the authors set Σθ (xt , t) = σt2 I, where   is very small, it would be hard for a neural network to predict
σt is not learned. Oddly, they found that fixing σt2 to βt        Σθ (xt , t) directly, even in the log domain, as observed by
yielded roughly the same sample quality as fixing it to β̃t .     Ho et al. (2020). Instead, we found it better to parameterize
                                                                  the variance as an interpolation between βt and β̃t in the
Improved Denoising Diffusion Probabilistic Models                                                                                                                 4

                                                                                                    1.0                                              linēr
                                                                                                                                                     cosine
                                                                                                    0.8


                                                                                                    0.6
                                                                                             ᾱ t
Figure 3. Latent samples from linear (top) and cosine (bottom)                                      0.4
schedules respectively at linearly spaced values of t from 0 to T .
The latents in the last quarter of the linear schedule are almost                                   0.2
purely noise, whereas the cosine schedule adds noise more slowly
                                                                                                    0.0
                                                                                                          0.0    0.2     0.4          0.6      0.8        1.0
                                                             cosine schedule                                            diffusion step (t/T)
            60                                               linear schedule

            50                                                                        Figure 5. ᾱt throughout diffusion in the linear schedule and our
                                                                                      proposed cosine schedule.
      FID   40

            30                                                                        noisy, and so doesn’t contribute very much to sample quality.
                                                                                      This can be seen visually in Figure 3. The result of this
            20                                                                        effect is studied in Figure 4, where we see that a model
                 0.0      0.1         0.2          0.3         0.4       0.5          trained with the linear schedule does not get much worse (as
                       fraction of reverse diffusion process skipped
                                                                                      measured by FID) when we skip up to 20% of the reverse
                                                                                      diffusion process.
Figure 4. FID when skipping a prefix of the reverse diffusion
process on ImageNet 64 × 64.                                                          To address this problem, we construct a different noise
                                                                                      schedule in terms of ᾱt :
                                                                                                                                                    2
                                                                                                          f (t)                     t/T + s π
                                                                                            ᾱt =               , f (t) = cos              ·                    (17)
log domain. In particular, our model outputs a vector v                                                   f (0)                      1+s     2
containing one component per dimension, and we turn this
output into variances as follows:                                                     To go from this definition to variances βt , we note that
                                                                                      βt = 1 − ᾱᾱt−1
                                                                                                    t
                                                                                                       . In practice, we clip βt to be no larger than
      Σθ (xt , t) = exp(v log βt + (1 − v) log β̃t )                           (15)   0.999 to prevent singularities at the end of the diffusion
                                                                                      process near t = T .
We did not apply any constraints on v, theoretically allowing                         Our cosine schedule is designed to have a linear drop-off of
the model to predict variances outside of the interpolated                            ᾱt in the middle of the process, while changing very little
range. However, we did not observe the network doing                                  near the extremes of t = 0 and t = T to prevent abrupt
this in practice, suggesting that the bounds for Σθ (xt , t) are                      changes in noise level. Figure 5 shows how ᾱt progresses
indeed expressive enough.                                                             for both schedules. We can see that the linear schedule from
Since Lsimple doesn’t depend on Σθ (xt , t), we define a new                          Ho et al. (2020) falls towards zero much faster, destroying
hybrid objective:                                                                     information more quickly than necessary.
                                                                                      We use a small offset s to prevent βt from being too small
                       Lhybrid = Lsimple + λLvlb                               (16)   near t = 0, since we found that having tiny amounts of
                                                                                      noise at the beginning of the process made it hard for the
For our experiments, we set λ = 0.001 to prevent Lvlb from                            network to predict √accurately enough. In particular, we
overwhelming Lsimple . Along this same line of reasoning,                             selected s such that β0 was slightly smaller than the pixel
we also apply a stop-gradient to the µθ (xt , t) output for the                       bin size 1/127.5, which gives s = 0.008. We chose to
Lvlb term. This way, Lvlb can guide Σθ (xt , t) while Lsimple                         use cos2 in particular because it is a common mathematical
is still the main source of influence over µθ (xt , t).                               function with the shape we were looking for. This choice
                                                                                      was arbitrary, and we expect that many other functions with
3.2. Improving the Noise Schedule                                                     similar shapes would work as well.
We found that while the linear noise schedule used in Ho
                                                                                      3.3. Reducing Gradient Noise
et al. (2020) worked well for high resolution images, it was
sub-optimal for images of resolution 64 × 64 and 32 × 32.                             We expected to achieve the best log-likelihoods by optimiz-
In particular, the end of the forward noising process is too                          ing Lvlb directly, rather than by optimizing Lhybrid . However,
Improved Denoising Diffusion Probabilistic Models                                                                                                                                 5
                                      5.00
                                                                                            Lvlb
                                      4.75                                                  Lhybrid
                                                                                                                      Table 1. Ablating schedule and objective on ImageNet 64 × 64.
                                                                                            Lvlb (resampled)




       training log loss (bits/dim)
                                      4.50
                                                                                                                            Iters   T     Schedule    Objective   NLL     FID
                                      4.25

                                      4.00                                                                                 200K     1K     linear      Lsimple    3.99    32.5
                                                                                                                           200K     4K     linear      Lsimple    3.77    31.3
                                      3.75

                                      3.50                                                                                 200K     4K     linear      Lhybrid    3.66    32.2
                                                                                                                           200K     4K     cosine      Lsimple    3.68    27.0
                                      3.25
                                                                                                                           200K     4K     cosine      Lhybrid    3.62    28.0
                                      3.00                                                                                 200K     4K     cosine       Lvlb      3.57    56.7
                                             0        50            100           150          200              250
                                                                training iterations (1e3)
                                                                                                                           1.5M     4K     cosine      Lhybrid    3.57    19.2
                                                                                                                           1.5M     4K     cosine       Lvlb      3.53    40.1
Figure 6. Learning curves comparing the log-likelihoods achieved
by different objectives on ImageNet 64 × 64.
                                                                                                                          Table 2. Ablating schedule and objective on CIFAR-10.
                                       105
                                                                                                                           Iters    T    Schedule    Objective    NLL    FID
                                       104
                                                                                                                           500K     1K     linear      Lsimple    3.73   3.29


       gradient noise scale
                                       103                                                                                 500K     4K     linear      Lsimple    3.37   2.90
                                       102                                                                                 500K     4K    linear       Lhybrid    3.26   3.07
                                                                                                                           500K     4K    cosine       Lsimple    3.26   3.05
                                       101
                                                                                                                           500K     4K    cosine       Lhybrid    3.17   3.19
                                       100                                                                  Lvlb           500K     4K    cosine        Lvlb      2.94   11.47
                                                                                                            Lhybrid
                                      10−1
                                                 0   25    50       75     100     125      150       175      200
                                                                   training step (1e3)
                                                                                                                      uniformly until we draw 10 samples for every t ∈ [0, T − 1].
Figure 7. Gradient noise scales for the Lvlb and Lhybrid objectives                                                   With this importance sampled objective, we are able to
on ImageNet 64 × 64.                                                                                                  achieve our best log-likelihoods by optimizing Lvlb . This
                                                                                                                      can be seen in Figure 6 as the Lvlb (resampled) curve. The
                                                                                                                      figure also shows that the importance sampled objective is
we were surprised to find that Lvlb was actually quite diffi-                                                         considerably less noisy than the original, uniformly sam-
cult to optimize in practice, at least on the diverse ImageNet                                                        pled objective. We found that the importance sampling
64 × 64 dataset. Figure 6 shows the learning curves for both                                                          technique was not helpful when optimizing the less-noisy
Lvlb and Lhybrid . Both curves are noisy, but the hybrid objec-                                                       Lhybrid objective directly.
tive clearly achieves better log-likelihoods on the training
set given the same amount of training time.                                                                           3.4. Results and Ablations
We hypothesized that the gradient of Lvlb was much noisier                                                            In this section, we ablate the changes we have made to
than that of Lhybrid . We confirmed this by evaluating the                                                            achieve better log-likelihoods. Table 1 summarizes the re-
gradient noise scales (McCandlish et al., 2018) for models                                                            sults of our ablations on ImageNet 64 × 64, and Table 2
trained with both objectives, as shown in Figure 7. Thus,                                                             shows them for CIFAR-10. We also trained our best Ima-
we sought out a way to reduce the variance of Lvlb in order                                                           geNet 64 × 64 models for 1.5M iterations, and report these
to optimize directly for log-likelihood.                                                                              results as well. Lvlb and Lhybrid were trained with learned
Noting that different terms of Lvlb have greatly different                                                            sigmas using the parameterization from Section 3.1. For
magnitudes (Figure 2), we hypothesized that sampling t                                                                Lvlb , we used the resampling scheme from Section 3.3.
uniformly causes unnecessary noise in the Lvlb objective.                                                             Based on our ablations, using Lhybrid and our cosine sched-
To address this, we employ importance sampling:                                                                       ule improves log-likelihood while keeping similar FID as
                                                                                                                    the baseline from Ho et al. (2020). Optimizing Lvlb further
                Lt
                                 q              X
Lvlb = Et∼pt         , where pt ∝ E[L2t ] and       pt = 1                                                            improves log-likelihood at the cost of a higher FID. We
                pt                                                                                                    generally prefer to use Lhybrid over Lvlb as it gives a boost
                                                       (18)
                                                                                                                      in likelihood without sacrificing sample quality.
Since E[L2t ] is unknown beforehand and may change
throughout training, we maintain a history of the previous                                                            In Table 3 we compare our best likelihood models against
10 values for each loss term, and update this dynamically                                                             prior work, showing that these models are competitive with
during training. At the beginning of training, we sample t                                                            the best conventional methods in terms of log-likelihood.
Improved Denoising Diffusion Probabilistic Models                                                                                                            6

                                                                                   Lsimple (σt2 = βt, mid-training)          Lsimple (̃̃IM, mid-training)
Table 3. Comparison of DDPMs to other likelihood-based mod-                        Lsimple (σ 2 = βt, fully trained)
                                                                                             t
                                                                                                                             Lsimple (̃̃IM, fully trained)
els on CIFAR-10 and Unconditional ImageNet 64 × 64. NLL is                         Lsimple (σt2 = βt̃ , mid-training)        Lhybrid (ours, mid-training)
                                                                                                                             Lhybrid (ours, fully trained)
reported in bits/dim. On ImageNet 64 × 64, our model is compet-                    Lsimple (σ 2 = β̃t, fully trained)
                                                                                             t

itive with the best convolutional models, but is worse than fully
                                                                                40.0
transformer-based architectures.
                                                                                35.0

 Model                                       ImageNet     CIFAR                 30.0
 Glow (Kingma & Dhariwal, 2018)                3.81        3.35           FID
 Flow++ (Ho et al., 2019)                      3.69        3.08                 25.0
 PixelCNN (van den Oord et al., 2016c)         3.57        3.14
 SPN (Menick & Kalchbrenner, 2018)             3.52          -                  20.0
 NVAE (Vahdat & Kautz, 2020)                     -         2.91
 Very Deep VAE (Child, 2020)                   3.52        2.87                 15.0
                                                                                                          102                        103
 PixelSNAIL (Chen et al., 2018)                3.52        2.85                                                    sampling steps
 Image Transformer (Parmar et al., 2018)       3.48        2.90
                                                                                20.0
 Sparse Transformer (Child et al., 2019)       3.44        2.80
 Routing Transformer (Roy et al., 2020)        3.43          -                  17.5

 DDPM (Ho et al., 2020)                        3.77        3.70                 15.0
 DDPM (cont flow) (Song et al., 2020b)           -         2.99                 12.5
 Improved DDPM (ours)                          3.53        2.94           FID   10.0
                                                                                 7.5

4. Improving Sampling Speed                                                      5.0
                                                                                 2.5
All of our models were trained with 4000 diffusion steps,                        0.0
and thus producing a single sample takes several minutes on                                               102                        103
                                                                                                                   sampling steps
a modern GPU. In this section, we explore how performance
scales if we reduce the steps used during sampling, and find
                                                                    Figure 8. FID versus number of sampling steps, for models trained
that our pre-trained Lhybrid models can produce high-quality
                                                                    on ImageNet 64 × 64 (top) and CIFAR-10 (bottom). All models
samples with many fewer diffusion steps than they were              were trained with 4000 diffusion steps.
trained with (without any fine-tuning). Reducing the steps
in this way makes it possible to sample from our models
in a number of seconds rather than minutes, and greatly             steps, using 25, 50, 100, 200, 400, 1000, and 4000 sampling
improves the practical applicability of image DDPMs.                steps. We do this for both a fully-trained checkpoint, and
                                                                    a checkpoint mid-way through training. For CIFAR-10 we
For a model trained with T diffusion steps, we would                used 200K and 500K training iterations, and for ImageNet-
typically sample using the same sequence of t values                64 we used 500K and 1500K training iterations. We find
(1, 2, ..., T ) as used during training. However, it is also        that the Lsimple models with fixed sigmas (with both the
possible to sample using an arbitrary subsequence S of t
                                                                    larger σt2 = βt and the smaller σt2 = β̃t ) suffer much more
values. Given the training noise schedule ᾱt , for a given
                                                                    in sample quality when using a reduced number of sampling
sequence S we can obtain the sampling noise schedule ᾱSt ,
                                                                    steps, whereas our Lhybrid model with learnt sigmas main-
which can be then used to obtain corresponding sampling
                                                                    tains high sample quality. With this model, 100 sampling
variances
                                                                    steps is sufficient to achieve near-optimal FIDs for our fully
                      ᾱSt           1 − ᾱSt−1                     trained models.
        βSt = 1 −           , β̃St =            βSt     (19)
                     ᾱSt−1           1 − ᾱSt
                                                                    Parallel to our work, Song et al. (2020a) propose a fast
                                                                    sampling algorithm for DDPMs by producing a new im-
Now, since Σθ (xSt , St ) is parameterized as a range between
                                                                    plicit model that has the same marginal noise distributions,
βSt and β̃St , it will automatically be rescaled for the shorter
                                                                    but deterministically maps noise to images. We include
diffusion process. We can thus compute p(xSt−1 |xSt ) as
                                                                    their algorithm, DDIM, in Figure 8, finding that DDIM pro-
N (µθ (xSt , St ), Σθ (xSt , St )).
                                                                    duces better samples with fewer than 50 sampling steps, but
To reduce the number of sampling steps from T to K, we              worse samples when using 50 or more steps. Interestingly,
use K evenly spaced real numbers between 1 and T (inclu-            DDIM performs worse at the start of training, but closes the
sive), and then round each resulting number to the nearest          gap to other samplers as training continues. We found that
integer. In Figure 8, we evaluate FIDs for an Lhybrid model         our striding technique drastically reduced performance of
and an Lsimple model that were trained with 4000 diffusion          DDIM, so our DDIM results instead use the constant strid-
Improved Denoising Diffusion Probabilistic Models                                                                                    7

                                                                       throughout the model. We train using the Lhybrid objective
Table 4. Sample quality comparison on class-conditional Ima-
                                                                       and use 250 sampling steps. We train two models: a ”small”
geNet 64 × 64. Precision and recall (Kynkäänniemi et al., 2019)
are measured using Inception-V3 features and K = 5. We trained         model with 100M parameters for 1.7M training steps, and
BigGAN-deep for 125K iterations, and did not use truncation for        a larger model with 270 million parameters for 250K it-
sampling to maximize recall for the GAN.                               erations. We train one BigGAN-deep model with 100M
                                                                       parameters across the generator and discriminator.
  Model                                   FID    Prec.    Recall       When computing metrics for this task, we generated 50K
  BigGAN-deep (Brock et al., 2018)       4.06     0.86     0.59        samples (rather than 10K) to be directly comparable to other
  Improved Diffusion (small)             6.92     0.77     0.72        works.2 This is the only ImageNet 64 × 64 FID we report
  Improved Diffusion (large)             2.92     0.82     0.71        that was computed using 50K samples. For FID, the ref-
                                                                       erence distribution features were computed over the full
                                                                       training set, following (Brock et al., 2018).
                                                                       Figure 9 shows our samples from the larger model, and
                                                                       Table 4 summarizes our results. We find that BigGAN-
                                                                       deep outperforms our smaller model in terms of FID, but
                                                                       struggles in terms of recall. This suggests that diffusion
                                                                       models are better at covering the modes of the distribution
                                                                       than comparable GANs.

                                                                       6. Scaling Model Size
                                                                       In the previous sections, we showed algorithmic changes
                                                                       that improved log-likelihood and FID without changing the
                                                                       amount of training compute. However, a trend in modern
                                                                       machine learning is that larger models and more training
                                                                       time tend to improve model performance (Kaplan et al.,
                                                                       2020; Chen et al., 2020a; Brown et al., 2020). Given this
Figure 9. Class-conditional ImageNet 64 × 64 samples generated         observation, we investigate how FID and NLL scale as a
using 250 sampling steps from Lhybrid model (FID 2.92). The            function of training compute. Our results, while prelimi-
classes are 9: ostrich, 11: goldfinch, 130: flamingo, 141: redshank,   nary, suggest that DDPMs improve in a predictable way as
154: pekinese, 157: papillon, 97: drake and 28: spotted salamander.    training compute increases.
We see that there is a high diversity in each class, suggesting good
coverage of the target distribution
                                                                       To measure how performance scales with training compute,
                                                                       we train four different models on ImageNet 64 × 64 with
                                                                       the Lhybrid objective described in Section 3.1. To change
                                                                       model capacity, we apply a depth multiplier across all lay-
ing1 from Song et al. (2020a), wherein the final timestep is           ers, such that the first layer has either 64, 96, 128, or 192
T − T /K + 1 rather than T . The other samplers performed              channels. Note that our previous experiments used 128
slightly better with our striding.                                     channels in the first layer. Since the depth of each layer af-
                                                                       fects the scale of the initial weights, we scale the Adam
5. Comparison to GANs                                                  (Kingma
                                                                         √        & Ba, 2014) learning rate for each model by
                                                                       1/ channel multiplier, such that the 128 channel model
While likelihood is a good proxy for mode-coverage, it is              has a learning rate of 0.0001 (as in our other experiments).
difficult to compare to GANs with this metric. Instead, we
turn to precision and recall (Kynkäänniemi et al., 2019).            Figure 10 shows how FID and NLL improve relative to
Since it is common in the GAN literature to train class-               theoretical training compute.3 The FID curve looks approx-
conditional models, we do the same for this experiment.                imately linear on a log-log plot, suggesting that FID scales
To make our models class-conditional, we inject class in-              according to a power law (plotted as the black dashed line).
formation through the same pathway as the timestep t. In               The NLL curve does not fit a power law as cleanly, suggest-
particular, we add a class embedding vi to the timestep                ing that validation NLL scales in a less-favorable manner
embedding et , and pass this embedding to residual blocks                 2
                                                                            We found that using more samples led to a decrease in esti-
   1
    We additionally tried the quadratic stride from Song et al.        mated FID of roughly 2 points.
                                                                          3
(2020a), but found that it hurt sample quality when combined with           The x-axis assumes full hardware utilization
our cosine schedule.
Improved Denoising Diffusion Probabilistic Models                                                                                            8
                       70.0                                                    the need for this adversarial objective, suggesting that the ad-
                                                  64 ch (30M params)
                       60.0                       96 ch (68M params)
                                                  128 ch (120M params)
                                                                               versarial objective is not necessary for powerful generative
                       50.0
                                                  192 ch (270M params)         modeling.
                       40.0                       4.00 + (2.500e-25*C)^-0.22
                                                                               Parallel to our work, Song et al. (2020a) and Song et al.
      FID              30.0
                                                                               (2020b) propose fast sampling algorithms for models trained
                                                                               with the DDPM objective by leveraging different sampling
                       20.0                                                    processes. Song et al. (2020a) does this by deriving an im-
                                                                               plicit generative model that has the same marginal noise
                              1017   1018             1019              1020
                                                                               distributions as DDPMs while deterministically mapping
                                        compute (FLOPs)                        noise to images. Song et al. (2020b) model the diffusion
                                                  64 ch (30M params)
                                                                               process as the discretization of a continuous SDE, and ob-
                       3.80
                                                  96 ch (68M params)           serve that there exists an ODE that corresponds to sampling
                                                  128 ch (120M params)
                       3.75                                                    from the reverse SDE. By varying the numerical precision
                                                  192 ch (270M params)
                                                  3.40 + (3.000e-15*C)^-0.17   of an ODE solver, they can sample with fewer function
                       3.70

      NLL (bits/dim)
                                                                               evaluations. However, they note that this technique obtains
                       3.65
                                                                               worse samples than ancestral sampling when used directly,
                       3.60                                                    and only achieves better FID when combined with Langevin
                       3.55                                                    corrector steps. This in turn requires hand-tuning of a signal-
                                                                               to-noise ratio for the Langevin steps. Our method allows
                              1017   1018             1019              1020   fast sampling directly from the ancestral process, which
                                        compute (FLOPs)
                                                                               removes the need for extra hyperparameters.

Figure 10. FID and validation NLL throughout training on Im-                   Also in parallel, Gao et al. (2020) develops a diffusion model
ageNet 64 × 64 for different model sizes. The constant for the                 with reverse diffusion steps modeled by an energy-based
FID trend line was approximated using the FID of in-distribution               model. A potential implication of this approach is that fewer
data. For the NLL trend line, the constant was approximated by                 diffusion steps should be needed to achieve good samples.
rounding down the current state-of-the-art NLL (Roy et al., 2020)
on this dataset.
                                                                               8. Conclusion
than FID. This could be caused by a variety of factors, such                   We have shown that, with a few modifications, DDPMs can
as 1) an unexpectedly high irreducible loss (Henighan et al.,                  sample much faster and achieve better log-likelihoods with
2020) for this type of diffusion model, or 2) the model over-                  little impact on sample quality. The likelihood is improved
fitting to the training distribution. We also note that these                  by learning Σθ using our parameterization and Lhybrid ob-
models do not achieve optimal log-likelihoods in general                       jective. This brings the likelihood of these models much
because they were trained with our Lhybrid objective and not                   closer to other likelihood-based models. We surprisingly
directly with Lvlb to keep both good log-likelihoods and                       discover that this change also allows sampling from these
sample quality.                                                                models with many fewer steps.
                                                                               We have also found that DDPMs can match the sample qual-
7. Related Work                                                                ity of GANs while achieving much better mode coverage
Chen et al. (2020b) and Kong et al. (2020) are two recent                      as measured by recall. Furthermore, we have investigated
works that use DDPMs to produce high fidelity audio condi-                     how DDPMs scale with the amount of available training
tioned on mel-spectrograms. Concurrent to our work, Chen                       compute, and found that more training compute trivially
et al. (2020b) use a combination of improved schedule and                      leads to better sample quality and log-likelihood.
L1 loss to allow sampling with fewer steps with very lit-                      The combination of these results makes DDPMs an attrac-
tle reduction in sample quality. However, compared to our                      tive choice for generative modeling, since they combine
unconditional image generation task, their generative task                     good log-likelihoods, high-quality samples, and reasonably
has a strong input conditioning signal provided by the mel-                    fast sampling with a well-grounded, stationary training ob-
spectrograms, and we hypothesize that this makes it easier                     jective that scales easily with training compute. These re-
to sample with fewer diffusion steps.                                          sults indicate that DDPMs are a promising direction for
Jolicoeur-Martineau et al. (2020) explored score matching                      future research.
in the image domain, and constructed an adversarial training
objective to produce better x0 predictions. However, they
found that choosing a better network architecture removed
Improved Denoising Diffusion Probabilistic Models                                                                          9

References                                                       Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P.
                                                                   Flow++: Improving flow-based generative models with
Brock, A., Donahue, J., and Simonyan, K. Large scale gan
                                                                  variational dequantization and architecture design. arXiv
  training for high fidelity natural image synthesis. arXiv
                                                                   preprint arXiv:1902.00275, 2019.
  preprint arXiv:1809.11096, 2018.
                                                                 Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,            bilistic models, 2020.
  J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
  Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,        Hyvärinen, A. Estimation of non-normalized statistical
  Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,         models by score matching. Journal of Machine Learning
  J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,    Research, 6(Apr):695–709, 2005.
  Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,
  S., Radford, A., Sutskever, I., and Amodei, D. Language        Jolicoeur-Martineau, A., Piché-Taillefer, R., des Combes,
  models are few-shot learners, 2020.                              R. T., and Mitliagkas, I. Adversarial score matching and
                                                                   improved sampling for image generation, 2020.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H.,
  Dhariwal, P., Luan, D., and Sutskever, I. Genera-              Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
  tive pretraining from pixels, 2020a. URL https:                  Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
  //cdn.openai.com/papers/Generative_                              Amodei, D. Scaling laws for neural language models,
  Pretraining_from_Pixels_V2.pdf.                                  2020.

                                                                 Kingma, D. P. and Ba, J. Adam: A method for stochastic
Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and
                                                                   optimization, 2014.
  Chan, W. Wavegrad: Estimating gradients for waveform
  generation, 2020b.                                             Kingma, D. P. and Dhariwal, P. Glow: Generative flow
                                                                   with invertible 1x1 convolutions. In Advances in neural
Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P.              information processing systems, pp. 10215–10224, 2018.
  Pixelsnail: An improved autoregressive generative model.
  In International Conference on Machine Learning, pp.           Kingma, D. P. and Welling, M. Auto-encoding variational
  864–872. PMLR, 2018.                                             bayes, 2013.

Child, R. Very deep vaes generalize autoregressive models        Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B.
  and can outperform them on images. arXiv preprint                Diffwave: A versatile diffusion model for audio synthesis,
  arXiv:2011.10650, 2020.                                          2020.

Child, R., Gray, S., Radford, A., and Sutskever, I. Generat-     Krizhevsky, A.      Learning multiple layers of
  ing long sequences with sparse transformers, 2019.               features from tiny images, 2009.          URL
                                                                   http://www.cs.toronto.edu/˜kriz/
Gao, R., Song, Y., Poole, B., Wu, Y. N., and Kingma, D. P.         learning-features-2009-TR.pdf.
  Learning energy-based models by diffusion recovery like-
  lihood, 2020.                                                  Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., and
                                                                  Aila, T. Improved precision and recall metric for assess-
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual              ing generative models, 2019.
  learning for image recognition, 2015.
                                                                 McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D.
Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C.,          An empirical model of large-batch training, 2018.
  Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S.,    Menick, J. and Kalchbrenner, N. Generating high fidelity im-
  Hallacy, C., Mann, B., Radford, A., Ramesh, A., Ryder,          ages with subscale pixel networks and multidimensional
  N., Ziegler, D. M., Schulman, J., Amodei, D., and Mc-           upscaling, 2018.
  Candlish, S. Scaling laws for autoregressive generative
  modeling, 2020.                                                Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer,
                                                                   N., Ku, A., and Tran, D. Image transformer. arXiv
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and        preprint arXiv:1802.05751, 2018.
  Hochreiter, S. Gans trained by a two time-scale update
  rule converge to a local nash equilibrium. Advances in         Ravuri, S. and Vinyals, O. Classification accuracy score
  Neural Information Processing Systems 30 (NIPS 2017),            for conditional generative models. arXiv preprint
  2017.                                                            arXiv:1905.10887, 2019.
Improved Denoising Diffusion Probabilistic Models              10

Razavi, A., van den Oord, A., and Vinyals, O. Generating
  diverse high-fidelity images with vq-vae-2, 2019.
Roy, A., Saffar, M., Vaswani, A., and Grangier, D. Efficient
  content-based sparse attention with routing transformers,
  2020.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,
  Radford, A., and Chen, X. Improved techniques for
  training gans, 2016.
Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pix-
  elcnn++: Improving the pixelcnn with discretized logistic
  mixture likelihood and other modifications, 2017.

Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and
  Ganguli, S. Deep unsupervised learning using nonequi-
  librium thermodynamics, 2015.
Song, J., Meng, C., and Ermon, S. Denoising diffusion
  implicit models, 2020a.

Song, Y. and Ermon, S. Generative modeling by estimating
  gradients of the data distribution. In Advances in Neural
  Information Processing Systems, pp. 11918–11930, 2019.
Song, Y. and Ermon, S. Improved techniques for train-
  ing score-based generative models. arXiv preprint
  arXiv:2006.09011, 2020.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er-
  mon, S., and Poole, B. Score-based generative modeling
  through stochastic differential equations, 2020b.

Vahdat, A. and Kautz, J. Nvae: A deep hierarchical vari-
  ational autoencoder. arXiv preprint arXiv:2007.03898,
  2020.
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.
  Pixel recurrent neural networks, 2016a.

van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt,
  L., Graves, A., and Kavukcuoglu, K. Conditional image
  generation with pixelcnn decoders, 2016b. URL http:
  //image-net.org/small/download.php.
van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt,
  L., Graves, A., and Kavukcuoglu, K. Conditional image
  generation with pixelcnn decoders, 2016c.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
  L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention
  is all you need, 2017.

Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and
  Xiao, J. Lsun: Construction of a large-scale image dataset
  using deep learning with humans in the loop, 2015.
Improved Denoising Diffusion Probabilistic Models                                                                                                              11

A. Hyperparameters                                                    the FID to be higher, but requires much less compute for
                                                                      sampling and helps do large ablations. Since we mainly use
For all of our experiments, we use a UNet model architec-             FID for relative comparisons on unconditional ImageNet
ture4 similar to that used by Ho et al. (2020). We changed            64 × 64, this bias is acceptable. For computing the reference
the attention layers to use multi-head attention (Vaswani             distribution statistics we follow prior work (Ho et al., 2020;
et al., 2017), and opted to use four attention heads rather           Brock et al., 2018) and use the full training set for CIFAR-10
than one (while keeping the same total number of channels).           and ImageNet, and 50K training samples for LSUN. Note
We employ attention not only at the 16x16 resolution, but             that unconditional ImageNet 64 × 64 models are trained and
also at the 8x8 resolution. Additionally, we changed the              evaluated using the official ImageNet-64 dataset (van den
way the model conditions on t. In particular, instead of com-         Oord et al., 2016a), whereas for class conditional ImageNet
puting a conditioning vector v and injecting it into hidden           64 × 64 and 256 × 256 we center crop and area downsample
state h as GroupNorm(h + v), we compute conditioning                  images (Brock et al., 2018).
vectors w and b and inject them into the hidden state as
GroupNorm(h)(w + 1) + b. We found in preliminary ex-
periments on ImageNet 64 × 64 that these modifications                B. Fast Sampling on LSUN 256 × 256
slightly improved FID.
For ImageNet 64 × 64 the architecture we use is described
                                                                              Lsimple (σt2 = βt, batch=64, lr=2e-5)        Lsimple (̃̃IM, batch=64, lr=2e-5)
as follows. The downsampling stack performs four steps of
                                                                              Lsimple (σ 2 = βt, batch=128, lr=1e-4)
                                                                                       t
                                                                                                                           Lsimple (̃̃IM, batch=128, lr=1e-4)
downsampling, each with three residual blocks (He et al.,                     Lsimple (σt2 = β̃t, batch=64, lr=2e-5)       Lhybrid (batch=64, lr=2e-5)
2015). The upsampling stack is setup as a mirror image of                                                                  Lhybrid (batch=128, lr=1e-4)
                                                                              Lsimple (σ 2 = βt̃ , batch=128, lr=1e-4)
                                                                                       t
the downsampling stack. From highest to lowest resolution,
the UNet stages use [C, 2C, 3C, 4C] channels, respectively.                 20.0
In our ImageNet 64 × 64 ablations, we set C = 128, but                      17.5
we experiment with scaling C in a later section. We esti-
                                                                            15.0
mate that, with C = 128, our model is comprised of 120M
parameters and requires roughly 39 billion FLOPs in the                     12.5
forward pass.                                                         FID   10.0
For our CIFAR-10 experiments, we use a smaller model with                    7.5
three resblocks per downsampling stage and layer widths
                                                                             5.0
[C, 2C, 2C, 2C] with C = 128. We swept over dropout
values {0.1, 0.2, 0.3} and found that 0.1 worked best for                    2.5
the linear schedule while 0.3 worked best for our cosine                     0.0
schedule. We expand upon this in Section F.                                                                      102                                      103
                                                                                                                  sampling steps
We use Adam (Kingma & Ba, 2014) for all of our experi-
ments. For most experiments, we use a batch size of 128,              Figure 11. FID vs. number of sampling steps from an LSUN
                                                                      256 × 256 bedroom model.
a learning rate of 10−4 , and an exponential moving aver-
age (EMA) over model parameters with a rate of 0.9999.
For our scaling experiments, we vary the learning rate to
accomodate for different model sizes. For our larger class-           To test the effectiveness of our Lhybrid models on a high-
conditional ImageNet 64 × 64 experiments, we scaled up                resolution domain, we trained both Lhybrid and Lsimple mod-
the batch size to 2048 for faster training on more GPUs.              els on the LSUN bedroom (Yu et al., 2015) dataset. We
                                                                      train two models: one with batch size 64 and learning rate
When using the linear noise schedule from Ho et al. (2020),           2 × 10−5 as in Ho et al. (2020), and another with a larger
we linearly interpolate from β1 = 0.0001/4 to β4000 =                 batch size 128 and learning rate 10−4 . All models were
0.02/4 to preserve the shape of ᾱt for the T = 4000 sched-           trained with 153.6M examples, which is 2.4M training itera-
ule.                                                                  tions with batch size 64.
When computing FID we produce 50K samples from our                    Our results are displayed in Figure 11. We find that DDIM
models, except for unconditional ImageNet 64 × 64 where               outperforms our Lhybrid model when using fewer than 50
we produce 10K samples. Using only 10K samples biases                 diffusion steps, while our Lhybrid model outperforms DDIM
   4
     In initial experiments, we found that a ResNet-style architec-   with more than 50 diffusion steps. Interestingly, we note
ture with no downsampling achieved better log-likelihoods but         that DDIM benefits from a smaller learning rate and batch
worse FIDs than the UNet architecture.                                size, whereas our method is able to take advantage of a
                                                                      larger learning rate and batch size.
Improved Denoising Diffusion Probabilistic Models                                                                            12

C. Sample Quality on ImageNet 256 × 256
We trained two models on class conditional ImageNet
256 × 256. The first is a usual diffusion model that directly
models the 256 × 256 images. The second model reduces
compute by chaining a pretrained 64 × 64 model p(x64 |y)
with another upsampling diffusion model p(x256 |x64 , y) to
upsample images to 256 × 256. For the upsampling model,
the downsampled image x64 is passed as extra conditioning
input to the UNet. This is similar to VQ-VAE-2 (Razavi
et al., 2019), which uses two stages of priors at different
latent resolutions to more efficiently learn global and local
features. The linear schedule worked better for 256 × 256
images, so we used that for these results. Table 5 summa-
rizes our results. For VQ-VAE-2, we use the FIDs reported
in (Ravuri & Vinyals, 2019). Diffusion models still obtain
the best FIDs for a likelihood-based model, and close the
gap to GANs considerably.

   MODEL                                            FID
   VQ-VAE-2 ((Razavi et al., 2019), two-stage)      38.1
   Improved Diffusion (ours, single-stage)          31.5
   Improved Diffusion (ours, two-stage)             12.3
   BigGAN (Brock et al., 2018)                       7.7
   BigGAN-deep (Brock et al., 2018)                  7.0

Table 5. Sample quality comparison on class conditional Ima-
geNet 256 × 256. BigGAN FIDs are reported for the truncation
that results in the best FID.




                                                                Figure 12. Random samples from two-stage class conditional Im-
                                                                ageNet 256 × 256 model. On top are random samples from the
                                                                64 × 64 model (FID 2.92), whereas on bottom are the results after
                                                                upsampling them to 256 × 256 (FID 12.3). Each model uses 250
                                                                sampling steps.
Improved Denoising Diffusion Probabilistic Models                                                                                                                            13

D. Combining Lhybrid and Lvlb Models                                          E. Log-likelihood with Fewer Diffusion Steps

                       1.100                                                                         Lsimple (σt2 = βt, mid-training)        Lsimple (̃̃IM, mid-training)
                                                                                                     Lsimple (σ 2 = βt, fully trained)
                                                                                                               t
                                                                                                                                             Lsimple (̃̃IM, fully trained)
                       1.075                                                                                                                 Lhybrid (ours, mid-training)
                                                                                                     Lsimple (σt2 = βt̃ , mid-training)
                                                                                                                                             Lhybrid (ours, fully trained)
                       1.050                                                                         Lsimple (σ 2 = β̃t, fully trained)
                                                                                                               t




Lt(θhybrid)/Lt(θvlb)
                       1.025
                                                                                               4.2
                       1.000

                       0.975                                                                   4.1

                       0.950                                                                   4.0


                                                                              NLL (bits/dim)
                       0.925                                                                   3.9

                       0.900                                                                   3.8
                               0   500   1000 1500 2000 2500 3000 3500 4000
                                               diffusion step (t)
                                                                                               3.7

Figure 13. The ratio between VLB terms for each diffusion step of                              3.6
θhybrid and θvlb . Values less than 1.0 indicate that θhybrid is ”better”
than θvlb for that timestep of the diffusion process.                                                                                         103
                                                                                                                                  evaluation steps

                                                                                               4.0


                                                                                               3.8




                                                                              NLL (bits/dim)
                                                                                               3.6


                                                                                               3.4


                                                                                               3.2


Figure 14. Samples from θvlb and θhybrid , as well as an ensemble                              3.0
produced by using θvlb for the first and last 100 diffusion steps. For                                                                        103
                                                                                                                                  evaluation steps
these samples, the seed was fixed, allowing a direct comparison
between models.

To understand the trade-off between Lhybrid and Lvlb , we                     Figure 15. NLL versus number of evaluation steps, for models
show in Figure 13 that the model resulting from Lvlb (re-                     trained on ImageNet 64 × 64 (top) and CIFAR-10 (bottom). All
ferred to as θvlb ) is better at the start and end of the diffusion           models were trained with 4000 diffusion steps.
process, while the model resulting from Lhybrid (referred
to as θhybrid ) is better throughout the middle of the diffu-                 Figures 15 plots negative log-likelihood as a function of
sion process. This suggests that θvlb is focusing more on                     number of sampling steps for both ImageNet 64 × 64 and
imperceptible details, hence the lower sample quality.                        CIFAR-10. In initial experiments, we found that although
                                                                              constant striding did not significantly affect FID, it dras-
Given the above observation, we performed an experiment
                                                                              tically reduced log-likelihood. To address this, we use a
on ImageNet 64 × 64 to combine the two models by con-
                                                                              strided subset of timesteps as for FID, but we also include
structing an ensemble that uses θhybrid for t ∈ [100, T −100)
                                                                              every t from 1 to T /K. This requires T /K extra evaluation
and θvlb elsewhere. We found that this model achieved an
                                                                              steps, but greatly improves log-likelihood compared to the
FID of 19.9 and an NLL of 3.52 bits/dim. This is only
                                                                              uniformly strided schedule. We did not attempt to calculate
slightly worse than θhybrid in terms of FID, while being bet-
                                                                              NLL using DDIM, since Song et al. (2020a) does not present
ter than both models in terms of NLL.
                                                                              NLL results or a simple way of estimating likelihood under
                                                                              DDIM.
Improved Denoising Diffusion Probabilistic Models                                                                                               14

F. Overfitting on CIFAR-10                                                G. Early stopping for FID
                                                                                10
      9                                                        linear                                                                0.0, 0.99
                                                               cosine                                                                0.0, 0.999
                                                                                9                                                    0.0, 0.9999
                                                                                                                                     0.0, 0.99995
      8                                                                                                                              0.0, 0.99999
                                                                                8                                                    0.1, 0.99
      7                                                                                                                              0.1, 0.999
                                                                          FID                                                        0.1, 0.9999
                                                                                7                                                    0.1, 0.99995
FID   6                                                                                                                              0.1, 0.99999
                                                                                                                                     0.3, 0.99
                                                                                6                                                    0.3, 0.999
      5                                                                                                                              0.3, 0.9999
                                                                                                                                     0.3, 0.99995
                                                                                5                                                    0.3, 0.99999
      4                                                                              200   400     600       800 1000 1200    1400   original best
                                                                                                 training iters (thousands)
      3
             100         200            300        400             500
                        training iters (thousands)                        Figure 17. A sweep of dropout and EMA hyperparameters on
                                                                          class conditional ImageNet-64.
      3.40                                               linear (test)
                                                         linear (train)
      3.35                                               cosine (test)    Like on CIFAR-10, we surprisingly observed overfitting
                                                         cosine (train)   on class-conditional ImageNet 64 × 64, despite it being a
      3.30
                                                                          much larger and more diverse dataset. The main observable
NLL
      3.25                                                                result of this overfitting was that FID started becoming
                                                                          worse over the course of training. We initially tried a sweep
      3.20
                                                                          (Figure 17) over the EMA hyperparameter to make sure it
      3.15                                                                was well tuned, and found that 0.9999 and 0.99995 worked
                                                                          best. We then tried runs with dropout 0.1 and 0.3, and
      3.10
                                                                          found that models with a small amount of dropout improved
               100         200            300          400         500
                                                                          the best attainable FID but took longer to get to the same
                          training iters (thousands)                      performance and still eventually overfit. We concluded that
                                                                          the best way to train, given what we know, is to early stop
                                                                          and instead increase model size if we want to use additional
Figure 16. FID (top) and NLL (bottom) over the course of training         training compute.
for two CIFAR-10 models, both with dropout 0.1. The model
trained with the linear schedule learns more slowly, but does not         H. Samples with Varying Steps and
overfit as quickly. When too much overfitting occurs, we observed
overfitting artifacts similar to those from Salimans et al. (2017),          Objectives
which is reflected by increasing FID.                                     Figures 18 through 23 show unconditional ImageNet 64 ×
                                                                          64 samples as we reduce number of sampling steps for
On CIFAR-10, we noticed that all models overfit, but tended               an Lhybrid model with 4K diffusion steps trained for 1.5M
to reach similar optimal FID at some point during training.               training iterations.
Holding dropout constant, we found that models trained
with our cosine schedule tended to reach optimal perfor-                  Figures 24 through 29 show unconditional CIFAR-10 sam-
mance (and then overfit) more quickly than those trained                  ples as we reduce number of sampling steps for an Lhybrid
with the linear schedule (Figure 16). In our experiments, we              model with 4K diffusion steps trained for 500K training
corrected for this difference by using more dropout for our               iterations.
cosine models than the linear models. We suspect that the                 Figures 30 and 31 highlight the difference in sample quality
overfitting from the cosine schedule is either due to 1) less             between models trained with Lhybrid and Lvlb .
noise in the cosine schedule providing less regularization,
or 2) the cosine schedule making optimization, and thus
overfitting, easier.
Improved Denoising Diffusion Probabilistic Models                                                                             15




Figure 18. 50 sampling steps on unconditional ImageNet 64 × 64    Figure 21. 400 sampling steps on unconditional ImageNet 64 × 64




Figure 19. 100 sampling steps on unconditional ImageNet 64 × 64   Figure 22. 1000 sampling steps on unconditional ImageNet 64×64




Figure 20. 200 sampling steps on unconditional ImageNet 64 × 64   Figure 23. 4K sampling steps on unconditional ImageNet 64 × 64.
Improved Denoising Diffusion Probabilistic Models                                                                      16




   Figure 24. 50 sampling steps on unconditional CIFAR-10   Figure 27. 400 sampling steps on unconditional CIFAR-10




Figure 25. 100 sampling steps on unconditional CIFAR-10     Figure 28. 1000 sampling steps on unconditional CIFAR-10




Figure 26. 200 sampling steps on unconditional CIFAR-10     Figure 29. 4000 sampling steps on unconditional CIFAR-10
Improved Denoising Diffusion Probabilistic Models                                                                                 17




Figure 30. Unconditional ImageNet 64 × 64 samples generated        Figure 31. Unconditional CIFAR-10 samples generated from
from Lhybrid (top) and Lvlb (bottom) models using the exact same   Lhybrid (top) and Lvlb (bottom) models using the exact same random
random noise. Both models were trained for 1.5M iterations.        noise. Both models were trained for 500K iterations.