Source notes and reports

Sources

← Home

                                                                   Scalable Diffusion Models with Transformers

                                                                         William Peebles*                     Saining Xie
                                                                          UC Berkeley                     New York University




arXiv:2212.09748v2 [cs.CV] 2 Mar 2023
                                        Figure 1. Diffusion models with transformer backbones achieve state-of-the-art image quality. We show selected samples from two
                                        of our class-conditional DiT-XL/2 models trained on ImageNet at 512×512 and 256×256 resolution, respectively.

                                                                Abstract                                  1. Introduction
                                           We explore a new class of diffusion models based on the           Machine learning is experiencing a renaissance powered
                                        transformer architecture. We train latent diffusion models        by transformers. Over the past five years, neural architec-
                                        of images, replacing the commonly-used U-Net backbone             tures for natural language processing [8, 42], vision [10]
                                        with a transformer that operates on latent patches. We an-        and several other domains have largely been subsumed by
                                        alyze the scalability of our Diffusion Transformers (DiTs)        transformers [60]. Many classes of image-level genera-
                                        through the lens of forward pass complexity as measured by        tive models remain holdouts to the trend, though—while
                                        Gflops. We find that DiTs with higher Gflops—through in-          transformers see widespread use in autoregressive mod-
                                        creased transformer depth/width or increased number of in-        els [3,6,43,47], they have seen less adoption in other gener-
                                        put tokens—consistently have lower FID. In addition to pos-       ative modeling frameworks. For example, diffusion models
                                        sessing good scalability properties, our largest DiT-XL/2         have been at the forefront of recent advances in image-level
                                        models outperform all prior diffusion models on the class-        generative models [9,46]; yet, they all adopt a convolutional
                                        conditional ImageNet 512×512 and 256×256 benchmarks,              U-Net architecture as the de-facto choice of backbone.
                                        achieving a state-of-the-art FID of 2.27 on the latter.
                                                                                                              * Work done during an internship at Meta AI, FAIR Team.

                                                                                                               Code and project page available here.


                                                                                                      1
                                                                                                               Diameter


                                                                                                                 5 20     80   320
                                                                                                                               Gflops




Figure 2. ImageNet generation with Diffusion Transformers (DiTs). Bubble area indicates the flops of the diffusion model. Left:
FID-50K (lower is better) of our DiT models at 400K training iterations. Performance steadily improves in FID as model flops increase.
Right: Our best model, DiT-XL/2, is compute-efficient and outperforms all prior U-Net-based diffusion models, like ADM and LDM.


    The seminal work of Ho et al. [19] first introduced the               More specifically, we study the scaling behavior of trans-
U-Net backbone for diffusion models. Having initially seen            formers with respect to network complexity vs. sample
success within pixel-level autoregressive models and con-             quality. We show that by constructing and benchmark-
ditional GANs [23], the U-Net was inherited from Pixel-               ing the DiT design space under the Latent Diffusion Mod-
CNN++ [52, 58] with a few changes. The model is con-                  els (LDMs) [48] framework, where diffusion models are
volutional, comprised primarily of ResNet [15] blocks. In             trained within a VAE’s latent space, we can successfully
contrast to the standard U-Net [49], additional spatial self-         replace the U-Net backbone with a transformer. We further
attention blocks, which are essential components in trans-            show that DiTs are scalable architectures for diffusion mod-
formers, are interspersed at lower resolutions. Dhariwal and          els: there is a strong correlation between the network com-
Nichol [9] ablated several architecture choices for the U-            plexity (measured by Gflops) vs. sample quality (measured
Net, such as the use of adaptive normalization layers [40] to         by FID). By simply scaling-up DiT and training an LDM
inject conditional information and channel counts for con-            with a high-capacity backbone (118.6 Gflops), we are able
volutional layers. However, the high-level design of the U-           to achieve a state-of-the-art result of 2.27 FID on the class-
Net from Ho et al. has largely remained intact.                       conditional 256 × 256 ImageNet generation benchmark.
    With this work, we aim to demystify the significance of
                                                                      2. Related Work
architectural choices in diffusion models and offer empiri-
cal baselines for future generative modeling research. We             Transformers. Transformers [60] have replaced domain-
show that the U-Net inductive bias is not crucial to the per-         specific architectures across language, vision [10], rein-
formance of diffusion models, and they can be readily re-             forcement learning [5, 25] and meta-learning [39]. They
placed with standard designs such as transformers. As a               have shown remarkable scaling properties under increas-
result, diffusion models are well-poised to benefit from the          ing model size, training compute and data in the language
recent trend of architecture unification—e.g., by inheriting          domain [26], as generic autoregressive models [17] and
best practices and training recipes from other domains, as            as ViTs [63]. Beyond language, transformers have been
well as retaining favorable properties like scalability, ro-          trained to autoregressively predict pixels [6, 7, 38]. They
bustness and efficiency. A standardized architecture would            have also been trained on discrete codebooks [59] as both
also open up new possibilities for cross-domain research.             autoregressive models [11,47] and masked generative mod-
    In this paper, we focus on a new class of diffusion models        els [4, 14]; the former has shown excellent scaling behavior
based on transformers. We call them Diffusion Transform-              up to 20B parameters [62]. Finally, transformers have been
ers, or DiTs for short. DiTs adhere to the best practices of          explored in DDPMs to synthesize non-spatial data; e.g., to
Vision Transformers (ViTs) [10], which have been shown to             generate CLIP image embeddings in DALL·E 2 [41, 46]. In
scale more effectively for visual recognition than traditional        this paper, we study the scaling properties of transformers
convolutional networks (e.g., ResNet [15]).                           when used as the backbone of diffusion models of images.


                                                                  2
                                                +                                         +                                        +
                                                             𝛼"
                                               Scale
         Noise            Σ                                                            Pointwise                               Pointwise
                                            Pointwise                                 Feedforward                             Feedforward
        32 x 32 x 4   32 x 32 x 4
                                           Feedforward
                                                                                      Layer Norm                              Layer Norm
                                                            𝛾",𝛽"
       Linear and Reshape                   Scale, Shift

                                           Layer Norm                                      +                                       +
             Layer Norm
                                                                                       Multi-Head
                                                +                                    Cross-Attention
                                                                                                                               Multi-Head
Nx             DiT Block                       Scale
                                                             𝛼!                       Layer Norm                              Self-Attention

                                                                                                                              Layer Norm
                                            Multi-Head                                    +
                                           Self-Attention
        Patchify      Embed
                                                            𝛾!,𝛽!                     Multi-Head
                                            Scale, Shift                             Self-Attention                           Concatenate
                                                                                                                              on Sequence
        Noised                             Layer Norm             MLP                 Layer Norm                               Dimension
                       Timestep 𝑡
        Latent
        32 x 32 x 4     Label 𝑦            Input Tokens      Conditioning             Input Tokens     Conditioning           Input Tokens     Conditioning


     Latent Diffusion Transformer        DiT Block with adaLN-Zero               DiT Block with Cross-Attention       DiT Block with In-Context Conditioning

Figure 3. The Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed
into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer
blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.

Denoising diffusion probabilistic models (DDPMs).                               3. Diffusion Transformers
Diffusion [19, 54] and score-based generative models [22,
56] have been particularly successful as generative models                      3.1. Preliminaries
of images [35,46,48,50], in many cases outperforming gen-
erative adversarial networks (GANs) [12] which had previ-                       Diffusion formulation. Before introducing our architec-
ously been state-of-the-art. Improvements in DDPMs over                         ture, we briefly review some basic concepts needed to
the past two years have largely been driven by improved                         understand diffusion models (DDPMs) [19, 54]. Gaus-
sampling techniques [19, 27, 55], most notably classifier-                      sian diffusion models assume a forward noising process
free guidance [21], reformulating diffusion models to pre-                      which √gradually applies noise to real data x0 : q(xt |x0 ) =
dict noise instead of pixels [19] and using cascaded DDPM                       N (xt ; ᾱt x0 , (1 − ᾱt )I), where constants ᾱt are hyperpa-
pipelines where low-resolution base diffusion models are                        rameters. By √applying the √ reparameterization trick, we can
trained in parallel with upsamplers [9, 20]. For all the dif-                   sample xt = ᾱt x0 + 1 − ᾱt t , where t ∼ N (0, I).
fusion models listed above, convolutional U-Nets [49] are                           Diffusion models are trained to learn the reverse process
the de-facto choice of backbone architecture. Concurrent                        that inverts forward process corruptions: pθ (xt−1 |xt ) =
work [24] introduced a novel, efficient architecture based                      N (µθ (xt ), Σθ (xt )), where neural networks are used to pre-
on attention for DDPMs; we explore pure transformers.                           dict the statistics of pθ . The reverse process model is
Architecture complexity. When evaluating architecture                           trained with the variational lower bound [30] of the log-
complexity in the image generation literature, it is fairly                     likelihood  of x0 , which reduces to L(θ) = −p(x0 |x1 ) +
                                                                                             ∗
                                                                                P
common practice to use parameter counts. In general, pa-                             D
                                                                                    t KL  (q   (xt−1 |xt , x0 )||pθ (xt−1 |xt )), excluding an ad-
rameter counts can be poor proxies for the complexity of                        ditional term irrelevant for training. Since both q ∗ and pθ
image models since they do not account for, e.g., image res-                    are Gaussian, DKL can be evaluated with the mean and co-
olution which significantly impacts performance [44, 45].                       variance of the two distributions. By reparameterizing µθ as
Instead, much of the model complexity analysis in this pa-                      a noise prediction network θ , the model can be trained us-
per is through the lens of theoretical Gflops. This brings us                   ing simple mean-squared error between the predicted noise
in-line with the architecture design literature where Gflops                    θ (xt ) and the ground truth sampled Gaussian noise t :
are widely-used to gauge complexity. In practice, the                           Lsimple (θ) = ||θ (xt ) − t ||22 . But, in order to train diffu-
golden complexity metric is still up for debate as it fre-                      sion models with a learned reverse process covariance Σθ ,
quently depends on particular application scenarios. Nichol                     the full DKL term needs to be optimized. We follow Nichol
and Dhariwal’s seminal work improving diffusion mod-                            and Dhariwal’s approach [36]: train θ with Lsimple , and
els [9, 36] is most related to us—there, they analyzed the                      train Σθ with the full L. Once pθ is trained, new images can
scalability and Gflop properties of the U-Net architecture                      be sampled by initializing xtmax ∼ N (0, I) and sampling
class. In this paper, we focus on the transformer class.                        xt−1 ∼ pθ (xt−1 |xt ) via the reparameterization trick.


                                                                            3
Classifier-free guidance. Conditional diffusion models
take extra information as input, such as a class label c.                                             DiT Block
In this case, the reverse process becomes pθ (xt−1 |xt , c),
where θ and Σθ are conditioned on c. In this setting,                   Input Tokens T × d

classifier-free guidance can be used to encourage the sam-
pling procedure to find x such that log p(c|x) is high [21].
                                                                                                     𝑇 = 𝐼/𝑝 !
By Bayes Rule, log p(c|x) ∝ log p(x|c) − log p(x), and
hence ∇x log p(c|x) ∝ ∇x log p(x|c)−∇x log p(x). By in-                                       Noised Latent
                                                                                                I×I×C             𝑝
terpreting the output of diffusion models as the score func-
                                                                                                                      𝑝
tion, the DDPM sampling procedure can be guided to sam-
ple x with high p(x|c) by: ˆθ (xt , c) = θ (xt , ∅) + s ·                                   𝐼
∇x log p(x|c) ∝ θ (xt , ∅)+s·(θ (xt , c)−θ (xt , ∅)), where
s > 1 indicates the scale of the guidance (note that s = 1 re-
covers standard sampling). Evaluating the diffusion model                                                     𝐼
with c = ∅ is done by randomly dropping out c during
training and replacing it with a learned “null” embedding
                                                                     Figure 4. Input specifications for DiT. Given patch size p × p,
∅. Classifier-free guidance is widely-known to yield sig-            a spatial representation (the noised latent from the VAE) of shape
nificantly improved samples over generic sampling tech-              I × I × C is “patchified” into a sequence of length T = (I/p)2
niques [21, 35, 46], and the trend holds for our DiT models.         with hidden dimension d. A smaller patch size p results in a longer
                                                                     sequence length and thus more Gflops.
Latent diffusion models. Training diffusion models di-
rectly in high-resolution pixel space can be computationally         Patchify. The input to DiT is a spatial representation z
prohibitive. Latent diffusion models (LDMs) [48] tackle this         (for 256 × 256 × 3 images, z has shape 32 × 32 × 4). The
issue with a two-stage approach: (1) learn an autoencoder            first layer of DiT is “patchify,” which converts the spatial
that compresses images into smaller spatial representations          input into a sequence of T tokens, each of dimension d,
with a learned encoder E; (2) train a diffusion model of             by linearly embedding each patch in the input. Following
representations z = E(x) instead of a diffusion model of             patchify, we apply standard ViT frequency-based positional
images x (E is frozen). New images can then be generated             embeddings (the sine-cosine version) to all input tokens.
by sampling a representation z from the diffusion model              The number of tokens T created by patchify is determined
and subsequently decoding it to an image with the learned            by the patch size hyperparameter p. As shown in Figure 4,
decoder x = D(z).                                                    halving p will quadruple T , and thus at least quadruple total
    As shown in Figure 2, LDMs achieve good performance              transformer Gflops. Although it has a significant impact on
while using a fraction of the Gflops of pixel space diffusion        Gflops, note that changing p has no meaningful impact on
models like ADM. Since we are concerned with compute                 downstream parameter counts.
efficiency, this makes them an appealing starting point for              We add p = 2, 4, 8 to the DiT design space.
architecture exploration. In this paper, we apply DiTs to
                                                                     DiT block design. Following patchify, the input tokens
latent space, although they could be applied to pixel space
                                                                     are processed by a sequence of transformer blocks. In ad-
without modification as well. This makes our image genera-
                                                                     dition to noised image inputs, diffusion models sometimes
tion pipeline a hybrid-based approach; we use off-the-shelf
                                                                     process additional conditional information such as noise
convolutional VAEs and transformer-based DDPMs.
                                                                     timesteps t, class labels c, natural language, etc. We explore
3.2. Diffusion Transformer Design Space                              four variants of transformer blocks that process conditional
                                                                     inputs differently. The designs introduce small, but impor-
    We introduce Diffusion Transformers (DiTs), a new ar-            tant, modifications to the standard ViT block design. The
chitecture for diffusion models. We aim to be as faithful to         designs of all blocks are shown in Figure 3.
the standard transformer architecture as possible to retain
its scaling properties. Since our focus is training DDPMs of           – In-context conditioning. We simply append the vec-
images (specifically, spatial representations of images), DiT            tor embeddings of t and c as two additional tokens in
is based on the Vision Transformer (ViT) architecture which              the input sequence, treating them no differently from
operates on sequences of patches [10]. DiT retains many of               the image tokens. This is similar to cls tokens in
the best practices of ViTs. Figure 3 shows an overview of                ViTs, and it allows us to use standard ViT blocks with-
the complete DiT architecture. In this section, we describe              out modification. After the final block, we remove the
the forward pass of DiT, as well as the components of the                conditioning tokens from the sequence. This approach
design space of the DiT class.                                           introduces negligible new Gflops to the model.


                                                                 4
             100                      XL/2 In-Context
                                                                             Model     Layers N    Hidden size d   Heads   Gflops (I=32, p=4)

                                      XL/2 Cross-Attention                   DiT-S         12           384          6           1.4

              80                      XL/2 adaLN                             DiT-B         12           768         12           5.6
                                      XL/2 adaLN-Zero                        DiT-L
                                                                             DiT-XL
                                                                                           24
                                                                                           28
                                                                                                       1024
                                                                                                       1152
                                                                                                                    16
                                                                                                                    16
                                                                                                                                 19.7
                                                                                                                                 29.1


   FID-50K    60                                                          Table 1. Details of DiT models. We follow ViT [10] model con-
                                                                          figurations for the Small (S), Base (B) and Large (L) variants; we
              40                                                          also introduce an XLarge (XL) config as our largest model.


              20                                                               We initialize the MLP to output the zero-vector for all
                   100K        200K        300K         400K                   α; this initializes the full DiT block as the identity
                             Training Steps                                    function. As with the vanilla adaLN block, adaLN-
                                                                               Zero adds negligible Gflops to the model.
Figure 5. Comparing different conditioning strategies. adaLN-
Zero outperforms cross-attention and in-context conditioning at all          We include the in-context, cross-attention, adaptive layer
stages of training.                                                       norm and adaLN-Zero blocks in the DiT design space.

  – Cross-attention block. We concatenate the embeddings                  Model size. We apply a sequence of N DiT blocks, each
    of t and c into a length-two sequence, separate from                  operating at the hidden dimension size d. Following ViT,
    the image token sequence. The transformer block is                    we use standard transformer configs that jointly scale N ,
    modified to include an additional multi-head cross-                   d and attention heads [10, 63]. Specifically, we use four
    attention layer following the multi-head self-attention               configs: DiT-S, DiT-B, DiT-L and DiT-XL. They cover a
    block, similar to the original design from Vaswani et                 wide range of model sizes and flop allocations, from 0.3
    al. [60], and also similar to the one used by LDM for                 to 118.6 Gflops, allowing us to gauge scaling performance.
    conditioning on class labels. Cross-attention adds the                Table 1 gives details of the configs.
    most Gflops to the model, roughly a 15% overhead.                        We add B, S, L and XL configs to the DiT design space.
  – Adaptive layer norm (adaLN) block.            Following               Transformer decoder. After the final DiT block, we need
    the widespread usage of adaptive normalization lay-                   to decode our sequence of image tokens into an output noise
    ers [40] in GANs [2, 28] and diffusion models with U-                 prediction and an output diagonal covariance prediction.
    Net backbones [9], we explore replacing standard layer                Both of these outputs have shape equal to the original spa-
    norm layers in transformer blocks with adaptive layer                 tial input. We use a standard linear decoder to do this; we
    norm (adaLN). Rather than directly learn dimension-                   apply the final layer norm (adaptive if using adaLN) and lin-
    wise scale and shift parameters γ and β, we regress                   early decode each token into a p×p×2C tensor, where C is
    them from the sum of the embedding vectors of t and                   the number of channels in the spatial input to DiT. Finally,
    c. Of the three block designs we explore, adaLN adds                  we rearrange the decoded tokens into their original spatial
    the least Gflops and is thus the most compute-efficient.              layout to get the predicted noise and covariance.
    It is also the only conditioning mechanism that is re-                    The complete DiT design space we explore is patch size,
    stricted to apply the same function to all tokens.                    transformer block architecture and model size.

  – adaLN-Zero block. Prior work on ResNets has found                     4. Experimental Setup
    that initializing each residual block as the identity
    function is beneficial. For example, Goyal et al. found                  We explore the DiT design space and study the scaling
    that zero-initializing the final batch norm scale factor γ            properties of our model class. Our models are named ac-
    in each block accelerates large-scale training in the su-             cording to their configs and latent patch sizes p; for exam-
    pervised learning setting [13]. Diffusion U-Net mod-                  ple, DiT-XL/2 refers to the XLarge config and p = 2.
    els use a similar initialization strategy, zero-initializing
    the final convolutional layer in each block prior to any              Training. We train class-conditional latent DiT models at
    residual connections. We explore a modification of                    256 × 256 and 512 × 512 image resolution on the Ima-
    the adaLN DiT block which does the same. In addi-                     geNet dataset [31], a highly-competitive generative mod-
    tion to regressing γ and β, we also regress dimension-                eling benchmark. We initialize the final linear layer with
    wise scaling parameters α that are applied immediately                zeros and otherwise use standard weight initialization tech-
    prior to any residual connections within the DiT block.               niques from ViT. We train all models with AdamW [29,33].


                                                                      5
Figure 6. Scaling the DiT model improves FID at all stages of training. We show FID-50K over training iterations for 12 of our DiT
models. Top row: We compare FID holding patch size constant. Bottom row: We compare FID holding model size constant. Scaling the
transformer backbone yields better generative models across all model sizes and patch sizes.


We use a constant learning rate of 1 × 10−4 , no weight de-         We follow convention when comparing against prior works
cay and a batch size of 256. The only data augmentation             and report FID-50K using 250 DDPM sampling steps.
we use is horizontal flips. Unlike much prior work with             FID is known to be sensitive to small implementation de-
ViTs [57, 61], we did not find learning rate warmup nor             tails [37]; to ensure accurate comparisons, all values re-
regularization necessary to train DiTs to high performance.         ported in this paper are obtained by exporting samples and
Even without these techniques, training was highly stable           using ADM’s TensorFlow evaluation suite [9]. FID num-
across all model configs and we did not observe any loss            bers reported in this section do not use classifier-free guid-
spikes commonly seen when training transformers. Follow-            ance except where otherwise stated. We additionally report
ing common practice in the generative modeling literature,          Inception Score [51], sFID [34] and Precision/Recall [32]
we maintain an exponential moving average (EMA) of DiT              as secondary metrics.
weights over training with a decay of 0.9999. All results
reported use the EMA model. We use identical training hy-
perparameters across all DiT model sizes and patch sizes.           Compute. We implement all models in JAX [1] and train
Our training hyperparameters are almost entirely retained           them using TPU-v3 pods. DiT-XL/2, our most compute-
from ADM. We did not tune learning rates, decay/warm-up             intensive model, trains at roughly 5.7 iterations/second on a
schedules, Adam β1 /β2 or weight decays.                            TPU v3-256 pod with a global batch size of 256.

Diffusion. We use an off-the-shelf pre-trained variational          5. Experiments
autoencoder (VAE) model [30] from Stable Diffusion [48].
                                                                    DiT block design. We train four of our highest Gflop
The VAE encoder has a downsample factor of 8—given an
                                                                    DiT-XL/2 models, each using a different block design—
RGB image x with shape 256 × 256 × 3, z = E(x) has
                                                                    in-context (119.4 Gflops), cross-attention (137.6 Gflops),
shape 32 × 32 × 4. Across all experiments in this section,
                                                                    adaptive layer norm (adaLN, 118.6 Gflops) or adaLN-zero
our diffusion models operate in this Z-space. After sam-
                                                                    (118.6 Gflops). We measure FID over the course of training.
pling a new latent from our diffusion model, we decode it
                                                                    Figure 5 shows the results. The adaLN-Zero block yields
to pixels using the VAE decoder x = D(z). We retain diffu-
                                                                    lower FID than both cross-attention and in-context condi-
sion hyperparameters from ADM [9]; specifically, we use a
                                                                    tioning while being the most compute-efficient. At 400K
tmax = 1000 linear variance schedule ranging from 1×10−4
                                                                    training iterations, the FID achieved with the adaLN-Zero
to 2 × 10−2 , ADM’s parameterization of the covariance Σθ
                                                                    model is nearly half that of the in-context model, demon-
and their method for embedding input timesteps and labels.
                                                                    strating that the conditioning mechanism critically affects
                                                                    model quality. Initialization is also important—adaLN-
Evaluation metrics. We measure scaling performance                  Zero, which initializes each DiT block as the identity func-
with Fréchet Inception Distance (FID) [18], the standard           tion, significantly outperforms vanilla adaLN. For the rest
metric for evaluating generative models of images.                  of the paper, all models will use adaLN-Zero DiT blocks.

                                                                6
                        Increasing transformer size




Decreasing patch size




  Figure 7. Increasing transformer forward pass Gflops increases sample quality. Best viewed zoomed-in. We sample from all 12 of
  our DiT models after 400K training steps using the same input latent noise and class label. Increasing the Gflops in the model—either by
  increasing transformer depth/width or increasing the number of input tokens—yields significant improvements in visual fidelity.
                                                                     7
          160                    S/8
                                 S/4
                                        B/8
                                        B/4
                                                L/8
                                                L/4
                                                        XL/8
                                                        XL/4
                                                                               200               S/8
                                                                                                 S/4
                                                                                                         B/8
                                                                                                         B/4
                                                                                                                  L/8
                                                                                                                  L/4
                                                                                                                          XL/8
                                                                                                                          XL/4
          140                    S/2    B/2     L/2     XL/2
                                                                               175               S/2     B/2      L/2     XL/2

          120                                                                  150                       30
                                                                                                         25
          100                                                                  125
FID-50K                                                              FID-50K
                                                                                                         20
           80                                                                  100                       15
           60                                                                   75                       10

           40                                                                   50
              Correlation: -0.93                                                25
           20
                    100              101              102                        0 7
                                                                                  10   108      109    1010 1011           1012
                           Transformer Gflops                                            Training Compute (Gflops)
Figure 8. Transformer Gflops are strongly correlated with FID.       Figure 9. Larger DiT models use large compute more effi-
We plot the Gflops of each of our DiT models and each model’s        ciently. We plot FID as a function of total training compute.
FID-50K after 400K training steps.

Scaling model size and patch size. We train 12 DiT mod-                 Larger DiT models are more compute-efficient. In
els, sweeping over model configs (S, B, L, XL) and patch             Figure 9, we plot FID as a function of total training compute
sizes (8, 4, 2). Note that DiT-L and DiT-XL are significantly        for all DiT models. We estimate training compute as model
closer to each other in terms of relative Gflops than other          Gflops · batch size · training steps · 3, where the factor of
configs. Figure 2 (left) gives an overview of the Gflops of          3 roughly approximates the backwards pass as being twice
each model and their FID at 400K training iterations. In             as compute-heavy as the forward pass. We find that small
all cases, we find that increasing model size and decreasing         DiT models, even when trained longer, eventually become
patch size yields considerably improved diffusion models.            compute-inefficient relative to larger DiT models trained for
    Figure 6 (top) demonstrates how FID changes as model             fewer steps. Similarly, we find that models that are identi-
size is increased and patch size is held constant. Across all        cal except for patch size have different performance profiles
four configs, significant improvements in FID are obtained           even when controlling for training Gflops. For example,
over all stages of training by making the transformer deeper         XL/4 is outperformed by XL/2 after roughly 1010 Gflops.
and wider. Similarly, Figure 6 (bottom) shows FID as patch
size is decreased and model size is held constant. We again          Visualizing scaling. We visualize the effect of scaling on
observe considerable FID improvements throughout train-              sample quality in Figure 7. At 400K training steps, we sam-
ing by simply scaling the number of tokens processed by              ple an image from each of our 12 DiT models using iden-
DiT, holding parameters approximately fixed.                         tical starting noise xtmax , sampling noise and class labels.
                                                                     This lets us visually interpret how scaling affects DiT sam-
DiT Gflops are critical to improving performance. The                ple quality. Indeed, scaling both model size and the number
results of Figure 6 suggest that parameter counts do not             of tokens yields notable improvements in visual quality.
uniquely determine the quality of a DiT model. As model
size is held constant and patch size is decreased, the trans-
                                                                     5.1. State-of-the-Art Diffusion Models
former’s total parameters are effectively unchanged (actu-
ally, total parameters slightly decrease), and only Gflops are       256×256 ImageNet. Following our scaling analysis, we
increased. These results indicate that scaling model Gflops          continue training our highest Gflop model, DiT-XL/2, for
is actually the key to improved performance. To investi-             7M steps. We show samples from the model in Figures 1,
gate this further, we plot the FID-50K at 400K training steps        and we compare against state-of-the-art class-conditional
against model Gflops in Figure 8. The results demonstrate            generative models. We report results in Table 2. When us-
that different DiT configs obtain similar FID values when            ing classifier-free guidance, DiT-XL/2 outperforms all prior
their total Gflops are similar (e.g., DiT-S/2 and DiT-B/4).          diffusion models, decreasing the previous best FID-50K of
We find a strong negative correlation between model Gflops           3.60 achieved by LDM to 2.27. Figure 2 (right) shows that
and FID-50K, suggesting that additional model compute is             DiT-XL/2 (118.6 Gflops) is compute-efficient relative to la-
the critical ingredient for improved DiT models. In Fig-             tent space U-Net models like LDM-4 (103.6 Gflops) and
ure 12 (appendix), we find that this trend holds for other           substantially more efficient than pixel space U-Net mod-
metrics such as Inception Score.                                     els such as ADM (1120 Gflops) or ADM-U (742 Gflops).


                                                                 8
 Class-Conditional ImageNet 256×256                                                180                         S/8     B/8      L/8      XL/8
                                                                                                               S/4     B/4      L/4      XL/4
 Model                   FID↓    sFID↓    IS↑     Precision↑   Recall↑             160                         S/2     B/2      L/2      XL/2
 BigGAN-deep [2]         6.95    7.36    171.4      0.87        0.28
 StyleGAN-XL [53]        2.30    4.02    265.12     0.78        0.53               140
 ADM [9]                 10.94   6.02    100.98     0.69        0.63               120
                                                                             FID-10K
 ADM-U                    7.49   5.13    127.49     0.72        0.63
 ADM-G                    4.59   5.25    186.70     0.82        0.52               100
 ADM-G, ADM-U             3.94   6.14    215.84     0.83        0.53
                                                                                    80
 CDM [20]                4.88      -     158.71       -           -
 LDM-8 [48]              15.51     -      79.03     0.65        0.63
                                                                                    60
 LDM-8-G                  7.76     -     209.52     0.84        0.35                40
 LDM-4                   10.56     -     103.49     0.71        0.62
 LDM-4-G (cfg=1.25)       3.95     -     178.22     0.81        0.55                20
 LDM-4-G (cfg=1.50)       3.60     -     247.67     0.87        0.48                     101       102       103        104            105
 DiT-XL/2                9.62    6.85    121.50     0.67        0.67                              Sampling Compute (Gflops)
 DiT-XL/2-G (cfg=1.25)   3.22    5.28    201.77     0.76        0.62
 DiT-XL/2-G (cfg=1.50)   2.27    4.60    278.24     0.83        0.57
                                                                              Figure 10. Scaling-up sampling compute does not compensate
                                                                              for a lack of model compute. For each of our DiT models trained
Table 2. Benchmarking class-conditional image generation on
                                                                              for 400K iterations, we compute FID-10K using [16, 32, 64, 128,
ImageNet 256×256. DiT-XL/2 achieves state-of-the-art FID.
                                                                              256, 1000] sampling steps. For each number of steps, we plot the
                                                                              FID as well as the Gflops used to sample each image. Small mod-
 Class-Conditional ImageNet 512×512
                                                                              els cannot close the performance gap with our large models, even
 Model                   FID↓    sFID↓    IS↑     Precision↑   Recall↑        if they sample with more test-time Gflops than the large models.
 BigGAN-deep [2]         8.43    8.13    177.90     0.88        0.29
 StyleGAN-XL [53]        2.41    4.06    267.75     0.77        0.52
 ADM [9]                 23.24   10.19    58.06     0.73        0.60          5.2. Scaling Model vs. Sampling Compute
 ADM-U                    9.96   5.62    121.78     0.75        0.64
 ADM-G                    7.72    6.57   172.71     0.87        0.42             Diffusion models are unique in that they can use addi-
 ADM-G, ADM-U             3.85    5.86   221.72     0.84        0.53          tional compute after training by increasing the number of
 DiT-XL/2                12.03   7.12    105.25     0.75        0.64          sampling steps when generating an image. Given the im-
 DiT-XL/2-G (cfg=1.25)   4.64    5.77    174.77     0.81        0.57          pact of model Gflops on sample quality, in this section we
 DiT-XL/2-G (cfg=1.50)   3.04    5.02    240.82     0.84        0.54
                                                                              study if smaller-model compute DiTs can outperform larger
Table 3. Benchmarking class-conditional image generation on                   ones by using more sampling compute. We compute FID
ImageNet 512×512. Note that prior work [9] measures Precision                 for all 12 of our DiT models after 400K training steps, us-
and Recall using 1000 real samples for 512 × 512 resolution; for              ing [16, 32, 64, 128, 256, 1000] sampling steps per-image.
consistency, we do the same.                                                  The main results are in Figure 10. Consider DiT-L/2 us-
                                                                              ing 1000 sampling steps versus DiT-XL/2 using 128 steps.
Our method achieves the lowest FID of all prior generative                    In this case, L/2 uses 80.7 Tflops to sample each image;
models, including the previous state-of-the-art StyleGAN-                     XL/2 uses 5× less compute—15.2 Tflops—to sample each
XL [53]. Finally, we also observe that DiT-XL/2 achieves                      image. Nonetheless, XL/2 has the better FID-10K (23.7
higher recall values at all tested classifier-free guidance                   vs 25.9). In general, scaling-up sampling compute cannot
scales compared to LDM-4 and LDM-8. When trained for                          compensate for a lack of model compute.
only 2.35M steps (similar to ADM), XL/2 still outperforms
all prior diffusion models with an FID of 2.55.                               6. Conclusion
512×512 ImageNet. We train a new DiT-XL/2 model on                               We introduce Diffusion Transformers (DiTs), a simple
ImageNet at 512 × 512 resolution for 3M iterations with                       transformer-based backbone for diffusion models that out-
identical hyperparameters as the 256 × 256 model. With a                      performs prior U-Net models and inherits the excellent scal-
patch size of 2, this XL/2 model processes a total of 1024                    ing properties of the transformer model class. Given the
tokens after patchifying the 64 × 64 × 4 input latent (524.6                  promising scaling results in this paper, future work should
Gflops). Table 3 shows comparisons against state-of-the-art                   continue to scale DiTs to larger models and token counts.
methods. XL/2 again outperforms all prior diffusion models                    DiT could also be explored as a drop-in backbone for text-
at this resolution, improving the previous best FID of 3.85                   to-image models like DALL·E 2 and Stable Diffusion.
achieved by ADM to 3.04. Even with the increased num-
ber of tokens, XL/2 remains compute-efficient. For exam-                      Acknowledgements. We thank Kaiming He, Ronghang
ple, ADM uses 1983 Gflops and ADM-U uses 2813 Gflops;                         Hu, Alexander Berg, Shoubhik Debnath, Tim Brooks, Ilija
XL/2 uses 524.6 Gflops. We show samples from the high-                        Radosavovic and Tete Xiao for helpful discussions. William
resolution XL/2 model in Figure 1 and the appendix.                           Peebles is supported by the NSF GRFP.


                                                                         9
References                                                                [16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
                                                                               units (gelus). arXiv preprint arXiv:1606.08415, 2016. 12
 [1] James Bradbury,         Roy Frostig,       Peter Hawkins,
                                                                          [17] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen,
     Matthew James Johnson, Chris Leary, Dougal Maclau-
                                                                               Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B
     rin, George Necula, Adam Paszke, Jake VanderPlas, Skye
                                                                               Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws
     Wanderman-Milne, and Qiao Zhang. JAX: composable
                                                                               for autoregressive generative modeling. arXiv preprint
     transformations of Python+NumPy programs, 2018. 6
                                                                               arXiv:2010.14701, 2020. 2
 [2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
     scale GAN training for high fidelity natural image synthesis.        [18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
     In ICLR, 2019. 5, 9                                                       Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
                                                                               two time-scale update rule converge to a local nash equilib-
 [3] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
                                                                               rium. 2017. 6
     biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
     Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-              [19] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
     guage models are few-shot learners. In NeurIPS, 2020. 1                   sion probabilistic models. In NeurIPS, 2020. 2, 3
 [4] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T             [20] Jonathan Ho, Chitwan Saharia, William Chan, David J
     Freeman. Maskgit: Masked generative image transformer. In                 Fleet, Mohammad Norouzi, and Tim Salimans. Cas-
     CVPR, pages 11315–11325, 2022. 2                                          caded diffusion models for high fidelity image generation.
 [5] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee,                       arXiv:2106.15282, 2021. 3, 9
     Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini-           [21] Jonathan Ho and Tim Salimans. Classifier-free diffusion
     vas, and Igor Mordatch. Decision transformer: Reinforce-                  guidance. In NeurIPS 2021 Workshop on Deep Generative
     ment learning via sequence modeling. In NeurIPS, 2021. 2                  Models and Downstream Applications, 2021. 3, 4
 [6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-               [22] Aapo Hyvärinen and Peter Dayan. Estimation of non-
     woo Jun, David Luan, and Ilya Sutskever. Generative pre-                  normalized statistical models by score matching. Journal
     training from pixels. In ICML, 2020. 1, 2                                 of Machine Learning Research, 6(4), 2005. 3
 [7] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.           [23] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A
     Generating long sequences with sparse transformers. arXiv                 Efros. Image-to-image translation with conditional adver-
     preprint arXiv:1904.10509, 2019. 2                                        sarial networks. In Proceedings of the IEEE conference on
 [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina                    computer vision and pattern recognition, pages 1125–1134,
     Toutanova. Bert: Pre-training of deep bidirectional trans-                2017. 2
     formers for language understanding. In NAACL-HCT, 2019.              [24] Allan Jabri, David Fleet, and Ting Chen. Scalable adap-
     1                                                                         tive computation for iterative generation. arXiv preprint
 [9] Prafulla Dhariwal and Alexander Nichol. Diffusion models                  arXiv:2212.11972, 2022. 3
     beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 5,          [25] Michael Janner, Qiyang Li, and Sergey Levine. Offline rein-
     6, 9, 12                                                                  forcement learning as one big sequence modeling problem.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,                    In NeurIPS, 2021. 2
     Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,                  [26] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
     Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-                  Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
     vain Gelly, et al. An image is worth 16x16 words: Trans-                  Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for
     formers for image recognition at scale. In ICLR, 2020. 1, 2,              neural language models. arXiv:2001.08361, 2020. 2, 13
     4, 5
                                                                          [27] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine.
[11] Patrick Esser, Robin Rombach, and Björn Ommer. Taming                    Elucidating the design space of diffusion-based generative
     transformers for high-resolution image synthesis, 2020. 2                 models. In Proc. NeurIPS, 2022. 3
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
                                                                          [28] Tero Karras, Samuli Laine, and Timo Aila. A style-based
     Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
                                                                               generator architecture for generative adversarial networks. In
     Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
                                                                               CVPR, 2019. 5
     3
                                                                          [29] Diederik Kingma and Jimmy Ba. Adam: A method for
[13] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
                                                                               stochastic optimization. In ICLR, 2015. 5
     huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
     Yangqing Jia, and Kaiming He. Accurate, large minibatch              [30] Diederik P Kingma and Max Welling. Auto-encoding varia-
     sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017.                 tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3, 6
     5                                                                    [31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
[14] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo                          Imagenet classification with deep convolutional neural net-
     Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-                      works. In NeurIPS, 2012. 5
     tor quantized diffusion model for text-to-image synthesis. In        [32] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko
     CVPR, pages 10696–10706, 2022. 2                                          Lehtinen, and Timo Aila. Improved precision and recall met-
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.                    ric for assessing generative models. In NeurIPS, 2019. 6
     Deep residual learning for image recognition. In CVPR,               [33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
     2016. 2                                                                   regularization. arXiv:1711.05101, 2017. 5


                                                                     10
[34] Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W             [50] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
     Battaglia. Generating images with sparse representations.                 Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
     arXiv preprint arXiv:2103.03841, 2021. 6                                  Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi,
[35] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav                     Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J
     Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever,                        Fleet, and Mohammad Norouzi. Photorealistic text-to-
     and Mark Chen. Glide: Towards photorealistic image                        image diffusion models with deep language understanding.
     generation and editing with text-guided diffusion models.                 arXiv:2205.11487, 2022. 3
     arXiv:2112.10741, 2021. 3, 4                                         [51] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki
[36] Alexander Quinn Nichol and Prafulla Dhariwal. Improved                    Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved
     denoising diffusion probabilistic models. In ICML, 2021. 3                techniques for training GANs. In NeurIPS, 2016. 6
[37] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On                    [52] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P
     aliased resizing and surprising subtleties in gan evaluation.             Kingma. PixelCNN++: Improving the pixelcnn with dis-
     In CVPR, 2022. 6                                                          cretized logistic mixture likelihood and other modifications.
                                                                               arXiv preprint arXiv:1701.05517, 2017. 2
[38] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
     Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-             [53] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-
     age transformer. In International conference on machine                   xl: Scaling stylegan to large diverse datasets. In SIGGRAPH,
     learning, pages 4055–4064. PMLR, 2018. 2                                  2022. 9
                                                                          [54] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[39] William Peebles, Ilija Radosavovic, Tim Brooks, Alexei
                                                                               and Surya Ganguli. Deep unsupervised learning using
     Efros, and Jitendra Malik. Learning to learn with genera-
                                                                               nonequilibrium thermodynamics. In ICML, 2015. 3
     tive models of neural network checkpoints. arXiv preprint
     arXiv:2209.12892, 2022. 2                                            [55] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
                                                                               ing diffusion implicit models. arXiv:2010.02502, 2020. 3
[40] Ethan Perez, Florian Strub, Harm De Vries, Vincent Du-
                                                                          [56] Yang Song and Stefano Ermon. Generative modeling by es-
     moulin, and Aaron Courville. Film: Visual reasoning with a
                                                                               timating gradients of the data distribution. In NeurIPS, 2019.
     general conditioning layer. In AAAI, 2018. 2, 5
                                                                               3
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
                                                                          [57] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross
     Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
                                                                               Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train
     Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
                                                                               your ViT? data, augmentation, and regularization in vision
     ing transferable visual models from natural language super-
                                                                               transformers. TMLR, 2022. 6
     vision. In ICML, 2021. 2
                                                                          [58] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt,
[42] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
                                                                               Oriol Vinyals, Alex Graves, et al. Conditional image genera-
     Sutskever. Improving language understanding by generative
                                                                               tion with pixelcnn decoders. Advances in neural information
     pre-training. 2018. 1
                                                                               processing systems, 29, 2016. 2
[43] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario             [59] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete
     Amodei, Ilya Sutskever, et al. Language models are unsu-                  representation learning. Advances in neural information pro-
     pervised multitask learners. 2019. 1                                      cessing systems, 30, 2017. 2
[44] Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo,          [60] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
     and Piotr Dollár. On network design spaces for visual recog-             reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
     nition. In ICCV, 2019. 3                                                  Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
[45] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,                   2, 5
     Kaiming He, and Piotr Dollár. Designing network design              [61] Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor
     spaces. In CVPR, 2020. 3                                                  Darrell, and Ross Girshick. Early convolutions help trans-
[46] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,                 formers see better. In NeurIPS, 2021. 6
     and Mark Chen. Hierarchical text-conditional image gener-            [62] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
     ation with clip latents. arXiv:2204.06125, 2022. 1, 2, 3, 4               Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku,
[47] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,                   Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autore-
     Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.                gressive models for content-rich text-to-image generation.
     Zero-shot text-to-image generation. In ICML, 2021. 1, 2                   arXiv:2206.10789, 2022. 2
[48] Robin Rombach, Andreas Blattmann, Dominik Lorenz,                    [63] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
     Patrick Esser, and Björn Ommer. High-resolution image syn-               cas Beyer. Scaling vision transformers. In CVPR, 2022. 2,
     thesis with latent diffusion models. In CVPR, 2022. 2, 3, 4,              5
     6, 9
[49] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
     net: Convolutional networks for biomedical image segmen-
     tation. In International Conference on Medical image com-
     puting and computer-assisted intervention, pages 234–241.
     Springer, 2015. 2, 3


                                                                     11
Figure 11. Additional selected samples from our 512×512 and 256×256 resolution DiT-XL/2 models. We use a classifier-free guidance
scale of 6.0 for the 512 × 512 model and 4.0 for the 256 × 256 model. Both models use the ft-EMA VAE decoder.

A. Additional Implementation Details                                   results (in terms of FID) when simply adjusting the scale
                                                                       factor. Specifically, three-channel guidance with a scale
   We include detailed information about all of our DiT                of (1 + x) appears reasonably well-approximated by four-
models in Table 4, including both 256 × 256 and 512 × 512              channel guidance with a scale of (1 + 34 x) (e.g., three-
models. In Figure 13, we report DiT training loss curves.              channel guidance with a scale of 1.5 gives an FID-50K of
Finally, we also include Gflop counts for DDPM U-Net                   2.27, and four-channel guidance with a scale of 1.375 gives
models from ADM and LDM in Table 6.                                    an FID-50K of 2.20). It is somewhat interesting that ap-
DiT model details. To embed input timesteps, we use                    plying guidance to a subset of elements can still yield good
a 256-dimensional frequency embedding [9] followed by                  performance, and we leave it to future work to explore this
a two-layer MLP with dimensionality equal to the trans-                phenomenon further.
former’s hidden size and SiLU activations. Each adaLN
layer feeds the sum of the timestep and class embeddings               B. Model Samples
into a SiLU nonlinearity and a linear layer with output neu-
                                                                          We show samples from our two DiT-XL/2 models at
rons equal to either 4× (adaLN) or 6× (adaLN-Zero) the
                                                                       512 × 512 and 256 × 256 resolution trained for 3M and 7M
transformer’s hidden size. We use GELU nonlinearities (ap-
                                                                       steps, respectively. Figures 1 and 11 show selected samples
proximated with tanh) in the core transformer [16].
                                                                       from both models. Figures 14 through 33 show uncurated
Classifier-free guidance on a subset of channels. In our               samples from the two models across a range of classifier-
experiments using classifier-free guidance, we applied guid-           free guidance scales and input class labels (generated with
ance only to the first three channels of the latents instead of        250 DDPM sampling steps and the ft-EMA VAE decoder).
all four channels. Upon investigating, we found that three-            As with prior work using guidance, we observe that larger
channel guidance and four-channel guidance give similar                scales increase visual fidelity and decrease sample diversity.


                                                                  12
     Model      Image Resolution   Flops (G)   Params (M)   Training Steps (K)    Batch Size    Learning Rate          DiT Block        FID-50K (no guidance)
     DiT-S/8       256 × 256         0.36         33               400               256          1 × 10−4            adaLN-Zero                  153.60
     DiT-S/4       256 × 256         1.41         33               400               256          1 × 10−4            adaLN-Zero                  100.41
     DiT-S/2       256 × 256         6.06         33               400               256          1 × 10−4            adaLN-Zero                   68.40
     DiT-B/8       256 × 256        1.42          131              400               256          1 × 10−4            adaLN-Zero                  122.74
     DiT-B/4       256 × 256        5.56          130              400               256          1 × 10−4            adaLN-Zero                   68.38
     DiT-B/2       256 × 256        23.01         130              400               256          1 × 10−4            adaLN-Zero                  43.47
     DiT-L/8       256 × 256        5.01          459              400               256          1 × 10−4            adaLN-Zero                  118.87
     DiT-L/4       256 × 256        19.70         458              400               256          1 × 10−4            adaLN-Zero                  45.64
     DiT-L/2       256 × 256        80.71         458              400               256          1 × 10−4            adaLN-Zero                  23.33
     DiT-XL/8      256 × 256         7.39         676             400                256          1 × 10−4           adaLN-Zero                   106.41
     DiT-XL/4      256 × 256         29.05        675              400               256          1 × 10−4           adaLN-Zero                   43.01
     DiT-XL/2      256 × 256        118.64        675              400               256          1 × 10−4           adaLN-Zero                   19.47
     DiT-XL/2      256 × 256        119.37        449              400               256          1 × 10−4            in-context                   35.24
     DiT-XL/2      256 × 256        137.62        598              400               256          1 × 10−4          cross-attention                26.14
     DiT-XL/2      256 × 256        118.56        600              400               256          1 × 10−4              adaLN                      25.21
     DiT-XL/2      256 × 256        118.64        675             2352               256          1 × 10−4           adaLN-Zero                   10.67
     DiT-XL/2      256 × 256        118.64        675             7000               256          1 × 10−4           adaLN-Zero                    9.62
     DiT-XL/2      512 × 512        524.60        675             1301               256          1 × 10−4            adaLN-Zero                  13.78
     DiT-XL/2      512 × 512        524.60        675             3000               256          1 × 10−4            adaLN-Zero                  11.93


Table 4. Details of all DiT models. We report detailed information about every DiT model in our paper. Note that FID-50K here is
computed without classifier-free guidance. Parameter and flop counts exclude the VAE model which contains 84M parameters across the
encoder and decoder. For both the 256 × 256 and 512 × 512 DiT-XL/2 models, we never observed FID saturate and continued training
them as long as possible. Numbers reported in this table use the ft-MSE VAE decoder.


C. Additional Scaling Results                                                       Class-Conditional ImageNet 256×256, DiT-XL/2-G (cfg=1.5)
                                                                                    Decoder       FID↓         sFID↓          IS↑      Precision↑          Recall↑
Impact of scaling on metrics beyond FID. In Figure 12,
                                                                                    original      2.46         5.18       271.56          0.82              0.57
we show the effects of DiT scale on a suite of evaluation                           ft-MSE        2.30         4.73       276.09          0.83              0.57
metrics—FID, sFID, Inception Score, Precision and Recall.                           ft-EMA        2.27         4.60       278.24          0.83              0.57
We find that our FID-driven analysis in the main paper gen-
eralizes to the other metrics—across every metric, scaled-up                     Table 5. Decoder ablation. We tested different pre-trained VAE
DiT models are more compute-efficient and model Gflops                           decoder weights available at https://huggingface.co/
are highly-correlated with performance. In particular, In-                       stabilityai/sd-vae-ft-mse. Different pre-trained de-
ception Score and Precision benefit heavily from increased                       coder weights yield comparable results on ImageNet 256 × 256.
model scale.
                                                                                  Diffusion U-Net Model Complexities
                                                                                  Model     Image Resolution     Base Flops (G)     Upsampler Flops (G)    Total Flops (G)
Impact of scaling on training loss. We also examine the                           ADM          128 × 128                307                  -                   307
impact of scale on training loss in Figure 13. Increasing                         ADM          256 × 256               1120                  -                  1120
                                                                                  ADM          512 × 512               1983                  -                  1983
DiT model Gflops (via transformer size or number of input                         ADM-U        256 × 256                110                 632                  742
                                                                                  ADM-U        512 × 512                307                2506                 2813
tokens) causes the training loss to decrease more rapidly and
                                                                                  LDM-4        256 × 256               104                   -                  104
saturate at a lower value. This phenomenon is consistent                          LDM-8        256 × 256                57                   -                   57
with trends observed with language models, where scaled-
up transformers demonstrate both improved loss curves as                         Table 6. Gflop counts for baseline diffusion models that use U-
well as improved performance on downstream evaluation                            Net backbones. Note that we only count Flops for DDPM com-
suites [26].                                                                     ponents.


D. VAE Decoder Ablations                                                         different choices of the VAE decoder; the original one used
   We used off-the-shelf, pre-trained VAEs across our ex-                        by LDM and the two fine-tuned decoders used by Stable
periments. The VAE models (ft-MSE and ft-EMA) are fine-                          Diffusion. Because the encoders are identical across mod-
tuned versions of the original LDM “f8” model (only the                          els, the decoders can be swapped-in without retraining the
decoder weights are fine-tuned). We monitored metrics for                        diffusion model. Table 5 shows results; XL/2 continues to
our scaling analysis in Section 5 using the ft-MSE decoder,                      outperform all prior diffusion models when using the LDM
and we used the ft-EMA decoder for our final metrics re-                         decoder.
ported in Tables 2 and 3. In this section, we ablate three


                                                                         13
Figure 12. DiT scaling behavior on several generative modeling metrics. Left: We plot model performance as a function of total training
compute for FID, sFID, Inception Score, Precision and Recall. Right: We plot model performance at 400K training steps for all 12 DiT
variants against transformer Gflops, finding strong correlations across metrics. All values were computed using the ft-MSE VAE decoder.




                                                                  14
                   0.21
                                                                                                                                                                                                S/8              S/4          S/2
                   0.20
                                                                                                                                                  0.20
                                                                                                                                                  0.19
                   0.19                                                                                                                           0.18
                                                                                                                                                  0.17
                                                                                                                                                  0.16



   Training Loss
                   0.18
                                                                                                                                                  0.15
                                                                                                                                                  0.14
                   0.17                                                                                                                                  0    10K   20K   30K     40K     50K   60K     70K      80K   90K   100K

                   0.16

                   0.15

                   0.14

                   0.13
                          0                100K              200K                300K             400K                500K             600K                  700K                  800K                   900K                      1M
                                                                                                         Training Iterations
                   0.21
                                                                                                                                                                                                B/8              B/4          B/2
                   0.20
                                                                                                                                                  0.20
                                                                                                                                                  0.19
                   0.19                                                                                                                           0.18
                                                                                                                                                  0.17
                                                                                                                                                  0.16



   Training Loss
                   0.18
                                                                                                                                                  0.15
                                                                                                                                                  0.14
                   0.17                                                                                                                                  0    10K   20K   30K     40K     50K   60K     70K      80K   90K   100K

                   0.16

                   0.15

                   0.14

                   0.13
                              0            100K            200K           300K            400K            500K               600K        700K                800K               900K               1M
                                                                                                         Training Iterations
                   0.21
                                                                                                                                                                                                 L/8             L/4          L/2
                   0.20
                                                                                                                                                  0.20
                                                                                                                                                  0.19
                   0.19                                                                                                                           0.18
                                                                                                                                                  0.17
                                                                                                                                                  0.16



   Training Loss
                   0.18
                                                                                                                                                  0.15
                                                                                                                                                  0.14
                   0.17                                                                                                                                  0    10K   20K   30K     40K     50K   60K     70K      80K   90K   100K

                   0.16

                   0.15

                   0.14

                   0.13
                              0            100K            200K           300K            400K            500K               600K        700K                800K               900K               1M
                                                                                                         Training Iterations
                   0.21
                                                                                                                                                                                            XL/8              XL/4           XL/2
                   0.20
                                                                                                                                                  0.20
                                                                                                                                                  0.19
                   0.19                                                                                                                           0.18
                                                                                                                                                  0.17
                                                                                                                                                  0.16



   Training Loss
                   0.18
                                                                                                                                                  0.15
                                                                                                                                                  0.14
                   0.17                                                                                                                                  0    10K   20K   30K     40K     50K   60K     70K      80K   90K   100K

                   0.16

                   0.15

                   0.14

                   0.13
                              0            100K            200K           300K            400K            500K               600K        700K                800K               900K               1M
                                                                                                         Training Iterations
                   0.21
                                                                                                                                                                            XL/2 (256x256)                    XL/2 (512x512)
                   0.20
                                                                                                                                                  0.20
                                                                                                                                                  0.19
                   0.19                                                                                                                           0.18
                                                                                                                                                  0.17
                                                                                                                                                  0.16



   Training Loss
                   0.18
                                                                                                                                                  0.15
                                                                                                                                                  0.14
                   0.17                                                                                                                                  0    10K   20K   30K     40K     50K   60K     70K      80K   90K   100K

                   0.16

                   0.15

                   0.14

                   0.13
                              0   0.25M 0.50M 0.75M 1.00M 1.25M 1.50M 1.75M 2.00M 2.25M 2.50M 2.75M 3.00M 3.25M 3.50M 3.75M 4.00M 4.25M 4.50M 4.75M 5.00M 5.25M 5.50M 5.75M 6.00M 6.25M 6.50M 6.75M 7.00M
                                                                                                         Training Iterations
Figure 13. Training loss curves for all DiT models. We plot the loss over training for all DiT models (the sum of the noise prediction
mean-squared error and DKL ). We also highlight early training behavior. Note that scaled-up DiT models exhibit lower training losses.

                                                                                                                 15
                                                   DiT-XL/2 512 × 512 samples, classifier-free guidance scale = 4.0




Figure 14. Uncurated 512 × 512 DiT-XL/2 samples.            Figure 15. Uncurated 512 × 512 DiT-XL/2 samples.
Classifier-free guidance scale = 4.0                        Classifier-free guidance scale = 4.0
Class label = “arctic wolf” (270)                           Class label = “volcano” (980)




                                                       16
                                                   DiT-XL/2 512 × 512 samples, classifier-free guidance scale = 4.0




Figure 16. Uncurated 512 × 512 DiT-XL/2 samples.            Figure 17. Uncurated 512 × 512 DiT-XL/2 samples.
Classifier-free guidance scale = 4.0                        Classifier-free guidance scale = 4.0
Class label = “husky” (250)                                 Class label = “sulphur-crested cockatoo” (89)




                                                       17
                                                   DiT-XL/2 512 × 512 samples, classifier-free guidance scale = 4.0




Figure 18. Uncurated 512 × 512 DiT-XL/2 samples.            Figure 19. Uncurated 512 × 512 DiT-XL/2 samples.
Classifier-free guidance scale = 4.0                        Classifier-free guidance scale = 4.0
Class label = “cliff drop-off” (972)                        Class label = “balloon” (417)




                                                       18
                                                   DiT-XL/2 512 × 512 samples, classifier-free guidance scale = 4.0




Figure 20. Uncurated 512 × 512 DiT-XL/2 samples.            Figure 21. Uncurated 512 × 512 DiT-XL/2 samples.
Classifier-free guidance scale = 4.0                        Classifier-free guidance scale = 4.0
Class label = “lion” (291)                                  Class label = “otter” (360)




                                                       19
                                                   DiT-XL/2 512 × 512 samples, classifier-free guidance scale = 2.0




Figure 22. Uncurated 512 × 512 DiT-XL/2 samples.            Figure 23. Uncurated 512 × 512 DiT-XL/2 samples.
Classifier-free guidance scale = 2.0                        Classifier-free guidance scale = 2.0
Class label = “red panda” (387)                             Class label = “panda” (388)




                                                       20
                                                   DiT-XL/2 512 × 512 samples, classifier-free guidance scale = 1.5




Figure 24. Uncurated 512 × 512 DiT-XL/2 samples.            Figure 25. Uncurated 512 × 512 DiT-XL/2 samples.
Classifier-free guidance scale = 1.5                        Classifier-free guidance scale = 1.5
Class label = “coral reef” (973)                            Class label = “macaw” (88)




                                                       21
                                                   DiT-XL/2 256 × 256 samples, classifier-free guidance scale = 4.0




Figure 26. Uncurated 256 × 256 DiT-XL/2 samples.            Figure 27. Uncurated 256 × 256 DiT-XL/2 samples.
Classifier-free guidance scale = 4.0                        Classifier-free guidance scale = 4.0
Class label = “macaw” (88)                                  Class label = “dog sled” (537)




                                                       22
                                                   DiT-XL/2 256 × 256 samples, classifier-free guidance scale = 4.0




Figure 28. Uncurated 256 × 256 DiT-XL/2 samples.            Figure 29. Uncurated 256 × 256 DiT-XL/2 samples.
Classifier-free guidance scale = 4.0                        Classifier-free guidance scale = 4.0
Class label = “arctic fox” (279)                            Class label = “loggerhead sea turtle” (33)




                                                       23
                                                   DiT-XL/2 256 × 256 samples, classifier-free guidance scale = 2.0




Figure 30. Uncurated 256 × 256 DiT-XL/2 samples.            Figure 31. Uncurated 256 × 256 DiT-XL/2 samples.
Classifier-free guidance scale = 2.0                        Classifier-free guidance scale = 2.0
Class label = “golden retriever” (207)                      Class label = “lake shore” (975)




                                                       24
                                                   DiT-XL/2 256 × 256 samples, classifier-free guidance scale = 1.5




Figure 32. Uncurated 256 × 256 DiT-XL/2 samples.            Figure 33. Uncurated 256 × 256 DiT-XL/2 samples.
Classifier-free guidance scale = 1.5                        Classifier-free guidance scale = 1.5
Class label = “space shuttle” (812)                         Class label = “ice cream” (928)




                                                       25