Improved Denoising Diffusion Probabilistic Models

Metadata

Slug: improved_ddpm_2021
Year: 2021
Venue: ICML
Authors: Alex Nichol, Prafulla Dhariwal
Reading status: read complete
Compute regime: Generative media compute (generative_media_compute)
Primary sources: PDF, extracted text
Reading card created: 2026-06-15

Compute Setup

The paper does not name the exact training hardware, GPU model, accelerator count, or cluster topology. It does say that sampling with the full 4000-step chain takes several minutes on a "modern GPU," and the appendix says larger class-conditional ImageNet 64x64 experiments scaled the batch size to 2048 "for faster training on more GPUs." Under the project rule, the device context is inferred as 2021 OpenAI GPU training infrastructure, likely V100-class multi-GPU systems, but that hardware inference is not explicitly listed in the paper.

Compute scale is specified through model size, diffusion steps, and training iterations. Most models use 4000 diffusion steps. The ImageNet 64x64 U-Net with C=128 has about 120M parameters and roughly 39B FLOPs per forward pass. The scaling study uses 30M, 68M, 120M, and 270M parameter models. Class-conditional ImageNet 64x64 uses 100M and 270M parameter models, sampling with 250 steps. Standard experiments mostly use batch size 128, Adam, learning rate 1e-4, and EMA 0.9999; larger class-conditional runs use batch size 2048.

Bottleneck

The bottleneck is the number of denoising network evaluations. If a 39B-FLOP U-Net is sampled for 4000 reverse steps, one image requires enormous serial inference work. The source states this directly: all models are trained with 4000 diffusion steps, and producing one sample takes several minutes on a modern GPU. This is the deployment bottleneck, especially compared with one-pass GAN generation.

There is also a training-objective bottleneck. Directly optimizing the variational lower bound (L_vlb) should improve likelihood, but the paper finds it difficult to optimize on diverse ImageNet 64x64 because the gradient is much noisier than the hybrid objective. Noisy gradients waste training compute: more iterations or larger batches are needed to extract the same signal. The paper's compute question is therefore twofold: reduce sampling passes at inference time and reduce gradient noise during likelihood training.

Method Adaptation

The key adaptation is to learn the reverse-process variance in addition to the mean. Ho et al.'s DDPM fixed the reverse variance; this paper parameterizes the variance as an interpolation between the upper and lower posterior variance bounds. This has a compute side effect: when the sampler uses a shorter subsequence of timesteps, the learned variance can be rescaled for that shorter process, making fewer reverse steps viable without retraining.

The second adaptation is objective design. The paper uses a hybrid objective combining the VLB with the simplified denoising objective. This improves log-likelihood without sacrificing sample quality as much as direct VLB optimization. For direct VLB, the authors add importance sampling over timesteps, maintaining loss histories and sampling t in proportion to estimated second moment. That concentrates gradient effort where VLB terms matter and reduces objective noise.

The third adaptation is strided sampling. A model trained with 4000 steps is sampled using an evenly spaced subset of K timesteps, with K tested at 25, 50, 100, 200, 400, 1000, and 4000. This changes inference cost linearly in K. Learned-sigma L_hybrid models preserve FID far better than fixed-sigma models when K is reduced.

Evidence

The likelihood evidence starts with the 4000-step schedule. On ImageNet 64x64, the Ho et al. setup with 1000 steps reaches 3.99 bits/dim after 200K iterations; increasing to 4000 steps improves it to 3.77. In the final comparison, Improved DDPM reaches 3.53 bits/dim on ImageNet 64x64 and 2.94 on CIFAR-10, competitive with convolutional likelihood models but behind some Transformer image models.

The sampling-speed evidence is the most device-relevant. Full 4000-step sampling takes minutes on a modern GPU; reducing steps makes sampling take seconds. Figure 8 evaluates 25 through 4000 steps. Learned-sigma L_hybrid models maintain quality under reduced-step sampling, and 100 steps is enough for near-optimal FID in fully trained models. DDIM is better below 50 steps, but worse at 50 or more in this comparison.

Quality/diversity evidence comes from class-conditional ImageNet 64x64 versus BigGAN-deep. The 270M-parameter diffusion model trains for 250K iterations, uses 250 sampling steps, and reaches FID 2.92, precision 0.82, and recall 0.71. The 100M-parameter model trained for 1.7M steps reaches FID 6.92, precision 0.77, and recall 0.72. The recall values support the claim that diffusion covers more of the target distribution at similar FID.

The scaling evidence uses theoretical training compute. For ImageNet 64x64, the authors train models with first-layer channel counts 64, 96, 128, and 192, corresponding to about 30M, 68M, 120M, and 270M parameters. FID improves approximately linearly on a log-log plot against compute, while NLL improves less cleanly. The x-axis assumes full hardware utilization, so the result is best read as a compute-scaling analysis rather than a measured wall-clock scaling law.

Historical Effect

Improved DDPM helped shift diffusion from an elegant but slow likelihood model into a practical generative-media compute path. The historical move was to expose inference steps as a controllable budget. Learned variances and strided schedules turned the reverse chain from "always run the training horizon" into "trade steps for quality," which later diffusion systems exploited through samplers, distillation, cascades, and latent models.

The paper also joined diffusion quality to training compute. By showing smooth FID scaling with model size and theoretical FLOPs, it made diffusion look like a scalable accelerator workload rather than a brittle sampling trick, setting up later large U-Nets, latent diffusion, and Transformer diffusion backbones.

Limits

Exact hardware is not listed; GPU era and likely infrastructure are inferred, not sourced.
Even 50-100 denoising steps remain much slower than single-pass generation.
Full 4000-step sampling is described as minutes per sample on a modern GPU.
The paper notes FID can overfit or move differently from likelihood, so compute improvements should not be read from FID alone.
The ImageNet 256x256 extension uses cascaded 64x64 plus upsampler models, adding another compute stage.