Denoising Diffusion Probabilistic Models

Metadata

Reading status: read complete
Year: 2020
Compute regime: Generative media compute (generative_media_compute)
PDF: 2020-ddpm_2020.pdf
Extracted text: 2020-ddpm_2020.txt
PDF URL: https://arxiv.org/pdf/2006.11239.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly states the accelerator class: all experiments used TPU v3-8, described by the authors as similar to 8 V100 GPUs. This is enough to avoid device-era inference for the main training runs. The appendix gives model sizes and throughput. The CIFAR-10 model has 35.7M parameters; the LSUN and CelebA-HQ 256x256 models have 114M parameters; and a larger LSUN Bedroom variant has about 256M parameters. The CIFAR model trains at 21 steps per second with batch size 128, reaching 800K steps in about 10.6 hours. Sampling a batch of 256 CIFAR images takes 17 seconds.

For 256x256 images, the compute shape changes sharply. The CelebA-HQ/LSUN models train at 2.2 steps per second with batch size 64, and sampling 128 images takes 300 seconds. The paper reports training CelebA-HQ for 0.5M steps, LSUN Bedroom for 2.4M steps, LSUN Cat for 1.8M steps, LSUN Church for 1.2M steps, and the larger LSUN Bedroom model for 1.15M steps. These numbers make the card's device story concrete: one TPU v3-8-equivalent setup can train the models, but the sequential denoising sampler is already the dominant inference-time tax.

Bottleneck

DDPM's bottleneck is the long reverse Markov chain. The paper sets T = 1000 for all experiments, so generating one sample usually requires 1000 sequential neural-network evaluations. The U-Net evaluations are accelerator-friendly convolutional workloads, but the chain is temporally serial: step t depends on the output at step t+1, so ordinary batching helps throughput but does not remove per-sample latency. This is why the appendix can report 21 CIFAR training steps per second while still needing 17 seconds to sample 256 CIFAR images.

The second bottleneck is resolution. Moving from 32x32 CIFAR to 256x256 LSUN/CelebA increases activation memory and convolutional FLOPs enough that batch size drops from 128 to 64 and training throughput drops from 21 to 2.2 steps per second. Sampling also stretches to 300 seconds for 128 images at 256x256. The method therefore trades GAN-style one-pass generation for a compute structure that is easier to train but expensive to sample.

Method Adaptation

The model adapts to this compute structure by making every reverse step a shared-parameter denoising problem. The reverse process is represented with a U-Net backbone similar to unmasked PixelCNN++ and Wide ResNet, with group normalization, residual blocks at multiple resolutions, self-attention at 16x16 feature maps, and Transformer sinusoidal timestep embeddings injected into residual blocks. Sharing one network across all timesteps avoids storing 1000 separate denoisers; the timestep embedding lets one accelerator-resident model condition on the noise level.

The simplified training objective is also a compute adaptation. Instead of optimizing the full variational bound in its original weighted form, the paper trains the network to predict the added noise with a simple squared-error loss. The authors report that this objective is simpler and improves sample quality, and it turns training into standard minibatched supervised denoising over random timesteps. Closed-form sampling of noisy x_t from x_0 makes the forward process cheap during training: the model does not have to simulate all previous noising steps to train at a particular t. The price is paid at inference, where the reverse chain remains sequential.

Evidence

The paper's main CIFAR-10 evidence is that the simplified objective reaches Inception Score 9.46 +/- 0.11 and FID 3.17, with NLL at no more than 3.75 bits/dim. The authors note that the FID is computed against the training set by standard practice, and that the test-set FID is 5.24. The ablation table shows why the objective mattered: the epsilon-prediction L_simple setting is far better in FID than the fixed-isotropic variational objective result shown nearby.

The larger-image evidence shows both quality and cost. The reported LSUN FIDs are 6.36 for Bedroom, 7.89 for Church, and 19.75 for Cat. These are tied to the throughput numbers above: 2.2 training steps per second at batch 64 and 300 seconds to sample 128 images at 256x256. The paper also says most hyperparameter search was performed on CIFAR-10 because larger models were constrained by memory and compute, which is a direct statement about how device budget shaped the experimental design.

Historical Effect

DDPM re-established diffusion as a high-fidelity image generation route by showing that many cheap denoising steps could beat or approach the sample quality of contemporary GANs, even with a much slower sampler. Its compute structure became the template for later diffusion work: train a large denoiser with random timestep supervision, then improve the sampling side with learned variances, fewer-step solvers, distillation, latent spaces, or stronger backbones.

The historical shift is important because the original paper did not solve the inference budget problem. It made the quality case so strongly that the next wave of work could treat sampling acceleration as the obvious bottleneck.

Limits

The limits are explicit in the device numbers. A 1000-step chain makes generation much slower than a GAN forward pass, and the 256x256 sampler takes minutes per batch on the stated TPU v3-8-equivalent setup. Batching can amortize accelerator utilization, but it cannot make a single image one-pass.

Likelihoods are also not the main strength. The CIFAR model produces strong samples, but the paper notes that its codelengths are not competitive with specialized likelihood models such as autoregressive density estimators. The experiments also rely on fixed T = 1000 without an extensive sweep, and the authors report that larger-model hyperparameter search was limited by memory and compute. DDPM's result is therefore a quality breakthrough with a clear open systems problem: make diffusion sampling cheaper without losing the denoising quality.