High-Resolution Image Synthesis with Latent Diffusion Models

Metadata

Reading status: read complete
Year: 2021
Compute regime: Generative media compute (generative_media_compute)
PDF: 2021-latent_diffusion_2021.pdf
Extracted text: 2021-latent_diffusion_2021.txt
PDF URL: https://arxiv.org/pdf/2112.10752.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly reports the accelerator setup for most experiments: "All models trained on a single NVIDIA A100" appears in the hyperparameter tables for unconditional and conditional LDMs, with one exception in Table 15 where the inpainting model was trained on eight V100 GPUs. The compute comparison table converts the authors' A100 time into V100-days using an assumed 2.2x A100-over-V100 U-Net speedup. This makes the paper unusually direct about single-accelerator feasibility for high-resolution diffusion work.

The setup spans several model sizes and batch regimes. Unconditional 256x256 models use latent downsampling factors such as f=4 or f=8 and train with batches from 42 to 96 depending on dataset and latent size. The ImageNet compression analysis trains LDM-1, -2, -4, -8, -16, and -32 for 2M iterations on a single A100, with model sizes around 391-395M and batch size rising as the latent representation shrinks. Conditional tasks include a 1.45B text-to-image LDM-8 trained on LAION for 390K iterations with batch size 680, and an ImageNet class-conditional LDM-4 trained for 178K iterations with batch size 1200.

Bottleneck

The bottleneck is repeated U-Net evaluation in pixel space. Diffusion models can produce strong images, but training requires gradients through many noisy pixel-space examples, and sampling requires sequential denoising steps. The paper states that powerful pixel-space diffusion models often take 150-1000 V100-days to train and that producing 50K samples can take about five days on a single A100 in the cited ADM setting.

This is a memory-bandwidth and spatial-resolution problem as much as a FLOP problem. A 256x256 or 512x512 RGB image carries many high-frequency details that are visually imperceptible or not semantically important, yet a pixel-space diffusion U-Net still evaluates convolutions and attention over the full grid. The model may downweight some noisy timesteps in the loss, but the backbone and gradients are still computed over all pixels. Sampling then repeats the same expensive backbone tens to thousands of times.

The paper's project is to reduce both training and inference cost without giving up diffusion's stable likelihood-based training and mode coverage. The bottleneck is therefore not "can diffusion work" but "can high-resolution diffusion become accessible outside labs that can spend hundreds of GPU-days per model."

Method Adaptation

Latent Diffusion splits image synthesis into two compute stages. First, an autoencoder maps an RGB image x into a latent z with spatial downsampling factor f, using perceptual and adversarial losses so that reconstructions stay on the image manifold. Second, the diffusion model operates on z rather than on pixels. A sampled latent can be decoded back to image space with one decoder pass, so the expensive iterative denoising loop happens on a smaller tensor.

The key adaptation is mild compression rather than extreme compression. Previous latent autoregressive systems needed aggressive spatial reduction because their priors were expensive; LDM can keep a two-dimensional latent grid and use a convolutional U-Net, preserving spatial inductive bias. The paper studies f values from 1 to 32 and finds that f=4 to f=8 is the useful compute-quality region. Too little compression leaves most pixel cost in place; too much compression loses information and caps fidelity.

Conditioning is also compute-shaped. Cross-attention layers inject text, class labels, layouts, or other conditioning into the U-Net without building a separate architecture for each modality. This makes the same latent-space denoising backbone reusable for text-to-image, layout-to-image, class-conditional ImageNet, super-resolution, semantic synthesis, and inpainting.

Evidence

The strongest compute evidence is Table 18. On ImageNet 256x256, ADM at 250 DDIM steps is listed at 916 V100-days, 0.12 samples/sec on a single A100, FID 10.94, and 554M parameters. LDM-8 at 200 DDIM steps is listed at 79 V100-days, 1.9 samples/sec, FID 17.41, and 395M parameters. The guided LDM-8 variant adds classifier compute for 91 total V100-days and reaches FID 8.11. LDM-4-G with classifier-free guidance reaches FID 3.60 with 271 V100-days and 0.4 samples/sec, competitive with or better than the cited ADM-guided rows while using substantially less training compute than ADM-G's 962 V100-day total.

The paper also reports that f matters. In the ImageNet 2M-step analysis, small downsampling factors LDM-1 and LDM-2 train slowly, while f too large causes fidelity to stagnate. The authors describe a FID gap of 38 between pixel-based LDM-1 and LDM-8 after 2M steps. On sampling speed versus FID plots, LDM-4 and LDM-8 improve both sample quality and throughput relative to pixel-space LDM-1.

Task evidence is broad. Unconditional LDM-4 reports FID 5.11 on CelebA-HQ and 4.98 on FFHQ; LDM-8 reports 4.02 on LSUN-Churches; LDM-4 reports 2.95 on LSUN-Bedrooms. In text-to-image on MS-COCO 256x256, LDM-KL-8-G reaches FID 12.63 and IS 30.29 with a 1.45B model and 250 DDIM steps, close to GLIDE's 12.24 while using far fewer parameters than GLIDE's 6B. For super-resolution, LDM-4 with 100 steps and additional training reaches 2.6/4.6 FID versus a pixel-DM baseline's 5.1/7.1 under the same additional training comparison.

Historical Effect

Latent Diffusion changed the compute economics of image generation. Instead of scaling pixel-space denoising until it was too expensive for most groups, it showed that a reusable perceptual compressor could remove much of the spatial burden while leaving the generative model with enough information to synthesize high-quality images. This directly set up the Stable-Diffusion-style pattern: train an autoencoder, run a conditional U-Net in latent space, use cross-attention for text, and sample with DDIM or related samplers.

Historically, the paper is also a bridge between GAN-era high-resolution image synthesis and diffusion-era controllability. GANs were fast at inference but hard to train and condition reliably; pixel diffusion was stable and controllable but slow. LDM preserved diffusion's conditioning flexibility while making training and sampling fit a single-A100 research workflow for many experiments.

Limits

The autoencoder is a new quality ceiling. If compression discards information, the diffusion prior cannot recover it reliably; if compression is too mild, the compute savings disappear. The paper's f sweep makes this tradeoff explicit: LDM-32 is too compressed for quality, while LDM-1 and LDM-2 keep too much pixel burden.

The method also adds components and evaluation complexity. The first-stage autoencoder has its own reconstruction losses, regularization, and possible artifacts. Some results depend on guidance, classifier compute, or task-specific conditioning. Text-to-image quality is competitive but not uniformly best, and global scene coherence can still fail for complicated prompts. Finally, although many experiments fit on a single A100, frontier-scale text-to-image training still accumulates large total compute when data, ablations, and larger resolutions are included.