Score-Based Generative Modeling through Stochastic Differential Equations

Metadata

Reading status: read complete
Year: 2020
Compute regime: Generative media compute (generative_media_compute)
PDF: 2020-score_sde_2020.pdf
Extracted text: 2020-score_sde_2020.txt
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper does not list exact hardware or accelerator counts. Under the project rule, the research-time setup should be treated as inferred 2020-era accelerator infrastructure, plausibly V100/TPU-class multi-GPU or TPU training, but the card should not attribute a specific device to the authors. The source does give unusually concrete scale proxies. CIFAR-10 and 64 x 64 CelebA architecture searches trained models for 1.3M iterations unless otherwise noted, saved one checkpoint every 50k iterations, and computed FID on 50k generated samples. The Appendix says the batch size is fixed to 128 on CIFAR-10 and 64 on LSUN bedroom/church at 256 x 256.

The high-resolution experiment is the strongest compute signal. For 1024 x 1024 CelebA-HQ, the authors trained a modified NCSN++-style model with the continuous objective, batch size 8, EMA rate 0.9999, and about 2.4M training iterations. The model was sampled with a predictor-corrector sampler discretized at 2000 steps, using a reverse-diffusion predictor and one Langevin corrector step per predictor update.

Bottleneck

The bottleneck is not only fitting the score network; it is spending enough neural-network evaluations at sampling time to turn noise into a high-quality image. Each reverse-time SDE solver step calls the score network, and predictor-corrector (PC) variants add MCMC correction calls on top of predictor calls. The paper compares samplers at equal "number of score function evaluations," the right compute currency here. A PC1000 sampler has 1000 predictor and 1000 corrector steps, roughly doubling the score-network evaluations relative to P1000.

The second bottleneck is numerical accuracy versus wall-clock cost. The probability-flow ODE gives exact likelihood computation through a neural ODE, but ODE solvers trade tolerance against number of function evaluations. Appendix settings use SciPy solve_ivp with absolute and relative tolerances of 1e-5 for likelihood reporting. Larger error tolerances reduce NFE, while tighter tolerances and more sample steps raise latency.

Memory also scales sharply with resolution. The paper does not spell out activation memory, but its reported batch choices show the pressure: batch 128 for CIFAR-10, batch 64 for 256 x 256 LSUN, and batch 8 for 1024 x 1024 CelebA-HQ. That is consistent with convolutional score networks whose per-example activation tensors grow with image area and with thousands of sampling evaluations making large batches difficult.

Method Adaptation

The core adaptation is to expose score-based generation as a solver problem. SMLD and DDPM become special discretizations of continuous-time SDEs. Instead of committing to one fixed ancestral sampler, the paper trains a time-dependent score network and then allows multiple reverse-time numerical methods to be swapped in: ancestral sampling, reverse diffusion, Euler-Maruyama, Langevin correctors, and probability-flow ODE solvers.

This design fits accelerator-era constraints in three ways. First, sample quality can be tuned by spending more batched forward passes rather than retraining from scratch; CIFAR-10 generation uses batch size 1024. Second, continuous training with random Fourier feature time embeddings improves compatibility with arbitrary time steps, which helps when solvers need non-integer times or when the authors switch from discrete DDPM-style steps to Euler-Maruyama predictors. Third, the probability-flow ODE reuses the learned score field for likelihoods, converting a generative model into a flow-like density estimator without maximum-likelihood training.

The method also adapts to dataset-specific compute. For VP SDEs, the authors only ran CIFAR-10 "to save computation" and compared checkpoints between 0.25M and 0.5M iterations because FID worsened after 0.5M. For improved continuous models, they reduced training to 0.95M iterations, then doubled residual blocks per resolution to buy quality with more per-step compute.

Evidence

The sampler tables directly support the compute tradeoff. On CIFAR-10, PC samplers consistently improve FID over predictor-only variants at the same base discretization. For example, with the probability-flow predictor under the VP SDE, P1000 gives FID 3.59, P2000 gives 3.23, and PC1000 gives 3.06. Under the VE SDE, probability-flow P1000 is poor at 15.41, P2000 improves to 10.54, but PC1000 reaches 3.51. The paper's interpretation is that splitting compute between prediction and correction is often better than just doubling predictor steps, and that corrector-only sampling can require far more computation to match mixed samplers.

Training and evaluation scale is also concrete. Batch sizes are 128 on CIFAR-10 and 64 on LSUN; generation uses batch 1024 on CIFAR-10 and 8 on LSUN. Most models train for 1.3M iterations; improved continuous CIFAR models train for 0.95M iterations; 1024 x 1024 CelebA-HQ uses batch 8 and around 2.4M iterations. FIDs are computed on 50k samples with TensorFlow GAN.

The quantitative payoff is large. On CIFAR-10, NCSN++ continuous deep VE reaches FID 2.20 and Inception Score 9.89, compared with DDPM's FID 3.17 and IS 9.46 in the table. The likelihood-oriented DDPM++ continuous deep sub-VP model reaches 2.99 bits/dim on uniformly dequantized CIFAR-10 without maximum-likelihood training.

Historical Effect

Score SDE reframed diffusion and score models as a solver-and-compute design space. Earlier score models and DDPMs could be read as separate algorithms; this paper showed they were points inside a shared continuous-time formulation. Later diffusion work often treated sampler design, NFE reduction, ODE/SDE choice, and guidance as first-class compute levers rather than incidental implementation details.

The paper also made the inference budget visible: train a score field, then decide how many neural function evaluations to spend and how to distribute them. That framing led naturally to later work on fast samplers, distillation, latent diffusion, and scheduler design.

Limits

The main limit for this card is hardware opacity: no exact GPU/TPU type, accelerator count, wall-clock training time, or memory footprint is reported. Device claims therefore remain inference under the project rule. The method also has a steep inference-cost profile. PC samplers improve sample quality, but they spend extra score evaluations; high-resolution generation uses very small batch sizes; and likelihood computation depends on neural ODE tolerances and potentially high NFE.

The reported improvements are tightly coupled to architecture and sampler choices: continuous time embeddings, deeper residual stacks, predictor choice, signal-to-noise ratio grids, and dataset-specific step counts. Moving to a different architecture, resolution, or hardware budget can change the optimal sampler and quality-cost point.