Compute Regimes
Ten stages of AI compute
Each regime defines a specific hardware environment: device, memory, interconnect, and data-pipeline constraints that shaped which AI methods could be trained, scaled, or deployed.
Pre-2012 CPU and statistical foundations
CPU-centric training, small datasets, hand-engineered features — the prerequisite measurement and optimization culture.
Single-GPU deep learning
Commodity GPUs make high-throughput dense tensor training practical. CNNs, dropout, batch normalization become dominant.
Multi-GPU dense training
The bottleneck shifts to synchronization, batch size, depth, and memory stability across multiple GPUs.
TPU and accelerator Transformer era
Accelerators reward large matrix multiplies and sequence batching. Transformers, BERT, and T5 fit this structure.
Hyperscale dense LLM training
Training becomes a datacenter-scale problem: model/data parallelism, optimizer state sharding, compute-optimal scaling.
Sparse and memory-efficient scaling
Memory, activation cost, and communication pressure drive MoE, attention kernels, sharding, and recomputation.
Generative media compute
Image and video generation depend on GPU throughput, denoising iteration cost, and latent-space compression.
Inference-time compute and post-training
The frontier shifts to inference allocation: RLHF, chain-of-thought, verifiers, retrieval, tools, and agents.
Efficient and edge inference
Deployment constraints dominate: latency, memory footprint, quantization, adapter size, KV-cache pressure.
Search, simulation, and science compute
Search, simulation, self-play, and scientific structure prediction combine neural networks with structured inference.