Compute Spine

This history uses compute regimes as the main branches. The method story is secondary: algorithms become important when they fit the current device, memory, interconnect, data, and inference constraints.

The numbered order is a reading scaffold, not a claim that one regime cleanly replaces the previous one. Multi-GPU dense training continues through the TPU and LLM eras; sparse/MoE work overlaps dense hyperscale training; inference-time compute co-evolves with pretraining scale. When a paper sits across regimes, the reading card should state the primary device constraint and link the secondary method branches.

1. Pre-2012 CPU And Statistical Foundations

The practical compute setup is CPU-centric training, small to medium datasets, hand-engineered features, and limited neural-network scale. Backpropagation, stochastic gradient descent, MNIST/LeNet, SVMs, and ImageNet create the prerequisite measurement and optimization culture.

This regime deliberately compresses several CPU sub-periods: workstation-scale toy neural networks in the 1980s, statistical learning and kernel methods on servers in the 1990s, deep-belief and autoencoder revival work on CPU-era research machines in the 2000s, and large-scale SGD/data-set construction on CPU clusters around 2009-2011. The compression is kept because the first main branch break for this project is the shift to commodity GPU tensor training.

2. Single-GPU Deep Learning

Commodity GPUs make high-throughput dense tensor training practical. CNNs, ReLU-like activations, dropout, batch normalization, Adam, and image classification benchmarks become the dominant demonstration path.

3. Multi-GPU Dense Training

The bottleneck moves from single-device throughput to synchronization, batch size, depth, and memory stability. Residual connections, normalization, mixed precision, and distributed SGD become central adaptations.

4. TPU And Accelerator Transformer Era

Accelerators reward large matrix multiplies, sequence batching, and compiler-friendly dense computation. Transformers, BERT-style pretraining, T5-style text-to-text training, and early ViT-style workloads fit this structure.

5. Hyperscale Dense LLM Training

The training stack becomes a datacenter-scale problem: dense transformer scaling, model/data parallelism, optimizer state sharding, data mixture design, and compute-optimal training laws.

6. Sparse And Memory-Efficient Scaling

Memory, activation cost, and communication pressure become first-order bottlenecks. Mixture-of-experts, attention kernels, long-context approximations, sharding, recomputation, and memory-efficient attention are compute-structure responses.

7. Generative Media Compute

Image and video generation depend on GPU/accelerator throughput, denoising iteration cost, latent-space compression, and high-bandwidth training data. GANs, VAEs, diffusion, latent diffusion, and diffusion transformers are read through those constraints.

8. Inference-Time Compute And Post-Training

As base models become expensive and general, the frontier shifts to inference allocation and behavior shaping: RLHF, instruction tuning, chain-of-thought, self-consistency, verifiers, retrieval, tools, and agents.

9. Efficient And Edge Inference

Deployment constraints dominate: latency, memory footprint, quantization error, adapter size, KV-cache pressure, and speculative execution. Distillation, quantization, LoRA-style adapters, and on-device architectures are the main adaptations.

10. Search, Simulation, And Science Compute

The compute loop is not just supervised training. Search, simulation, self-play, and scientific structure prediction combine neural networks with structured inference or domain simulators.

Frontier Provisional Layer

The first completed corpus was intentionally biased toward historically stable signals: test-of-time style recognition, high citation counts, and papers whose downstream role is already visible. That selection naturally ended around 2024 because 2025/2026 papers are too recent for mature citation and award evidence.

The frontier addendum keeps the same compute-structure rule but labels recent papers as provisional. Hardware/device setup remains the branch axis, while the method itself must still have technical value: each card should say what the method contributes, then explain why that contribution fits the available accelerators, memory systems, interconnects, rollout/evaluator loops, and inference-time budgets.

Reading Constraint

These sections are working scaffolds. A claim becomes supported only after it links to a reading card, source report, or ledger entry.

Ten compute regimes