Compute Bottlenecks Ledger

This ledger tracks cross-regime bottlenecks. Entries below are draft syntheses backed by reading cards or source reports; use the linked cards for evidence details.

Bottleneck	Regime	Evidence	Method adaptation	Status
CPU-era optimization and feature scale	Pre-2012 CPU and statistical foundations	Support-vector networks, large-scale SGD	SVMs, backpropagation, and SGD fit smaller CPU-era datasets and feature pipelines before accelerator-scale dense training	card-backed draft
Dense convolution throughput and GPU memory	Single-GPU deep learning	AlexNet, VGG, GoogLeNet	CNNs, ReLU-style training, dropout, and compact convolution modules exploit commodity GPU dense arithmetic	card-backed draft
Training stability under depth and batch-size constraints	Multi-GPU dense training	ResNet, batch normalization, group normalization	Residual connections and normalization methods make deeper or distributed dense networks trainable	card-backed draft
Distributed communication and model fit	Multi-GPU dense training / hyperscale dense LLM training	ImageNet in 1 hour, Megatron-LM, ZeRO	Large-batch SGD, tensor parallelism, pipeline/model parallelism, and optimizer-state partitioning trade communication for feasible scale	card-backed draft
Accelerator-friendly dense matrix multiplication	TPU and accelerator Transformer era	Attention Is All You Need, BERT, T5, TPU datacenter analysis	Transformers and TPU-style workloads emphasize batched dense matmul and compiler-friendly layouts	card-backed draft
Compute/data/model allocation	Hyperscale dense LLM training	GPT-3, Scaling Laws, Chinchilla, PaLM	Scaling laws and compute-optimal training decide whether to spend budget on parameters, tokens, or longer training	card-backed draft
Conditional compute and sparse activation	Sparse and memory-efficient scaling	MoE, GShard, Switch Transformer	Mixture-of-experts increases total parameters while activating a sparse subset per token	card-backed draft
Attention IO and memory hierarchy	Sparse and memory-efficient scaling	FlashAttention, FlashAttention-2	IO-aware exact attention tiles work through SRAM/HBM hierarchy and reduces materialized attention traffic	card-backed draft
Sampling cost and generative-model throughput	Generative media compute	DDPM, latent diffusion, DiT, StyleGAN	Diffusion, GAN, VAE, and autoregressive image models trade accelerator training throughput, latent compression, and sampling cost differently	card-backed draft
Inference-time allocation and behavior shaping	Inference-time compute and post-training	RAG, InstructGPT, chain-of-thought, ReAct	Retrieval, preference optimization, reasoning samples, and tool calls spend runtime or post-training compute after base-model pretraining	card-backed draft
Deployment memory and latency	Efficient and edge inference	distillation, MobileNet, LoRA, GPTQ, speculative decoding	Compression, efficient architectures, adapters, quantization, and draft-model decoding reduce serving cost or adaptation memory	card-backed draft
Hardware parameter anchors	Cross-regime	mainstream accelerator era map, paper compute device extraction	Source reports define the device, memory, interconnect, and pod/GPU scale used to interpret reading cards	sourced draft

Cross-regime comparisons

Compute Bottlenecks Ledger