Evidence and contrasts across compute regimes

Cross-regime comparisons

← Home

Compute Bottlenecks Ledger

This ledger tracks cross-regime bottlenecks. Entries below are draft syntheses backed by reading cards or source reports; use the linked cards for evidence details.

Bottleneck Regime Evidence Method adaptation Status
CPU-era optimization and feature scale Pre-2012 CPU and statistical foundations Support-vector networks, large-scale SGD SVMs, backpropagation, and SGD fit smaller CPU-era datasets and feature pipelines before accelerator-scale dense training card-backed draft
Dense convolution throughput and GPU memory Single-GPU deep learning AlexNet, VGG, GoogLeNet CNNs, ReLU-style training, dropout, and compact convolution modules exploit commodity GPU dense arithmetic card-backed draft
Training stability under depth and batch-size constraints Multi-GPU dense training ResNet, batch normalization, group normalization Residual connections and normalization methods make deeper or distributed dense networks trainable card-backed draft
Distributed communication and model fit Multi-GPU dense training / hyperscale dense LLM training ImageNet in 1 hour, Megatron-LM, ZeRO Large-batch SGD, tensor parallelism, pipeline/model parallelism, and optimizer-state partitioning trade communication for feasible scale card-backed draft
Accelerator-friendly dense matrix multiplication TPU and accelerator Transformer era Attention Is All You Need, BERT, T5, TPU datacenter analysis Transformers and TPU-style workloads emphasize batched dense matmul and compiler-friendly layouts card-backed draft
Compute/data/model allocation Hyperscale dense LLM training GPT-3, Scaling Laws, Chinchilla, PaLM Scaling laws and compute-optimal training decide whether to spend budget on parameters, tokens, or longer training card-backed draft
Conditional compute and sparse activation Sparse and memory-efficient scaling MoE, GShard, Switch Transformer Mixture-of-experts increases total parameters while activating a sparse subset per token card-backed draft
Attention IO and memory hierarchy Sparse and memory-efficient scaling FlashAttention, FlashAttention-2 IO-aware exact attention tiles work through SRAM/HBM hierarchy and reduces materialized attention traffic card-backed draft
Sampling cost and generative-model throughput Generative media compute DDPM, latent diffusion, DiT, StyleGAN Diffusion, GAN, VAE, and autoregressive image models trade accelerator training throughput, latent compression, and sampling cost differently card-backed draft
Inference-time allocation and behavior shaping Inference-time compute and post-training RAG, InstructGPT, chain-of-thought, ReAct Retrieval, preference optimization, reasoning samples, and tool calls spend runtime or post-training compute after base-model pretraining card-backed draft
Deployment memory and latency Efficient and edge inference distillation, MobileNet, LoRA, GPTQ, speculative decoding Compression, efficient architectures, adapters, quantization, and draft-model decoding reduce serving cost or adaptation memory card-backed draft
Hardware parameter anchors Cross-regime mainstream accelerator era map, paper compute device extraction Source reports define the device, memory, interconnect, and pod/GPU scale used to interpret reading cards sourced draft