Hardware dataset

AI Accelerator Timeline

A hardware timeline for AI methods, from the GTX 580 used for AlexNet to Blackwell Ultra and Ironwood. The 16 entries track device class, memory, interconnect, and access model across mainstream AI accelerators.

Source: history/sources/data/mainstream_accelerator_eras.csv Citations checked against NVIDIA and Google primary sources.

Show:
2012

NVIDIA GTX 580 3GB

2-GPU paper setup

NVIDIA
Memory 3 GB GDDR5 per GPU
Scale-up network PCIe/model split
Largest reported scale 2 GPUs in AlexNet

paper anchor . AlexNet reports two GTX 580 3GB GPUs and identifies GPU memory as the limiting factor. Full board specs need a separate archived vendor/board source.

2014

NVIDIA Tesla K40

single data-center GPU

NVIDIA
Peak Compute 5 FP32 TFLOPS; 1.7 FP64 TFLOPS
Memory up to 12 GB GDDR5
Scale-up network PCIe
Power 235 W
Largest reported scale single GPU or small PCIe servers

widely adopted pre-HBM research GPU . Representative for 2014-2015 CUDA research servers before HBM and tensor cores. Figures follow NVIDIA's Volta comparison table GPU Boost peak convention; base-clock board datasheets can report lower p…

2015

NVIDIA Tesla K80

dual-GPU data-center board / research cluster GPU

NVIDIA
Memory 24 GB GDDR5 per board / 12 GB per GPU, from repo reading-card statements and common K80 board identity; exact vendor spec not yet locally archived
Scale-up network PCIe multi-GPU servers
Largest reported scale 96 K80 GPUs in GNMT; 60 K80 GPUs in Xception; K80 GPU baseline in TPU v1 datacenter paper

paper-verified mid-2010s research GPU . Added because local reading cards repeatedly identify K80 as a 2015-2016 training and inference comparison device. Numeric board parameters should not be used for quantitative plots until an archived …

2016

NVIDIA Tesla P100

research cluster GPU

NVIDIA
Peak Compute 9.3 FP32 TFLOPS (PCIe); 10.6 FP32 TFLOPS (SXM2); 4.7 FP64 TFLOPS (PCIe); 5.3 FP64 TFLOPS (SXM2); 18.7 FP16 TFLOPS
Memory 12-16 GB HBM2
Memory bandwidth 549-732 GB/s
Scale-up network PCIe Gen3 or first-generation NVLink
Network bandwidth up to 160 GB/s NVLink generation reference
Power 250-300 W
Largest reported scale 8-GPU servers and 256-GPU ImageNet training reports

widely adopted Pascal/HBM GPU . The PCIe datasheet gives lower P100 figures; NVIDIA's Volta comparison table gives the SXM2-class peak figures.

2017

Google Cloud TPU v2

Cloud TPU training accelerator / TPUv2 mesh

TPU
Peak Compute 11.5 PFLOP/s maximum reported for 512 TPUv2 cores in Mesh TensorFlow local card; chip-level vendor peak not recorded in selected source
Memory 8 GB per accelerator in GPipe local card; Google docs list v2 TPU configurations up to v2-512
Scale-up network TPUv2 mesh / high-speed links
Largest reported scale v2-512 configuration in Google Cloud docs; 512-core TPUv2 mesh in Mesh TensorFlow card

paper-verified Google Cloud TPU training anchor . Added because GPipe and Mesh TensorFlow both use TPUv2; keep exact chip-level peak conservative because current Google docs point to an ACM architecture paper for detailed specs.

NVIDIA Tesla V100 SXM2/SXM3

training GPU

NVIDIA
Peak Compute 15.7 FP32 TFLOPS; 7.8 FP64 TFLOPS; 125 Tensor TFLOPS
Memory 16-32 GB HBM2
Memory bandwidth 900 GB/s
Scale-up network NVLink 2
Network bandwidth 300 GB/s per GPU
Power 300 W
Largest reported scale 512 GPUs in Megatron-LM report

widely adopted Volta tensor-core GPU . Primary 2017 anchor for Volta tensor-core multi-GPU training; later 2019-2020 V100 clusters support hyperscale dense LLM cards such as Megatron-LM and GPT-3.

2018

Google Cloud TPU v3

TPU pod chip

TPU
Peak Compute 123 BF16 TFLOPS per chip
Memory 32 GiB HBM2
Memory bandwidth 900 GB/s
Scale-up network 2D torus
Network bandwidth pod all-reduce 340 TB/s; pod bisection 6.4 TB/s
Power 123/220/262 W min/mean/max
Largest reported scale 1024-chip pod

Google TPU transformer-era anchor . Good anchor for BERT-era and early large transformer TPU training.

2020

NVIDIA A100 40GB/80GB

training GPU

NVIDIA
Peak Compute 19.5 FP32 TFLOPS; 9.7 FP64 TFLOPS; 312 BF16/FP16 Tensor TFLOPS; 624 sparse BF16/FP16 Tensor TFLOPS
Memory 40 GB HBM2 or 80 GB HBM2e
Memory bandwidth 1.6 TB/s or over 2 TB/s
Scale-up network NVLink 3
Network bandwidth 600 GB/s per GPU
Power 400 W
Largest reported scale 384 GPUs in BLOOM report; larger MLPerf/HGX clusters

widely adopted Ampere LLM GPU . BF16 and 80 GB HBM2e made this the main open-report LLM training GPU for 2021-2023.

Google Cloud TPU v4

TPU pod chip

TPU
Peak Compute 275 BF16 or INT8 TFLOPS per chip
Memory 32 GiB HBM2
Memory bandwidth 1200 GB/s
Scale-up network 3D mesh or twisted torus
Network bandwidth pod all-reduce 1.1 PB/s; pod bisection 24 TB/s
Power 90/170/192 W min/mean/max
Largest reported scale 4096-chip pod

Google TPU pod training anchor . TPU v4 was deployed in 2020 and is a strong source for large-scale TPU interconnect/topology constraints.

2022

NVIDIA H100 SXM

Hopper training/inference GPU

NVIDIA
Peak Compute 67 FP32 TFLOPS; 34 FP64 TFLOPS; 1979 sparse BF16/FP16 Tensor TFLOPS; 3958 sparse FP8 Tensor TFLOPS
Memory 80 GB HBM3
Memory bandwidth 3.35 TB/s
Scale-up network NVLink 4
Network bandwidth 900 GB/s per GPU
Power up to 700 W
Largest reported scale HGX/DGX H100 4-8 GPU nodes and larger clusters

widely adopted Hopper frontier GPU . Transformer Engine and FP8 shift the bottleneck toward memory capacity, KV cache, and scale-up fabric.

2023

Google Cloud TPU v5e

TPU cloud chip

TPU
Peak Compute 197 BF16 TFLOPS; 393 INT8 TOPS per chip
Memory 16 GB HBM
Memory bandwidth 800 GiB/s
Scale-up network 2D torus; 4 ICI ports
Network bandwidth 400 GB/s bidirectional ICI per chip
Largest reported scale 256-chip pod

Google Cloud cost-efficient TPU anchor . Useful anchor for economical training/inference, not the frontier-memory configuration.

Google Cloud TPU v5p

TPU pod chip

TPU
Peak Compute 459 BF16 TFLOPS; 459 FP8 TFLOPS per chip
Memory 95 GiB HBM
Memory bandwidth 2575 GiB/s
Scale-up network 3D torus
Network bandwidth 1200 GB/s bidirectional ICI per chip
Largest reported scale 8960-chip pod; largest single slice 6144 chips

Google frontier TPU pod anchor . High-memory TPU generation for large dense and MoE training. Google comparison docs list FP8 at the same peak as BF16; do not infer an Ironwood-style 2x FP8/BF16 ratio.

2024

NVIDIA H200 SXM

Hopper memory-expanded GPU

NVIDIA
Peak Compute 67 FP32 TFLOPS; 34 FP64 TFLOPS; 1979 sparse BF16/FP16 Tensor TFLOPS; 3958 sparse FP8 Tensor TFLOPS
Memory 141 GB HBM3e
Memory bandwidth 4.8 TB/s
Scale-up network NVLink 4
Network bandwidth 900 GB/s per GPU
Power up to 700 W
Largest reported scale HGX H200 4-8 GPU nodes

widely adopted memory-heavy inference GPU . Same broad compute class as H100, but HBM capacity/bandwidth changes long-context and inference economics.

Google Cloud TPU v6e Trillium

TPU cloud chip

TPU
Peak Compute 918 BF16 TFLOPS; 918 FP8 TFLOPS; 1836 INT8 TOPS per chip
Memory 32 GB HBM
Memory bandwidth 1638 GiB/s
Scale-up network 2D torus; 4 ICI ports
Network bandwidth 800 GB/s bidirectional ICI per chip
Largest reported scale 256-chip pod

Google TPU training/fine-tuning/serving anchor . Google positions v6e for transformer, text-to-image, CNN training, fine-tuning, and serving. The v6e page lists BF16 and INT8; the TPU7x comparison table also lists v6e FP8 at 918 TFLOPS.

2025

NVIDIA HGX B200

8-GPU Blackwell system

NVIDIA
Peak Compute 600 FP32 TFLOPS system; 72 dense FP4 PFLOPS or 144 sparse FP4 PFLOPS system; 36 sparse FP16/BF16 PFLOPS system
Memory 1.4 TB total HBM
Scale-up network NVLink 5 plus NVLink Switch
Network bandwidth 1.8 TB/s GPU-to-GPU; 14.4 TB/s total NVLink bandwidth
Largest reported scale 8 Blackwell SXM GPUs per HGX baseboard

frontier candidate Blackwell system . Treat as system-level because NVIDIA's current public HGX table reports B200 in 8-GPU units.

Google Cloud TPU7x Ironwood

TPU pod chip

TPU
Peak Compute 2307 BF16 TFLOPS; 4614 FP8 TFLOPS per chip
Memory 192 GiB HBM
Memory bandwidth 7380 GiB/s
Scale-up network 3D torus
Network bandwidth 1200 GB/s bidirectional ICI per chip; 200 GB/s per axis
Largest reported scale 9216-chip pod

frontier candidate TPU7x . Google calls TPU7x the latest Cloud TPU and positions it for dense and MoE training plus decode-heavy inference.

2026

NVIDIA HGX B300

8-GPU Blackwell Ultra system

NVIDIA
Peak Compute 600 FP32 TFLOPS system; 144 sparse FP4 PFLOPS or 108 dense FP4 PFLOPS system; 36 sparse FP16/BF16 PFLOPS system
Memory 2.1 TB total HBM
Scale-up network NVLink 5 plus NVLink Switch
Network bandwidth 1.8 TB/s GPU-to-GPU; 14.4 TB/s total NVLink bandwidth
Largest reported scale 8 Blackwell Ultra SXM GPUs per HGX baseboard

frontier candidate Blackwell Ultra system . Included for 2026 tracking, but most published research may still report H100/H200/A100 until B300 systems are broadly used. NVIDIA HGX lists FP4 as Sparse | Dense = 144 | 108, a 1.33:1 ratio, so this…