Hardware dataset

AI Accelerator Timeline

A hardware timeline for AI methods, from the GTX 580 used for AlexNet to Blackwell Ultra and Ironwood. The 16 entries track device class, memory, interconnect, and access model across mainstream AI accelerators.

Source: history/sources/data/mainstream_accelerator_eras.csv Citations checked against NVIDIA and Google primary sources.

Show:

2012

NVIDIA GTX 580 3GB

2-GPU paper setup

NVIDIA

Memory 3 GB GDDR5 per GPU

Scale-up network PCIe/model split

Largest reported scale 2 GPUs in AlexNet

paper anchor . AlexNet reports two GTX 580 3GB GPUs and identifies GPU memory as the limiting factor. Full board specs need a separate archived vendor/board source.

2014

NVIDIA Tesla K40

single data-center GPU

NVIDIA

Peak Compute 5 FP32 TFLOPS; 1.7 FP64 TFLOPS

Memory up to 12 GB GDDR5

Scale-up network PCIe

Power 235 W

Largest reported scale single GPU or small PCIe servers

widely adopted pre-HBM research GPU . Representative for 2014-2015 CUDA research servers before HBM and tensor cores. Figures follow NVIDIA's Volta comparison table GPU Boost peak convention; base-clock board datasheets can report lower p…

2015

NVIDIA Tesla K80

dual-GPU data-center board / research cluster GPU

NVIDIA

Memory 24 GB GDDR5 per board / 12 GB per GPU, from repo reading-card statements and common K80 board identity; exact vendor spec not yet locally archived

Scale-up network PCIe multi-GPU servers

Largest reported scale 96 K80 GPUs in GNMT; 60 K80 GPUs in Xception; K80 GPU baseline in TPU v1 datacenter paper

paper-verified mid-2010s research GPU . Added because local reading cards repeatedly identify K80 as a 2015-2016 training and inference comparison device. Numeric board parameters should not be used for quantitative plots until an archived …

2016

NVIDIA Tesla P100

research cluster GPU

NVIDIA

Peak Compute 9.3 FP32 TFLOPS (PCIe); 10.6 FP32 TFLOPS (SXM2); 4.7 FP64 TFLOPS (PCIe); 5.3 FP64 TFLOPS (SXM2); 18.7 FP16 TFLOPS

Memory 12-16 GB HBM2

Memory bandwidth 549-732 GB/s

Scale-up network PCIe Gen3 or first-generation NVLink

Network bandwidth up to 160 GB/s NVLink generation reference

Power 250-300 W

Largest reported scale 8-GPU servers and 256-GPU ImageNet training reports

widely adopted Pascal/HBM GPU . The PCIe datasheet gives lower P100 figures; NVIDIA's Volta comparison table gives the SXM2-class peak figures.

2017

Google Cloud TPU v2

Cloud TPU training accelerator / TPUv2 mesh

TPU

Peak Compute 11.5 PFLOP/s maximum reported for 512 TPUv2 cores in Mesh TensorFlow local card; chip-level vendor peak not recorded in selected source

Memory 8 GB per accelerator in GPipe local card; Google docs list v2 TPU configurations up to v2-512

Scale-up network TPUv2 mesh / high-speed links

Largest reported scale v2-512 configuration in Google Cloud docs; 512-core TPUv2 mesh in Mesh TensorFlow card

paper-verified Google Cloud TPU training anchor . Added because GPipe and Mesh TensorFlow both use TPUv2; keep exact chip-level peak conservative because current Google docs point to an ACM architecture paper for detailed specs.

NVIDIA Tesla V100 SXM2/SXM3

training GPU

NVIDIA

Peak Compute 15.7 FP32 TFLOPS; 7.8 FP64 TFLOPS; 125 Tensor TFLOPS

Memory 16-32 GB HBM2

Memory bandwidth 900 GB/s

Scale-up network NVLink 2

Network bandwidth 300 GB/s per GPU

Power 300 W

Largest reported scale 512 GPUs in Megatron-LM report

widely adopted Volta tensor-core GPU . Primary 2017 anchor for Volta tensor-core multi-GPU training; later 2019-2020 V100 clusters support hyperscale dense LLM cards such as Megatron-LM and GPT-3.

2018

Google Cloud TPU v3

TPU pod chip

TPU

Peak Compute 123 BF16 TFLOPS per chip

Memory 32 GiB HBM2

Memory bandwidth 900 GB/s

Scale-up network 2D torus

Network bandwidth pod all-reduce 340 TB/s; pod bisection 6.4 TB/s

Power 123/220/262 W min/mean/max

Largest reported scale 1024-chip pod

Google TPU transformer-era anchor . Good anchor for BERT-era and early large transformer TPU training.

2020

NVIDIA A100 40GB/80GB

training GPU

NVIDIA

Peak Compute 19.5 FP32 TFLOPS; 9.7 FP64 TFLOPS; 312 BF16/FP16 Tensor TFLOPS; 624 sparse BF16/FP16 Tensor TFLOPS

Memory 40 GB HBM2 or 80 GB HBM2e

Memory bandwidth 1.6 TB/s or over 2 TB/s

Scale-up network NVLink 3

Network bandwidth 600 GB/s per GPU

Power 400 W

Largest reported scale 384 GPUs in BLOOM report; larger MLPerf/HGX clusters

widely adopted Ampere LLM GPU . BF16 and 80 GB HBM2e made this the main open-report LLM training GPU for 2021-2023.

Google Cloud TPU v4

TPU pod chip

TPU

Peak Compute 275 BF16 or INT8 TFLOPS per chip

Memory 32 GiB HBM2

Memory bandwidth 1200 GB/s

Scale-up network 3D mesh or twisted torus

Network bandwidth pod all-reduce 1.1 PB/s; pod bisection 24 TB/s

Power 90/170/192 W min/mean/max

Largest reported scale 4096-chip pod

Google TPU pod training anchor . TPU v4 was deployed in 2020 and is a strong source for large-scale TPU interconnect/topology constraints.

2022

NVIDIA H100 SXM

Hopper training/inference GPU

NVIDIA

Peak Compute 67 FP32 TFLOPS; 34 FP64 TFLOPS; 1979 sparse BF16/FP16 Tensor TFLOPS; 3958 sparse FP8 Tensor TFLOPS

Memory 80 GB HBM3

Memory bandwidth 3.35 TB/s

Scale-up network NVLink 4

Network bandwidth 900 GB/s per GPU

Power up to 700 W

Largest reported scale HGX/DGX H100 4-8 GPU nodes and larger clusters

widely adopted Hopper frontier GPU . Transformer Engine and FP8 shift the bottleneck toward memory capacity, KV cache, and scale-up fabric.

2023

Google Cloud TPU v5e

TPU cloud chip

TPU

Peak Compute 197 BF16 TFLOPS; 393 INT8 TOPS per chip

Memory 16 GB HBM

Memory bandwidth 800 GiB/s

Scale-up network 2D torus; 4 ICI ports

Network bandwidth 400 GB/s bidirectional ICI per chip

Largest reported scale 256-chip pod

Google Cloud cost-efficient TPU anchor . Useful anchor for economical training/inference, not the frontier-memory configuration.

Google Cloud TPU v5p

TPU pod chip

TPU

Peak Compute 459 BF16 TFLOPS; 459 FP8 TFLOPS per chip

Memory 95 GiB HBM

Memory bandwidth 2575 GiB/s

Scale-up network 3D torus

Network bandwidth 1200 GB/s bidirectional ICI per chip

Largest reported scale 8960-chip pod; largest single slice 6144 chips

Google frontier TPU pod anchor . High-memory TPU generation for large dense and MoE training. Google comparison docs list FP8 at the same peak as BF16; do not infer an Ironwood-style 2x FP8/BF16 ratio.

2024

NVIDIA H200 SXM

Hopper memory-expanded GPU

NVIDIA

Peak Compute 67 FP32 TFLOPS; 34 FP64 TFLOPS; 1979 sparse BF16/FP16 Tensor TFLOPS; 3958 sparse FP8 Tensor TFLOPS

Memory 141 GB HBM3e

Memory bandwidth 4.8 TB/s

Scale-up network NVLink 4

Network bandwidth 900 GB/s per GPU

Power up to 700 W

Largest reported scale HGX H200 4-8 GPU nodes

widely adopted memory-heavy inference GPU . Same broad compute class as H100, but HBM capacity/bandwidth changes long-context and inference economics.

Google Cloud TPU v6e Trillium

TPU cloud chip

TPU

Peak Compute 918 BF16 TFLOPS; 918 FP8 TFLOPS; 1836 INT8 TOPS per chip

Memory 32 GB HBM

Memory bandwidth 1638 GiB/s

Scale-up network 2D torus; 4 ICI ports

Network bandwidth 800 GB/s bidirectional ICI per chip

Largest reported scale 256-chip pod

Google TPU training/fine-tuning/serving anchor . Google positions v6e for transformer, text-to-image, CNN training, fine-tuning, and serving. The v6e page lists BF16 and INT8; the TPU7x comparison table also lists v6e FP8 at 918 TFLOPS.

2025

NVIDIA HGX B200

8-GPU Blackwell system

NVIDIA

Peak Compute 600 FP32 TFLOPS system; 72 dense FP4 PFLOPS or 144 sparse FP4 PFLOPS system; 36 sparse FP16/BF16 PFLOPS system

Memory 1.4 TB total HBM

Scale-up network NVLink 5 plus NVLink Switch

Network bandwidth 1.8 TB/s GPU-to-GPU; 14.4 TB/s total NVLink bandwidth

Largest reported scale 8 Blackwell SXM GPUs per HGX baseboard

frontier candidate Blackwell system . Treat as system-level because NVIDIA's current public HGX table reports B200 in 8-GPU units.

Google Cloud TPU7x Ironwood

TPU pod chip

TPU

Peak Compute 2307 BF16 TFLOPS; 4614 FP8 TFLOPS per chip

Memory 192 GiB HBM

Memory bandwidth 7380 GiB/s

Scale-up network 3D torus

Network bandwidth 1200 GB/s bidirectional ICI per chip; 200 GB/s per axis

Largest reported scale 9216-chip pod

frontier candidate TPU7x . Google calls TPU7x the latest Cloud TPU and positions it for dense and MoE training plus decode-heavy inference.

2026

NVIDIA HGX B300

8-GPU Blackwell Ultra system

NVIDIA

Peak Compute 600 FP32 TFLOPS system; 144 sparse FP4 PFLOPS or 108 dense FP4 PFLOPS system; 36 sparse FP16/BF16 PFLOPS system

Memory 2.1 TB total HBM

Scale-up network NVLink 5 plus NVLink Switch

Network bandwidth 1.8 TB/s GPU-to-GPU; 14.4 TB/s total NVLink bandwidth

Largest reported scale 8 Blackwell Ultra SXM GPUs per HGX baseboard

frontier candidate Blackwell Ultra system . Included for 2026 tracking, but most published research may still report H100/H200/A100 until B300 systems are broadly used. NVIDIA HGX lists FP4 as Sparse | Dense = 144 | 108, a 1.33:1 ratio, so this…

Browse papers by compute regime