Mastering the game of Go with deep neural networks and tree search

Metadata

Reading status: read complete
Year: 2016
Compute regime: Search, simulation, and science compute (search_simulation_science_compute)
PDF: 2016-alphago_2016.pdf
Extracted text: 2016-alphago_2016.txt
PDF URL: https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper is explicit about both training hardware and match-time search hardware. The final single-machine AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. The distributed version used multiple machines with 40 search threads, 1,202 CPUs, and 176 GPUs. The hardware split mirrors the algorithm: CPUs execute asynchronous Monte Carlo tree search and rollouts, while GPUs batch the expensive policy and value network evaluations.

Training also used substantial accelerator parallelism for its time. The supervised policy network was trained on 29.4M positions from 160,000 KGS 6-9 dan games, augmented by board symmetries, with asynchronous SGD in DistBelief on 50 GPUs. It used minibatch size 16 and took around 3 weeks for 340M training steps. The reinforcement-learning policy stage trained 10,000 minibatches of 128 self-play games on 50 GPUs for one day, and the value network trained from over 30M distinct self-play positions on 50 GPUs for one week. No TPU hardware is stated; the paper's concrete devices are CPUs and GPUs.

Bottleneck

The bottleneck is the mismatch between Go's enormous search tree and the cost of deep neural network evaluation. The paper states that evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. A pure neural evaluator would bottleneck on GPU latency; a pure rollout search would waste CPU simulations on a very broad tree and inaccurate terminal playouts. AlphaGo therefore has a dual bottleneck: choosing which positions deserve expensive GPU evaluation and keeping enough CPU simulations in flight to hide that cost.

Rollout throughput is the other side of the same constraint. The fast rollout policy executes at about 1,000 simulations per second per CPU thread, while the deeper policy and value networks are evaluated asynchronously in GPU minibatches. CPUs supply breadth and rollout statistics; GPUs supply informed priors and stronger leaf evaluations.

Method Adaptation

AlphaGo adapts the method to the hardware by separating fast and slow evaluators. The supervised policy network narrows the search tree by assigning prior probabilities to moves. The reinforcement-learned policy improves the policy through self-play. The value network predicts the winner from a board position, replacing many expensive full-game rollouts with a learned evaluation. A separate fast rollout policy is much cheaper and uses local pattern features, making it suitable for CPU playouts to terminal states.

The MCTS implementation is explicitly asynchronous. Multiple search threads traverse the tree with virtual loss to discourage collisions. Leaf positions are inserted into queues for GPU policy and value evaluation; positions are evaluated in minibatches to improve accelerator utilization. The distributed version keeps the tree on a master, sends rollout work to CPU workers, and sends neural network work to GPU workers. This is the compute structure behind the result: search is not replaced by a neural net; search is reorganized so the scarce GPU calls are concentrated on high-value positions.

The method also adapts to memory and batching through input representation. The policy and value networks take 19x19 feature-plane stacks rather than raw rendered images, so GPU memory is spent on board-state features and convolutional filters, not vision preprocessing.

Evidence

The paper gives direct benchmark evidence for both strength and hardware scaling. In an internal tournament with 5 seconds per move, single-machine AlphaGo scored Elo 2890 using 48 CPUs and 8 GPUs, while distributed AlphaGo scored Elo 3140 with 1,202 CPUs and 176 GPUs. The paper states that single-machine AlphaGo won 494 out of 495 games, a 99.8% win rate, against other Go programs. Distributed AlphaGo defeated Fan Hui, European champion and professional 2 dan, 5-0 in the formal match.

The ablation and scalability tables are especially compute-relevant. With 48 CPUs and 8 GPUs, asynchronous AlphaGo rises from Elo 2203 with 1 search thread to 2890 with 40 search threads. Holding 40 threads and 48 CPUs fixed, reducing GPUs from 8 to 4 lowers Elo to 2850, to 2 lowers it to 2738, and to 1 lowers it to 2181. The distributed variants continue upward: 428 CPUs/64 GPUs scores 2937, 764 CPUs/112 GPUs scores 3079, and 1,202 CPUs/176 GPUs scores 3140.

Historical Effect

AlphaGo established neural policy/value networks plus tree search as a viable compute structure for enormous combinatorial spaces. Before this paper, Go programs relied heavily on handcrafted patterns and Monte Carlo search. AlphaGo showed that GPU-trained networks could make search selective enough that the program evaluated far fewer positions than classic brute-force game engines while still playing stronger moves.

Historically, its importance is also architectural. It is an early high-profile example of AI performance coming from a heterogeneous runtime: offline distributed neural training, online CPU search, GPU minibatched inference, and engineered dataflow between them.

Limits

The limits are clear from the same compute structure. AlphaGo depends on human expert games for the supervised policy stage, handcrafted rollout features for cheap CPU playouts, Go-specific board symmetries and feature planes, and a large amount of match-time search hardware. Its strongest distributed configuration is not merely a trained neural net served on one accelerator; it is a 1,202-CPU/176-GPU search system.

The paper also does not claim a general game-playing architecture beyond two-player perfect-information Go. It assumes legal move generation, deterministic transitions, and a domain where rollouts and value networks combine cleanly. The compute lesson is powerful but specialized: deep networks reshape search, yet strong play still came from substantial online simulation budget and domain engineering.