Mastering the game of Go without human knowledge

Metadata

Reading status: read complete
Year: 2017
Compute regime: Search, simulation, and science compute (search_simulation_science_compute)
PDF: 2017-alphago_zero_2017.pdf
Extracted text: 2017-alphago_zero_2017.txt
PDF URL: https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper gives several exact compute facts, though it does not name the TPU or GPU model generation. AlphaGo Zero's search/play system uses "a single machine in the Google Cloud with 4 TPUs"; the authors contrast this with AlphaGo Lee, which was distributed over many machines and used 48 TPUs, and AlphaGo Fan, which used 176 GPUs. The neural-network optimization pipeline is separate: TensorFlow optimization runs on Google Cloud with 64 GPU workers and 19 CPU parameter servers, with batch size 32 per worker for a total mini-batch size of 2048.

The 20-block run trains for about 3 days from random behavior, generates 4.9 million self-play games, uses 1600 MCTS simulations per move, and updates parameters from 700000 mini-batches of 2048 positions. The larger 40-block run trains for about 40 days, generates 29 million games, and uses 3.1 million mini-batches of 2048 positions. During final evaluation, AlphaGo Zero and AlphaGo Master each run on a single machine with 4 TPUs; AlphaGo Lee remains a distributed 48-TPU comparison point.

Bottleneck

The bottleneck is search-generated data, not supervised dataset size. AlphaGo Zero has no human games to imitate and no rollout policy, so every improvement must come from repeatedly using the current network inside MCTS, producing higher-quality self-play targets, and then training the network to match those targets. Each move in self-play uses 1600 simulations, which the paper says corresponds to about 0.4 seconds of thinking time per move in the 20-block run. That is a large simulation budget multiplied across millions of games.

The system also has a pipeline bottleneck. Search produces data, optimization updates checkpoints, and an evaluator decides whether a checkpoint is strong enough to generate future self-play. If generation outruns training, data quality stalls; if training outruns generation, the replay buffer lacks fresh positions. The paper addresses this with an asynchronous pipeline that continually optimizes from recent self-play, continually evaluates checkpoints, and uses the best player so far for data generation.

Method Adaptation

AlphaGo Zero adapts the algorithm to spend expensive TPU search on fewer, higher-quality neural evaluations. A single residual network outputs both policy and value, replacing AlphaGo Lee's separate policy and value networks and eliminating rollout evaluation. In each MCTS simulation, a leaf is evaluated once by the network to provide move priors and a value. The search probabilities become an improved policy target, and the final game result becomes the value target. The loss combines value error, policy cross-entropy against search visit counts, and L2 regularization.

The compute-saving choice is to make the network do all leaf evaluation and to remove rollout playouts. The paper explicitly says AlphaGo Zero does not use a rollout policy or tree policy and has no extra search heuristics beyond listed rules. The evaluator uses 400 games with 1600 simulations per move and promotes a new checkpoint only if it wins by more than 55%, preventing noisy checkpoint churn. Self-play generates 25000 games per iteration using the same 1600-simulation move selection; resignation saves compute on clearly lost games, while resignation is disabled in 10% of games to estimate false-positive resignations.

The residual architecture is also a compute-structure choice. The network treats the 19 x 19 Go board as an image stack and uses either 20 or 40 residual blocks. Combining policy and value into one residual tower improves computational efficiency and regularizes shared representations.

Evidence

The 3-day result is the main evidence. AlphaGo Zero outperforms AlphaGo Lee after 36 hours, then after 72 hours defeats the exact Lee Sedol version of AlphaGo Lee by 100 games to 0 under the same 2-hour match conditions. This uses one 4-TPU machine for AlphaGo Zero search, compared with AlphaGo Lee's distributed 48-TPU system. The paper emphasizes that AlphaGo Lee had been trained over several months, while AlphaGo Zero learns from random play.

The architecture ablation supports the compute allocation. Networks trained on a fixed 72-hour self-play dataset show that residual networks are more accurate and improve playing performance by over 600 Elo compared with the previous convolutional architecture. Combining policy and value into one network slightly reduces move prediction accuracy but reduces value error and boosts playing performance by around another 600 Elo, partly due to improved computational efficiency.

The 40-block run extends the scale evidence. It trains for 40 days, generates 29 million games, and reaches a final tournament Elo around 5185 in the paper's internal comparison, ahead of AlphaGo Master at 4858, AlphaGo Lee at 3739, and AlphaGo Fan at 3144. In a 100-game match with 2-hour time controls, the 40-block AlphaGo Zero beats AlphaGo Master 89-11.

Historical Effect

AlphaGo Zero changed the compute recipe for high-performance game AI. Instead of spending human data plus handcrafted rollouts plus distributed search, it showed that enough self-play search and a single policy-value residual network could bootstrap superhuman performance from random initialization. The paper also made search a training operator: MCTS is not just an inference-time add-on, but the mechanism that creates the policy-improvement targets used in every update.

Limits

The method is not free of domain structure or compute. The paper lists the domain knowledge still used: perfect game rules, legal moves, Tromp-Taylor scoring during self-play/search, 19 x 19 board geometry, perspective normalization, and rotation/reflection symmetry. It is specialized to zero-sum perfect-information games. The exact TPU and GPU models are not named, limiting hardware reproducibility. Most importantly, the self-play budget is enormous: millions of games, 1600 simulations per move, 64 GPU workers for optimization, and 4 TPU search machines for play/evaluation.