A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
Metadata
- Reading status: read complete
- Year: 2018
- Compute regime: Search, simulation, and science compute (
search_simulation_science_compute) - PDF: 2018-alphazero_2018.pdf
- Extracted text: 2018-alphazero_2018.txt
- PDF URL: https://arxiv.org/pdf/1712.01815.pdf
- OpenAlex:
- Citation count source/date:
- Citation count:
- Reading card created: 2026-06-15
Compute Setup
The paper is explicit about the training fleet. Each game-specific AlphaZero instance trains for 700000 steps with mini-batches of size 4096, starting from random parameters, using 5000 first-generation TPUs to generate self-play games and 64 second-generation TPUs to train the neural networks. This is a much larger generation fleet than AlphaGo Zero's single-machine self-play setup, reflecting the need to train separate systems for chess, shogi, and Go at high throughput.
Appendix Table S3 gives the per-game scale: 44 million chess games, 24 million shogi games, and 21 million Go games; training times of 9 hours for chess, 12 hours for shogi, and 34 hours for Go; and 800 MCTS simulations per move. Those 800 simulations correspond to about 40 ms per move in chess, 80 ms in shogi, and 200 ms in Go during training. Evaluation MCTS is executed on a single machine with 4 TPUs. The baselines are CPU engines: Stockfish 8 and Elmo/YaneuraOu use 64 CPU threads and a 1GB hash size.
Bottleneck
The bottleneck is making search plus self-play general enough to replace decades of handcrafted game-engine heuristics while still producing enough high-quality training data. Chess and shogi engines such as Stockfish and Elmo use alpha-beta search, handcrafted evaluation, move ordering, pruning, quiescence search, opening/endgame databases, and other domain adaptations. AlphaZero replaces these with a neural policy-value network and MCTS. That shifts compute from cheap handcrafted position evaluation to expensive neural-network evaluation.
The throughput contrast is stark. The paper reports that AlphaZero searches about 80 thousand positions per second in chess and 40 thousand in shogi, compared with 70 million positions per second for Stockfish and 35 million for Elmo. AlphaZero therefore cannot win by brute-force node count. It must spend much more compute per evaluated node but use the policy-value network to focus search on better variations. The training bottleneck is then self-play volume: tens of millions of games must be generated with neural MCTS, and the network must be updated fast enough for the latest parameters to improve future games.
Method Adaptation
AlphaZero adapts AlphaGo Zero's compute structure to multiple games by removing Go-specific assumptions. It uses a general MCTS guided by a deep network f_theta(s) = (p, v), where p gives move probabilities and v estimates expected outcome. Unlike AlphaGo Zero's binary win/loss value, AlphaZero optimizes expected outcome so draws can be represented. It also removes rotation/reflection augmentation and random board transformations during MCTS, because chess and shogi are not symmetric in the same way as Go.
The training loop is simplified for throughput. AlphaGo Zero promoted checkpoints after evaluator matches; AlphaZero maintains a single continually updated network and generates self-play from the latest parameters. This removes an iteration-level evaluator gate from data generation. During training, each MCTS uses 800 simulations, Dirichlet noise is scaled to the typical number of legal moves for each game, and illegal moves are masked by setting their probabilities to zero and renormalizing. Board states and actions are encoded as spatial planes derived from the basic rules: 119 input planes and 4672 possible policy moves for chess, 362 input planes and 11259 policy moves for shogi, and the AlphaGo Zero Go representation for Go.
Evidence
The learning-curve evidence is anchored to wall-clock time and training steps. AlphaZero surpasses Stockfish after 4 hours, or 300K steps, surpasses Elmo after less than 2 hours, or 110K steps, and surpasses AlphaGo Lee after 8 hours, or 165K steps. In the final 100-game matches at 1 minute per move, it scores 25 wins, 25 draws, and 0 losses as White against Stockfish, and 3 wins, 47 draws, and 0 losses as Black. Against Elmo, it scores 43 wins, 2 draws, and 5 losses as first player, and 47 wins, 0 draws, and 3 losses as second player. Against the 3-day AlphaGo Zero, it wins 60-40 across the two Go color assignments shown in the table.
The search evidence is equally important. Even though AlphaZero evaluates around a thousand times fewer positions than alpha-beta engines, its MCTS scales more effectively with thinking time in the paper's Elo plots. That supports the central compute claim: neural evaluation is expensive, but a learned policy/value prior can make each evaluated node more useful.
The training-scale evidence from Table S3 also matters. Chess is the cheapest of the three in per-move thinking time, shogi is intermediate, and Go is slowest, reflecting board size, game length, and search cost differences. Yet the same algorithm and hyperparameters work across all three, with only game-rule encodings and exploration-noise scale adjusted.
Historical Effect
AlphaZero generalized the AlphaGo Zero recipe into a canonical compute-intensive pattern: large-scale self-play generation, neural MCTS for policy improvement, and a policy-value network trained from search targets and outcomes. Its historical importance is partly algorithmic and partly infrastructural. It showed that with enough TPU self-play throughput, a single general recipe could beat specialized engines in domains whose search traditions were very different.
Limits
The limits are clear in the paper's own domain-knowledge list. AlphaZero is given perfect game rules, legal moves, termination/scoring rules, board geometry, and feature/action encodings. It handles perfect-information board games, not imperfect-information games. The compute budget is also extraordinary: 5000 TPUs for self-play plus 64 TPUs for network training, with tens of millions of games generated. The paper notes that AlphaGo Master and the larger AlphaGo Zero effort were ultimately trained for 100 times the length of time, so this result is large but still not the maximum compute version of the line.
Links
- Compute regime:
history/compute_regimes/search_simulation_science_compute/README.md - Source PDF and extracted text are listed in metadata above.
- Queue status:
read_complete.