Mastering Atari, Go, chess and shogi by planning with a learned model

Metadata

Reading status: read complete
Year: 2019
Compute regime: Search, simulation, and science compute (search_simulation_science_compute)
PDF: 2019-muzero_2019.pdf
Extracted text: 2019-muzero_2019.txt
PDF URL: https://arxiv.org/pdf/1911.08265.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly states that all experiments ran on third-generation Google Cloud TPUs. The allocation differs sharply by domain. For each board game, Go, chess, and shogi, MuZero uses 16 TPUs for training and 1000 TPUs for self-play. For each Atari game, it uses 8 TPUs for training and 32 TPUs for self-play. The appendix explains the smaller Atari acting fleet: Atari uses only 50 simulations per move instead of 800, and its dynamics function is smaller than the representation function.

Training proceeds for 1 million mini-batches, with batch size 2048 in board games and 1024 in Atari. All reported main experiments unroll the learned model for K = 5 hypothetical steps. During both training and evaluation, MuZero uses 800 simulations per search in board games and 50 simulations per search in Atari. The Atari comparison table lists MuZero at 20.0B environment frames, 12 hours, and 1M training steps in the large-data setting; MuZero Reanalyze uses 200M frames, 12 hours, and 1M training steps in the sample-efficient setting.

Bottleneck

MuZero's bottleneck is planning without a supplied simulator. AlphaZero can use game rules to move from one board state to the next inside MCTS. MuZero removes that assumption: the model must learn a representation, dynamics function, reward prediction, policy, and value good enough for search. This creates two coupled costs. First, every MCTS simulation calls the learned dynamics and prediction functions, so search is still expensive. Second, model errors can accumulate when search goes deeper, especially in visually complex Atari.

The paper distinguishes board-game and Atari budgets around this bottleneck. Board games keep AlphaZero's 800 simulations per move because the branching factor and tactical precision demand deeper search. Atari has a smaller action space and simpler policies, so the authors use 50 simulations per move to speed experiments. However, the Atari scaling analysis shows performance plateaus around 100 simulations, probably because learned-model inaccuracy limits the value of more search. In Go, by contrast, the learned model scales well to searches far longer than those used during training.

Method Adaptation

MuZero adapts model-based planning by learning only what search needs. The representation function maps past observations to a hidden state. The dynamics function takes a hidden state and hypothetical action and returns a new hidden state plus immediate reward. The prediction function returns policy and value. There is no requirement that the hidden state reconstruct pixels or match true environment state. This is a direct compute choice: predicting full observations would spend model capacity and search compute on irrelevant visual detail, while predicting reward, value, and policy focuses the latent model on planning.

The search algorithm remains close to AlphaZero. Each simulation selects actions with an upper-confidence rule, expands a leaf by calling the dynamics and prediction functions, and backs up discounted rewards and value estimates. The paper notes that each simulation makes at most one dynamics call and one prediction call, keeping computational order similar to AlphaZero even though the simulator is learned. For board games, the model uses AlphaZero-like representations and 16 residual blocks rather than AlphaZero's 20, with 256 hidden planes. For Atari, the representation network downsamples 96 x 96 RGB frame histories through stride-2 convolutions and residual blocks to a 6 x 6 hidden state.

MuZero Reanalyze is another compute adaptation. It revisits old time steps and reruns MCTS with the latest network to provide fresher policy targets for 80% of updates. This increases compute per stored sample but improves sample efficiency by turning old experience into newer search targets.

Evidence

The board-game evidence shows that a learned dynamics model can substitute for a perfect simulator at high playing strength. Figure 2 evaluates MuZero against AlphaZero using 800 simulations per move for both players. The text says MuZero matches AlphaZero on chess and shogi and slightly exceeds AlphaZero on Go, despite using 16 residual blocks per evaluation rather than AlphaZero's 20. The Go search-scaling experiment is especially compute-relevant: both networks are trained at 800 simulations per search, about 0.1 seconds, yet MuZero's learned model scales to searches up to 10 seconds, two orders of magnitude longer than the training search time.

The Atari evidence is broader. In the large-data setting, MuZero reports median human-normalized score 2041.1% and mean 4999.2% with 20.0B frames and 12 hours, compared with R2D2 at 1920.6% median and 4024.9% mean with 37.5B frames and 5 days. The paper states that MuZero outperforms R2D2 in 42 of 57 Atari games and outperforms the previous model-based method SimPLe in all games. In the 200M-frame setting, MuZero Reanalyze reaches 731.1% median and 2168.9% mean, above the listed IMPALA, Rainbow, and LASER medians.

The simulation-count ablations show the search budget tradeoff. In Atari, final performance improves with more simulations up to about 100 and then mostly plateaus. In Ms. Pacman, even 6 simulations per move can learn an effective policy, but more simulations improve faster. This supports the paper's smaller Atari acting TPU allocation while also showing that search remains part of the learning signal.

Historical Effect

MuZero extended the AlphaZero compute recipe beyond domains with known rules. Its historical effect was to make model-based planning competitive in both precise board games and visually rich Atari without requiring a perfect simulator. It reframed the learned model as a value-equivalent planning device rather than a pixel predictor, which helped separate model-based RL from expensive observation reconstruction.

Limits

The paper explicitly says imperfect-information games such as poker are not directly addressed. The learned dynamics are deterministic in the reported model, with stochastic extensions left for future work. MuZero also remains compute-heavy: board games use 1000 acting TPUs per game, Atari still uses 32 acting TPUs per game, and every action is selected through MCTS unless ablated. Finally, the Atari planning plateau suggests learned-model error limits how much additional inference compute can help in visually complex environments.