资料来源 — AI 计算结构史

Mastering the Game of Go without Human Knowledge
David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur
Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy
Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis.

DeepMind, 5 New Street Square, London EC4A 3TW.

*These authors contributed equally to this work.

      A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, su-
perhuman proficiency in challenging domains. Recently, AlphaGo became the first program
to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated posi-
tions and selected moves using deep neural networks. These neural networks were trained
by supervised learning from human expert moves, and by reinforcement learning from self-
play. Here, we introduce an algorithm based solely on reinforcement learning, without hu-
man data, guidance, or domain knowledge beyond game rules. AlphaGo becomes its own
teacher: a neural network is trained to predict AlphaGo’s own move selections and also the
winner of AlphaGo’s games. This neural network improves the strength of tree search, re-
sulting in higher quality move selection and stronger self-play in the next iteration. Starting
tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning
100-0 against the previously published, champion-defeating AlphaGo.


      Much progress towards artificial intelligence has been made using supervised learning sys-
tems that are trained to replicate the decisions of human experts 1–4 . However, expert data is often
expensive, unreliable, or simply unavailable. Even when reliable data is available it may impose a
ceiling on the performance of systems trained in this manner 5 . In contrast, reinforcement learn-
ing systems are trained from their own experience, in principle allowing them to exceed human
capabilities, and to operate in domains where human expertise is lacking. Recently, there has been
rapid progress towards this goal, using deep neural networks trained by reinforcement learning.
These systems have outperformed humans in computer games such as Atari 6, 7 and 3D virtual en-
vironments 8–10 . However, the most challenging domains in terms of human intellect – such as the


                                                 1
game of Go, widely viewed as a grand challenge for artificial intelligence 11 – require precise and
sophisticated lookahead in vast search spaces. Fully general methods have not previously achieved
human-level performance in these domains.


      AlphaGo was the first program to achieve superhuman performance in Go. The published
version 12 , which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October
2015. AlphaGo Fan utilised two deep neural networks: a policy network that outputs move prob-
abilities, and a value network that outputs a position evaluation. The policy network was trained
initially by supervised learning to accurately predict human expert moves, and was subsequently
refined by policy-gradient reinforcement learning. The value network was trained to predict the
winner of games played by the policy network against itself. Once trained, these networks were
combined with a Monte-Carlo Tree Search (MCTS) 13–15 to provide a lookahead search, using the
policy network to narrow down the search to high-probability moves, and using the value net-
work (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in
the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach (see
Methods), and defeated Lee Sedol, the winner of 18 international titles, in March 2016.


      Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee 12 in several im-
portant aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting
from random play, without any supervision or use of human data. Second, it only uses the black
and white stones from the board as input features. Third, it uses a single neural network, rather
than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this
single neural network to evaluate positions and sample moves, without performing any Monte-
Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that
incorporates lookahead search inside the training loop, resulting in rapid improvement and precise
and stable learning. Further technical differences in the search algorithm, training procedure and
network architecture are described in Methods.




                                                  2
1   Reinforcement Learning in AlphaGo Zero


Our new method uses a deep neural network fθ with parameters θ. This neural network takes as an
input the raw board representation s of the position and its history, and outputs both move probabil-
ities and a value, (p, v) = fθ (s). The vector of move probabilities p represents the probability of
selecting each move (including pass), pa = P r(a|s). The value v is a scalar evaluation, estimating
the probability of the current player winning from position s. This neural network combines the
roles of both policy network and value network 12 into a single architecture. The neural network
consists of many residual blocks 4 of convolutional layers 16, 17 with batch normalisation 18 and
rectifier non-linearities 19 (see Methods).


      The neural network in AlphaGo Zero is trained from games of self-play by a novel reinforce-
ment learning algorithm. In each position s, an MCTS search is executed, guided by the neural
network fθ . The MCTS search outputs probabilities π of playing each move. These search proba-
bilities usually select much stronger moves than the raw move probabilities p of the neural network
fθ (s); MCTS may therefore be viewed as a powerful policy improvement operator 20, 21 . Self-play
with search – using the improved MCTS-based policy to select each move, then using the game
winner z as a sample of the value – may be viewed as a powerful policy evaluation operator. The
main idea of our reinforcement learning algorithm is to use these search operators repeatedly in
a policy iteration procedure 22, 23 : the neural network’s parameters are updated to make the move
probabilities and value (p, v) = fθ (s) more closely match the improved search probabilities and
self-play winner (π , z); these new parameters are used in the next iteration of self-play to make the
search even stronger. Figure 1 illustrates the self-play training pipeline.


      The Monte-Carlo tree search uses the neural network fθ to guide its simulations (see Figure
2). Each edge (s, a) in the search tree stores a prior probability P (s, a), a visit count N (s, a),
and an action-value Q(s, a). Each simulation starts from the root state and iteratively selects
moves that maximise an upper confidence bound Q(s, a) + U (s, a), where U (s, a) ∝ P (s, a)/(1 +
N (s, a)) 12, 24 , until a leaf node s0 is encountered. This leaf position is expanded and evaluated just


                                                   3
Figure 1: Self-play reinforcement learning in AlphaGo Zero. a The program plays a game s1 , ..., sT against itself.
In each position st , a Monte-Carlo tree search (MCTS) αθ is executed (see Figure 2) using the latest neural network
fθ . Moves are selected according to the search probabilities computed by the MCTS, at ∼ π t . The terminal position
sT is scored according to the rules of the game to compute the game winner z. b Neural network training in AlphaGo
Zero. The neural network takes the raw board position st as its input, passes it through many convolutional layers
with parameters θ, and outputs both a vector pt , representing a probability distribution over moves, and a scalar value
vt , representing the probability of the current player winning in position st . The neural network parameters θ are
updated so as to maximise the similarity of the policy vector pt to the search probabilities πt , and to minimise the
error between the predicted winner vt and the game winner z (see Equation 1). The new parameters are used in the
next iteration of self-play a.




                                                           4
Figure 2: Monte-Carlo tree search in AlphaGo Zero. a Each simulation traverses the tree by selecting the edge
with maximum action-value Q, plus an upper confidence bound U that depends on a stored prior probability P and
visit count N for that edge (which is incremented once traversed). b The leaf node is expanded and the associated
position s is evaluated by the neural network (P (s, ·), V (s)) = fθ (s); the vector of P values are stored in the outgoing
edges from s. c Action-values Q are updated to track the mean of all evaluations V in the subtree below that action. d
Once the search is complete, search probabilities π are returned, proportional to N 1/τ , where N is the visit count of
each move from the root state and τ is a parameter controlling temperature.


once by the network to generate both prior probabilities and evaluation, (P (s0 , ·), V (s0 )) = fθ (s0 ).
Each edge (s, a) traversed in the simulation is updated to increment its visit count N (s, a), and to
update its action-value to the mean evaluation over these simulations, Q(s, a) = 1/N (s, a) s0 |s,a→s0 V (s0 ),
                                                                                            P

where s, a → s0 indicates that a simulation eventually reached s0 after taking move a from position
s.


       MCTS may be viewed as a self-play algorithm that, given neural network parameters θ and
a root position s, computes a vector of search probabilities recommending moves to play, π =
αθ (s), proportional to the exponentiated visit count for each move, πa ∝ N (s, a)1/τ , where τ is a
temperature parameter.


       The neural network is trained by a self-play reinforcement learning algorithm that uses
MCTS to play each move. First, the neural network is initialised to random weights θ0 . At each
subsequent iteration i ≥ 1, games of self-play are generated (Figure 1a). At each time-step t,
an MCTS search π t = αθi−1 (st ) is executed using the previous iteration of neural network fθi−1 ,
and a move is played by sampling the search probabilities π t . A game terminates at step T when

                                                            5
both players pass, when the search value drops below a resignation threshold, or when the game
exceeds a maximum length; the game is then scored to give a final reward of rT ∈ {−1, +1} (see
Methods for details). The data for each time-step t is stored as (st , π t , zt ) where zt = ±rT is
the game winner from the perspective of the current player at step t. In parallel (Figure 1b), new
network parameters θi are trained from data (s, π , z) sampled uniformly among all time-steps of
the last iteration(s) of self-play. The neural network (p, v) = fθi (s) is adjusted to minimise the
error between the predicted value v and the self-play winner z, and to maximise the similarity of
the neural network move probabilities p to the search probabilities π . Specifically, the parame-
ters θ are adjusted by gradient descent on a loss function l that sums over mean-squared error and
cross-entropy losses respectively,




                  (p, v) = fθ (s),                l = (z − v)2 − π > log p + c||θ||2                 (1)

where c is a parameter controlling the level of L2 weight regularisation (to prevent overfitting).


2   Empirical Analysis of AlphaGo Zero Training


We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training
started from completely random behaviour and continued without human intervention for approx-
imately 3 days.


      Over the course of training, 4.9 million games of self-play were generated, using 1,600 simu-
lations for each MCTS, which corresponds to approximately 0.4s thinking time per move. Param-
eters were updated from 700,000 mini-batches of 2,048 positions. The neural network contained
20 residual blocks (see Methods for further details).


      Figure 3a shows the performance of AlphaGo Zero during self-play reinforcement learning,
as a function of training time, on an Elo scale 25 . Learning progressed smoothly throughout train-
ing, and did not suffer from the oscillations or catastrophic forgetting suggested in prior literature

                                                  6
Figure 3: Empirical evaluation of AlphaGo Zero. a Performance of self-play reinforcement learning. The plot
shows the performance of each MCTS player αθi from each iteration i of reinforcement learning in AlphaGo Zero.
Elo ratings were computed from evaluation games between different players, using 0.4 seconds of thinking time per
move (see Methods). For comparison, a similar player trained by supervised learning from human data, using the
KGS data-set, is also shown. b Prediction accuracy on human professional moves. The plot shows the accuracy of the
neural network fθi , at each iteration of self-play i, in predicting human professional moves from the GoKifu data-set.
The accuracy measures the percentage of positions in which the neural network assigns the highest probability to the
human move. The accuracy of a neural network trained by supervised learning is also shown. c Mean-squared error
(MSE) on human professional game outcomes. The plot shows the MSE of the neural network fθi , at each iteration
of self-play i, in predicting the outcome of human professional games from the GoKifu data-set. The MSE is between
the actual outcome z ∈ {−1, +1} and the neural network value v, scaled by a factor of 14 to the range [0, 1]. The MSE
of a neural network trained by supervised learning is also shown.




                                                          7
26–28
        . Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 hours; for compari-
son, AlphaGo Lee was trained over several months. After 72 hours, we evaluated AlphaGo Zero
against the exact version of AlphaGo Lee that defeated Lee Sedol, under the 2 hour time controls
and match conditions as were used in the man-machine match in Seoul (see Methods). AlphaGo
Zero used a single machine with 4 Tensor Processing Units (TPUs) 29 , while AlphaGo Lee was
distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100
games to 0 (see Extended Data Figure 5 and Supplementary Information).


         To assess the merits of self-play reinforcement learning, compared to learning from human
data, we trained a second neural network (using the same architecture) to predict expert moves
in the KGS data-set; this achieved state-of-the-art prediction accuracy compared to prior work
12, 30–33
            (see Extended Data Table 1 and 2 respectively). Supervised learning achieved better initial
performance, and was better at predicting the outcome of human professional games (Figure 3).
Notably, although supervised learning achieved higher move prediction accuracy, the self-learned
player performed much better overall, defeating the human-trained player within the first 24 hours
of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively differ-
ent to human play.


         To separate the contributions of architecture and algorithm, we compared the performance
of the neural network architecture in AlphaGo Zero with the previous neural network architecture
used in AlphaGo Lee (see Figure 4). Four neural networks were created, using either separate
policy and value networks, as in AlphaGo Lee, or combined policy and value networks, as in
AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the
residual network architecture from AlphaGo Zero. Each network was trained to minimise the
same loss function (Equation 1) using a fixed data-set of self-play games generated by AlphaGo
Zero after 72 hours of self-play training. Using a residual network was more accurate, achieved
lower error, and improved performance in AlphaGo by over 600 Elo. Combining policy and value
together into a single network slightly reduced the move prediction accuracy, but reduced the value
error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to


                                                     8
Figure 4: Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee. Comparison of
neural network architectures using either separate (“sep”) or combined policy and value networks (“dual”), and using
either convolutional (“conv”) or residual networks (“res”). The combinations “dual-res” and “sep-conv” correspond
to the neural network architectures used in AlphaGo Zero and AlphaGo Lee respectively. Each network was trained on
a fixed data-set generated by a previous run of AlphaGo Zero. a Each trained network was combined with AlphaGo
Zero’s search to obtain a different player. Elo ratings were computed from evaluation games between these different
players, using 5 seconds of thinking time per move. b Prediction accuracy on human professional moves (from the
GoKifu data-set) for each network architecture. c Mean-squared error on human professional game outcomes (from
the GoKifu data-set) for each network architecture.




                                                         9
improved computational efficiency, but more importantly the dual objective regularises the network
to a common representation that supports multiple use cases.


3   Knowledge Learned by AlphaGo Zero


AlphaGo Zero discovered a remarkable level of Go knowledge during its self-play training process.
This included fundamental elements of human Go knowledge, and also non-standard strategies
beyond the scope of traditional Go knowledge.


      Figure 5 shows a timeline indicating when professional joseki (corner sequences) were dis-
covered (Figure 5a, Extended Data Figure 1); ultimately AlphaGo Zero preferred new joseki vari-
ants that were previously unknown (Figure 5b, Extended Data Figure 2). Figure 5c and the Sup-
plementary Information show several fast self-play games played at different stages of training.
Tournament length games played at regular intervals throughout training are shown in Extended
Data Figure 3 and Supplementary Information. AlphaGo Zero rapidly progressed from entirely
random moves towards a sophisticated understanding of Go concepts including fuseki (opening),
tesuji (tactics), life-and-death, ko (repeated board situations), yose (endgame), capturing races,
sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly,
shicho (“ladder” capture sequences that may span the whole board) – one of the first elements of
Go knowledge learned by humans – were only understood by AlphaGo Zero much later in training.


4   Final Performance of AlphaGo Zero


We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero
using a larger neural network and over a longer duration. Training again started from completely
random behaviour and continued for approximately 40 days.


      Over the course of training, 29 million games of self-play were generated. Parameters were
updated from 3.1 million mini-batches of 2,048 positions each. The neural network contained


                                                  10
Figure 5: Go knowledge learned by AlphaGo Zero. a Five human joseki (common corner sequences) discovered
during AlphaGo Zero training. The associated timestamps indicate the first time each sequence occured (taking account
of rotation and reflection) during self-play training. Extended Data Figure 1 provides the frequency of occurence over
training for each sequence. b Five joseki favoured at different stages of self-play training. Each displayed corner
sequence was played with the greatest frequency, among all corner sequences, during an iteration of self-play training.
The timestamp of that iteration is indicated on the timeline. At 10 hours a weak corner move was preferred. At 47
hours the 3-3 invasion was most frequently played. This joseki is also common in human professional play; however
AlphaGo Zero later discovered and preferred a new variation. Extended Data Figure 2 provides the frequency of
occurence over time for all five sequences and the new variation. c The first 80 moves of three self-play games that
were played at different stages of training, using 1,600 simulations (around 0.4s) per search. At 3 hours, the game
focuses greedily on capturing stones, much like a human beginner. At 19 hours, the game exhibits the fundamentals
of life-and-death, influence and territory. At 70 hours, the game is beautifully balanced, involving multiple battles and
a complicated ko fight, eventually resolving into a half-point win for white. See Supplementary Information for the
full games.
40 residual blocks. The learning curve is shown in Figure 6a. Games played at regular intervals
throughout training are shown in Extended Data Figure 4 and Supplementary Information.


       We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo
Fan, AlphaGo Lee, and several previous Go programs. We also played games against the strongest
existing program, AlphaGo Master – a program based on the algorithm and architecture presented
in this paper but utilising human data and features (see Methods) – which defeated the strongest
human professional players 60–0 in online games 34 in January 2017. In our evaluation, all pro-
grams were allowed 5 seconds of thinking time per move; AlphaGo Zero and AlphaGo Master
each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed
over 176 GPUs and 48 TPUs respectively. We also included a player based solely on the raw neural
network of AlphaGo Zero; this player simply selected the move with maximum probability.


       Figure 6b shows the performance of each program on an Elo scale. The raw neural network,
without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating
of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo
Fan.


       Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100 game
match with 2 hour time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Figure
6) and Supplementary Information.


5   Conclusion


Our results comprehensively demonstrate that a pure reinforcement learning approach is fully fea-
sible, even in the most challenging of domains: it is possible to train to superhuman level, without
human examples or guidance, given no knowledge of the domain beyond basic rules. Further-
more, a pure reinforcement learning approach requires just a few more hours to train, and achieves
much better asymptotic performance, compared to training on human expert data. Using this ap-


                                                12
Figure 6: Performance of AlphaGo Zero. a Learning curve for AlphaGo Zero using larger 40 block residual network
over 40 days. The plot shows the performance of each player αθi from each iteration i of our reinforcement learning
algorithm. Elo ratings were computed from evaluation games between different players, using 0.4 seconds per search
(see Methods). b Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40 days using a 40 residual block
neural network. The plot shows the results of a tournament between: AlphaGo Zero, AlphaGo Master (defeated top
human professionals 60-0 in online games), AlphaGo Lee (defeated Lee Sedol), AlphaGo Fan (defeated Fan Hui), as
well as previous Go programs Crazy Stone, Pachi and GnuGo. Each program was given 5 seconds of thinking time
per move. AlphaGo Zero and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan and
AlphaGo Lee were distributed over many machines. The raw neural network from AlphaGo Zero is also included,
which directly selects the move a with maximum probability pa , without using MCTS. Programs were evaluated on
an Elo scale 25 : a 200 point gap corresponds to a 75% probability of winning.




                                                         13
proach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained
from human data using handcrafted features, by a large margin.


      Humankind has accumulated Go knowledge from millions of games played over thousands
of years, collectively distilled into patterns, proverbs and books. In the space of a few days, starting
tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel
strategies that provide new insights into the oldest of games.




                                                  14
References


 1. Friedman, J., Hastie, T. & Tibshirani, R. The Elements of Statistical Learning: Data Mining,
    Inference, and Prediction (Springer-Verlag, 2009).

 2. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

 3. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional
    neural networks. In Advances in Neural Information Processing Systems, 1097–1105 (2012).

 4. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE
    Conference on Computer Vision and Pattern Recognition, 770–778 (2016).

 5. Hayes-Roth, F., Waterman, D. & Lenat, D. Building expert systems (Addison-Wesley, 1984).

 6. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–
    533 (2015).

 7. Guo, X., Singh, S. P., Lee, H., Lewis, R. L. & Wang, X. Deep learning for real-time Atari
    game play using offline Monte-Carlo tree search planning. In Advances in Neural Information
    Processing Systems, 3338–3346 (2014).

 8. Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In International
    Conference on Machine Learning, 1928–1937 (2016).

 9. Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. International
    Conference on Learning Representations (2017).

10. Dosovitskiy, A. & Koltun, V. Learning to act by predicting the future. In International Con-
    ference on Learning Representations (2017).

11. Mandziuk, J. Computational intelligence in mind games. In Challenges for Computational
    Intelligence, 407–442 (2007).




                                              15
12. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature
    529, 484–489 (2016).

13. Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In Interna-
    tional Conference on Computers and Games, 72–83 (2006).

14. Kocsis, L. & Szepesvári, C. Bandit based Monte-Carlo planning. In 15th European Conference
    on Machine Learning, 282–293 (2006).

15. Browne, C. et al. A survey of Monte-Carlo tree search methods. IEEE Transactions of Com-
    putational Intelligence and AI in Games 4, 1–43 (2012).

16. Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of
    pattern recognition unaffected by shift in position. Biological Cybernetics 36, 193–202 (1980).

17. LeCun, Y. & Bengio, Y. Convolutional networks for images, speech, and time series. In Arbib,
    M. (ed.) The Handbook of Brain Theory and Neural Networks, chap. 3, 276–278 (MIT Press,
    1995).

18. Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing
    internal covariate shift. In International Conference on Machine Learning, 448–456 (2015).

19. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J. & Seung, H. S. Digital
    selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405,
    947–951 (2000).

20. Howard, R. Dynamic Programming and Markov Processes (MIT Press, 1960).

21. Sutton, R. & Barto, A. Reinforcement Learning: an Introduction (MIT Press, 1998).

22. Bertsekas, D. P. Approximate policy iteration: a survey and some new methods. Journal of
    Control Theory and Applications 9, 310–335 (2011).

23. Scherrer, B. Approximate policy iteration schemes: A comparison. In International Confer-
    ence on Machine Learning, 1314–1322 (2014).

                                                16
24. Rosin, C. D. Multi-armed bandits with episode context. Annals of Mathematics and Artificial
    Intelligence 61, 203–230 (2011).

25. Coulom, R. Whole-history rating: A Bayesian rating system for players of time-varying
    strength. In International Conference on Computers and Games, 113–124 (2008).

26. Laurent, G. J., Matignon, L. & Le Fort-Piat, N. The world of Independent learners is not
    Markovian. International Journal of Knowledge-Based and Intelligent Engineering Systems
    15, 55–64 (2011).

27. Foerster, J. N. et al. Stabilising experience replay for deep multi-agent reinforcement learning.
    In International Conference on Machine Learning (2017).

28. Heinrich, J. & Silver, D. Deep reinforcement learning from self-play in imperfect-information
    games. In NIPS Deep Reinforcement Learning Workshop (2016).

29. Jouppi, N. P., Young, C., Patil, N. et al. In-datacenter performance analysis of a tensor pro-
    cessing unit. In Proceedings of the 44th Annual International Symposium on Computer Archi-
    tecture, ISCA ’17, 1–12 (ACM, 2017).

30. Maddison, C. J., Huang, A., Sutskever, I. & Silver, D. Move evaluation in Go using deep con-
    volutional neural networks. In International Conference on Learning Representations (2015).

31. Clark, C. & Storkey, A. J. Training deep convolutional neural networks to play Go. In Inter-
    national Conference on Machine Learning, 1766–1774 (2015).

32. Tian, Y. & Zhu, Y. Better computer Go player with neural network and long-term prediction.
    In International Conference on Learning Representations (2016).

33. Cazenave, T. Residual networks for computer Go. IEEE Transactions on Computational
    Intelligence and AI in Games (2017).

34. Huang,     A.        AlphaGo     Master    online   series   of   games    (2017).         URL:
    https://deepmind.com/research/alphago/match-archive/master.


                                                 17
Supplementary Information


Supplementary Information is available in the online version of the paper.


Acknowledgements


We thank A. Cain for work on the visuals; A. Barreto, G. Ostrovski, T. Ewalds, T. Schaul, J. Oh
and N. Heess for reviewing the paper; and the rest of the DeepMind team for their support.


Author Contributions


D.S., J.S., K.S., I.A., A.G., L.S. and T.H. designed and implemented the reinforcement learning al-
gorithm in AlphaGo Zero. A.H., J.S., M.L., D.S. designed and implemented the search in AlphaGo
Zero. L.B., J.S., A.H, F.H., T.H., Y.C, D.S. designed and implemented the evaluation framework
for AlphaGo Zero. D.S., A.B., F.H., A.G., T.L., T.G., L.S., G.v.d.D., D.H. managed and advised
on the project. D.S., T.G., A.G. wrote the paper.


Author Information


Reprints and permissions information is available at www.nature.com/reprints. The authors de-
clare no competing financial interests. Readers are welcome to comment on the online version
of the paper. Correspondence and requests for materials should be addressed to D.S. (davidsil-
ver@google.com).




                                                18
Methods

Reinforcement learning Policy iteration 20, 21 is a classic algorithm that generates a sequence of
improving policies, by alternating between policy evaluation – estimating the value function of
the current policy – and policy improvement – using the current value function to generate a better
policy. A simple approach to policy evaluation is to estimate the value function from the outcomes
of sampled trajectories 35, 36 . A simple approach to policy improvement is to select actions greedily
with respect to the value function 20 . In large state spaces, approximations are necessary to evaluate
each policy and to represent its improvement 22, 23 .

         Classification-based reinforcement learning 37 improves the policy using a simple Monte-
Carlo search. Many rollouts are executed for each action; the action with the maximum mean value
provides a positive training example, while all other actions provide negative training examples; a
policy is then trained to classify actions as positive or negative, and used in subsequent rollouts.
This may be viewed as a precursor to the policy component of AlphaGo Zero’s training algorithm
when τ → 0.

         A more recent instantiation, classification-based modified policy iteration (CBMPI), also
performs policy evaluation by regressing a value function towards truncated rollout values, similar
to the value component of AlphaGo Zero; this achieved state-of-the-art results in the game of Tetris
38
     . However, this prior work was limited to simple rollouts and linear function approximation using
handcrafted features.

         The AlphaGo Zero self-play algorithm can similarly be understood as an approximate pol-
icy iteration scheme in which MCTS is used for both policy improvement and policy evaluation.
Policy improvement starts with a neural network policy, executes an MCTS based on that policy’s
recommendations, and then projects the (much stronger) search policy back into the function space
of the neural network. Policy evaluation is applied to the (much stronger) search policy: the out-
comes of self-play games are also projected back into the function space of the neural network.
These projection steps are achieved by training the neural network parameters to match the search


                                                   19
probabilities and self-play game outcome respectively.

          Guo et al. 7 also project the output of MCTS into a neural network, either by regressing
a value network towards the search value, or by classifying the action selected by MCTS. This
approach was used to train a neural network for playing Atari games; however, the MCTS was
fixed — there was no policy iteration — and did not make any use of the trained networks.

Self-play reinforcement learning in games Our approach is most directly applicable to zero-sum
games of perfect information. We follow the formalism of alternating Markov games described in
previous work 12 , noting that algorithms based on value or policy iteration extend naturally to this
setting 39 .

          Self-play reinforcement learning has previously been applied to the game of Go. NeuroGo
40, 41
         used a neural network to represent a value function, using a sophisticated architecture based
on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by
temporal-difference learning 42 to predict territory in games of self-play, building on prior work
43
     . A related approach, RLGO 44 , represented the value function instead by a linear combination
of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporal-
difference learning to predict the winner in games of self-play. Both NeuroGo and RLGO achieved
a weak amateur level of play.

          Monte-Carlo tree search (MCTS) may also be viewed as a form of self-play reinforcement
learning 45 . The nodes of the search tree contain the value function for the positions encountered
during search; these values are updated to predict the winner of simulated games of self-play.
MCTS programs have previously achieved strong amateur level in Go 46, 47 , but used substantial
domain expertise: a fast rollout policy, based on handcrafted features 48 13 , that evaluates positions
by running simulations until the end of the game; and a tree policy, also based on handcrafted
features, that selects moves within the search tree 47 .

          Self-play reinforcement learning approaches have achieved high levels of performance in
other games: chess 49–51 , checkers 52 , backgammon 53 , othello 54 , Scrabble 55 and most recently

                                                   20
poker 56 . In all of these examples, a value function was trained by regression 54–56 or temporal-
difference learning 49–53 from training data generated by self-play. The trained value function was
used as an evaluation function in an alpha-beta search 49–54 , a simple Monte-Carlo search 55, 57 , or
counterfactual regret minimisation 56 . However, these methods utilised handcrafted input features
49–53, 56
            or handcrafted feature templates 54, 55 . In addition, the learning process used supervised
learning to initialise weights 58 , hand-selected weights for piece values 49, 51, 52 , handcrafted restric-
tions on the action space 56 , or used pre-existing computer programs as training opponents 49, 50 or
to generate game records 51 .

         Many of the most successful and widely used reinforcement learning methods were first
introduced in the context of zero-sum games: temporal-difference learning was first introduced
for a checkers-playing program 59 , while MCTS was introduced for the game of Go 13 . However,
very similar algorithms have subsequently proven highly effective in video games 6–8, 10 , robotics
60
     , industrial control 61–63 , and online recommendation systems 64, 65 .

AlphaGo versions We compare three distinct versions of AlphaGo:


      1. AlphaGo Fan is the previously published program 12 that played against Fan Hui in October
         2015. This program was distributed over many machines using 176 GPUs.

      2. AlphaGo Lee is the program that defeated Lee Sedol 4–1 in March, 2016. It was previously
         unpublished but is similar in most regards to AlphaGo Fan 12 . However, we highlight several
         key differences to facilitate a fair comparison. First, the value network was trained from
         the outcomes of fast games of self-play by AlphaGo, rather than games of self-play by the
         policy network; this procedure was iterated several times – an initial step towards the tabula
         rasa algorithm presented in this paper. Second, the policy and value networks were larger
         than those described in the original paper – using 12 convolutional layers of 256 planes
         respectively – and were trained for more iterations. This player was also distributed over
         many machines using 48 TPUs, rather than GPUs, enabling it to evaluate neural networks
         faster during search.

                                                      21
   3. AlphaGo Master is the program that defeated top human players by 60–0 in January, 2017 34 .
      It was previously unpublished but uses the same neural network architecture, reinforcement
      learning algorithm, and MCTS algorithm as described in this paper. However, it uses the
      same handcrafted features and rollouts as AlphaGo Lee 12 and training was initialised by
      supervised learning from human data.

   4. AlphaGo Zero is the program described in this paper. It learns from self-play reinforcement
      learning, starting from random initial weights, without using rollouts, with no human super-
      vision, and using only the raw board history as input features. It uses just a single machine
      in the Google Cloud with 4 TPUs (AlphaGo Zero could also be distributed but we chose to
      use the simplest possible search algorithm).


Domain Knowledge Our primary contribution is to demonstrate that superhuman performance
can be achieved without human domain knowledge. To clarify this contribution, we enumerate the
domain knowledge that AlphaGo Zero uses, explicitly or implicitly, either in its training procedure
or its Monte-Carlo tree search; these are the items of knowledge that would need to be replaced for
AlphaGo Zero to learn a different (alternating Markov) game.


   1. AlphaGo Zero is provided with perfect knowledge of the game rules. These are used dur-
      ing MCTS, to simulate the positions resulting from a sequence of moves, and to score any
      simulations that reach a terminal state. Games terminate when both players pass, or after
      19 · 19 · 2 = 722 moves. In addition, the player is provided with the set of legal moves in
      each position.

   2. AlphaGo Zero uses Tromp-Taylor scoring 66 during MCTS simulations and self-play train-
      ing. This is because human scores (Chinese, Japanese or Korean rules) are not well-defined
      if the game terminates before territorial boundaries are resolved. However, all tournament
      and evaluation games were scored using Chinese rules.

   3. The input features describing the position are structured as a 19 × 19 image; i.e. the neural
      network architecture is matched to the grid-structure of the board.

                                                22
      4. The rules of Go are invariant under rotation and reflection; this knowledge has been utilised
         in AlphaGo Zero both by augmenting the data set during training to include rotations and
         reflections of each position, and to sample random rotations or reflections of the position
         during MCTS (see Search Algorithm). Aside from komi, the rules of Go are also invari-
         ant to colour transposition; this knowledge is exploited by representing the board from the
         perspective of the current player (see Neural network architecture)


        AlphaGo Zero does not use any form of domain knowledge beyond the points listed above.
It only uses its deep neural network to evaluate leaf nodes and to select moves (see section below).
It does not use any rollout policy or tree policy, and the MCTS is not augmented by any other
heuristics or domain-specific rules. No legal moves are excluded – even those filling in the player’s
own eyes (a standard heuristic used in all previous programs 67 ).

        The algorithm was started with random initial parameters for the neural network. The neural
network architecture (see Neural Network Architecture) is based on the current state of the art
in image recognition 4, 18 , and hyperparameters for training were chosen accordingly (see Self-
Play Training Pipeline). MCTS search parameters were selected by Gaussian process optimisation
68
     , so as to optimise self-play performance of AlphaGo Zero using a neural network trained in
a preliminary run. For the larger run (40 block, 40 days), MCTS search parameters were re-
optimised using the neural network trained in the smaller run (20 block, 3 days). The training
algorithm was executed autonomously without human intervention.

Self-Play Training Pipeline AlphaGo Zero’s self-play training pipeline consists of three main
components, all executed asynchronously in parallel. Neural network parameters θi are continually
optimised from recent self-play data; AlphaGo Zero players αθi are continually evaluated; and the
best performing player so far, αθ∗ , is used to generate new self-play data.

        Optimisation Each neural network fθi is optimised on the Google Cloud using TensorFlow,
with 64 GPU workers and 19 CPU parameter servers. The batch-size is 32 per worker, for a
total mini-batch size of 2,048. Each mini-batch of data is sampled uniformly at random from

                                                   23
all positions from the most recent 500,000 games of self-play. Neural network parameters are
optimised by stochastic gradient descent with momentum and learning rate annealing, using the
loss in Equation 1. The learning rate is annealed according to the standard schedule in Extended
Data Table 3. The momentum parameter is set to 0.9. The cross-entropy and mean-squared error
losses are weighted equally (this is reasonable because rewards are unit scaled, r ∈ {−1, +1})
and the L2 regularisation parameter is set to c = 10−4 . The optimisation process produces a new
checkpoint every 1,000 training steps. This checkpoint is evaluated by the evaluator and it may be
used for generating the next batch of self-play games, as we explain next.

      Evaluator To ensure we always generate the best quality data, we evaluate each new neural
network checkpoint against the current best network fθ∗ before using it for data generation. The
neural network fθi is evaluated by the performance of an MCTS search αθi that uses fθi to evalu-
ate leaf positions and prior probabilities (see Search Algorithm). Each evaluation consists of 400
games, using an MCTS with 1,600 simulations to select each move, using an infinitesimal tem-
perature τ → 0 (i.e. we deterministically select the move with maximum visit count, to give the
strongest possible play). If the new player wins by a margin of > 55% (to avoid selecting on noise
alone) then it becomes the best player αθ∗ , and is subsequently used for self-play generation, and
also becomes the baseline for subsequent comparisons.

      Self-Play The best current player αθ∗ , as selected by the evaluator, is used to generate data.
In each iteration, αθ∗ plays 25,000 games of self-play, using 1,600 simulations of MCTS to select
each move (this requires approximately 0.4s per search). For the first 30 moves of each game, the
temperature is set to τ = 1; this selects moves proportionally to their visit count in MCTS, and
ensures a diverse set of positions are encountered. For the remainder of the game, an infinitesimal
temperature is used, τ → 0. Additional exploration is achieved by adding Dirichlet noise to the
prior probabilities in the root node s0 , specifically P (s, a) = (1 − )pa + ηa , where η ∼ Dir(0.03)
and  = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad
moves. In order to save computation, clearly lost games are resigned. The resignation threshold
vresign is selected automatically to keep the fraction of false positives (games that could have been



                                                  24
won if AlphaGo had not resigned) below 5%. To measure false positives, we disable resignation
in 10% of self-play games and play until termination.

Supervised Learning For comparison, we also trained neural network parameters θSL by super-
vised learning. The neural network architecture was identical to AlphaGo Zero. Mini-batches of
data (s, π , z) were sampled at random from the KGS data-set, setting πa = 1 for the human expert
move a. Parameters were optimised by stochastic gradient descent with momentum and learning
rate annealing, using the same loss as in Equation 1, but weighting the mean-squared error com-
ponent by a factor of 0.01. The learning rate was annealed according to the standard schedule
in Extended Data Table 3. The momentum parameter was set to 0.9, and the L2 regularisation
parameter was set to c = 10−4 .

      By using a combined policy and value network architecture, and by using a low weight on
the value component, it was possible to avoid overfitting to the values (a problem described in
prior work 12 ). After 72 hours the move prediction accuracy exceeded the state of the art reported
in previous work 12, 30–33 , reaching 60.4% on the KGS test set; the value prediction error was also
substantially better than previously reported 12 . The validation set was composed of professional
games from GoKifu. Accuracies and mean squared errors are reported in Extended Data Table 1
and Extended Data Table 2 respectively.

Search Algorithm AlphaGo Zero uses a much simpler variant of the asynchronous policy and
value MCTS algorithm (APV-MCTS) used in AlphaGo Fan and AlphaGo Lee.

      Each node s in the search tree contains edges (s, a) for all legal actions a ∈ A(s). Each edge
stores a set of statistics,

                                {N (s, a), W (s, a), Q(s, a), P (s, a)},

where N (s, a) is the visit count, W (s, a) is the total action-value, Q(s, a) is the mean action-value,
and P (s, a) is the prior probability of selecting that edge. Multiple simulations are executed in
parallel on separate search threads. The algorithm proceeds by iterating over three phases (a–c in
Figure 2), and then selects a move to play (d in Figure 2).

                                                  25
      Select (Figure 2a). The selection phase is almost identical to AlphaGo Fan 12 ; we recapitulate
here for completeness. The first in-tree phase of each simulation begins at the root node of the
search tree, s0 , and finishes when the simulation reaches a leaf node sL at time-step L. At each
of these time-steps, t < L, an action is selected according to the statistics in the search tree,
                                  
at = argmax Q(st , a) + U (st , a) , using a variant of the PUCT algorithm 24 ,
         a
                                                             pP
                                                                  b N (s, b)
                                 U (s, a) = cpuct P (s, a)
                                                             1 + N (s, a)

where cpuct is a constant determining the level of exploration; this search control strategy initially
prefers actions with high prior probability and low visit count, but asympotically prefers actions
with high action-value.

      Expand and evaluate (Figure 2b). The leaf node sL is added to a queue for neural network
evaluation, (di (p), v) = fθ (di (sL )), where di is a dihedral reflection or rotation selected uniformly
at random from i ∈ [1..8].

      Positions in the queue are evaluated by the neural network using a mini-batch size of 8; the
search thread is locked until evaluation completes. The leaf node is expanded and each edge (sL , a)
is initialised to {N (sL , a) = 0, W (sL , a) = 0, Q(sL , a) = 0, P (sL , a) = pa }; the value v is then
backed up.

      Backup (Figure 2c). The edge statistics are updated in a backward pass through each step
t ≤ L. The visit counts are incremented, N (st , at ) = N (st , at ) + 1, and the action-value is updated
to the mean value, W (st , at ) = W (st , at ) + v, Q(st , at ) = W (st ,at )
                                                                  N (st ,at )
                                                                              . We use virtual loss to ensure
each thread evaluates different nodes 69 .

      Play (Figure 2d). At the end of the search AlphaGo Zero selects a move a to play in the root
position s0 , proportional to its exponentiated visit count, π(a|s0 ) = N (s0 , a)1/τ / b N (s0 , b)1/τ ,
                                                                                       P

where τ is a temperature parameter that controls the level of exploration. The search tree is reused
at subsequent time-steps: the child node corresponding to the played action becomes the new root
node; the subtree below this child is retained along with all its statistics, while the remainder of

                                                     26
the tree is discarded. AlphaGo Zero resigns if its root value and best child value are lower than a
threshold value vresign .

      Compared to the MCTS in AlphaGo Fan and AlphaGo Lee, the principal differences are that
AlphaGo Zero does not use any rollouts; it uses a single neural network instead of separate policy
and value networks; leaf nodes are always expanded, rather than using dynamic expansion; each
search thread simply waits for the neural network evaluation, rather than performing evaluation
and backup asynchronously; and there is no tree policy. A transposition table was also used in the
large (40 block, 40 day) instance of AlphaGo Zero.

Neural Network Architecture The input to the neural network is a 19 × 19 × 17 image stack
comprising 17 binary feature planes. 8 feature planes Xt consist of binary values indicating the
presence of the current player’s stones (Xti = 1 if intersection i contains a stone of the player’s
colour at time-step t; 0 if the intersection is empty, contains an opponent stone, or if t < 0). A
further 8 feature planes, Yt , represent the corresponding features for the opponent’s stones. The
final feature plane, C, represents the colour to play, and has a constant value of either 1 if black
is to play or 0 if white is to play. These planes are concatenated together to give input features
st = [Xt , Yt , Xt−1 , Yt−1 , ..., Xt−7 , Yt−7 , C]. History features Xt , Yt are necessary because Go is
not fully observable solely from the current stones, as repetitions are forbidden; similarly, the
colour feature C is necessary because the komi is not observable.

      The input features st are processed by a residual tower that consists of a single convolutional
block followed by either 19 or 39 residual blocks 4 .

      The convolutional block applies the following modules:


   1. A convolution of 256 filters of kernel size 3 × 3 with stride 1

   2. Batch normalisation 18

   3. A rectifier non-linearity



                                                   27
     Each residual block applies the following modules sequentially to its input:


  1. A convolution of 256 filters of kernel size 3 × 3 with stride 1

  2. Batch normalisation

  3. A rectifier non-linearity

  4. A convolution of 256 filters of kernel size 3 × 3 with stride 1

  5. Batch normalisation

  6. A skip connection that adds the input to the block

  7. A rectifier non-linearity


     The output of the residual tower is passed into two separate “heads” for computing the policy
and value respectively. The policy head applies the following modules:


  1. A convolution of 2 filters of kernel size 1 × 1 with stride 1

  2. Batch normalisation

  3. A rectifier non-linearity

  4. A fully connected linear layer that outputs a vector of size 192 + 1 = 362 corresponding to
      logit probabilities for all intersections and the pass move


     The value head applies the following modules:


  1. A convolution of 1 filter of kernel size 1 × 1 with stride 1

  2. Batch normalisation

  3. A rectifier non-linearity

                                                 28
   4. A fully connected linear layer to a hidden layer of size 256

   5. A rectifier non-linearity

   6. A fully connected linear layer to a scalar

   7. A tanh non-linearity outputting a scalar in the range [−1, 1]


      The overall network depth, in the 20 or 40 block network, is 39 or 79 parameterised layers
respectively for the residual tower, plus an additional 2 layers for the policy head and 3 layers for
the value head.

      We note that a different variant of residual networks was simultaneously applied to computer
Go 33 and achieved amateur dan-level performance; however this was restricted to a single-headed
policy network trained solely by supervised learning.

Neural Network Architecture Comparison Figure 4 shows the results of a comparison between
network architectures. Specifically, we compared four different neural networks:


   1. dual-res: The network contains a 20-block residual tower, as described above, followed by
      both a policy head and a value head. This is the architecture used in AlphaGo Zero.

   2. sep-res: The network contains two 20-block residual towers. The first tower is followed by
      a policy head and the second tower is followed by a value head.

   3. dual-conv: The network contains a non-residual tower of 12 convolutional blocks, followed
      by both a policy head and a value head.

   4. sep-conv: The network contains two non-residual towers of 12 convolutional blocks. The
      first tower is followed by a policy head and the second tower is followed by a value head.
      This is the architecture used in AlphaGo Lee.


      Each network was trained on a fixed data-set containing the final 2 million games of self-
play data generated by a previous run of AlphaGo Zero, using stochastic gradient descent with

                                                   29
the annealing rate, momentum, and regularisation hyperparameters described for the supervised
learning experiment; however, cross-entropy and mean-squared error components were weighted
equally, since more data was available.

Evaluation We evaluated the relative strength of AlphaGo Zero (Figure 3a and 6) by measuring
the Elo rating of each player. We estimate the probability that player a will defeat player b by
a logistic function p(a defeats b) = 1+exp (celo1(e(b)−e(a)) , and estimate the ratings e(·) by Bayesian
logistic regression, computed by the BayesElo program 25 using the standard constant celo = 1/400.

      Elo ratings were computed from the results of a 5 second per move tournament between
AlphaGo Zero, AlphaGo Master, AlphaGo Lee, and AlphaGo Fan. The raw neural network from
AlphaGo Zero was also included in the tournament. The Elo ratings of AlphaGo Fan, Crazy Stone,
Pachi and GnuGo were anchored to the tournament values from prior work 12 , and correspond to
the players reported in that work. The results of the matches of AlphaGo Fan against Fan Hui and
AlphaGo Lee against Lee Sedol were also included to ground the scale to human references, as
otherwise the Elo ratings of AlphaGo are unrealistically high due to self-play bias.

      The Elo ratings in Figure 3a, 4a and 6a were computed from the results of evaluation games
between each iteration of player αθi during self-play training. Further evaluations were also per-
formed against baseline players with Elo ratings anchored to the previously published values 12 .

      We measured the head-to-head performance of AlphaGo Zero against AlphaGo Lee, and
the 40 block instance of AlphaGo Zero against AlphaGo Master, using the same player and match
conditions as were used against Lee Sedol in Seoul, 2016. Each player received 2 hours of thinking
time plus 3 byoyomi periods of 60 seconds per move. All games were scored using Chinese rules
with a komi of 7.5 points.

Data Availability The datasets used for validation and testing are the GoKifu dataset (available
from http://gokifu.com/ ) and the KGS dataset (available from https://u-go.net/gamerecords/ ).




                                                  30
References


35. Barto, A. G. & Duff, M. Monte Carlo matrix inversion and reinforcement learning. Advances
    in Neural Information Processing Systems 687–694 (1994).

36. Singh, S. P. & Sutton, R. S. Reinforcement learning with replacing eligibility traces. Machine
    learning 22, 123–158 (1996).

37. Lagoudakis, M. G. & Parr, R. Reinforcement learning as classification: Leveraging modern
    classifiers. In International Conference on Machine Learning, 424–431 (2003).

38. Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. & Geist, M. Approximate modi-
    fied policy iteration and its application to the game of Tetris. Journal of Machine Learning
    Research 16, 1629–1676 (2015).

39. Littman, M. L. Markov games as a framework for multi-agent reinforcement learning. In
    International Conference on Machine Learning, 157–163 (1994).

40. Enzenberger, M. The integration of a priori knowledge into a Go playing neural network
    (1996). URL: http://www.cgl.ucsf.edu/go/Programs/neurogo-html/neurogo.html.

41. Enzenberger, M. Evaluation in Go by a neural network using soft segmentation. In Advances
    in Computer Games Conference, 97–108 (2003).

42. Sutton, R. Learning to predict by the method of temporal differences. Machine Learning 3,
    9–44 (1988).

43. Schraudolph, N. N., Dayan, P. & Sejnowski, T. J. Temporal difference learning of position
    evaluation in the game of Go. Advances in Neural Information Processing Systems 817–824
    (1994).

44. Silver, D., Sutton, R. & Müller, M. Temporal-difference search in computer Go. Machine
    Learning 87, 183–219 (2012).



                                               31
45. Silver, D. Reinforcement Learning and Simulation-Based Search in Computer Go. Ph.D.
    thesis, University of Alberta, Edmonton, Canada (2009).

46. Gelly, S. & Silver, D. Monte-Carlo tree search and rapid action value estimation in computer
    Go. Artificial Intelligence 175, 1856–1875 (2011).

47. Coulom, R. Computing Elo ratings of move patterns in the game of Go. International Com-
    puter Games Association Journal 30, 198–208 (2007).

48. Gelly, S., Wang, Y., Munos, R. & Teytaud, O. Modification of UCT with patterns in Monte-
    Carlo Go. Tech. Rep. 6062, INRIA (2006).

49. Baxter, J., Tridgell, A. & Weaver, L. Learning to play chess using temporal differences.
    Machine Learning 40, 243–263 (2000).

50. Veness, J., Silver, D., Blair, A. & Uther, W. Bootstrapping from game tree search. In Advances
    in Neural Information Processing Systems, 1937–1945 (2009).

51. Lai, M. Giraffe: Using Deep Reinforcement Learning to Play Chess. Master’s thesis, Imperial
    College London (2015).

52. Schaeffer, J., Hlynka, M. & Jussila, V. Temporal difference learning applied to a high-
    performance game-playing program. In International Joint Conference on Artificial Intelli-
    gence, 529–534 (2001).

53. Tesauro, G. TD-gammon, a self-teaching backgammon program, achieves master-level play.
    Neural Computation 6, 215–219 (1994).

54. Buro, M. From simple features to sophisticated evaluation functions. In International Confer-
    ence on Computers and Games, 126–145 (1999).

55. Sheppard, B. World-championship-caliber Scrabble. Artificial Intelligence 134, 241–275
    (2002).



                                               32
56. Moravčı́k, M. et al. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.
    Science (2017).

57. Tesauro, G. & Galperin, G. On-line policy improvement using Monte-Carlo search. In Ad-
    vances in Neural Information Processing, 1068–1074 (1996).

58. Tesauro, G. Neurogammon: a neural-network backgammon program. In International Joint
    Conference on Neural Networks, vol. 3, 33–39 (1990).

59. Samuel, A. L. Some studies in machine learning using the game of checkers II - recent
    progress. IBM Journal of Research and Development 11, 601–617 (1967).

60. Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: A survey. The
    International Journal of Robotics Research 32, 1238–1274 (2013).

61. Zhang, W. & Dietterich, T. G. A reinforcement learning approach to job-shop scheduling. In
    International Joint Conference on Artificial Intelligence, 1114–1120 (1995).

62. Cazenave, T., Balbo, F. & Pinson, S. Using a Monte-Carlo approach for bus regulation. In
    International IEEE Conference on Intelligent Transportation Systems, 1–6 (2009).

63. Evans, R. & Gao, J. Deepmind AI reduces Google data centre cooling bill by 40% (2016).
    URL: https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/.

64. Abe, N. et al. Empirical comparison of various reinforcement learning strategies for sequential
    targeted marketing. In IEEE International Conference on Data Mining, 3–10 (2002).

65. Silver, D., Newnham, L., Barker, D., Weller, S. & McFall, J. Concurrent reinforcement learn-
    ing from customer interactions. In International Conference on Machine Learning, 924–932
    (2013).

66. Tromp, J. Tromp-Taylor rules (1995). URL: http://tromp.github.io/go.html.

67. Müller, M. Computer Go. Artificial Intelligence 134, 145–179 (2002).



                                                 33
68. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & de Freitas, N. Taking the human out of
    the loop: A review of Bayesian optimization. Proceedings of the IEEE 104, 148–175 (2016).

69. Segal, R. B. On the scalability of parallel UCT. Computers and Games 6515, 36–47 (2011).




                                              34
                                                     KGS train            KGS test             GoKifu validation

 Supervised learning (20 block)                      62.0                 60.4                 54.3
 Supervised learning (12 layer 12 )                  59.1                 55.9                 -
 Reinforcement learning (20 block)                     -                    -                  49.0
 Reinforcement learning (40 block)                     -                    -                  51.3

Extended Data Table 1: Move prediction accuracy. Percentage accuracies of move prediction for neural net-
works trained by reinforcement learning (i.e. AlphaGo Zero) or supervised learning respectively. For supervised
learning, the network was trained for 3 days on KGS data (amateur games); comparative results are also shown from
Silver et al 12 . For reinforcement learning, the 20 block network was trained for 3 days and the 40 block network was
trained for 40 days. Networks were also evaluated on a validation set based on professional games from the GoKifu
data set.




                                                     KGS train            KGS test             GoKifu validation

 Supervised learning (20 block)                      0.177                0.185                0.207
 Supervised learning (12 layer 12 )                  0.19                 0.37                 -
 Reinforcement learning (20 block)                     -                    -                  0.177
 Reinforcement learning (40 block)                     -                    -                  0.180

Extended Data Table 2: Game outcome prediction error. Mean squared error on game outcome predictions
for neural networks trained by reinforcement learning (i.e. AlphaGo Zero) or supervised learning respectively. For
supervised learning, the network was trained for 3 days on KGS data (amateur games); comparative results are also
shown from Silver et al 12 . For reinforcement learning, the 20 block network was trained for 3 days and the 40 block
network was trained for 40 days. Networks were also evaluated on a validation set based on professional games from
the GoKifu data set.
               Thousands of steps        Reinforcement learning         Supervised learning

                       0–200                        10−2                           10−1
                      200–400                       10−2                           10−2
                      400–600                       10−3                           10−3
                      600–700                       10−4                           10−4
                      700–800                       10−4                           10−5
                       >800                         10−4                            -

Extended Data Table 3: Learning rate schedule. Learning rate used during reinforcement learning and super-
vised learning experiments, measured in thousands of steps (mini-batch updates).
                                  5-3 point press                                                                          Small avalanche
            4.00e-03                                                                                      4.00e-04

            3.50e-03                                                                                      3.50e-04                                                       10

            3.00e-03
                                                                     6        4   2                                                                              3   5   8
                                                                                                          3.00e-04
                                                                         5    3                                                                              7   2   4   1    9


Frequency                                                                                     Frequency
            2.50e-03                                                                                      2.50e-04
                                                                                      1                                                                                  6 11
            2.00e-03                                                                                      2.00e-04

            1.50e-03                                                                                      1.50e-04

            1.00e-03                                                                                      1.00e-04

            5.00e-04                                                                                      5.00e-05

            0.00e+00                                                                                      0.00e+00
                        0    10    20   30   40   50   60   70                                                       0    10   20   30   40   50   60   70
                                        Hours                                                                                       Hours

                            Attach and draw back                                                                         Knight's move pincer
             2.50e-04                                                                                     2.50e-05


             2.00e-04                                                                                     2.00e-05                                           3           6
                                                                 8       4    3   5
                                                                         6    2       1                                                                              1        4



Frequency                                                                                     Frequency
             1.50e-04                                                                                     1.50e-05


                                                                                      7                                                                                  2
             1.00e-04                                                                                     1.00e-05


             5.00e-05                                                                                     5.00e-06
                                                                                                                                                                         5


            0.00e+00                                                                                      0.00e+00
                        0    10    20   30   40   50   60   70                                                       0    10   20   30   40   50   60   70
                                        Hours                                                                                       Hours



                              Pincer 3-3 point
             4.00e-04

             3.50e-04
                                                                         12       6   4 10
             3.00e-04
                                                                              7   1   5   8
             2.50e-04
                                                                                          9


Frequency
             2.00e-04
                                                                                      2 11
             1.50e-04

             1.00e-04
                                                                                      3
             5.00e-05

            0.00e+00

                        0    10    20   30   40   50   60   70
                                        Hours



Extended Data Figure 1: Frequency of occurence over time during training, for each joseki from Figure 5a
(corner sequences common in professional play that were discovered by AlphaGo Zero). The corresponding joseki are
reproduced in this figure as insets.
                                 1-1 point                                                                            Outside attachment
            4.50e-04                                                                   1               6.00e-04

            4.00e-04
                                                                                                       5.00e-04
            3.50e-04                                                                                                                                                         5    7
            3.00e-04                                                                                   4.00e-04                                                 9   3    1   4



Frequency                                                                                  Frequency
            2.50e-04                                                                                                                                                8    2   6
                                                                                                       3.00e-04
            2.00e-04

            1.50e-04                                          2                                        2.00e-04

            1.00e-04                                          4   3
                                                                                                       1.00e-04
            5.00e-05                                                                                                                                                         10
            0.00e+00                                                                                   0.00e+00
                       0   10   20   30   40   50   60   70                                                       0   10   20   30   40   50   60   70
                                     Hours                                                                                      Hours

                       Knight's move approach                                                                           One-space jump
            6.00e-04                                                                                   6.00e-03

                                                                       6
            5.00e-04                                                                                   5.00e-03
                                                              7            10 8                                                                          8          2    3
            4.00e-04                                              5        1                           4.00e-03                                                     4        1


Frequency                                                                                  Frequency
                                                                               3   9                                                                                     6   7
            3.00e-04                                                                                   3.00e-03
                                                                           4   2                                                                                             5
            2.00e-04                                                                                   2.00e-03


            1.00e-04                                                                                   1.00e-03


            0.00e+00                                                                                   0.00e+00
                       0   10   20   30   40   50   60   70                                                       0   10   20   30   40   50   60   70
                                     Hours                                                                                      Hours


                                3-3 invasion                                                                      3-3 point knight's move
            3.50e-02                                              12                                   2.50e-03

                                                              10 7     6 11 13                                                                               17 8 11 15 13 10
            3.00e-02
                                                                  8    5   4   2                       2.00e-03                                              19 5   6    4   2    9 14
            2.50e-02
                                                                       9   1   3                                                                               18 7      1   3 12



                                                                                           Frequency
                                                                                                       1.50e-03
            2.00e-02



Frequency
            1.50e-02                                                                                                                                                         20
                                                                                                       1.00e-03

            1.00e-02
                                                                                                       5.00e-04
            5.00e-03                                                                                                                                                    16 at 9

            0.00e+00                                                                                   0.00e+00
                       0   10   20   30   40   50   60   70                                                       0   10   20   30   40   50   60   70
                                     Hours                                                                                      Hours



Extended Data Figure 2: Frequency of occurence over time during training, for each joseki from Figure 5b, (cor-
ner sequences that AlphaGo Zero favoured for at least one iteration), and one additional variation. The corresponding
joseki are reproduced in this figure as insets.
 Game 1, B: AG Zero, W: AG Zero, Result: B+R                          Game 2, B: AG Zero, W: AG Zero, Result: B+R                         Game 3, B: AG Zero, W: AG Zero, Result: W+R                  Game 4, B: AG Zero, W: AG Zero, Result: B+R
 2     90 63 13 65             51                                                                                                                                                    77                                                                93
                   23 48 88          98 28                           10044 42 48                     82 81                                                               60 58 30 7 76                   59                                         89 70 94
             50 22                      55         76                 98 43 23 22 18 16              80 79                                        8 99 72 70 69 21 61 56 57       4 3     33             58 45 47                          26 24 18 65 67 90
    96 86                92          17       94                      99 87 86 21 24 15 14 20 52 78 49 45 77                                        97 98 74 71 68 87 59 31       6 2 1 35                  44 21                             25 17 20 66
    59             56          83       30    8                          94 95 25 26    7    64 58         67                                 37       80 79 75 73 86 88 91       26 25 5 34                46                                      23 68
                19 93    3 5 81 12 52            9                       96 27 28 11 2 9                      41                                       84                55 29 27 22 32                                                             69
 79                         41     4 25            77                    51 29 30    5       13                                                  83 81 78 66 64 62 54 89 90 39 36 28                                                          99
       39                            14 10 38                            50 59 31 32            56      12                                       85 82 67 65 63 53 52 49 43 41 40                           48 50                          92 95 98 61
    32          33 44 57                      1                          97 33 34                  6                                             93 92 94          51 50 48 42 44                        76 49 72                       88 91 96 84
 54       46                16             91                            37 35 10       8                     46                              38 96      10095 11                                        77 74 43 73                   10097
      100          80                   40 31 6                          39 36 38                                                                                                                           75          78                          82
    68 29                87       69 99 60 75 64 24                      40 54               60          4    47 91                                                                                                                     87 85 83 79 60
                      67 26                66 61                                                                                                                                         46 47           15 13                             86 62 80 81
    73 42                85             11       18 15                       53                                            17                                                            24 45           12 11                                      41
                         71 21 84 62 78          7                      83                                      61                                                                       16 23           10 5 6 14 30 36 34                      40 39
          20 45       27       35             95 53                               57                68      75 69          19 65                    10                       20 18 14 12 15                 4 1 32 29 16 35 56 52 54             22 37
    97                                                                            55           66        72 76 84 3        62 63 93                                             19 17 13 9                  2 3      33 31 57 27 53 55           42 38 51
                      47 72          70 58            49                               73 71        70      74 1 85           88 89 92                                                                   8 7 9          19 28                    64 63 71
          43 89 36                82 37 34         74                                                       90



 Game 5, B: AG Zero, W: AG Zero, Result: B+R                          Game 6, B: AG Zero, W: AG Zero, Result: W+R                         Game 7, B: AG Zero, W: AG Zero, Result: B+R                  Game 8, B: AG Zero, W: AG Zero, Result: B+R
                                                                                                                                                                       38                                            15
    21 71                                                30 73                 17                                              25 39                                36 35 28 37 39                       48 16 14 7 10 13             80          76
    20 13 70                                       26 24 25              16 13 29 28           40          30   96 27 95 23 21 22              85 16                   30 25 26 24 20 33                 47 3 5 6 11         94             50 17 9 77
       12 3 11                                     22 1                  71 12 3                                   84 41 24 2 26            97 80 79 86                32 31 27 1 23                     96 4 2 12 79                 45 20 8 74 75
       14 10 19                                    74 23                 69 14 15                               78 47 46                       88 87 93                                                           95                  55 54          21
          64 62 65                                                       66 65 19                                  58 48       97              95 18                   34                                   91                  97 49          53
       72 27 66 67                                                    70 67 18 64                                           77 42                                                                              9810099             59 58 56 46 57
             69 68 75                                                    68                                                 81 83                                                                           92 93               61 60          72 73
                76              100                                         55 60                                        57 80 31                                                                                               63 62 90 78 71
                                 97 94 98          87                       61 72                                        63 62 82 93                                                   29                                       65 64       89
                                    93 92       82 90 91                10098 73 74            76               59          45 56                                                   99 98                                       67 66
                                 95 81 78 86 61 63                             99                                           79 44 90                                               100                                          69 68          29
                                       79 77 60 80 59 44 48              51                                                 89 85 86                                92                                   83                     88 70          27 28
                                       56 55 57 40 52 51 47              38 35 49                                              88 87                72           96 91 94                                44 41 87                              23 26
       9 5                             54 49 34 58 50 53                    36 37                                              92 91                 71       83 53 90 63 45                                42 43                        33 25 24
       7 2 6                           36 28 33 18 4 37 39                  34 1                                   75 20 8 4 10 94              4 2 10 19 74 82 59 44 41 17 77 15                           40 1                         31 30 32 18
    29 8                   96 88 84 32 31 35 17 15 16 45                    32 33                                  11    7 5 6               21 3 5 6 9       69 52 62 66 40 65 49 43 76 51 57              22 39                     35 34 19
                           99 89 83 42 41 43          38 46              50 43 52 53                                           9             22 14 12 7 8 11 58 67 68 60 42 47 46 50 54 55               82 81 84 85                  37 36 38 52
                                    85                                      54                                                                          13 70 61 64             48          56              86                           51
                                                                                                                                         73 at 61 75 at 67 78 at 58 81 at 67 84 at 58
                                                                                                                                         89 at 67
 Game 9, B: AG Zero, W: AG Zero, Result: W+R                          Game 10, B: AG Zero, W: AG Zero, Result: W+R                         Game 11, B: AG Zero, W: AG Zero, Result: W+R                Game 12, B: AG Zero, W: AG Zero, Result: W+R
                          79                                                       15                                                                                                                                                         59
      70 78 77 66 64 65            61 62                                18 16 14 9 10 13                                                                                                                                                   45 58 57 61
   69 68 76 20 71 60 55             5 46                                17 5 7 8 11                   40 98 4                                       1         94                       3                                 43 42 44 17 28 54 53
   84 3 80 73 72 74 58 59          47    4                                 6 2 12                        97 38                                           7          93                       89 97          3       5                      29    2 55
   87 82 57 81 75 67 63                  52 54                                                             39 19                                6                                      87 83 84 95                       95 89 78 71 67 56 51 52
   85 83 88                              49 48                                                                                                                                         5 82 92 96          32               87 88 48 83 70 60 18
      5610086                            43 50 53                                                                                               99100                                  77 76                             96 92 98 90 85 84
         99                                 51                                                                                                                                         79 78 88                                97 91 86             50
                                      45                                                                             29                            8                                22 80 11 81               36 82                  93 99 49 46
                                                                                                                  83                                                                         98            37 13                              66 64 27 62
                                                   44 41 39                                                       96 28                                                    91       12 10                                                           65 63
                                                   42 40 94                                              82 80 74 41 42                                                    68 75 66          90               31                                    47
     14                                                                 33                57 55 64 81 71 70 72                                                       70 74 73 65 67 13    9                                             94       21 69 68
  16 11 12                                            93 36             26 23 37       63 62 6510056 75 73 43 30                                   43                   69 72             71                                         79 34 22 20 19
     10 9 13                                    37          32             24 25          99 52 66 93 91 89 86 88 45 44 67                   26 24 44          47 40 38             85 14                                         7610077        40 25 81
     15 8 1                                     33 26 2 22 31              22 1              54 53 92 58 60 90 87 68 3 46 69                 52 23 4 28 48 46 41 39 31 30 20 18 16     2                    7 1 11 15 30          38 75 33       4 23 80
     17 6 7 96                     38        34 24 25 23 21 28 35          20 21 51 50       48 94 49 59 61 27 95 78 76 47                      25 27 55 45 42 49 36 21 19 15 32 86                      16 6 8 10 9 14        74 73 35          26 24
     19 18 97 91 90                          30 27          29          32 31 34 35                               84 79 85                   51 53 29 50 57 56 37 34 33 35 17                                      12          72 41          39
                                                                           36                                                                   54 64 63 62 60 59 58 61
89 at 3     92 at 82       95 at 3        98 at 82                  77 at 46

 Game 13, B: AG Zero, W: AG Zero, Result: W+R                         Game 14, B: AG Zero, W: AG Zero, Result: W+R                        Game 15, B: AG Zero, W: AG Zero, Result: W+R                 Game 16, B: AG Zero, W: AG Zero, Result: W+R
                                                58                       62                                                                  67 57
      27 29                               36 34 57                    87 57 56 58 97 88 91                                                   56 55 44 45 59                                                 16 15 18 20
      26 25 23                24 38       32 33                             47 48 13 92 14    46             1                            66 20 17 19 11 58 39 53                   1                          6 7                           14
      28 2                           5 35 30 1                              49 4        59 60                                                46 18 4 41 14 15 52                                            22 8 5                                  4
                                 56       31                             65 64 5010053 54                                                 62 35 63 43 42 16 37                                              21 10 11
      22                               47 37                             51 61 99 98                         5                                  13 79 84 38 36                   61                         12 9 19                                     13
                                    54          61                       67 55                                                               51 21 40 78 97100                                              17
                                       44 46 40                          66                                                                           71 73 74 77                   64
      66                      76          45                                79                                                                  85 29 70 72 75 80 88
                           99       50                                   75 78 84                                                               28 68 69 76 81 93 90                65                           38          90 75
                                 55 89 43 63                          77 68 63 76 90                         52 42                              50 49 96 82 92 87                                                 36            74 65 91
                  95          96 97 86 65 53 62                          73 74             95                43 41 44                           48 47 94 91 95 99 98                                           34 35            64 51 50 73 95
   17                77 91 98 51 64 41 42                                85 30          96       45          81 24                                             89                                                 37         76 48 49 57 94
   12 9 21        94                69 67 60 14                          72 69 70                   40          82                           24 26 31 60                                                    39 23         79 80 46 47
      10 11 100      74 84 82 80 85 87 68                                22 71       20          39 36 33                                 54 23 22 25                   86                                             78 77 42 45 52 53 72 92 97 96
      8 3      75 92 90 83 81 79 78 48 88 4                              18 7 3 11 21 19         37 31 26 34 2                               32 3 27 6 10               83 30 34    2                       43 28 2 30 82 24 40 41 54 56 26 3 98100 1
      6 7 71 70                  49 13       52                       28 15 6 8 10 9                   23 32                                 33    9 7 8             12       5                             32 27 29 81 62 58 25 55 68 86 93 44 83
   16 15 18 19 72       93                39 59                          16 27 29 25 12 17             35 38                                                                                                33         31 60 59 89 63 66 67 84 85
      20       73                                                        94                                                                                                                                               61 87 70 71 69 88
                                                                    80 at 68      83 at 63          86 at 68    89 at 63   93 at 68                                                                  99 at 44

 Game 17, B: AG Zero, W: AG Zero, Result: W+R                         Game 18, B: AG Zero, W: AG Zero, Result: B+R                        Game 19, B: AG Zero, W: AG Zero, Result: B+R                 Game 20, B: AG Zero, W: AG Zero, Result: W+R

          25    27         79 80             67 66       70                                                             17 15 13
       22 1                   26             65 62          58 86                      27100                          9 14 11 12                10 11                  65 59 57 49 47 53                         19 3     11 21
       21 20    24                      28 69 61 56 4 54 57                       3 99 98 96             88 97       25 16 2                    12 3                60 63 44 58 50 4 48 51                       9 18        20 23 37         5         1
       23                                 74 72 68 59 53 55 51 85                                                 86 84 87                      14 15                  64 68          56 52                      7 6            36
          3     97                           82 71 75       52 88                                                    85 81 10                16 13 41                  66 61 45 43 54                            8 10        14                    56
                96 81                        36 73 60 42 12 64                                                          62 61 79                40               71 69 67             55                    84
                                             78 76 77 41 43 63 87                                                       72 68 71             42                                          62                  27 97                                     41
                                                35                                                             83 76 56 58 59                                                98 94 46 92               94    25 24 30       40
       17             83                     34       29                                                       75 66 65 57 54 45                                          99 95 9310091 82                   29 13 26          98
                                                      90 84 91                                                    63 64 55 39 36 60                                                   81 83 84                  28 34 92                         42 58 22
                                                      46 37 89                                                    73 44 37 40 31 38                                             29 85 24 70 86               15 33 90 39 87 89 96 54 52 46          57
                      95     100                                                          93                   74 52 69 42 41 29 30              27                                77 75 25 80            93 31 35 91 95 86 88          53 51 45
    49 13 15                          40             10 8 93                 8         94 95                   53 80 70 32 28 21 24                                                9 76 74                99 12 32                85                   17
       14                          45 39 38 32       6 7                                                          78 51 23 22                  5     37             36 18 20 22 79 78                    100             66 76 80 79
    48    2     16                    94 33 31 30 18    9                         4      7                 26     50 49    1 20                28 26       30 32 34 17 19 21 96 8 72 1                          4 68 62 63 81 72 47 43 49 50 2 38
       47 50               44         92 99 11       5                                 6 5                        90 46 92 19 18                 2      23 31 33 35 97 90       6 7 73                          60 67 64 61 48 71 70 55 44 16
                                         98 19                                                             82     89 91 35 33 34 48                                 39 38 88 87                                    59 69 65 77 82 73 74 75
                                                                                                                                 47                                                89                                             78
                                                                    43 at 40      67 at 54          77 at 69                                                                                         83 at 48




          Extended Data Figure 3: AlphaGo Zero (20 block) self-play games. The 3 day training run was subdivided into
          20 periods. The best player from each period (as selected by the evaluator) played a single game against itself, with 2
          hour time controls. 100 moves are shown for each game; full games are provided in Supplementary Information.
  Game 1, B: AG Zero, W: AG Zero, Result: B+R                  Game 2, B: AG Zero, W: AG Zero, Result: W+R                          Game 3, B: AG Zero, W: AG Zero, Result: B+R                             Game 4, B: AG Zero, W: AG Zero, Result: W+R
                 48       61 86                                                                                                                                  85                                           89
        11       80 97 27 34 49 72 50                73 85              58 57                98 99 11                                                         67 42 86 78 84 75 83                            82 81 87
        14 7     84       96 52 39                   74                 14 13 82                17 8 9 7 5 21                                     93 94 81 51 52 34             82 61                         84 80 79 29       76 73 77
        45       36 31             15 53                             4     15          56       47 18 10 2 6 19                           3     5 92 80 96 97 55 64 39 40    4 62 36 63                          83 2 85        78 74 75          5     3
           82                95 88 81             26                 62 61 67 66          60 63                20                                 95    87 98 66 65 41 71 72       57                         90 86                            49       59 47
     6        29          19 33                   10                 16 75 74 68 59 79 81                   52                                  91 90 99 76            68 37 33                                  30       88       94       54 56 57       46
           42 37                                        59              64 69 72       70                                                                 100 77 69 70 73 79 89                                                             52 53 58 48 42
                                            28       16              55 65 73       80                                                                              49 58       54 88                                              92 93 63 31 60 45
           30 58                8 75           2 56 68                                                      48 96                            43                  45 74 47                                                       95 98          62 61 43 44
     99                         35 77 70 5                                                               78 83                                                         38 48 46                                                                65       50
           9                       79 25          83 41                                                  76 71                                                      53             50                           10099                             51 64 55 72
        20 63 23          94 18 21 69                67                    43 51                         77 88 89                      16                                                                           28                            69 67 68 70
     92          89 62 90                               47           45          49 50                   93 90 91                         14 44 56                                 29                                                                66 23 21
     38 51 65 10012                 3                             97                44                95 92 85 84                      12 9 15                               35 27 26 31                                                       25 22 9 12 71
  4     24 13                78          71             64       10012        37                53          86                         20 10 11                           59 28 24 25                            27 36             97 96             11 10 15
  57 43                         1           32                          38 35 36          54          29 1 23 87                    19 18 8 1                                2 23 30                             32 34 33                26          1 8 19
                       54                      22       98        42 40 3 39 34                    28 25 24 22                         17 6 7                                22 21 32                               4        41    35 91 38 24       7 6 17
                                91 76 60                          46 41                         30 27 26 31 33                            13                                    60                                                       37 39 40 16 13 14
  46       87 17 44          93       40 55 66                                                     32                                                                                                                                                   18
                                                             94 at 85                                                                                                                                     20 at 13

  Game 5, B: AG Zero, W: AG Zero, Result: W+R                  Game 6, B: AG Zero, W: AG Zero, Result: W+R                          Game 7, B: AG Zero, W: AG Zero, Result: W+R                             Game 8, B: AG Zero, W: AG Zero, Result: B+R
                                                                                                                                         76 75 82                   79             42
              58 57 48                  77 76                                                                                         68 77 66 67 74          49 51 78 57 55 41 60 39 40                               9 89                                        85
        40 44 35 46 47      43           7 32 75                      43 45 39 44                          29      35 33              72 64 65 92 91 85 25 44 48 56 53 54 61 21 20                            93 5 7   88                                               79 77 83
        39 2 74 42 45 49                33   4 78                   47 2    40 41 46                               4                  69 70 1     94 86 84 87 45 36 58 50       3 22                          91 6 4 8      95                                87      80 2 78 81
        41       73                                                         42 48                                     34                 71       93       37       43 46       23                            92                                         98                84 82
        5 36 38                               34                    5                                           32 30 11 72              19          95             47                24                         94                                                90      11
           37 53 54                                                                                                31 36                                            89 38          28                                                            97           86
        51 55                                      60                                                           92 71 73                                  100 62 90                                                                           100
     71 50 52                                                                                              89                                     59       63             52                                       17    19                      99
        69 65 70                                                                                 93          12 63                                                                                            36                                                   75        42
        63 64 66 96                               16                    88                             65 64 70 69                                                                                                 18                                    68             76
     67 62 61 94 93 97                         19 30                                                66 67 51 60 28 61                                                             99                                                                  53 66
        68 90 95                         21 20 25 31 29                                     50 94         55       59                                      73                                                                 35       96 74 72 70 64 60 61 67
        5610089 72 99                 24 22 23 27 6                     90                  98 97 56 54 21 20 24 23                       17 35                                                                  10 24        33 34    55    73 71 69 63 62 65
     79       98                      18 11 26                          91 85 87              10074 53 52 17 18 68 62                  33    34                                   26                          28 23           25 32 41 54 56    52
     83    3 81 85               17   10 9 28 1 13                83 81 3 84 37             99 76 75 15 16 1 7 19 57                      29 4                              27       88 12 2 6 96             30 29 1         21 22 57          50 15 3 13
     82       80 84 59 91                 8 14 12 15              82 80 86 78 79               77       9 14 8 6 22 58                 31 30 98            18                        11 8 7 5 81              31 26 27        20 37 39    59 51 47 46 14 12 43
        92       88 86 87                                      96 95       38 49                    13 10 25 26                           97 32                                   13 10 9 14 16 83            40 38                          49 16 45 48 44
                                                                                                       27                                                                            15                                                            58
                                                                                                                                  80 at 66

  Game 9, B: AG Zero, W: AG Zero, Result: W+R                  Game 10, B: AG Zero, W: AG Zero, Result: W+R                         Game 11, B: AG Zero, W: AG Zero, Result: B+R                            Game 12, B: AG Zero, W: AG Zero, Result: W+R
       22
    18 17 20                                                     90 89          53                73 71 59                                            87      86 45 63 67 64 65                33 34          31        9 32                                       92
    21 10 11 28                       70 100                     93 10 11          51 31       72 45 68 65                                                 20       62 17 66 22             11 10             28 5 7 23 22                                   91 76 87
    23 12 3                                 2 99                 94 12 1        52 48 49       67 66    4                                        2            46 24 25 16                   3 12 40           33 6 2 8     30                             86       88 4
    19 14 15                                   97                   14 15          50 80          75                                                                   23                   13 36 35 38                                                   90 89 85      80
    16 13 26 29                                                  16 13 29                76             58 40                                                                26          42 43 37 14                                                   95 93 84 79 77 81
    25 27                                      83 82             70 28 63       62    61                   57                                    50           47 31 27                         41 39                        26
                                         65 71 98 73 84          30 69                      82             60                                                          30 28 32          29                        15    25 27 70                             78
                                   78 79 63 64 72                   81          64    77          84                                                             44                            80 81                                                     97                     75
                                      69 66 86 58                99 96 97                83                78                                                                         48 49 78 79             39 20      24        38             94                         82 83
                                87       85 67 68                  10098 95                             39 79                                                                               72 74 77             35
                                      59 57                         92             86 87             35 32                                       52                                   84 71 70 59 61             17     21 37 49 96 69 100 98
                    80                         46 48                   27             85                33                             19                       100          99          76 73 60 75                            57 56 52
    36           81 77 76 74 60       56 49 45 44 9              88          26 22          23                9                        9 18                                  97 98       68 55 56              34 16 18 36 58 53 59 61
              33       75 51 96       52    8 43 42                 91 25 24                47          8 36 37                                8                21        90 95 96 69       85 58 54        63 41 29         66                99
    38 37 1   31 32       95 94 61          4 7 47                     3     21 18       46 19 42       2 7 44                              7 4                           93 94       83    1 57               62    3 47 19 65 50 54 60 64 68       13 1 11
       34 35 30           39 89 50 92 62    6 5                              17 20       74 43 41       6 5                                 5 6                 15        91 92 82 53                          72 40 44 46 43 51 55 67 48              12 10
                       93 88 90 91       54 53 40 41                      55 54 56                         34 38                            51                                              89 88              71 73 42 45                        14
                                               55                                                                                                                                                              74
24 at 17

  Game 13, B: AG Zero, W: AG Zero, Result: B+R                 Game 14, B: AG Zero, W: AG Zero, Result: W+R                         Game 15, B: AG Zero, W: AG Zero, Result: W+R                            Game 16, B: AG Zero, W: AG Zero, Result: W+R
                41                                                                                                                                                               41
       35 30 39 40                        43 49                                                       99                                                                47 40 36 37                           65 27                                          97     96
       19 20 37 38       42            7 6                          13 14              44        39100               30 29             28 18 74 13 26 39                45 22 21                              29 16 15 17 5                               92 88 67 10098
       21 4 29 31 36                   1 8                       23 15 4                    28   42                  2 31                 17 2     20          43 27          4 23                            28 30 2 25 12 13                         84 81 86 87  3 99
       25 27 28 32 33                  9                         21 20 16                        40                  32                   19                44 42       46 67 24                                    26 14                              85 90 89 93     73
    23 22 26                                 10                  22 17 18                             97                5                 5 14 16                    38 65          25                           11 18                                    80 91 78 74 72 82
    24       34 60                                               25 19 26                                       41           33           71 15                                  35                                                                       94
                                       53                        27 24                                                                    70 73                                                                                                        83 95 79 77 75
                      87                  51                        43       59                  58                                       68 69                                                                    66    68
                                                                                                                                             29                                                                                                   64
                 84             86 96 99100 90                             77             91                                           72 95                                                                                            63
                 82 83    81 97 93 95 92 89 88                          76 71 72 84 87 89 90 93 57 98                                  91 89 92 75                         79 77                                         69                       62
     62 50 66                      98      91 85 52                        70 73 69 85 86 75 88 94 95                                  93 64 87 90                         80 78 76                                                     61                         70          76 57
        11 18 65 75 71 76       58 94               48             79      78 66 63 74 68                53                            94 88 66                                                                               33        59 60                                  55 10
        61 63 79 14 72 73 77 59               47                           67 61 62 64 65 92                51                                     51 54                            83                                                     52             51 41 71           9 54 53 56
        67 2 64 12 13 74 57 56                3 46                  7 1 9 55 60 81 96 83 80 48 47 37 3 35 50                           58 7 3 11 48 52            99 97 33 1 31 82                               20 4 22 31                               37 38 44           1 8 58
        16 15 17 5 68 70 55                   45 44              56 6 8 10 11 54             49 45 46 36 34 52                      60 55 6 8 10 9 49 53      10098 96 81 32 30 84                            24 19 21 42 23                 32        40 34 39 45           7 6
        54          69 78                                                  12 82                   38                                  56 57 61 59 12 50 63          85 34 86                                    50 46 43 49 47                                                35 36
                                                                                                                                                                                                                       48
80 at 13                                                                                                                          62 at 55

  Game 17, B: AG Zero, W: AG Zero, Result: B+R                 Game 18, B: AG Zero, W: AG Zero, Result: W+R                         Game 19, B: AG Zero, W: AG Zero, Result: W+R                            Game 20, B: AG Zero, W: AG Zero, Result: W+R
    61                                                                                                                                       60
    59 58 60                        10         46              86           10 88                          26                          71 58 59 61 13 15 49               69 67                                41 21
       19 20          91 92      26      8 6 45                85 54 6 8 12 9 52                                  24 22 64          77 56 9 11 35 12 14 16 47 48             3 42                              16 15 35 5 40
       21 4             10093 94       9 1 7                   87 55 7 1 13 53                   98 92          25 3 23 91             75 10 2 36 45       51 46 39     7    43 44                             18 14 1        22                                             4
       23 24 96 55               54            47                                                   97                              91 76 79 80 37 52 50 53 40 63 65 64 68 6                                   20 17 19 29 34
    25 22 95 43                                                                                  93                                    85 83 84 54 82 55            62 66                                   42    8 24 25 28
    31 30       99 97                                                   67                                99                           86 90 38 81 92            57                                            36 30 27 26 38
    56 32 44 88 98                       34                             63 66                          10096 65                        93 89 41                                 18                             32 23 31
                   85 86                                                   62                                                          88 87                              23 24                                33       44 37
                   79 84                                                51 58                                          43                                                                                         39 45
    87 89 41 42 71 78 83 90                 49                          61 59 60                      84                                    17                                             22                                        10098
    82 80 81 64 74 77 70                                                57 56                               95                                                                                                                        88 87 99       84
    73 72 35 57 40 69 75 76                                                                        78 77 94 71 89                                                                            100                   43                    82 89 85 81       51
    39 11 18 65 63 67 68                    48                      11 35                 75          90 34 5                               19                                                20 97                                      90 72 74 79 50 13
    62 38 28 66 14 50 52                                         21 18 20              76 72 73 74 30 38                                                                             78       98 99              6        57    75 66 70 68 69 73 80 12
    36 37 2 27 12 13 51 53               3                       19 15 4               36 49 82 29 28 37 2 42 40                       27 26 4                                    70 33    1     21              53 48 56 7     64 65 91 67 97       2 11
    33 16 15 17 5                                                17 16       14        48 47       80 27 33 31 32 41 46                31 25 28            8         34           32 5     95                 49 46 3 52 60 58 63           71 83 10 9
       29                                                           50                 70 68 69 83 81 79 39 44                         29 30                                      74 72 73 94                    47 54 61 59 62 76 92 93 94       77 78 86
                                                                                                                  45                                                                    96                                         96 95
                                                                                                                                                                                                          55 at 46




           Extended Data Figure 4: AlphaGo Zero (40 block) self-play games. The 40 day training run was subdivided into
           20 periods. The best player from each period (as selected by the evaluator) played a single game against itself, with 2
           hour time controls. 100 moves are shown for each game; full games are provided in Supplementary Information.
  Game 1, B: AG Lee, W: AG Zero, Result: W+R                     Game 2, B: AG Lee, W: AG Zero, Result: W+R                      Game 3, B: AG Lee, W: AG Zero, Result: W+R                      Game 4, B: AG Lee, W: AG Zero, Result: W+0.50

              30 32 34                             96                        32 34 36       85                                         30 28 29 79 83                                                       18 20 30
        26 28 29 31 33 35           97       47 51 52 57               28 30 31 33 35 37 84 83         41 45 46 68 61                  20 21 77 31 78 67          41 55 58                         32 14 16 17 19 21 31              93
        27 1                              49 48    2     56 62         29 1 51              82   59 43 42 66 2 67 60 69          40 22 1 76 80 66 69 64 73 59 53 52 56 2                           29 15 1 59                              2
                                          63 50 58 60 53                     50                        44 65 49                     32 27 26 82 43 81 70 71       54       50                         81 80 48 50 51
                                          80 81       17 61                                      70 72 91 93 27                  39 24 23 88 62 42 86 68 65 75 61       51 49 97                   79 58 75 82       53
                     73                      59 75 55                                            76 63 64 92                        25 89 33 92 85 72 74          57                               78 77 73 76 52 49
        64        72 67                   94 95 74                          40             94 74 73 80 71 62                           19 63 84 87 91                                                 60 74
             36      68 69 71       90 92 87                             38                81 75 78 89 77 88                       100       18       90          60                                              54 55
                     70 54          86 89 93 82                          54 53 39                79          98 100                    99 98                         48 46                               56       84 85
                                 10099 91 78 77 79                    56 11 52 57                            90 87                     11 13 14                                                    86                41                 65 92
             11                   88 98 84 83 76 25                   58 55                                  86 23                     15 12 16                      47 45                            13 57 35 43         46 62
                       16                    85                                                           96 95 97                  35 34 37                                                       45 44 42 33 40                 66 63 64
                    66                             8                                 21                   99 8                      36 38       44                      8     93                         36 34 88           39       67    8
        10 14       12 65                                40           10          14 15 12                   48 47 25                  10                                                             37 25 89 90       38 72 68
     24 13 20       19 21         9           7 44 4                24 18 16 20 13 9                    7    4                                  17          9      7    4     94                      23 22 26 87 28 70 9             7    4 96 99
        22 3        15 18                     5 6 43 37               19 3 17 22                        5 6     26                        3                        5 6     95                         27 3 24 91 61 47 71 11 69       5 6    95
        23                                39 38 41 42 46                                                                                                             96                                                        10 94 12 98 97 100
                                             45
                                                                                                                                                                                               83 at 58

  Game 5, B: AG Lee, W: AG Zero, Result: W+R                     Game 6, B: AG Lee, W: AG Zero, Result: W+0.50                   Game 7, B: AG Lee, W: AG Zero, Result: W+R                      Game 8, B: AG Lee, W: AG Zero, Result: W+R
                                                                                                                                                                                                                                             66
     67 43       65         32 33                 83 86              72 74 76 85                     25 24                              16 18 20                           61 58                    34          32 35              55 50 51
     64 36 37 39 44 24 25 30 31                   15 14           88 71 70 73 10 96     15 11      19 23 27 22 20 30                    15 14 17 12        59              57 62                 40 33 26 28 30 29 37      71 57       56 12 49
        3 34 38 13 66 28 26 27                    1 16               86 3 75            12 13              26 1 21 28                   60 1 19 87                        2                         38 27 1 31 36                  10 46 3 59
        41 35 40 42         29                       17 18       10087           94     16 14 95 18 69 53 51 33 32 29                                               63                           74 39 84       41         70 58          47 52
           88                                        19 20           98 17                    93        54 52 50 34                   13                                                            72 73                              60 53 54
                                                     21 22           99          92     89                          80                55                                     56                        96                              44 61 48
             84                                      23                       97                                    79             47 46 53 54                                                         85                              63 62
                 99100                                                                                                             45 44 37 52 65 67 91 88 96 89                                                                       67 64
           9        98                               85                    9                                                          35 36       66 90 92 93 94                                                                       9 65 68
        11                                        87 82                                                           81                  33 34 64             95                                       79 81          93 95                     69
     10 95                                                                                                                            22 51 71 73                            21                  83 76 45 80 87 92 94 97
     96 92 93       97                                 68           37                                                 4877           68 69                                                         77 78 89                                 17
     12 5 7                                       48     57      43 36 5 7                             82                68                 70 85                         8                         82 14 98 86 88                     7 5 16 23
     91 6                                             56 55         39 6 42                  84              47    66 58 63 78     84 10 86 72 83                               26                                      100            22 6 19
     94    4     8                           47    2     58         40 41 4         8             90 49 83      65 2 55 59         49 41 43 11 74 97       9         7 28 4                               2     90 91 43         8     4 21 20
        89 80 60 72 50 63 49                 45 46 51               44 35               31           91 46   45 60 57 56 62        42 38 3     75 76 99 98           5 6 27 23                         42 99 13               11          15 24
        90 78 73 59 70 69 74 79           53 52 62 54                          38                            67 64 61              50 39 40 48 77 7810082        25 24 29 30 32                        75 25                       18
                 77 71 75 76 81              61                                                                                                81 79 80             31



  Game 9, B: AG Lee, W: AG Zero, Result: W+R                     Game 10, B: AG Lee, W: AG Zero, Result: W+R                     Game 11, B: AG Zero, W: AG Lee, Result: B+R                     Game 12, B: AG Zero, W: AG Lee, Result: B+1.50
                                                                      48
        30 28 29                                91 89 90           42 41 44 45 90                                                  24 22       50 49                                               24 50 51 53 52 29 27 30 64                34 35
        20 21 31          60 72 62           45 88 83 84              24 25 94               37    31 38                              21 4        42 41                      57 3                  22 21 23          25 6 61 63        48    8 7
  34 22 1                 67                       2 85 87            26 1                         39    2 36                         23       20 43          59 51       39 36 37                    4     19 33 31 26 28 62 68 66 18       2 9
     32 27 26 57 61                                   86 96           28 29                              40 35                           19 48 47          58             56 55                                   32             65       17 47 10 11
  33 24 2310058 56 66 73                                           30 27 46 49 81                     34 32 23                           33 44                               38                    46 74 20 44                67                12 13
     25       35 65 68       98                                    43 47       80                        33 76 77                        31 32                                                     45 42 73 55 57          92             59 14 15
        19 59 64 70                                                                                         78 79                  29 25 26                                                           41 43 87 98                               16
           75 18 69 76                                                         63 83                     50 51                     27 6 28                       95                                         86 85 84 58                69
              71 74 97                                                                  97                                         30                   89 74 82 93             40                             56 83                   79 78 60
        11 13 14 99                                                         85 62 82                                                     92       84 88 73 72 81 76 94                                99 54 88 89 91                      71
        15 12 16                                       63                   89 75 86 8810098                11                           91 78 80 83 86 65 75 71                                     100          90 93             77 75 72 70
     37 36 39 44 43                                                      59 61 64 84 87 99                                               18 77 79 87                                                              96 94 95             76       82
     38 40 42 41 55 93                           8                 93       60 65 67                     8                            46             63 90          54          66                                                        80 5     97
        10          53 54 51 92                  78 77 81          95 58 10 72 66 69 73                        16                           17 85                61 62 60                                39                                  81
                 17 52 47 50 9               7   4 80 79           96 55       71 70 68    9        7 20 4                         34 8 2 45            67       70 52 5     97 1                        36 49 38
           3           46 48 49              5 6       82          56 12 3 53 57                    5 6 19 13                      35 7 9 10 12 14 16 68 64         69 53 96 98 99                    3 37                 40                1
                             94 95                                 92 91 52 54 74               15 14 17 18 22                              11 13 15                              100
                                                                                                   21



  Game 13, B: AG Zero, W: AG Lee, Result: B+R                    Game 14, B: AG Zero, W: AG Lee, Result: B+R                     Game 15, B: AG Zero, W: AG Lee, Result: B+R                     Game 16, B: AG Zero, W: AG Lee, Result: B+R
                                                                                         59
                                                        24 40                         41 58 39       57 49 56                                                     61    57 56          49           36                                  61 57 58 62 63
        3                   94                  20 4 22 23 26          7 8         40 37 50 51 45 47 18 53 55 20 21                                    5             59 58 9     46 42 43        38 35 34 4 32                 67 49 45 39 56 59 3 65
                                                  19 21 36 35          9 4         42 60 52 28 46 48 54       2 19 23                        2                          60 47 1 40 39                  33 31                66 53 52 48 44 60 20 21
                                             28         25          11 10             33 43 44                24 22 25                                               69          44 41 38           37         41     85 69 55 54          24 23 25 64
                                                                    13 12             35 34                   26 17                   6                           78 68 63 55 53 51 45 48                                84 68                22 42 72
        53 55                                                       15 14                36                94 29 27                  10099                              64 62 54 52 50                                               81 26          71
        51 52                            27                        10016           38                                                                                                                                                   73 74 43
     65 42 54                                  6                                                              88                                                      74 73          97 98             6                          82 75 76
     67 63 61 64 89                         29 30                      89 90       96                                                                              84                71 82                                              78 51 47
        57 58 62 85 88             41       31 32                         91 83 97                      74                                                      85 83 70 67 81 80 88                                                          77 50
     66 60 59 79 77 84 87                   33 34                      62 85 84 95                            72 70                     22                         86 77 75       65 87 72                28       19                   70 40 79 83
        56 78 80 83 91                   37 18                               79                      78 73 69 68                                  96               90          66                                                             80
              68 69 90 92 99 96          43 38                            31                76                87 5 71                        13                                79    8                    18                                     46
     93          86          95 98 97    17                            93                               82 86                      31                                    89          26 94                   17 27                                  99
        1     70 5     73      100       45 2 8 46                     92                   66 65 63 67 80    1                            3 35 12                       91 76 20 4 23 25              8 2 30            29              5 95 93 1 98
              74 75 44 71 47 16 14 12 10 9 7 49                        3     77 30          6     64 32 81 61                           28 34 11 10             14       92 7 18 15 16 24 95        86 7 9 10 12 14 16                  97 94 92 96100
           81 82       76 72 50 48 15 13 11                                  75                      98 99                              37 33 32 29 30                      93 21 19 17 27       90 87 88 89 11 13 15
                                                                                                                                                 36                                                 91
39 at 23

  Game 17, B: AG Zero, W: AG Lee, Result: B+R                    Game 18, B: AG Zero, W: AG Lee, Result: B+R                     Game 19, B: AG Zero, W: AG Lee, Result: B+1.50                  Game 20, B: AG Zero, W: AG Lee, Result: B+R
                                  47 45       43                                                                37                 99 98100                                                            27
    69 62 65 67             46 37 41 32 40 44 18 19                   18                                  31 29 36                    39 36                42             24                        21 20 25 26
    63 59 60 66 68       6 38 36 31 35 29     8 7                  13 12       10                   25 30 27 28 34                    7 8         96          44 16       18 19                        13 14 62 28 60                     78    3
       4 61 21                 33 30 27       2 9                  15 11 2           41                32 4 33 35                     9 2      95          41 43       4 17 21                         15 4     59 71 72 58                  6 7
       64                         34 24 23 17 39 10 11             17 14 16       42 43                                            11 10 93          40                22 20 23                     74 17 18 61 69                        10 9 11
          22                         28 20 25 42 12 13             21 9 19        44                         26                    13 12 97 94                         25 15 27                  76 19 16 23 29 68 70                  80 79 8
                                           26 14 15                20                                                              37 14                                     26                  75 22 24 32 63 66                     12 81 83
                      92                         16                   39                                                              38       91                                                77 73 65 30                           82
             76 75 93 77                                           38 23          76                                                                                   29                              31 67 64 88
                         83 82 78                                  47 45             75                                                                                   31                           53 50 51 52
                87 85 84 79                                        69 22 46 72                                                                 92                      33 28                        39 36 37 45 46 87          89 90
                   89 88 86                                        74 70          73                         40                                                           35                        41 35 34 38 85
       74 58 99 90                                                 60 59 71 61           77 79              10088                  77             88                                             47 42 40 44                   91
    81 57       55100                         5                    24    5               93 78 83         7 86 87                  74 5 49 53                 73 71       30                        43 33 48          99 93         92
             52 56 54                      70 71                         51 50 58 66 68 89 90 92 95       99 84 85                 78 48 50 51 52 86          70 63 34 65 67 69                        49 54      10086 94 95 97 84
          48 51 50                      94 91 73                   56 52 48 49 54 65 62 94 91 96 97 6     80 81                             85 46 47 87       62 61 59 64 66 68                           2                 96 98       5       1
       3 49 53 80                    96 95 72 1                       53 1 55 57 67 63 64 8    98         82 3                     82    1     54 45 89 6  90 60 32 57 3 72                            55 56
                                     98 97                                                                                            84 81 83                   58 55 56 75 76                        57
                                                                                                                                                                       79 80




            Extended Data Figure 5: Tournament games between AlphaGo Zero (20 block, 3 day) versus AlphaGo Lee
            using 2 hour time controls. 100 moves of the first 20 games are shown; full games are provided in Supplementary
            Information.
  Game 1, B: AG Master, W: AG Zero, Result: W+R                Game 2, B: AG Zero, W: AG Master, Result: B+R                        Game 3, B: AG Master, W: AG Zero, Result: W+R                                   Game 4, B: AG Zero, W: AG Master, Result: B+R
                   36                                                           68                                                                                                                                     51 52
                12 31 34                                            61 60 66 63 56 69                                                                                                                               50 49 45 48                   67 69       59
       6 8 10 9 32              30 19 26 24                         7 8 64 55 72 70 71 49 53        6                                         3   98                                                 7 6               44 41 46 55 17          54 61 66 5 71 60
    73 7 1 11 33 35100    67 66 28 68    4 23                       9 4 59 58 67 47          52 54       2                                                             13                            1 8               53 40 3              65 63 64 70 68 72 4
    72                    65 60 59 63       25                   11 10 62 57 65        73                   50                                                                                       11 10                42 47             73 58       37
       74          99 98  91 61 62 64 22 20 17                   13 12          48                       51 5                                5 32                                                       9 12              38 39                                  6
    88 86 84 87           95 90 94 29 21                         15 14                                                              76 56 54 26                                                                           43 35                         62
       70 85 71           97 96 89 92                               16                                   75                         74 53 16 58 90 96 94                                                                                       90         100
                                            27                                                           77 74                      75 78 49 55 51 52 89 91 92                                                                              83 76       89
    56 13                       93                                                                      100                            67 59 30 72 87 88 93                                                               19          96 92 84 79 75 99 80 97 98
             53                             18                               45                          99 92                         77 68 70 71 83 84 95                           35        36                                       95 91       81 85 86 36
    57                                                                       43 44                       98 97 85                      65 69 73 46 50                                                                                             74 82
                                                                 31 29 27       40                       84 76 83                      64 57 15 47 43 44 60                                                  23
                             45 47             83      75           28 22 23 41 42                 91 80 17                            66 14 33 79 42 31 45                           29             22                  18    24                   78            94
                          39 42                     77           30 25 24 26              46             78 39 87                         34       80 48 25 61                                       20 21                     22 21                         77
         3 54       5     37 38       51 15 43 48 2 78 76        35 33 3            19                79 1 38 81                       82    2 38 40 28                          27                  4 19                30 1 23 20 25           87          93 2 8 56
         58 55      41 40 16 82 46 52 49 44 14                      34 32 18              20       21 37 36 82 86 96               10099 63 41 37 39 62 24                                           18 17               32 31 26 27             88 16 14 12 10 9 7 57
         69                  81       50 79 80                                                     89 88       95                      86                                                                                34 33 28 29                   15 13 11
                                                                                                      90 93 94
                                                                                                                                  81 at 53    85 at 78       97 at 88

  Game 5, B: AG Master, W: AG Zero, Result: W+R                Game 6, B: AG Zero, W: AG Master, Result: B+R                        Game 7, B: AG Master, W: AG Zero, Result: W+R                                   Game 8, B: AG Zero, W: AG Master, Result: B+R

         34 36 44                                                          11 13 17                                                                                                                                             11 13 15                 29 28 33 34
         31 32 68 43 16 46          29 14 39 42                   43 7 9 10 12 14 18                                                                              63        18             17                          55 7 9 10 12 14 16                27 26 31 32
         3 30  5 74 45 47 63        15 26 2                       42 8 4                                   3                                  2         19                       4                                     54 8 4               84           25 20 23 3 30
            33                      27 40                                  31 80 83                           27                                                           45 50 49                                             85 83 81 82                 21 22
         35       72 73 62 60 58 48 41 37 38                         84 79 60 44          85               25 16                         14                                   51 47 48 52                              74 71          79 80                    24 18
                        67 64 61 59 57    28                         50 32 59 69                              41                         44 15                    97          53 46                                    72 67 68 73 77 62 90
                     70 66 65          25                               47 48 55          70 100                                         70 69                                55 54                                       66 65 69 61 76 75
                     71 69                   24                         49 53 58 56             86 99      40 19                                                  96 82 95 61 56                                             52       78 92
            13                                                       98          54 39 57             81 24                                                       81 80 93 94 62                                          70          91                               19
                                                                     93 64 65 35 46 51                     74                                   26 43                79 78 68                                                58                89
                          92 94 95                                      96 95       52          78      73 26 23                          16 22 23 27                91 71 83 90                                                         60       93
                       98 93 88 96                                      34 94 33 38 66                  77                             65 67 21 20                            92 57 58                                    63 97 59          95 87 88                   45
        82 91       99 9710089 87 54 56                              6           61             75         22 20                       64 41 5 29        10089 98                   9 12                                  6 64 99        56 96 94 86                      49
     83 85                86 55 51 50                                      97          63 76 68 71            21                    66 39 38 34 28 35       88 85 99             11 10                                                57                               43 42
     78 7 1 11 80 90               52 49 22 20 4 18                     2     29 36 30 62 67               1                           40 37 24 25 31       74 13 84             1 8                                   98    2 40 50 53                           1       44
  84 75 6 8 10 9 81                   53 23 21 19 17              92 90 28 5                 72 37 15                                     32 3 30           75 72 73 87          7 6                                      36 35 38 5 51                     17 47 46
     76       79 12 77                                               91 87 88 45 82                                                       36 33             77 76 86                59 60                             10037 39 41                                 48
                                                                           89
                                                                                                                                  42 at 37

  Game 9, B: AG Master, W: AG Zero, Result: W+R                Game 10, B: AG Zero, W: AG Master, Result: B+R                       Game 11, B: AG Master, W: AG Zero, Result: B+R                                  Game 12, B: AG Zero, W: AG Master, Result: B+R
                                              62                          84
                   91             84 61 60 55 56                       81 80          76 74 75                                                                                                          98 99                         29                   81 79 78 80
    30 29 3     33       86 88 90 85 81    7 6                   73 71 83 63 82 46 56 19 79 77 18 30 32 34                                  44 38 26 40 25 32      31                                18 17                         21 28 50 20       23 18 75 74
    32 28 26 27       89 13 87             1 8                   72 68 3 69 65 54 53 57 78                 1 31 33                    54 43 2 37 39 34 24                                            4 19                     1       31 85 57 19                3 76
          31                         93 11 10                       67 66 62 58 55 61                24 22 23 28                         46 41 53       68         33                                20 21                         51 84 54 53                   77
    80 79 5              54             63 58 9 12                  17       59 60          52 50 48 47 21 20 26                         14 42 45             67100                                  22                   17 27             55 56                   82
       34             65             92 73 72 59 57                 88 70 64                85 51 49       25 27 29                   30 29 15       27 80 84      90 92                                   23             22 25                52
       16 35                52 77 68 75 74 78                                                                                         47                83 91 85 93 35                                                    26 30 88 32
          37          64 83 69 66 67 70 76                           93 91 92                                                            28                                               36                                 98 87      100
       36                82 53       71                              89 90 94                                                                                                          63 86                           48 95 96 97 94 86 93
    38          94                                                97 96                                                                     69               71                        87                              47 34 40 49 99 91                   92 73
                                                                        99 45100                                          16             16 59                    74                   89 88 82                           39 43                                     16
         23 15 25           50                                       43 98 95                                             14 15       58          70                                      94 81 64                     89             58                      67 14 15
         14 24        47 45 51                          21           5 42                                                 12 13       57 60 5                        77             95       65 9 12                      5 33 83                    60 70 71 12 13
    99                46 44 43 49               20                41 38 40                                                10 11       56 62 61                    75 72          76          11 10                  90 45 24 42 44 35             69 72       61 10 11
        95 2 42       48 39 40                  4 19              39 35 4                        44                    2 9               50 48 49                 73 13                97    1 8                       46    2 37        36       68 59 65       4 9
     97 96 41               22                  18 17             37 36       6                                        8 7            52 51 3     55                                   79 96 7 6                          41 38 6                    66 64       8 7
    10098                                                                                                              86 87                                                              78 66                                                                  62 63




  Game 13, B: AG Master, W: AG Zero, Result: W+R               Game 14, B: AG Zero, W: AG Master, Result: W+R                       Game 15, B: AG Master, W: AG Zero, Result: W+R                                  Game 16, B: AG Zero, W: AG Master, Result: W+R

                12                                                       11 13 15                          29 28 33 34                    63 31 32     66 84                                                           77 37 39 41                               52
         6 8 10 9 38                    48 88                     69 7 9 10 12 14 16                       27 26 31 32                 64 30 27 28 34 24 65           83    18 17                                   83 74 36 35 38 5 43                    15 51 50
         7 1 11 40                         87 4     52            68 8 2                                   25 20 23 1 30                  59 2 29 33                        4 19                                       75 73 4 42 40 93                          3     48
                                  86 84                                                                       21 22                          60                             20 21                                      82 76             45 72                      47 46
                               89 85       49 51 47                                                           61 24 18                    14 26 36 38                 67    22 93 92                                      6                 66       94                53
                                     96       50                                                              94         96            58 25 15 35 37                 87 91 94 23                                            79       54               100          49
         35 33 90                 83 97 53                                   92                                       95               54 61           41       89 85 82                                                  64 63 81                      95
         34 30 31                 81 80                                 44   70 91     93                                           56 53 52                       86 88 80                                                  62 80 61 86                      96 85 97
         32 13                 82 74 73 99                                                                        97     19            57 55                                95                                                           58                      84 19
         39 37              72 79 98                                          67 99                                         58         51                                                                                    44 57             78                   92
                   95 94         100                              66 65 77 89 98                                  60     47 59            16 43 81                       96                                                              65
         36        92 91 66 65 67             17                     72 81 76 80 90                                                    50          40                             79                                      60 59 71          88
         23              93 58 57                55 71 77            6 71 82 83 86                                    46               39 46 5                                 9 12                                                68 69          98       87 24 16
            21              70       69 68       56 60 75            84 87 88 45100                   64           54 52               48 44 45 42 97                       11 10 77 90                                         67 70 89 91                21 22
         3 18 41 5    27 29 45 54 15 63 62 2           76         75 73 4 40 42                               62 56 3 48 55            49 47 73 72     62 99 13 71          1 8 78                                     56 8 2 90                        25 20 23 1 30
         19 20 43 25 16 28 42 46 59 14 61           78            74 36 35 38 5 43                         63 17 57 49 50 53                 3 98     10070 68 69           7 6                                        55 7 9 10 12 14 18 99            27 26 31 32
         22 44 24 26                          64               85 78 37 39 41                                            51                     74                             75 76                                            11 13 17                29 28 33 34
                                                                  79



  Game 17, B: AG Master, W: AG Zero, Result: W+R               Game 18, B: AG Zero, W: AG Master, Result: B+R                       Game 19, B: AG Master, W: AG Zero, Result: W+R                                  Game 20, B: AG Zero, W: AG Master, Result: B+R
                                                  50
        34 31 51                            47 46 43 44                                                            88 89                        12 62 65                    95 78 77 79                  56              45 44
        30 3 28                                7 6                62 34 32 16                    20          21 37 36 81              66 6 8 10 9 61 63                           76 43            50 48 55 72           7 8         52                     6
     38 35 22 23 29          13                1 8                35 33 3              19                          1 42 84               7 1 11                             94 52 53               4 47 60               9 4                                      2
  56 37 36 32 26 33                            11 10              30 25 24 26               44                     85 43                                                                           70 49 71 74        11 10 68 51
     52 39 5 27                             53 48 9 12               28 22 23 39 40                                   15                                                              51        46 44 17              13 12                 46                         5
     55 57 19 18                                  49 45           31 29 27       38                                   80                                97                                         45 54 58           17 14 67
        16 20 21 25       83 80                                               41 61                                                      81    80 91 96 93 67                         75           59 57 68 73           18
              24 41       79 82      10063 99                                                                                               87 85 88 90 92                                               69
                                   62 98                                                                     60          86                 13 82 86
     70                   78 68 92             81 90                                                                                        99 83 84       89
           69             84 77 67 93                                18                            58 57 87 83                                 98                                                                                  41 43       85
        42 15          97                      61 64              17 14                               59 82 55                                   100                                                                   31 29 27 89 88 38 60 86
        14                   91                   54              13 12       64       70 90 92 91       54 5                               29 33       25 41                                                             28 22 23 39 40 50 55                       15
                                               88 66 86           11 10 66 63 65             93 72100 50 48 51                        64 28 23 26 27 18 21 24                                                          30 25 24 26 47 59 56 42                       37 99
           2        17             95 75 89 87 4 65                  9 4         96 56 77 69 94 67 78    2 45 49                      31 3 22 32 5 19 20                              15             2                 35 33 3 48 53 19 91 92               96 63 1 36 98
        72 71 73                74       58 60 59 85                 7 8      97 95 74       71 68 6        46 47                     34 30 35 36 39 16                               42 14                            49 34 32 72 16 69 58 61 66 20        21 94 62100
                             96 76             94                    53 52 98 76 75 73 79                                                   38 37                                                                   81 80 95 74 71 73 70 54 57 64
                                                                                                                                                                                                                    84 83 82 76 78 77 75 65
40 at 35                                                     99 at 67                                                             40 at 35                                                                        79 at 73    87 at 83    90 at 84     93 at 83   97 at 80




           Extended Data Figure 6: AlphaGo Zero (40 block, 40 day) versus AlphaGo Master tournament games using 2
           hour time controls. 100 moves of the first 20 games are shown; full games are provided in Supplementary Information.