Grandmaster level in StarCraft II using multi-agent reinforcement learning

Metadata

Slug: alphastar_2019
Year: 2019
Venue: Nature
Authors: Oriol Vinyals et al.
Reading status: read complete
Compute regime: Search, simulation, and science compute
Primary sources: Nature article text, reporting summary, detailed architecture, supplementary pseudocode

Compute Setup

The accessible Nature article text does not name GPU hardware for reinforcement learning. The local supplementary supervised pseudocode does state a device for imitation learning: the supervised learner "uses a 128 core TPUv3 slice", with batch size 512 and trajectory length 64. The RL league pseudocode does not name its learner hardware, but it exposes the distributed shape: BATCH_SIZE = 512, TRAJECTORY_LENGTH = 64, 12 learning players, and 16,000 actor loops per player, with a note that batched inference was used in practice even though omitted from the simplified pseudocode.

The reporting summary gives the strongest compute scale: 360 online evaluation games on European Battle.net servers; 90 games per race plus 30 per race for supervised agents; around 130,000,000 full agent-vs-agent games for league analysis; and around 20,000,000 ablation games. Under the project rule, the undisclosed RL hardware is inferred as 2018-2019 DeepMind distributed accelerator learning plus large CPU/game-simulation actor infrastructure, consistent with the paper's IMPALA-style actor-learner references and the pseudocode's actor/learner split. Exact accelerator counts for RL should not be claimed from the local sources.

Bottleneck

StarCraft II combines partial observability, long horizons, real-time action, large structured observations, and a non-transitive multiplayer strategy space. The compute bottleneck is not just neural-network training; it is experience production and opponent diversity. A single policy can overfit to exploitable behaviors, while a small self-play population can miss counter-strategies. The reporting summary's 130M league games and 20M ablation games show that simulation volume is a first-class compute resource.

There is also an inference-budget bottleneck. The agent must act under human-comparable action-rate and delay constraints. The extended-data text says APM limits did not affect supervised performance and were acceptable compared with humans, and delay distributions were measured between game observation and executed action. The architecture therefore has to process entity lists, maps, scalar features, recurrent state, and autoregressive action arguments quickly enough for real-time play.

Method Adaptation

AlphaStar adapts to this compute structure by splitting the problem into supervised imitation, distributed RL, and league maintenance. The article states that AlphaStar uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies. The supervised pseudocode samples trajectories from human replay data, applies MLE plus L2 regularization, and uses Adam with learning rate 1e-3 before fine-tuning and 1e-5 during fine-tuning. The RL pseudocode uses Adam with learning rate 3e-5, beta1 0, beta2 0.99, and epsilon 1e-5, and sends length-64 trajectories from actors to learners in batches of 512.

The league mechanism is explicitly compute-directed. For each race, the pseudocode creates main players, main exploiters, league exploiters, and historical checkpoints. Main players sample historical opponents with PFSP, self-play against other main agents, and verification matches that check exploitability and forgetting. Checkpoints are gated by billions of agent steps and win-rate thresholds. This uses simulation not only to improve a policy but to search for weaknesses in the population.

The neural architecture is shaped around StarCraft observations and action latency. Up to 512 entities are preprocessed; entities after 512 are ignored, and missing entries use a large negative bias. A 3-layer, 2-headed self-attention Transformer embeds entities with key/query/value size 128 and MLP hidden size 1024. Spatial maps are downsampled from 128x128 to 16x16 through convolutions and ResBlocks. Entity, spatial, and scalar embeddings feed a 3-layer LSTM core with hidden size 384. The action is decoded autoregressively: action type, delay, queue flag, selected units up to 64, target unit, and target location. This factorization keeps a huge combinatorial action space tractable by reusing state and masking invalid arguments.

Evidence

The headline result is in the abstract: AlphaStar was rated Grandmaster level for all three StarCraft II races and above 99.8% of officially ranked human players. The reporting summary documents the online evaluation sample size, with 360 games total and race-specific splits. The Battle.net extended-data text says human and AlphaStar win rates closely follow the MMR model, supporting the use of ladder rating as an evaluation lens.

The paper also provides evidence for the league compute design. Extended Data Fig. 5 text states that PFSP-based training outperforms FSP under all measures considered: stronger population, less exploitable solution, and better final agent performance against the corresponding league. Extended Data Fig. 8 reports around 3,000,000 rock-paper-scissors cycles involving at least one exploiter and around 200 involving only main agents, using a 70% win-rate threshold. That is direct evidence that non-transitivity was not a minor edge case; the system needed a population and payoff matrix rather than a single scalar self-play curriculum.

Historical Effect

AlphaStar made large-scale multi-agent league training a mainstream compute structure for open-ended strategic domains, complementing single-agent Atari and board-game self-play. Its historical role is less about a new accelerator kernel and more about organizing compute: human replay imitation for a starting policy, distributed actors for game generation, centralized learners for neural updates, payoff matrices for opponent selection, and exploiters for robustness. It also shows that interactive inference constraints can coexist with very large offline simulation budgets.

Limits

The accessible article and local supplementary files do not disclose exact RL accelerator counts, host counts, or total training time, so those must remain unspecified. Only the supervised 128-core TPUv3 slice is directly stated. The system relies on StarCraft-specific interfaces, human replay data, action-space engineering, and massive simulation volume. The architecture also truncates or discretizes parts of the environment, such as ignoring entities after 512 and factorizing action arguments, so it is not a generic plug-in policy for arbitrary real-time worlds. Finally, the compute budget is large enough that replication is a systems project, not just a model implementation.