Dota 2 with Large Scale Deep Reinforcement Learning

Download PDF

Dota 2 with Large Scale Deep Reinforcement Learning

Metadata

Compute Setup

The paper describes a very large CPU/GPU distributed system but does not name the GPU model. It says OpenAI Five used thousands of GPUs over multiple months, with infrastructure on Google Cloud Platform and fast GPU ops from the blocksparse library. The final resource table gives ranges rather than one fixed machine: 480-1,536 optimizer GPUs, 500-1,440 rollout GPUs, and 80,000-172,800 rollout CPUs. The LSTM size increased from 2048 to 4096 during the run.

The paper's compute accounting is explicit about optimization but also warns that optimization is only part of total cost. By the April 13, 2019 match against Team OG, OpenAI Five had used 770 +/- 50 PFLOP/s-days of optimization compute on GPUs; by April 22, it had used 820 +/- 50 PFLOP/s-days. The authors estimate optimizer GPUs were roughly 30% of dollar cost, rollout forward-pass GPUs another 30%, rollout CPUs another 30%, and controllers/evaluation/overhead the remaining 10%. So the headline PFLOP/s-days number understates the full heterogeneous system budget.

Bottleneck

Dota 2 makes simulation throughput the central bottleneck. A game runs at 30 frames per second for about 45 minutes, with partial observability, long horizons, a high-dimensional action space, and a multiplayer self-play distribution that changes as the agent improves. The paper reports that the system learned from batches of approximately 2M frames every 2 seconds, which is a scale statement about environment generation as much as neural optimization.

The second bottleneck is freshness of data. Early systems that waited for full episodes produced rollout data that was hours and thousands of gradient steps old, often useless or harmful. Sample reuse was another constraint: if optimizers consumed data faster than rollouts produced it, the same experience was reused too often and training slowed. OpenAI Five is therefore a tight asynchronous loop between CPUs simulating games, GPUs doing policy inference, and GPUs doing optimization.

Method Adaptation

OpenAI Five adapts PPO to the hardware by separating simulation, forward passes, and optimization. Rollout worker machines run Dota 2 self-play games at about half real time because that allowed more games in parallel and higher total throughput. The rollout machines run the game engine but not the policy; they communicate with a pool of forward-pass GPUs that execute policy inference in batches of about 60. Optimizer GPUs receive game data into local experience buffers, sample minibatches, compute gradients, and allreduce them with NCCL before synchronously updating parameters.

The effective batch is built directly from the device count. Each optimizer GPU uses 120 samples of 16 timesteps, and at peak the system uses up to 1,536 optimizer GPUs, for a total batch size of 2,949,120 timesteps with five hero policy replicas. PPO uses truncated backpropagation through time, Adam, GAE with lambda 0.95, and gradient clipping. The model is a central shared LSTM feeding separate policy and value heads.

The "surgery" mechanism is also a compute adaptation. Because training lasted about 10 months while Dota 2, observations, actions, and model architecture changed, restarting from scratch after each change would have been prohibitively expensive. Surgery transforms an old policy into a compatible new policy so training can continue across environment and architecture shifts.

Evidence

The main performance evidence is that OpenAI Five defeated Team OG, the reigning Dota 2 world champions, on April 13, 2019, and then won 99.4% of more than 7,000 public games during the online showcase. The paper ties this to the resource scale above: by the OG match, training had consumed 770 +/- 50 PFLOP/s-days of GPU optimization compute and months of CPU/GPU rollout infrastructure.

The systems evidence is equally important. The paper reports that larger batch sizes sped up training through the tested regime, up to batches of millions of observations. Rerun's 983K-timestep batch had about a 2.5x speedup over a 123K-timestep baseline. The authors also show that stale rollout data and excessive sample reuse slow learning, explaining why optimizer GPUs had to scale with rollout CPUs and forward-pass GPUs.

Historical Effect

OpenAI Five showed that large-scale self-play RL could reach world-champion esports performance when simulation and optimization are treated as a distributed system. It extended the AlphaGo/AlphaZero lesson into a real-time, partially observable, multi-agent environment where experience generation, policy inference, and gradient updates all had to be scaled together.

Historically, it is a search/simulation/science-compute milestone because the limiting resource was not simply matrix multiplication. The project depended on CPU simulation workers, GPUs split between inference and optimization, large-batch PPO, and training-continuation tools for a changing environment. It made the systems side of RL impossible to separate from the algorithmic result.

Limits

The limits are substantial and mostly compute-structure limits. OpenAI Five does not learn from pixels; it uses semantic observations and a custom Dota 2 gym environment. The system plays with restrictions relative to full Dota 2, including a limited hero pool and exclusions such as some multi-unit control complexities. It also uses scripted or engineered subsystems and a large amount of environment integration.

The hardware disclosure is incomplete at the device-SKU level. The paper provides GPU and CPU counts, batch sizes, throughput, and PFLOP/s-days, but not GPU model names or detailed cluster topology. The reported optimization compute also excludes large parts of the total system, including rollout CPUs and forward-pass serving GPUs. Finally, this approach assumes a simulatable environment that can run in massive parallelism. That is exactly what made OpenAI Five possible, and also what limits direct transfer to domains where real-world interaction rather than simulation is the bottleneck.

Links

  • Compute regime: history/compute_regimes/search_simulation_science_compute/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.