Human-level control through deep reinforcement learning

Download PDF

Human-level control through deep reinforcement learning

Metadata

Compute Setup

The paper does not report the training hardware. The local device extraction file also marks 2015-dqn_2015 as not reported. Under the project rule, the device should therefore be inferred from the research time and device era rather than claimed as paper-stated: a 2014 NVIDIA Tesla K40-class CUDA research server, most likely a single-GPU or small local GPU setup, is the conservative inference from the accelerator-era map. The paper itself supports the workload scale but not the exact device.

The workload is per-game deep reinforcement learning from Atari pixels. The authors train a different network for each of 49 Atari 2600 games. Each network is trained for 50 million frames, described as about 38 days of game experience, using RMSProp with minibatches of 32 and a replay memory of the 1 million most recent frames. The model is small by later standards, but the experiment is compute-heavy because the forward/backward loop is embedded in a long simulator data-generation loop.

Bottleneck

The bottleneck is not only neural-network FLOPs. It is the interaction between unstable bootstrapped value learning, correlated simulator samples, long per-game training runs, and limited 2014 GPU memory. The input begins as 210 x 160 color video at 60 Hz, but the network consumes an 84 x 84 x 4 history. That stack of four frames preserves short-term motion information without feeding a recurrent model or long sequence into the GPU.

Deep Q-learning with a nonlinear function approximator can diverge because the model learns targets that depend on its own changing parameters. Atari adds another device-shaped problem: consecutive frames are highly correlated, so naive online SGD wastes minibatches on redundant states and creates unstable feedback loops. The solution has to make a GPU trainable on off-policy replay while retaining enough state history to learn action values from pixels.

Method Adaptation

DQN adapts the method to fit the single-GPU deep-learning regime in concrete ways. Experience replay stores transitions and samples minibatches uniformly, which turns a temporally correlated stream into a more IID-like training set and lets each emulator transition be reused in many weight updates. The 1 million frame replay memory is a memory-side design choice: it is large enough to decorrelate and smooth the behavior distribution, but bounded so training can keep cycling through recent experience.

The target network is another compute-stability adaptation. Instead of using a target that changes every gradient step, the method periodically clones the Q-network and holds that target fixed for a block of updates. Reward clipping and error clipping further keep gradients in a numerically manageable range.

The network architecture is also built around inference cost. Some previous Q architectures require a separate forward pass for every action, making cost scale linearly with action count. DQN instead has one shared convolutional trunk and a separate output unit for each valid action, so the action-value vector is produced in one pass. Frame skipping is similarly compute-aware: because emulator stepping is cheaper than network action selection, repeating actions on skipped frames lets the agent play roughly k times more frames without proportionally increasing runtime.

Evidence

The methods section gives the main training scale: 49 Atari games, a different network per game, RMSProp minibatches of 32, epsilon-greedy exploration annealed from 1.0 to 0.1 over the first million frames, 50 million total frames per game, and replay memory of 1 million most recent frames. The input is an 84 x 84 x 4 preprocessed image history.

The benchmark evidence is broad rather than one-off. DQN outperforms the best previous reinforcement-learning methods on 43 of 49 games. It reaches more than 75% of professional human-normalized score on 29 games, using the same architecture, learning algorithm, and hyperparameters across the suite. Figure 3 normalizes DQN scores between random play and a professional human games tester and shows human-level or above performance on a majority of games.

The ablations are important for the compute-structure claim. Extended Data Table 3 trains agents for 10 million frames while toggling replay and the separate target Q-network. Extended Data Table 4 compares the convolutional network to a linear function approximator under the same replay and target-network machinery, supporting the value of GPU-suitable convolutional representation learning from pixels.

Historical Effect

DQN made deep reinforcement learning from raw pixels a practical accelerator-era baseline. The historical compute effect was to show that a fixed GPU-trained recipe could absorb large simulator workloads and produce one general benchmark table across dozens of games. This shifted deep RL from hand-engineered features and small domains toward standardized simulator throughput, replay buffers, and convolutional policies.

It also foreshadowed a pattern later simulation papers would amplify: the model may be modest, but capability comes from many environment interactions plus careful reuse of samples. The expensive object is not just a static dataset; it is the loop that creates data, stores it, samples it, and updates a policy/value function.

Limits

The exact hardware is inferred, not paper-stated, and the paper does not report wall-clock training time. The method also remains sample-inefficient in a human sense: 50 million frames is about 38 days of game experience per title. Long-horizon exploration remains weak. The paper explicitly calls out temporally extended planning tasks such as Montezuma's Revenge as a major challenge.

The replay buffer is uniformly sampled and finite, so it can overwrite rare important transitions and does not prioritize high-learning-value experience. The Atari benchmark disables audio and uses minimal prior knowledge, which helps standardization but also narrows the domain. DQN's compute adaptation solved the 2015 single-GPU stability problem; it did not solve exploration, planning, or efficient transfer across games.

Links

  • Compute regime: history/compute_regimes/search_simulation_science_compute/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.