Voyager: An Open-Ended Embodied Agent with Large Language Models

Metadata

Reading status: read complete
Year: 2023
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2023-voyager_2023.pdf
Extracted text: 2023-voyager_2023.txt
PDF URL: https://arxiv.org/pdf/2305.16291.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper does not list accelerator hardware, GPU type, host count, or wall-clock training cost. It explicitly says Voyager interacts with GPT-4 through black-box queries and bypasses model parameter finetuning. The implementation uses OpenAI gpt-4-0314 and gpt-3.5-turbo-0301 APIs for text completion and text-embedding-ada-002 for embeddings. Temperatures are set to 0 except the automatic curriculum, which uses temperature 0.1 to encourage task diversity. Under the project rule, the LLM compute is inferred to be OpenAI-hosted inference on the provider's 2023 accelerator infrastructure; the paper itself does not identify those devices.

The non-LLM side is a local or server-side Minecraft simulation stack. The source says the simulation environment is built on MineDojo and uses Mineflayer JavaScript APIs for motor controls. The agent is therefore not training a vision policy or low-level controller. Its expensive loop is API inference plus execution in a symbolic/high-level game interface: propose a curriculum task, retrieve skills by embedding, ask GPT-4 for code, execute that code in Minecraft, feed back environment messages and interpreter errors, and repeat until self-verification accepts the skill or the system gives up after four rounds of code generation.

Bottleneck

The bottleneck is not accelerator memory for backpropagation; it is long-horizon inference control under API cost and simulator feedback. Minecraft exploration requires sequencing resource acquisition, crafting prerequisites, travel, combat, and recovery from failed code. A one-shot prompt is too brittle because a program can call a nonexistent API, use an invalid Minecraft item, or fail after environmental state changes. Repeating GPT-4 calls can fix those errors, but every round consumes latency and paid inference. The paper explicitly flags cost: GPT-4 API use is significant and is 15 times more expensive than GPT-3.5 in their discussion.

The second bottleneck is memory across tasks. Without gradient updates, the agent needs a way to store useful behaviors so it does not repeatedly rediscover how to craft or collect basic resources. Plain ReAct-style loops spend inference again and again on similar plans. Voyager's open-ended setting makes that waste central: success is measured over 160 prompting iterations, so reusable skills and a curriculum that avoids impossible or premature tasks are the difference between progress and plateau.

Method Adaptation

Voyager adapts by converting action selection into code synthesis. GPT-4 writes Mineflayer JavaScript functions that call high-level control primitives; MineDojo/Mineflayer executes them in the environment. This is a compute-device choice as much as an algorithmic one: the system spends large-model inference on program generation and uses ordinary program execution for repeated low-level action sequences. A successful behavior becomes executable code in the skill library rather than more hidden tokens in context.

The skill library is an explicit memory/bandwidth optimization. When a new skill is verified, Voyager stores the program and indexes it by a text description embedding generated through text-embedding-ada-002. For a new task, GPT-3.5 helps generate task context and the system retrieves the top 5 relevant skills for the GPT-4 prompt. This keeps the prompt focused on reusable procedures instead of all past experience. The automatic curriculum spends GPT-4 inference on picking tasks matched to the current inventory, completed tasks, failed tasks, and nearby state. Iterative prompting then uses three feedback channels: environment feedback from chat/logs, execution errors from the interpreter, and a separate GPT-4 self-verification agent. The four-round cap bounds runaway inference.

Evidence

The main exploration result is device-relevant because the x-axis is prompting iterations, not gradient steps. Voyager discovers 63 unique items within 160 prompting iterations, reported as 3.3x more than counterpart agents. It also travels 2.3x longer distances than baselines. In the tech-tree table, ReAct and Reflexion fail to unlock wooden, stone, iron, or diamond tools in all three trials. AutoGPT unlocks wooden, stone, and iron tools but fails diamond. Voyager reaches wooden tools in 6 +/- 2 prompting iterations, stone tools in 11 +/- 2, iron tools in 21 +/- 7, and is the only method to unlock a diamond tool, doing so in one of three trials at 102 iterations.

The ablations show where inference compute pays off. Replacing the automatic curriculum with a random one drops discovered item count by 93%, which means exploration compute is wasted on tasks out of order. Removing the skill library causes later plateauing, indicating that context-only memory is not enough for open-ended embodied progress. Removing self-verification drops discovered items by 73%, because the system loses the gate that decides when to commit a skill and move on. Replacing GPT-4 with GPT-3.5 for code generation yields 5.7x fewer unique items, so the expensive model is not an incidental choice in this architecture.

Historical Effect

Voyager is historically important because it makes the agent loop look like a software system: a frontier model writes code, execution supplies structured errors, another model verifies success, and an embedding index provides long-term memory. Compared with RL agents that amortize behavior into weights through many environment steps, Voyager spends inference at task time and stores successes as programs. That shifted later agent work toward code-as-action, tool feedback, curriculum generation, and retrieval over skill libraries.

Limits

The compute economy depends on abstractions the paper does not solve. Voyager does not support visual perception in the reported version and does not compare directly with pixel-in, low-level-control Minecraft agents because it relies on high-level Mineflayer APIs. The source is explicit that the focus is GPT-4 lifelong embodied learning, not 3D perception or sensorimotor control. Hardware and serving details for GPT-4/GPT-3.5 are not disclosed, so cost is discussed only at the API level. The architecture still fails when GPT-4 proposes nonexistent items such as a copper sword, uses invalid resources such as cobblestone as fuel, or calls functions absent from the provided API. The iterative loop catches some of these errors, but each retry is more paid inference and simulator time.