Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Metadata

Reading status: read complete
Year: 2023
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2023-tree_of_thoughts_2023.pdf
Extracted text: 2023-tree_of_thoughts_2023.txt
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper does not disclose hardware. The experiments use Chat Completion mode GPT-4 for the main results and GPT-3.5-turbo for appendix comparisons, so accelerator devices, batch sizes, and serving topology are hidden behind API inference. Under the project rule, infer only broad 2023 API-hosted frontier-model inference on provider datacenter accelerators, not a paper-specified GPU or TPU.

The paper does provide concrete inference-cost proxies. Experiments were run between May 5 and May 16, 2023, with GPT-4 sampling temperature 0.7 unless otherwise stated. For Game of 24, ToT uses about 5.5K generated tokens and 1.4K prompt tokens per case, costing about $0.74. Best-of-100 CoT uses 6.7K completion and 2.2K prompt tokens, costing $0.47. For Creative Writing, ToT uses about 4K completion and 2.9K prompt tokens, about $0.32 per case, around 5x the IO/CoT token and dollar cost.

Bottleneck

The bottleneck is not model training; it is inference-time search over reasoning states. Standard input-output prompting and chain-of-thought decoding generate a single left-to-right trajectory. If an early step is wrong, later tokens inherit the error. The Game of 24 error analysis makes this concrete: around 60% of CoT samples fail after the first step, effectively after the first small arithmetic move. Repeating CoT 100 times helps, but it is a flat sampling strategy with no local pruning or backtracking.

Tree of Thoughts shifts the bottleneck from one-pass generation to search control. It spends API calls on proposing candidate thoughts, evaluating or voting on partial states, pruning bad branches, and keeping a bounded frontier. The compute unit becomes a node expansion in a thought tree. The method can spend 5 to 100 times more generated tokens than CoT, but spends them on structured exploration rather than independent restarts.

The method is also prompt-bandwidth heavy. Evaluation prompts include the problem, current state, candidate thoughts, and instructions for value or voting. As search grows, prompt tokens can become a substantial part of cost, especially for BFS where many states are evaluated and for DFS where constraints are repeatedly translated into prompts.

Method Adaptation

The method adapts frozen LMs to deliberate reasoning by introducing an external controller around inference. A state is the original input plus a sequence of thoughts. A thought can be sized to the task: an intermediate equation for Game of 24, a short writing plan for Creative Writing, or a word placement for Mini Crosswords. This is a compute adaptation because thought granularity determines branching factor, prompt length, evaluator cost, and whether pruning is possible.

Two generation modes fit different search spaces. Independent sampling works for rich spaces such as writing plans, where diversity matters. Sequential proposal prompts work for constrained spaces like arithmetic steps or crossword words, where proposing several candidates in one context reduces duplication. Evaluation is also modular: value prompts score states independently as sure/maybe/impossible or on a scalar scale, while vote prompts compare multiple states and select the most promising.

The search algorithm is chosen to match depth and branching. Game of 24 and Creative Writing use breadth-first search with bounded breadth because the trees are shallow. In Game of 24, there are three thought steps and the paper keeps the best b = 5 candidates per step, sampling values three times per thought. Creative Writing uses depth 2: sample five plans, vote five times, then sample five passages from the best plan and vote again. Mini Crosswords uses depth-first search because the task can involve up to ten word-filling steps; pruning and backtracking keep the search within a 100-step limit.

This is inference-time compute as an algorithmic budget. Users can change beam size, vote count, model choice, prompt style, early stopping, or pruning thresholds to trade cost for accuracy without retraining the LM.

Evidence

The paper's strongest result is Game of 24 on 100 hard games indexed 901-1000 from 4nums.com. IO prompting reaches 7.3% success, CoT 4.0%, CoT self-consistency with 100 samples 9.0%, and IO plus refine 27%. Best-of-100 IO reaches 33% and best-of-100 CoT reaches 49%. ToT with breadth 1 reaches 45%, and ToT with breadth 5 reaches 74%. This is the compute-structure result: a similar or lower completion-token budget than best-of-100 CoT is much more effective when organized as local search.

Creative Writing shows a softer tradeoff: GPT-4 coherence scoring gives IO 6.19, CoT 6.93, and ToT 7.56, and humans prefer ToT over CoT in 41 cases versus 21 for CoT. Mini Crosswords shows the importance of backtracking: IO gets 14% words and 0 solved games; CoT gets 15.6% words and 1 game; ToT gets 60% words and 4 of 20 games. Removing pruning drops word success to 41.5%, and removing backtracking drops it to 20%.

The appendix tests model substitution. On Game of 24, GPT-3.5 ToT reaches 19% versus GPT-4 ToT at 74%. GPT-4 generation plus GPT-3.5 evaluation reaches 64%, while GPT-3.5 generation plus GPT-4 evaluation reaches 31%, suggesting generation quality is the primary bottleneck and cheaper evaluators may be viable.

Historical Effect

ToT made explicit test-time search a mainstream LLM method. It translated classical AI search ideas into an API-era pattern: generate candidate thoughts, evaluate them with the same or another LM, and use BFS/DFS control logic outside the model. Historically, this helped distinguish "reasoning capability in weights" from "reasoning obtained by spending inference budget."

The paper also made cost accounting part of reasoning evaluation. Its token and dollar tables show that better performance often arrives by spending more inference, but not all spending is equal. Structured search beats independent sampling on the showcased tasks because it allocates computation to branch selection, lookahead, and pruning.

Limits

Hardware is not reported, and the API setting hides batching, latency, and serving details. The token/dollar estimates are therefore the only reliable compute measurements in the local source. ToT can be expensive: the authors state it may require 5 to 100 times more generated tokens than CoT depending on prompts and search algorithms. It also requires task-specific thought decomposition, proposal prompts, evaluators, and search limits.

The method works best when intermediate states can be evaluated. It is less clearly useful for tasks GPT-4 already solves with CoT, tasks where external knowledge is the bottleneck, or open-ended tasks where value prompts are noisy.