ReAct: Synergizing Reasoning and Acting in Language Models

Metadata

Reading status: read complete
Year: 2022
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2022-react_2022.pdf
Extracted text: 2022-react_2022.txt
PDF URL: https://arxiv.org/pdf/2210.03629.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper does not list exact accelerator hardware, host counts, or wall-clock cost. Its main setup is a frozen large language model used at inference time: the source says the main experiments prompt PaLM-540B with few-shot in-context examples, and Appendix A.1 reports GPT-3 text-davinci-002 experiments with greedy decoding. Under the project rule, the PaLM runs are inferred to use Google's 2022-era TPU-backed PaLM serving/training infrastructure, while the GPT-3 runs are inferred to use OpenAI-hosted API inference. Those are inferences from the model/provider era, not hardware disclosures in the paper.

The reported gradient-training component is smaller than the main PaLM-540B prompting setup. ReAct bootstraps 3,000 trajectories with correct answers and finetunes PaLM-8B and PaLM-62B to decode full trajectories containing thoughts, actions, and observations. Appendix B.1 states that all finetuning uses batch size 64; on PaLM-8B, ReAct and Act are finetuned for 4,000 steps while Standard and CoT are finetuned for 2,000 steps; on PaLM-62B, ReAct and Act again use 4,000 steps while Standard and CoT use 1,000 steps. The compute setup is therefore primarily inference-time token generation over a tool/environment loop, plus a modest post-training pass on smaller PaLM models.

Bottleneck

ReAct targets a bottleneck that plain chain-of-thought leaves on the table: static sampled reasoning cannot fetch missing facts, check an environment, or repair a bad search path. Once actions are admitted, the limiting resource changes from a single forward generation to a bounded sequence of thought, action, and observation tokens. That sequence consumes context window, incurs tool/environment latency, and creates a branching search problem because every action changes the next observation.

The paper makes this concrete by capping the Wikipedia interaction horizon: if ReAct fails to return an answer, it backs off to CoT self-consistency after 7 HotpotQA steps or 5 FEVER steps, and the authors report that more steps did not improve performance. In ALFWorld, an expert policy may require more than 50 steps and more than 50 locations, so a prompted LLM has to spend scarce context and generations tracking subgoals. In WebShop, the environment contains 1.18M products and 12k human instructions, so search reformulation and product-option filtering become inference-time compute rather than learned parametric memory.

Method Adaptation

The method adapts the language model to this compute regime by forcing the trajectory format to expose intermediate state. For HotpotQA and FEVER, the authors manually compose ReAct-format exemplars: 6 training cases for HotpotQA and 3 for FEVER, with each trajectory alternating thought, action, and observation. Thoughts are used to decompose questions, extract evidence from Wikipedia observations, perform arithmetic or commonsense reasoning, and reformulate searches. Actions are constrained to a simple Wikipedia API, so the model pays only for short textual observations rather than a full web page or dense retrieval stack.

For longer-horizon decision tasks, the adaptation is sparser. The paper notes that ALFWorld and WebShop can involve many actions, so thoughts need only appear at the most relevant positions. This reduces token overhead while preserving a working-memory channel for subgoal tracking. ReAct also combines with CoT-SC as a compute router: one direction uses ReAct when internal-knowledge self-consistency is weak; the other uses CoT-SC as fallback when the action loop stalls. The hybrid reaches CoT-SC performance with 21 samples using only 3-5 samples in the reported plots, showing a direct exchange between external-tool compute and repeated internal sampling.

Evidence

On PaLM-540B prompting, Table 1 reports HotpotQA exact match of 27.4 for ReAct versus 25.7 for Act; on FEVER, ReAct reaches 60.9 accuracy versus 58.9 for Act. The hybrids are stronger: ReAct -> CoT-SC reaches 35.1 HotpotQA EM, while CoT-SC -> ReAct reaches 64.6 FEVER accuracy. The failure analysis is compute-structural: ReAct has fewer hallucination failures than CoT because external observations ground the trajectory, but it has more reasoning errors because the fixed action/observation structure can trap the model in repetitive or poorly chosen searches.

The embodied/text-environment evidence is sharper. On ALFWorld, ReAct best-of-6 reaches 71% success across tasks, compared with Act best-of-6 at 45% and BUTLER best-of-8 at 37%; BUTLER is an imitation-learning baseline trained on 100k expert trajectories for each task type. On WebShop, ReAct reaches a 66.6 score and 40.0% success rate, compared with Act at 62.3/30.1, imitation learning at 59.9/29.1, and IL+RL at 62.4/28.7. These numbers support the paper's claim that spending inference on natural-language subgoal reasoning can beat heavier task-specific training when the environment is language-heavy.

The finetuning evidence shows the post-training side of the same structure. With just 3,000 bootstrapped examples, finetuned ReAct on PaLM-8B outperforms all PaLM-62B prompting methods, and finetuned ReAct on PaLM-62B outperforms all PaLM-540B prompting methods in the reported HotpotQA scaling figure. That is not a claim that 8B is generally stronger than 540B; it is evidence that training the trajectory interface can substitute for some scale on this specific tool-use behavior.

Historical Effect

ReAct helped turn tool use from an external wrapper into a first-class decoding pattern: the model generates reasoning, issues actions, receives observations, and continues. Its historical role is not a new accelerator or distributed-training trick, but a new placement of compute. Instead of spending all budget on parameters or samples of a hidden chain, it spends tokens on recoverable interaction with search APIs and text environments. That pattern directly anticipates later LLM agents that combine planner state, tool calls, retrieval, execution feedback, and fallback sampling.

Limits

The paper's limits also follow from this compute structure. Hardware and serving cost are not disclosed, so the card cannot attach a precise device cost to PaLM-540B or GPT-3 use. ReAct is sensitive to action-horizon choices: more Wikipedia steps did not help after the chosen caps, and bad or empty search results can derail the whole trajectory. Prompted traces must fit in context, and every observation competes with reasoning tokens for memory. In interactive environments, tool latency and environment API design become part of the model's effective inference budget. Finally, ReAct does not solve low-level perception or control; it works best when the environment can be represented as compact text observations and valid text actions.