Toolformer: Language Models Can Teach Themselves to Use Tools

Metadata

Reading status: read complete
Year: 2023
Compute regime: Inference-time compute and post-training (inference_time_compute_post_training)
PDF: 2023-toolformer_2023.pdf
Extracted text: 2023-toolformer_2023.txt
PDF URL: https://arxiv.org/pdf/2302.04761.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly states its training hardware and several memory choices. Toolformer is based on GPT-J with 6.7B parameters. Appendix B says the authors use up to 25k examples per API, maximum sequence length 1,024, effective batch size 128, DeepSpeed ZeRO-3, 8 NVIDIA A100 40GB GPUs, BF16, and training up to 2,000 steps. They evaluate perplexity on a 1,000-example CCNet development set every 500 steps and pick the best checkpoint.

The setup is a compact multi-GPU finetuning job rather than frontier-model pretraining. ZeRO-3 partitions optimizer state, gradients, and parameters across devices, which is the key memory adaptation that makes a 6.7B model fit on eight 40GB A100s with batch 128 and 1,024-token sequences. BF16 reduces memory and bandwidth relative to FP32 while preserving dynamic range. The source does not give wall-clock time or inference hardware for downstream evaluations.

Bottleneck

The bottleneck is generating useful tool-use supervision without human annotation for every API call. Tool calls are sparse: most text positions do not need a calculator, calendar, QA system, translation system, or Wikipedia search. Naively finetuning on every sampled call would waste context, teach bad habits, and degrade ordinary language modeling. But executing and filtering candidate calls is itself expensive because the system must sample possible insertion points, generate candidate API calls, execute the APIs, and score whether the result reduces future-token loss.

At inference time, the bottleneck changes to branching and latency. A model that can call tools can also over-call tools, get stuck in a loop, or call a tool when the API cost is higher than the expected gain. The paper controls this by allowing at most one API call per input and by triggering API calls when the API token is among the top-k next tokens rather than only the single most likely token. This improves tool usage but makes compute budget a decoding hyperparameter.

Method Adaptation

Toolformer adapts by using the language model's own loss as the supervision filter. For each API, the authors write a few demonstrations that prompt GPT-J to insert calls into plain text. They sample candidate positions where the model assigns sufficient probability to an API marker, sample up to five API calls per position, execute each call, and compare future-token cross-entropy with and without the API result. A call is retained only if the response lowers the loss by at least the filtering threshold. The retained calls are then inserted into the original CCNet text, producing an augmented corpus that preserves ordinary language data while adding calls where they help.

The tools are chosen to be text-in/text-out so they fit a normal LM context: a question answering model, a simple arithmetic calculator, a BM25 Wikipedia search engine over KILT Wikipedia, an NLLB-based translation system, and a calendar API. At inference, the model generates the API call, the external system returns a text result, and the model continues generation. The method therefore spends GPU compute once in finetuning to teach call placement and spends tool/API latency only when the model decides to call a tool.

Evidence

The filtering process produces very different amounts of supervision by tool. At filtering threshold 1.0, Table 2 reports 18,526 question-answering examples, 60,974 Wikipedia Search examples, 994 calculator examples, 20,587 calendar examples, and 1,034 machine-translation examples. This is a direct sign of the sample-efficiency bottleneck: processing many documents yields only a small number of useful calculator calls.

Downstream results show the benefit when calls are used. On LAMA, Toolformer reaches 33.8 on SQuAD, 11.5 on Google-RE, and 53.5 on T-REx, compared with Toolformer-disabled at 22.1, 6.3, and 34.9. It also exceeds OPT 66B and GPT-3 175B on these reported LAMA subsets despite being based on GPT-J 6.7B. On math benchmarks, enabling calculator calls improves Toolformer from 14.8 to 40.4 on ASDiv, 6.3 to 29.4 on SVAMP, and 15.0 to 44.0 on MAWPS; the paper notes that the model asks the calculator for help on 97.9% of examples across those benchmarks.

The language-modeling cost is small in the reported metrics. Table 8 gives WikiText perplexity of 10.3 for Toolformer-disabled, the same as GPT-J+CC, and CCNet perplexity of 10.5 for both. On DATESET, Toolformer reaches 27.3 versus 5.9 with API calls disabled and 0.8 for GPT-3 175B, showing the calendar API can buy task performance without broad perplexity regression. Table 9 also shows the inference compute knob: for T-REx, moving from k=0 to k=10 increases API calls from 0% to 98.1% and improves overall score from 34.9 to 53.5.

Historical Effect

Toolformer made tool use look like a learned language-model behavior rather than a prompt-specific wrapper. Its contribution is not just that a model can use a calculator or search engine, but that a 6.7B model can create, filter, and learn from its own tool-call traces on ordinary text. In compute-history terms, it is a bridge from inference-only prompting to post-training that internalizes the decision of when to spend external compute.

Limits

The limitations are also compute-structural. Toolformer cannot chain tools because API calls for each tool are generated independently, so the finetuning data contains no examples where one tool's output becomes another tool's input. It does not browse interactively through many search results or refine a query in response to bad output. The model is sensitive to input wording when deciding whether to call an API, and the paper explicitly notes that it does not account for the tool-dependent computational cost of an API call. Finally, the self-supervised filtering process is sample-inefficient for some tools, so moving from a few APIs to a large tool ecosystem would multiply preprocessing and execution cost.