资料来源 — AI 计算结构史

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick Jane Dwivedi-Yu Roberto Dessì† Roberta Raileanu
Maria Lomeli Luke Zettlemoyer Nicola Cancedda Thomas Scialom
Meta AI Research † Universitat Pompeu Fabra

Abstract The New England Journal of Medicine is a registered
trademark of [QA(“Who is the publisher of The New
Language models (LMs) exhibit remarkable England Journal of Medicine?”) → Massachusetts
abilities to solve new tasks from just a few

arXiv:2302.04761v1 [cs.CL] 9 Feb 2023
Medical Society] the MMS.
examples or textual instructions, especially at
scale. They also, paradoxically, struggle with
Out of 1400 participants, 400 (or [Calculator(400 / 1400)
basic functionality, such as arithmetic or fac-
tual lookup, where much simpler and smaller → 0.29] 29%) passed the test.
models excel. In this paper, we show that
LMs can teach themselves to use external tools The name derives from “la tortuga”, the Spanish word for
via simple APIs and achieve the best of both [MT(“tortuga”) → turtle] turtle.
worlds. We introduce Toolformer, a model
trained to decide which APIs to call, when to The Brown Act is California’s law [WikiSearch(“Brown
call them, what arguments to pass, and how to Act”) → The Ralph M. Brown Act is an act of the
best incorporate the results into future token
California State Legislature that guarantees the public's
prediction. This is done in a self-supervised
right to attend and participate in meetings of local
way, requiring nothing more than a handful of
legislative bodies.] that requires legislative bodies, like
demonstrations for each API. We incorporate
city councils, to hold their meetings open to the public.
a range of tools, including a calculator, a Q&A
system, a search engine, a translation system,
and a calendar. Toolformer achieves substan- Figure 1: Exemplary predictions of Toolformer. The
tially improved zero-shot performance across model autonomously decides to call different APIs
a variety of downstream tasks, often competi- (from top to bottom: a question answering system,
tive with much larger models, without sacrific- a calculator, a machine translation system, and a
ing its core language modeling abilities. Wikipedia search engine) to obtain information that is
useful for completing a piece of text.
1 Introduction
Large language models achieve impressive zero- A simple way to overcome these limitations of
and few-shot results on a variety of natural lan- today’s language models is to give them the abil-
guage processing tasks (Brown et al., 2020; Chowd- ity to use external tools such as search engines,
hery et al., 2022, i.a.) and show several emergent calculators, or calendars. However, existing ap-
capabilities (Wei et al., 2022). However, all of proaches either rely on large amounts of human
these models have several inherent limitations that annotations (Komeili et al., 2022; Thoppilan et al.,
can at best be partially addressed by further scal- 2022) or limit tool use to task-specific settings only
ing. These limitations include an inability to access (e.g., Gao et al., 2022; Parisi et al., 2022), hinder-
up-to-date information on recent events (Komeili ing a more widespread adoption of tool use in LMs.
et al., 2022) and the related tendency to hallucinate Therefore, we propose Toolformer, a model that
facts (Maynez et al., 2020; Ji et al., 2022), difficul- learns to use tools in a novel way, which fulfills the
ties in understanding low-resource languages (Lin following desiderata:
et al., 2021), a lack of mathematical skills to per-
form precise calculations (Patel et al., 2021) and an • The use of tools should be learned in a
unawareness of the progression of time (Dhingra self-supervised way without requiring large
et al., 2022). amounts of human annotations. This is impor-
1 2 3 LM Dataset
LM Dataset
Sample API Calls Execute API Calls Filter API Calls with API Calls

1
x1:i-1 = Pittsburgh is ci = What other name is ri1 = Steel City Li(ci1 → Steel City) x* = Pittsburgh is
also known as Pittsburgh known by? < min(Li(ci1 → ε), Li(ε)) also known as
[QA(What …?
xi:n = the Steel City ci2 = Which country is ri2 = United States Li(ci2 → United States) → Steel City)]
Pittsburgh in? > min(Li(ci2 → ε), Li(ε)) the Steel City.

Figure 2: Key steps in our approach, illustrated for a question answering tool: Given an input text x, we first
sample a position i and corresponding API call candidates c1i , c2i , . . . , cki . We then execute these API calls and
filter out all calls which do not reduce the loss Li over the next tokens. All remaining API calls are interleaved
with the original text, resulting in a new text x∗ .

tant not only because of the costs associated several other baselines on various tasks.
with such annotations, but also because what
humans find useful may be different from 2 Approach
what a model finds useful. Our aim is to equip a language model M with the
• The LM should not lose any of its generality ability to use different tools by means of API calls.
and should be able to decide for itself when We require that inputs and outputs for each API
and how to use which tool. In contrast to can be represented as text sequences. This allows
existing approaches, this enables a much more seamless insertion of API calls into any given text,
comprehensive use of tools that is not tied to using special tokens to mark the start and end of
specific tasks. each such call.
We represent each API call as a tuple c = (ac , ic )
Our approach for achieving these goals is based where ac is the name of the API and ic is the cor-
on the recent idea of using large LMs with in- responding input. Given an API call c with a cor-
context learning (Brown et al., 2020) to generate responding result r, we denote the linearized se-
entire datasets from scratch (Schick and Schütze, quences of the API call not including and including
2021b; Honovich et al., 2022; Wang et al., 2022): its result, respectively, as:
Given just a handful of human-written examples
of how an API can be used, we let a LM annotate e(c) = <API> ac (ic ) </API>
a huge language modeling dataset with potential e(c, r) = <API> ac (ic ) → r </API>
API calls. We then use a self-supervised loss to
determine which of these API calls actually help where “<API>”, “</API>” and “→” are special
the model in predicting future tokens. Finally, we tokens.1 Some examples of linearized API calls
finetune the LM itself on the API calls that it con- inserted into text sequences are shown in Figure 1.
siders useful. As illustrated in Figure 1, through Given a dataset C = {x1 , . . . , x|C| } of plain
this simple approach, LMs can learn to control a va- texts, we first convert this dataset into a dataset
riety of tools, and to choose for themselves which C ∗ augmented with API calls. This is done in three
tool to use when and how. steps, illustrated in Figure 2: First, we exploit the
As our approach is agnostic of the dataset be- in-context learning ability of M to sample a large
ing used, we can apply it to the exact same dataset number of potential API calls. We then execute
that was used to pretrain a model in the first place. these API calls and finally check whether the ob-
This ensures that the model does not lose any tained responses are helpful for predicting future
of its generality and language modeling abilities. tokens; this is used as a filtering criterion. After
We conduct experiments on a variety of differ- filtering, we merge API calls for different tools,
ent downstream tasks, demonstrating that after resulting in the augmented dataset C ∗ , and finetune
learning to use tools, Toolformer, which is based 1
In practice, we use the token sequences “ [”, “]” and
on a pretrained GPT-J model (Wang and Komat- “->” to represent “<API>”, “</API>” and “→”, respec-
suzaki, 2021) with 6.7B parameters, achieves much tively. This enables our approach to work without modifying
the existing LM’s vocabulary. For reasons of readability, we
stronger zero-shot results, clearly outperforming a still refer to them as “<API>”, “</API>” and “→” through-
much larger GPT-3 model (Brown et al., 2020) and out this section.
Your task is to add calls to a Question Answering API to a Executing API Calls As a next step, we execute
piece of text. The questions should help you get all API calls generated by M to obtain the corre-
information required to complete the text. You can call the sponding results. How this is done depends entirely
API by writing "[QA(question)]" where "question" is the on the API itself – for example, it can involve call-
question you want to ask. Here are some examples of API ing another neural network, executing a Python
calls:
script or using a retrieval system to perform search
over a large corpus. The response for each API call
Input: Joe Biden was born in Scranton, Pennsylvania.
ci needs to be a single text sequence ri .
Output: Joe Biden was born in [QA("Where was Joe
Biden born?")] Scranton, [QA("In which state is Filtering API Calls Let i be the position of the
Scranton?")] Pennsylvania. API call ci in the sequence x = x1 , . . . , xn , and let
ri be the response from the API. Further, given a
Input: Coca-Cola, or Coke, is a carbonated soft drink sequence (wi | i ∈ N) of weights, let
manufactured by the Coca-Cola Company.
n
X
Output: Coca-Cola, or [QA("What other name is Li (z) = − wj−i · log pM (xj | z, x1:j−1 )
Coca-Cola known by?")] Coke, is a carbonated soft drink j=i
manufactured by [QA("Who manufactures Coca-Cola?")]
the Coca-Cola Company. be the weighted cross entropy loss for M over the
tokens xi , . . . , xn if the model is prefixed with z.
Input: x We compare two different instantiations of this loss:
Output:
L+
i = Li (e(ci , ri ))
Figure 3: An exemplary prompt P (x) used to generate L−
i = min (Li (ε), Li (e(ci , ε)))
API calls for the question answering tool.
where ε denotes an empty sequence. The former is
the weighted loss over all tokens xi , . . . , xn if the
M itself on this dataset. Each of these steps is API call and its result are given to M as a prefix;3
described in more detail below. the latter is the minimum of the losses obtained
Sampling API Calls For each API, we write a from (i) doing no API call at all and (ii) doing an
prompt P (x) that encourages the LM to anno- API call, but not providing the response. Intuitively,
tate an example x = x1 , . . . , xn with API calls. an API call is helpful to M if providing it with both
An example of such a prompt for a question an- the input and the output of this call makes it easier
swering tool is shown in Figure 3; all prompts for the model to predict future tokens, compared to
used are shown in Appendix A.2. Let pM (zn+1 | not receiving the API call at all, or receiving only
z1 , . . . , zn ) be the probability that M assigns to its input. Given a filtering threshold τf , we thus
token zn+1 as a continuation for the sequence only keep API calls for which
z1 , . . . , zn . We first sample up to k candidate posi-
L− +
i − Li ≥ τf
tions for doing API calls by computing, for each
i ∈ {1, . . . , n}, the probability holds, i.e., adding the API call and its result reduces
the loss by at least τf , compared to not doing any
pi = pM (<API> | P (x), x1:i−1 ) API call or obtaining no result from it.
that M assigns to starting an API call at position Model Finetuning After sampling and filtering
i. Given a sampling threshold τs , we keep all po- calls for all APIs, we finally merge the remaining
sitions I = {i | pi > τs }; if there are more than k API calls and interleave them with the original
such positions, we only keep the top k. inputs. That is, for an input text x = x1 , . . . , xn
For each position i ∈ I, we then obtain up to m with a corresponding API call and result (ci , ri ) at
API calls c1i , . . . , cm
i by sampling from M given the position i, we construct the new sequence x∗ =
sequence [P (x), x1 , . . . , xi−1 , <API>] as a prefix 3
We provide e(ci , ri ) as a prefix instead of inserting it at
and </API> as an end-of-sequence token.2 position i because M is not yet finetuned on any examples
containing API calls, so inserting it in the middle of x would
2
We discard all examples where M does not generate the interrupt the flow and not align with patterns in the pretraining
</API> token. corpus, thus hurting perplexity.
x1:i−1 , e(ci , ri ), xi:n ; we proceed analogously for snippets from Wikipedia. Compared to our ques-
texts with multiple API calls. Doing this for all x ∈ tion answering tool, this search enables a model
C results in the new dataset C ∗ augmented with API to get more comprehensive information on a sub-
calls. We use this new dataset to finetune M , using ject, but requires it to extract the relevant parts by
a standard language modeling objective. Crucially, itself. As our search engine, we use a BM25 re-
apart from inserted API calls the augmented dataset triever (Robertson et al., 1995; Baeza-Yates et al.,
C ∗ contains the exact same texts as C, the original 1999) that indexes the Wikipedia dump from KILT
dataset. As a consequence, finetuning M on C ∗ (Petroni et al., 2021).
exposes it to the same content as finetuning on C.
Moreover, as API calls are inserted in exactly those Machine Translation System Our fourth tool is
positions and with exactly those inputs that help a machine translation system based on a LM that
M predict future tokens, finetuning on C ∗ enables can translate a phrase from any language into En-
the language model to decide when and how to use glish. More concretely, we use the 600M parameter
which tool, based purely on its own feedback. NLLB (Costa-jussà et al., 2022) as our multilingual
machine translation model that works for 200 lan-
Inference When generating text with M after guages (including low-resource ones). The source
finetuning with our approach, we perform regular language is automatically detected using the fast-
decoding until M produces the “→” token, indicat- Text classifier (Joulin et al., 2016), while the target
ing that it next expects the response for an API call. language is always set to English.
At this point, we interrupt the decoding process,
call the appropriate API to get a response, and con- Calendar Our final tool is a calendar API that,
tinue the decoding process after inserting both the when queried, returns the current date without tak-
response and the </API> token. ing any input. This provides temporal context for
predictions that require some awareness of time.
3 Tools
4 Experiments
We explore a variety of tools to address different
shortcomings of regular LMs. The only constraints We investigate whether our approach enables a
we impose on these tools is that (i) both their inputs model to use tools without any further supervision
and outputs can be represented as text sequences, and to decide for itself when and how to call which
and (ii) we can obtain a few demonstrations of of the available tools. To test this, we select a vari-
their intended use. Concretely, we explore the fol- ety of downstream tasks where we assume at least
lowing five tools: a question answering system, a one of the considered tools to be useful, and evalu-
Wikipedia search engine, a calculator, a calendar, ate performance in zero-shot settings (Section 4.2).
and a machine translation system. Some examples Beyond that, we also ensure that our approach does
of potential calls and return strings for the APIs not hurt the model’s core language modeling abili-
associated with each of these tools are shown in ties; we verify this by looking at perplexity on two
Table 1. We briefly discuss all tools below; further language modeling datasets (Section 4.3). Finally,
details can be found in Appendix A. we investigate how the ability to learn using tools
is affected by model size (Section 4.4).
Question Answering Our first tool is a question
answering system based on another LM that can an- 4.1 Experimental Setup
swer simple factoid questions. Specifically, we use Dataset Generation Throughout all of our ex-
Atlas (Izacard et al., 2022), a retrieval-augmented periments, we use a subset of CCNet (Wenzek et al.,
LM finetuned on Natural Questions (Kwiatkowski 2020) as our language modeling dataset C and GPT-
et al., 2019). J (Wang and Komatsuzaki, 2021) as our language
Calculator As a second tool, we use a calculator model M . To reduce the computational cost of
that can perform simple numeric calculations; we annotating C with API calls, we define heuristics
only support the four basic arithmetic operations. for some APIs to get a subset of C for which API
Results are always rounded to two decimal places. calls are more likely to be helpful than for an av-
erage text. For example, we only consider texts
Wikipedia Search Our third tool is a search en- for the calculator tool if they contain at least three
gine that, given a search term, returns short text numbers. Details of the heuristics used are given in
API Name Example Input Example Output
Question Answering Where was the Knights New Haven, Connecticut
of Columbus founded?
Wikipedia Search Fishing Reel Types Spin fishing > Spin fishing is distinguished between fly fishing and bait
cast fishing by the type of rod and reel used. There are two types of reels
used when spin fishing, the open faced reel and the closed faced reel.
Calculator 27 + 4 * 2 35
Calendar ε Today is Monday, January 30, 2023.
Machine Translation sûreté nucléaire nuclear safety

Table 1: Examples of inputs and outputs for all APIs used.

Number of Examples • GPT-J: A regular GPT-J model without any
API τf = 0.5 τf = 1.0 τf = 2.0 finetuning.
Question Answering 51,987 18,526 5,135
Wikipedia Search 207,241 60,974 13,944 • GPT-J + CC: GPT-J finetuned on C, our sub-
Calculator 3,680 994 138 set of CCNet without any API calls.
Calendar 61,811 20,587 3,007
Machine Translation 3,156 1,034 229
• Toolformer: GPT-J finetuned on C ∗ , our sub-
Table 2: Number of examples with API calls in C ∗ for set of CCNet augmented with API calls.
different values of our filtering threshold τf .
• Toolformer (disabled): The same model as
Toolformer, but API calls are disabled during
Appendix A. For obtaining C ∗ from C, we perform decoding.5
all steps described in Section 2 and additionally
filter out all examples for which all API calls were For most tasks, we additionally compare to OPT
eliminated in the filtering step.4 For the weighting (66B) (Zhang et al., 2022) and GPT-36 (175B)
function, we use (Brown et al., 2020), two models that are about
10 and 25 times larger than our other baseline mod-
w̃t els, respectively.
wt = P with w̃t = max(0, 1 − 0.2 · t)
s∈N w̃s
4.2 Downstream Tasks
to make sure that API calls happen close to where We evaluate all models on a variety of downstream
the information provided by the API is actually tasks. In all cases, we consider a prompted zero-
helpful for the model. The thresholds τs and τf are shot setup – i.e., models are instructed to solve
chosen individually for each tool to ensure a suffi- each task in natural language, but we do not pro-
ciently larger number of examples; see Appendix A vide any in-context examples. This is in contrast
for details. Table 2 shows relevant statistics of our to prior work on tool use (e.g., Gao et al., 2022;
final dataset augmented with API calls. Parisi et al., 2022), where models are provided
with dataset-specific examples of how a tool can be
Model Finetuning We finetune M on C ∗ using used to solve a concrete task. We choose the more
a batch size of 128 and a learning rate of 1 · 10−5 challenging zero-shot setup as we are interested
with linear warmup for the first 10% of training. in seeing whether Toolformer works in precisely
Details of our finetuning procedure are given in those cases where a user does not specify in ad-
Appendix B. vance which tools should be used in which way for
Baseline Models Throughout the remainder of solving a specific problem.
this section, we mainly compare the following mod- We use standard greedy decoding, but with one
els: modification for Toolformer: We let the model start
an API call not just when <API> is the most likely
4
While this filtering alters the distribution of training exam-
5
ples, we assume that the remaining examples are close enough This is achieved by manually setting the probability of
to the original distribution so that M ’s language modeling the <API> token to 0.
6
abilities remain unaffected. This assumption is empirically We use the original davinci variant that is not finetuned
validated in Section 4.3. on any instructions.
token, but whenever it is one of the k most likely Model SQuAD Google-RE T-REx
tokens. For k = 1, this corresponds to regular GPT-J 17.8 4.9 31.9
greedy decoding; we instead use k = 10 to in- GPT-J + CC 19.2 5.6 33.2
Toolformer (disabled) 22.1 6.3 34.9
crease the disposition of our model to make use of Toolformer 33.8 11.5 53.5
the APIs that it has access to. At the same time,
OPT (66B) 21.6 2.9 30.1
we only at most one API call per input to make GPT-3 (175B) 26.8 7.0 39.8
sure the model does not get stuck in a loop where
it constantly calls APIs without producing any ac- Table 3: Results on subsets of LAMA. Toolformer uses
tual output. The effect of these modifications is the question answering tool for most examples, clearly
explored in Section 5. outperforming all baselines of the same size and achiev-
ing results competitive with GPT-3 (175B).
4.2.1 LAMA
We evaluate our models on the SQuAD, Google- Model ASDiv SVAMP MAWPS
RE and T-REx subsets of the LAMA benchmark GPT-J 7.5 5.2 9.9
(Petroni et al., 2019). For each of these subsets, the GPT-J + CC 9.6 5.0 9.3
Toolformer (disabled) 14.8 6.3 15.0
task is to complete a short statement with a miss- Toolformer 40.4 29.4 44.0
ing fact (e.g., a date or a place). As LAMA was OPT (66B) 6.0 4.9 7.9
originally designed to evaluate masked language GPT-3 (175B) 14.0 10.0 19.8
models (e.g., Devlin et al., 2019), we filter out ex-
amples where the mask token is not the final token, Table 4: Results for various benchmarks requiring
so that the remaining examples can be processed mathematical reasoning. Toolformer makes use of the
calculator tool for most examples, clearly outperform-
in a left-to-right fashion. To account for different
ing even OPT (66B) and GPT-3 (175B).
tokenizations and added complexity from not in-
forming the model that a single word is required,
we use a slightly more lenient evaluation criterion number predicted by the model.7
than exact match and simply check whether the Table 4 shows results for all benchmarks. While
correct word is within the first five words predicted GPT-J and GPT-J + CC perform about the same,
by the model. As LAMA is based on statements Toolformer achieves stronger results even when
obtained directly from Wikipedia, we prevent Tool- API calls are disabled. We surmise that this is be-
former from using the Wikipedia Search API to cause the model is finetuned on many examples
avoid giving it an unfair advantage. of API calls and their results, improving its own
Results for all models can be seen in Table 3. mathematical capabilities. Nonetheless, allowing
All GPT-J models without tool use achieve similar the model to make API calls more than doubles per-
performance. Crucially, Toolformer clearly outper- formance for all tasks, and also clearly outperforms
forms these baseline models, improving upon the the much larger OPT and GPT-3 models. This is
best baseline by 11.7, 5.2 and 18.6 points, respec- because across all benchmarks, for 97.9% of all
tively. It also clearly outperforms OPT (66B) and examples the model decides to ask the calculator
GPT-3 (175B), despite both models being much tool for help.
larger. This is achieved because the model inde-
pendently decides to ask the question answering 4.2.3 Question Answering
tool for the required information in almost all cases We look at Web Questions (Berant et al., 2013),
(98.1%); for only very few examples, it uses a dif- Natural Questions (Kwiatkowski et al., 2019) and
ferent tool (0.7%) or no tool at all (1.2%). TriviaQA (Joshi et al., 2017), the three question an-
swering datasets considered by Brown et al. (2020).
4.2.2 Math Datasets
For evaluation, we check whether the first 20 words
We test mathematical reasoning abilities on ASDiv predicted by a model contain the correct answer
(Miao et al., 2020), SVAMP (Patel et al., 2021) and instead of requiring an exact match. For Tool-
the MAWPS benchmark (Koncel-Kedziorski et al., former, we disable the question answering tool as
2016). We again account for the fact that we test
7
all models in a zero-shot setup by using a more An exception to this is if the model’s prediction contains
an equation (e.g., “The correct answer is 5+3=8”), in which
lenient evaluation criterion: As the required output case we consider the first number after the “=” sign to be its
is always a number, we simply check for the first prediction.
Model WebQS NQ TriviaQA Model Es De Hi Vi Zh Ar
GPT-J 18.5 12.8 43.9 GPT-J 15.2 16.5 1.3 8.2 18.2 8.2
GPT-J + CC 18.4 12.2 45.6 GPT-J + CC 15.7 14.9 0.5 8.3 13.7 4.6
Toolformer (disabled) 18.9 12.6 46.7 Toolformer (disabled) 19.8 11.9 1.2 10.1 15.0 3.1
Toolformer 26.3 17.7 48.8 Toolformer 20.6 13.5 1.4 10.6 16.8 3.7
OPT (66B) 18.6 11.4 45.7 OPT (66B) 0.3 0.1 1.1 0.2 0.7 0.1
GPT-3 (175B) 29.0 22.6 65.9 GPT-3 (175B) 3.4 1.1 0.1 1.7 17.7 0.1
GPT-J (All En) 24.3 27.0 23.9 23.3 23.1 23.6
Table 5: Results for various question answering dataset. GPT-3 (All En) 24.7 27.2 26.1 24.9 23.6 24.0
Using the Wikipedia search tool for most examples,
Toolformer clearly outperforms baselines of the same Table 6: Results on MLQA for Spanish (Es), German
size, but falls short of GPT-3 (175B). (De), Hindi (Hi), Vietnamese (Vi), Chinese (Zh) and
Arabic (Ar). While using the machine translation tool
to translate questions is helpful across all languages,
this would make solving the tasks trivial, especially further pretraining on CCNet deteriorates performance;
given that the underlying QA system was finetuned consequently, Toolformer does not consistently outper-
on Natural Questions. form GPT-J. The final two rows correspond to models
that are given contexts and questions in English.
Results are shown in Table 5. Once again,
Toolformer clearly outperforms all other models
based on GPT-J, this time mostly relying on the former does not consistently outperform vanilla
Wikipedia search API (99.3%) to find relevant in- GPT-J. This is mainly because for some languages,
formation. However, Toolformer still lags behind finetuning on CCNet deteriorates performance; this
the much larger GPT-3 (175B) model. This is likely might be due to a distribution shift compared to
due to both the simplicity of our search engine (in GPT-J’s original pretraining data.
many cases, it returns results that are clearly not OPT and GPT-3 perform surprisingly weak
a good match for a given query) and the inability across all languages, mostly because they fail to
of Toolformer to interact with it, e.g., by refor- provide an answer in English despite being in-
mulating its query if results are not helpful or by structed to do so. A potential reason for GPT-J not
browsing through multiple of the top results. We suffering from this problem is that it was trained on
believe that adding this functionality is an exciting more multilingual data than both OPT and GPT-3,
direction for future work. including the EuroParl corpus (Koehn, 2005; Gao
et al., 2020). As an upper bound, we also evaluate
4.2.4 Multilingual Question Answering
GPT-J and GPT-3 on a variant of MLQA where
We evaluate Toolformer and all baseline models both the context and the question are provided in
on MLQA (Lewis et al., 2019), a multilingual English. In this setup, GPT-3 performs better than
question-answering benchmark. A context para- all other models, supporting our hypothesis that
graph for each question is provided in English, its subpar performance on MLQA is due to the
while the question can be in Arabic, German, Span- multilingual aspect of the task.
ish, Hindi, Vietnamese, or Simplified Chinese. In
order to solve the task, the model needs to be able 4.2.5 Temporal Datasets
to understand both the paragraph and the question, To investigate the calendar API’s utility, we eval-
so it may benefit from translating the question into uate all models on T EMP LAMA (Dhingra et al.,
English. Our evaluation metric is the percentage of 2022) and a new dataset that we call DATESET.
times the model’s generation, capped at 10 words, T EMP LAMA is a dataset built from Wikidata that
contains the correct answer. contains cloze queries about facts that change with
Results are shown in Table 6. Using API calls time (e.g., “Cristiano Ronaldo plays for ___”)
consistently improves Toolformer’s performance as well as the correct answer for the years be-
for all languages, suggesting that it has learned to tween 2010 and 2020. DATESET, described in
make use of the machine translation tool. Depend- Appendix D, is also generated through a series
ing on the language, this tool is used for 63.8% of templates, but populated using a combination
to 94.9% of all examples; the only exception to of random dates/durations (e.g., “What day of the
this is Hindi, for which the machine translation week was it 30 days ago?”). Critically, knowing the
tool is used in only 7.3% of cases. However, Tool- current date is required to answer these questions.
Model T EMP LAMA DATESET Model WikiText CCNet
GPT-J 13.7 3.9 GPT-J 9.9 10.6
GPT-J + CC 12.9 2.9 GPT-J + CC 10.3 10.5
Toolformer (disabled) 12.7 5.9 Toolformer (disabled) 10.3 10.5
Toolformer 16.3 27.3
OPT (66B) 14.5 1.3 Table 8: Perplexities of different models on WikiText
GPT-3 (175B) 15.5 0.8 and our validation subset of CCNet. Adding API calls
comes without a cost in terms of perplexity for lan-
Table 7: Results for the temporal datasets. Toolformer guage modeling without any API calls.
outperforms all baselines, but does not make use of the
calendar tool for T EMP LAMA.
training data for GPT-J is more similar to Wiki-
Text than our randomly selected subset of CCNet.
For both tasks, we use the same evaluation as for Most importantly, however, training on C ∗ (our
the original LAMA dataset. dataset annotated with API calls) does not lead to
Results shown in Table 7 illustrate that Tool- an increase in perplexity compared to training on
former outperforms all baselines for both T EM - C when API calls are disabled at inference time.8
P LAMA and DATESET . However, closer inspec-
tion shows that improvements on T EMP LAMA 4.4 Scaling Laws
can not be attributed to the calendar tool, which is We investigate how the ability to ask external tools
only used for 0.2% of all examples, but mostly to for help affects performance as we vary the size
the Wikipedia search and question answering tools, of our LM. To this end, we apply our approach
which Toolformer calls the most. This makes sense not just to GPT-J, but also to four smaller mod-
given that named entities in T EMP L AMA are often els from the GPT-2 family (Radford et al., 2019),
so specific and rare that even knowing the exact with 124M, 355M, 775M and 1.6B parameters, re-
date alone would be of little help. The best course spectively. We do so using only a subset of three
of action for this dataset – first querying the calen- tools: the question answering system, the calcula-
dar API to get the current date, and then querying tor, and the Wikipedia search engine. Apart from
the question answering system with this date – is this, we follow the experimental setup described in
not only prohibited by our restriction of using at Section 4.1.
most one API call per example, but also hard to Figure 4 shows that the ability to leverage the
learn for Toolformer given that all API calls in its provided tools only emerges at around 775M pa-
training data are sampled independently. rameters: smaller models achieve similar perfor-
For DATESET, on the other hand, the consider- mance both with and without tools. An exception
able improvement of Toolformer compared to other to this is the Wikipedia search engine used mostly
models can be fully accredited to the calendar tool, for QA benchmarks; we hypothesize that this is
which it makes use of for 54.8% of all examples. because the API is comparably easy to use. While
models become better at solving tasks without API
4.3 Language Modeling calls as they grow in size, their ability to make good
In addition to verifying improved performance on use of the provided API improves at the same time.
various downstream tasks, we also want to ensure As a consequence, there remains a large gap be-
that language modeling performance of Toolformer tween predictions with and without API calls even
does not degrade through our finetuning with API for our biggest model.
calls. To this end, we evaluate our models on
two language modeling datasets: WikiText (Mer- 5 Analysis
ity et al., 2017) and a subset of 10,000 randomly
Decoding Strategy We investigate the effect of
selected documents from CCNet (Wenzek et al.,
our modified decoding strategy introduced in Sec-
2020) that were not used during training. Perplex-
tion 4.2, where instead of always generating the
ities of various models are shown in Table 8. As
one would expect, finetuning on CCNet leads to 8
We do not evaluate the perplexity of Toolformer with
slightly improved performance on a different CC- API calls enabled as computing the probability pM (xt |
x1 , . . . , xt−1 ) of token xt given x1 , . . . , xt−1 would require
Net subset, but it slightly deteriorates performance marginalizing over all potential API calls that the model could
on WikiText, presumably because the original pre- make at position t, which is intractable.
LAMA Math Benchmarks QA Benchmarks
30 30 40

35
25 25
30
20 20
25

15 15 20
15
10 10
Toolformer 10
5 Toolformer (disabled) 5
5
GPT3
0 0 0
0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000
Model Parameters (M) Model Parameters (M) Model Parameters (M)

Figure 4: Average performance on LAMA, our math benchmarks and our QA benchmarks for GPT-2 models of
different sizes and GPT-J finetuned with our approach, both with and without API calls. While API calls are not
helpful to the smallest models, larger models learn how to make good use of them. Even for bigger models, the
gap between model predictions with and without API calls remains high.

most likely token, we generate the <API> token T-REx WebQS
if it is one of the k most likely tokens. Table 9 k All AC NC % All AC NC %
shows performance on the T-REx subset of LAMA 0 34.9 – 34.9 0.0 18.9 – 18.9 0.0
and on WebQS for different values of k. As ex- 1 47.8 53.0 44.3 40.3 19.3 17.1 19.9 8.5
pected, increasing k leads to the model doing API 3 52.9 58.0 29.0 82.8 26.3 26.5 6.6 99.3
10 53.5 54.0 22.5 98.1 26.3 26.4 – 100.0
calls for more examples – from 40.3% and 8.5%
with k = 1 (i.e., regular greedy decoding) to 98.1% Table 9: Toolformer results on the T-REx subset of
and 100% for k = 10. While for T-REx, there is LAMA and on WebQS for different values of k used
already a clear improvement in performance with during decoding. Numbers shown are overall perfor-
greedy decoding, on WebQS our model only starts mance (All), performance on the subset where the
to make a substantial number of API calls as we model decides to make an API call (AC) and all re-
slightly increase k. Interestingly, for k = 1 the maining examples (NC), as well as the percentage of
examples for which the model decides to call an API
model is calibrated to some extent: It decides to
(%).
call APIs for examples that it would perform partic-
ularly badly on without making API calls. This can
be seen from the fact that performance on examples “Fast train success” in the fourth example that does
where it decides not to make an API call (44.3 and not give any relevant information but still reduces
19.9) is higher than average performance if no API perplexity. However, some amount of noise in the
calls are made at all (34.9 and 18.9). However, this API calls that are not filtered can actually be useful
calibration is lost for higher values of k. as it forces the model finetuned on C ∗ to not always
blindly follow the results of each call it makes.
Data Quality We qualitatively analyze some
API calls generated with our approach for different 6 Related Work
APIs. Table 10 shows some examples of texts from
CCNet augmented with API calls, as well as the Language Model Pretraining There are various
corresponding score L− +
i − Li that is used as a fil- approaches that augment language models with
tering criterion, and whether the API calls made by some form of additional textual information during
the model are intuitively useful in the given context. pretraining, including various forms of metadata
As can be seen, high values of L− +
i − Li typically (Keskar et al., 2019), HTML tags (Aghajanyan
correspond to useful API calls, whereas low values et al., 2021), Wikipedia markup (Schick et al.,
correspond to API calls that do not provide any in- 2022), or related texts obtained from an informa-
formation that is useful for predicting future tokens. tion retrieval system (Guu et al., 2020; Borgeaud
There are some exceptions, e.g., an API call for et al., 2021; Izacard et al., 2022). For all of these
Example L− +
i − Li Useful
The Flodden Window (a war memorial dedicated to The Middleton Archers), in the Grade I-listed
Church of St Leonard in Middleton is said to be the oldest war memorial in the United King-
dom. <API> WikiSearch(War memorial Flodden) → Battle of Flodden > Commemoration >
The stained-glass Flodden Window in Middleton Parish Church [. . . ] was constructed by Sir 5.49 3
Richard Assheton in memory of the Battle of Flodden and the archers from Middleton who
fought in it. </API> Sir Richard Assheton of Middleton (who built St Leonard) was granted
knighthood [. . . ]
Note: The WL will be open on Friday, <API> Calendar() → Today is Thursday, March 9, 2017.
2.11 3
</API> March 10, and Sunday, March 19 for regular hours.
The Nile has an approximate length of <API> QA(What is the approximate length of the Nile?)
2.08 3
→ 6,853 km </API> 6,853 kilometers, the White Nile being its main source.
If Venus had an atmosphere similar to Earth’s then you would expect Venus’ mean temperature to be
499 K (1.74 x 287) rather than 735 K which is <API> Calculator(735 / 499) → 1.47 </API> 1.47 1.59 3
(735 / 499) times hotter than it should be.
You are here: Home / Featured / Catch this fast train to success! <API> WikiSearch(Fast train
success) → Fast Train > It also peaked at #23 on the Canadian CHUM singles chart, on June
26, 1971. The success of this single established Myles Goodwyn as the band’s main songwriter, 0.92 7
and made it possible for April Wine to record a second album. </API> Don’t wait weeks and
[. . .]
Os Melhores Escolas em Jersey 2020 <API> MT(Os Melhores Escolas em Jersey) → The Best
Schools in Jersey </API> On this page you can search for Universities, Colleges and Business 0.70 3
schools in Jersey
Enjoy these pictures from the <API> Calendar() → Today is Friday, April 19, 2013. </API>
0.33 3
Easter Egg Hunt.
85 patients (23%) were hospitalised alive and admitted to a hospital ward. Of them, <API> Calcula-
−0.02 7
tor(85 / 23) → 3.70 </API> 65% had a cardiac aetiology [. . .]
But hey, after the <API> Calendar() → Today is Saturday, June 25, 2011. </API> Disneyland
−0.41 7
fiasco with the fire drill, I think it’s safe to say Chewey won’t let anyone die in a fire.
The last time I was with <API> QA(Who was last time I was with?) → The Last Time </API>
−1.23 7
him I asked what he likes about me and he said he would tell me one day.

Table 10: Examples of API calls for different tools, sorted by the value of L− +
i − Li that is used as a filtering
criterion. High values typically correspond to API calls that are intuitively useful for predicting future tokens.

approaches, additional information is always pro- used (Gao et al., 2022; Lazaridou et al., 2022; Yao
vided, regardless of whether it is helpful or not. In et al., 2022). In contrast, the self-supervised nature
contrast, Toolformer learns for itself to explicitly of Toolformer enables it to learn how and when to
asks for the right information. use tools without requiring a specific prompt that
shows task-specific examples of how a tool could
Tool Use Several approaches aim to equip LMs be used. Perhaps most closely related to our work
with the ability to use external tools such as search is TALM (Parisi et al., 2022), an approach that
engines (Komeili et al., 2022; Thoppilan et al., uses a similar self-supervised objective for teach-
2022; Lazaridou et al., 2022; Shuster et al., 2022; ing a model to use a calculator and a search engine,
Yao et al., 2022), web browsers (Nakano et al., but explores this only in settings where a model is
2021), calculators (Cobbe et al., 2021; Thoppilan finetuned for downstream tasks.
et al., 2022), translation systems (Thoppilan et al.,
2022) and Python interpreters (Gao et al., 2022). Bootstrapping The idea of using self-training
The way these models learn to use tools can roughly and bootstrapping techniques to improve models
be divided into two approaches: Either they rely on has been investigated in various contexts, rang-
large amounts of human supervision (Komeili et al., ing from word sense disambiguation (Yarowsky,
2022; Nakano et al., 2021; Thoppilan et al., 2022) 1995), relation extraction (Brin, 1999; Agichtein
or they work by prompting the language model in and Gravano, 2000), parsing (McClosky et al.,
a few-shot setup tailored towards a specific task 2006; Reichart and Rappoport, 2007), sequence
where it is known a priori which tools needs to be generation (He et al., 2020), few-shot text classi-
fication (Schick and Schütze, 2021a) and retrieval reduce perplexity on future tokens. Toolformer
(Izacard and Grave, 2021) to reasoning (Zelikman considerably improves zero-shot performance of a
et al., 2022). In a similar spirit to these approaches, 6.7B parameter GPT-J model, enabling it to even
Toolformer is trained on its own predictions after outperform a much larger GPT-3 model on a range
applying a perplexity-based filtering step. of different downstream tasks.

7 Limitations
References
While our approach enables LMs to learn how to
use a variety of tools in a self-supervised way, there Armen Aghajanyan, Dmytro Okhonko, Mike Lewis,
Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettle-
are some clear limitations to what can be achieved moyer. 2021. Htlm: Hyper-text pre-training and
with our method in its current form. One such limi- prompting of language models.
tation is the inability of Toolformer to use tools in a
chain (i.e., using the output of one tool as an input Eugene Agichtein and Luis Gravano. 2000. Snowball:
Extracting relations from large plain-text collections.
for another tool). This is due to the fact that API In Proceedings of the Fifth ACM Conference on Dig-
calls for each tool are generated independently; as a ital Libraries, DL ’00, page 85–94, New York, NY,
consequence, there are no examples of chained tool USA. Association for Computing Machinery.
use in the finetuning dataset. Our current approach
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al.
also does not allow the LM to use a tool in an in- 1999. Modern information retrieval, volume 463.
teractive way – especially for tools such as search ACM press New York.
engines, that could potentially return hundreds of
different results, enabling a LM to browse through Jonathan Berant, Andrew Chou, Roy Frostig, and Percy
Liang. 2013. Semantic parsing on Freebase from
these results or to refine its search query in a simi- question-answer pairs. In Proceedings of the 2013
lar spirit to Nakano et al. (2021) can be crucial for Conference on Empirical Methods in Natural Lan-
certain applications. Beyond this, we found models guage Processing, pages 1533–1544, Seattle, Wash-
trained with Toolformer to often be sensitive to the ington, USA. Association for Computational Lin-
exact wording of their input when deciding whether guistics.
or not to call an API; this is perhaps unsurprising Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
given that LMs are known to be very sensitive to mann, Trevor Cai, Eliza Rutherford, Katie Millican,
the prompt they are provided with in both zero- George van den Driessche, Jean-Baptiste Lespiau,
and few-shot settings (Jiang et al., 2020; Schick Bogdan Damoc, Aidan Clark, Diego de Las Casas,
Aurelia Guy, Jacob Menick, Roman Ring, Tom Hen-
and Schütze, 2021a). Depending on the tool, our nigan, Saffron Huang, Loren Maggiore, Chris Jones,
method is also very sample-inefficient; for example, Albin Cassirer, Andy Brock, Michela Paganini, Ge-
processing more than a million documents results offrey Irving, Oriol Vinyals, Simon Osindero, Karen
in only a few thousand examples of useful calls Simonyan, Jack W. Rae, Erich Elsen, and Laurent
Sifre. 2021. Improving language models by retriev-
to the calculator API. A potential solution to this ing from trillions of tokens.
problem might be to iteratively apply our approach,
similar to how this is done in related bootstrapping Sergey Brin. 1999. Extracting patterns and relations
approaches (Schick and Schütze, 2021a; Izacard from the world wide web. In The World Wide Web
and Databases, pages 172–183, Berlin, Heidelberg.
and Grave, 2021; Parisi et al., 2022). Finally, when Springer Berlin Heidelberg.
deciding whether or not to make an API call, Tool-
former currently does not take into account the Tom Brown, Benjamin Mann, Nick Ryder, Melanie
tool-dependent, computational cost incurred from Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
making an API call. Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon
8 Conclusion Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
Clemens Winter, Chris Hesse, Mark Chen, Eric
We have introduced Toolformer, a language model Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
that learns in a self-supervised way how to use Jack Clark, Christopher Berner, Sam McCandlish,
different tools such as search engines, calculators, Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In
and translation systems via simple API calls. This Advances in Neural Information Processing Systems,
is done by finetuning on a large number of sampled volume 33, pages 1877–1901. Curran Associates,
API calls that are filtered based on whether they Inc.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
Maarten Bosma, Gaurav Mishra, Adam Roberts, pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
Paul Barham, Hyung Won Chung, Charles Sutton, augmented language model pre-training.
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio
Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- Ranzato. 2020. Revisiting self-training for neural
odkumar Prabhakaran, Emily Reif, Nan Du, Ben sequence generation. In International Conference
Hutchinson, Reiner Pope, James Bradbury, Jacob on Learning Representations.
Austin, Michael Isard, Guy Gur-Ari, Pengcheng
Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe- Or Honovich, Thomas Scialom, Omer Levy, and Timo
mawat, Sunipa Dev, Henryk Michalewski, Xavier Schick. 2022. Unnatural instructions: Tuning lan-
Garcia, Vedant Misra, Kevin Robinson, Liam Fe- guage models with (almost) no human labor.
dus, Denny Zhou, Daphne Ippolito, David Luan,
Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Gautier Izacard and Edouard Grave. 2021. Distilling
Ryan Sepassi, David Dohan, Shivani Agrawal, Mark knowledge from reader to retriever for question an-
Omernick, Andrew M. Dai, Thanumalayan Sankara- swering. In International Conference on Learning
narayana Pillai, Marie Pellat, Aitor Lewkowycz, Representations.
Erica Moreira, Rewon Child, Oleksandr Polozov,
Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren- Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas
nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-
Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Yu, Armand Joulin, Sebastian Riedel, and Edouard
Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Grave. 2022. Atlas: Few-shot learning with retrieval
Palm: Scaling language modeling with pathways. augmented language models.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Madotto, and Pascale Fung. 2022. Survey of hallu-
Nakano, et al. 2021. Training verifiers to solve math cination in natural language generation. ACM Com-
word problems. arXiv preprint arXiv:2110.14168. puting Surveys.

Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Neubig. 2020. How can we know what language
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, models know? Transactions of the Association for
et al. 2022. No language left behind: Scaling Computational Linguistics, 8:423–438.
human-centered machine translation. arXiv preprint
arXiv:2207.04672. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
Zettlemoyer. 2017. TriviaQA: A large scale dis-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tantly supervised challenge dataset for reading com-
Kristina Toutanova. 2019. BERT: Pre-training of prehension. In Proceedings of the 55th Annual Meet-
deep bidirectional transformers for language under- ing of the Association for Computational Linguistics
standing. In Proceedings of the 2019 Conference (Volume 1: Long Papers), pages 1601–1611, Van-
of the North American Chapter of the Association couver, Canada. Association for Computational Lin-
for Computational Linguistics: Human Language guistics.
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ- Armand Joulin, Edouard Grave, Piotr Bojanowski,
ation for Computational Linguistics. Matthijs Douze, Hérve Jégou, and Tomas Mikolov.
2016. Fasttext. zip: Compressing text classification
Bhuwan Dhingra, Jeremy R. Cole, Julian Martin models. arXiv preprint arXiv:1612.03651.
Eisenschlos, Daniel Gillick, Jacob Eisenstein, and
William W. Cohen. 2022. Time-aware language Nitish Shirish Keskar, Bryan McCann, Lav R. Varsh-
models as temporal knowledge bases. Transactions ney, Caiming Xiong, and Richard Socher. 2019.
of the Association for Computational Linguistics, Ctrl: A conditional transformer language model for
10:257–273. controllable generation.

Leo Gao, Stella Biderman, Sid Black, Laurence Gold- Philipp Koehn. 2005. Europarl: A parallel corpus for
ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- statistical machine translation. In Proceedings of
race He, Anish Thite, Noa Nabeshima, et al. 2020. machine translation summit x: papers, pages 79–86.
The pile: An 800gb dataset of diverse text for lan-
guage modeling. arXiv preprint arXiv:2101.00027. Mojtaba Komeili, Kurt Shuster, and Jason Weston.
2022. Internet-augmented dialogue generation. In
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Proceedings of the 60th Annual Meeting of the As-
Pengfei Liu, Yiming Yang, Jamie Callan, and Gra- sociation for Computational Linguistics (Volume 1:
ham Neubig. 2022. Pal: Program-aided language Long Papers), pages 8460–8478, Dublin, Ireland.
models. Association for Computational Linguistics.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
Nate Kushman, and Hannaneh Hajishirzi. 2016. Krueger, Kevin Button, Matthew Knight, Benjamin
MAWPS: A math word problem repository. In Pro- Chess, and John Schulman. 2021. Webgpt: Browser-
ceedings of the 2016 Conference of the North Amer- assisted question-answering with human feedback.
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. Talm:
1152–1157, San Diego, California. Association for Tool augmented language models.
Computational Linguistics.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- 2021. Are NLP models really able to solve simple
field, Michael Collins, Ankur Parikh, Chris Al- math word problems? In Proceedings of the 2021
berti, Danielle Epstein, Illia Polosukhin, Jacob De- Conference of the North American Chapter of the
vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Association for Computational Linguistics: Human
Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Language Technologies, pages 2080–2094, Online.
Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Association for Computational Linguistics.
Natural questions: A benchmark for question an-
swering research. Transactions of the Association Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick
for Computational Linguistics, 7:452–466. Lewis, Majid Yazdani, Nicola De Cao, James
Thorne, Yacine Jernite, Vladimir Karpukhin, Jean
Angeliki Lazaridou, Elena Gribovskaya, Wojciech Maillard, Vassilis Plachouras, Tim Rocktäschel, and
Stokowiec, and Nikolai Grigorev. 2022. Internet- Sebastian Riedel. 2021. KILT: a benchmark for
augmented language models through few-shot knowledge intensive language tasks. In Proceedings
prompting for open-domain question answering. of the 2021 Conference of the North American Chap-
arXiv preprint arXiv:2203.05115. ter of the Association for Computational Linguistics:
Human Language Technologies, pages 2523–2544,
Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian Online. Association for Computational Linguistics.
Riedel, and Holger Schwenk. 2019. Mlqa: Eval-
uating cross-lingual extractive question answering. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
arXiv preprint arXiv:1910.07475. Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl-
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu edge bases? In Proceedings of the 2019 Confer-
Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- ence on Empirical Methods in Natural Language
man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Processing and the 9th International Joint Confer-
Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav ence on Natural Language Processing (EMNLP-
Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle- IJCNLP), pages 2463–2473, Hong Kong, China. As-
moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy- sociation for Computational Linguistics.
anov, and Xian Li. 2021. Few-shot learning with
multilingual language models. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
Ryan McDonald. 2020. On faithfulness and factual- guage models are unsupervised multitask learners.
ity in abstractive summarization. OpenAI blog, 1(8):9.

David McClosky, Eugene Charniak, and Mark Johnson. Roi Reichart and Ari Rappoport. 2007. Self-training
2006. Effective self-training for parsing. In Pro- for enhancement and domain adaptation of statisti-
ceedings of the Human Language Technology Con- cal parsers trained on small datasets. In Proceed-
ference of the NAACL, Main Conference, pages 152– ings of the 45th Annual Meeting of the Association of
159, New York City, USA. Association for Compu- Computational Linguistics, pages 616–623, Prague,
tational Linguistics. Czech Republic. Association for Computational Lin-
guistics.
Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. 2017. Pointer sentinel mixture mod- Stephen E Robertson, Steve Walker, Susan Jones,
els. In International Conference on Learning Repre- Micheline M Hancock-Beaulieu, Mike Gatford, et al.
sentations. 1995. Okapi at trec-3. Nist Special Publication Sp,
109:109.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su.
2020. A diverse corpus for evaluating and develop- Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio
ing English math word problem solvers. In Proceed- Petroni, Patrick Lewis, Gautier Izacard, Qingfei You,
ings of the 58th Annual Meeting of the Association Christoforos Nalmpantis, Edouard Grave, and Se-
for Computational Linguistics, pages 975–984, On- bastian Riedel. 2022. Peer: A collaborative lan-
line. Association for Computational Linguistics. guage model.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Timo Schick and Hinrich Schütze. 2021a. Exploiting
Long Ouyang, Christina Kim, Christopher Hesse, cloze-questions for few-shot text classification and
Shantanu Jain, Vineet Kosaraju, William Saunders, natural language inference. In Proceedings of the
16th Conference of the European Chapter of the As- web crawl data. In Proceedings of the Twelfth Lan-
sociation for Computational Linguistics: Main Vol- guage Resources and Evaluation Conference, pages
ume, pages 255–269, Online. Association for Com- 4003–4012, Marseille, France. European Language
putational Linguistics. Resources Association.

Timo Schick and Hinrich Schütze. 2021b. Generating Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
datasets with pretrained language models. In Pro- Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
ceedings of the 2021 Conference on Empirical Meth- React: Synergizing reasoning and acting in language
ods in Natural Language Processing, pages 6943– models.
6951, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics. David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In 33rd
Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Annual Meeting of the Association for Computa-
Eric Michael Smith, Stephen Roller, Megan Ung, tional Linguistics, pages 189–196, Cambridge, Mas-
Moya Chen, Kushal Arora, Joshua Lane, Morteza sachusetts, USA. Association for Computational
Behrooz, William Ngan, Spencer Poff, Naman Linguistics.
Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kam-
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D.
badur, and Jason Weston. 2022. Blenderbot 3: a de-
Goodman. 2022. Star: Bootstrapping reasoning
ployed conversational agent that continually learns
with reasoning.
to responsibly engage.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
Amin Ghafouri, Marcelo Menegali, Yanping Huang, Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
Maxim Krikun, Dmitry Lepikhin, James Qin, De- trained transformer language models.
hao Chen, Yuanzhong Xu, Zhifeng Chen, Adam
Roberts, Maarten Bosma, Vincent Zhao, Yanqi
Zhou, Chung-Ching Chang, Igor Krivokon, Will
Rusch, Marc Pickett, Pranesh Srinivasan, Laichee
Man, Kathleen Meier-Hellstern, Meredith Ringel
Morris, Tulsee Doshi, Renelito Delos Santos, Toju
Duke, Johnny Soraker, Ben Zevenbergen, Vinod-
kumar Prabhakaran, Mark Diaz, Ben Hutchinson,
Kristen Olson, Alejandra Molina, Erin Hoffman-
John, Josh Lee, Lora Aroyo, Ravi Rajakumar,
Alena Butryna, Matthew Lamm, Viktoriya Kuzmina,
Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray
Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian
Croak, Ed Chi, and Quoc Le. 2022. Lamda: Lan-
guage models for dialog applications.

Ben Wang and Aran Komatsuzaki. 2021. GPT-
J-6B: A 6 Billion Parameter Autoregressive
Language Model. https://github.com/
kingoflolz/mesh-transformer-jax.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
isa Liu, Noah A. Smith, Daniel Khashabi, and Han-
naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
guage model with self generated instructions.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf-
fel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
gatama, Maarten Bosma, Denny Zhou, Donald Met-
zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
Percy Liang, Jeff Dean, and William Fedus. 2022.
Emergent abilities of large language models.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
mand Joulin, and Edouard Grave. 2020. CCNet:
Extracting high quality monolingual datasets from
A API Details 2016), while the target language is always set to
English. Since most of the CCNet dataset is in
When sampling and filtering API calls, by default
English, we filter out the parts that contain only
we use values of τs = 0.05 and τf = 1.0 – i.e.,
English text before generating API calls. More
we only make API calls at positions where the
specifically, we only keep those paragraphs which
probability of the <API> token is at least 5%, and
contain text chunks in a language other than En-
we keep API calls if they reduce the loss by at least
glish preceded and followed by English text. We
1.0. We only keep the top k = 5 such positions and
use text chunks of size 10 tokens. To determine
sample up to m = 5 API calls for each position
whether the middle text chunk is in a language
identified in a piece of text. Due to the heuristic
different than English we again use the fastText
filtering described below, we generate API calls for
classifier with a confidence greater than 0.8. We
the calculator and machine translation system on
also filter out any text chunks that contain only
only a small subset of C; to compensate for this,
numbers or special symbols. This filtering mecha-
we set τs = 0.0, k = 20 and m = 10 for these
nism allows us to generate data more efficiently by
tools. As the resulting sets of API calls are still
focusing our API call generations in places where
comparably small, we additionally set τf = 0.5.
the MT tool is likely to be helpful. After generating
A.1 Implementation the MT API calls, we additionally remove from our
Question Answering We use the Atlas model of training set those where the input to the MT tool
Izacard et al. (2022) finetuned on Natural Ques- appears after the API call but not before it. While
tions (Kwiatkowski et al., 2019) as our question during data generation the model can look ahead
answering system. For creating C ∗ we use Atlas- to generate API calls, this is not possible at infer-
large, enabling us to efficiently process millions ence time, so we want to dissuade the model from
of API calls; during inference, we use the larger calling the API in such cases.
Atlas-xxl model. A.2 Prompts
Calculator Our calculator is based on a simple Below, we list the prompts used to sample API
Python script and only supports the operators “+”, calls for each tool considered.
“−”, “∗”, and “/”. It does not return any result Question Answering We use the following
for syntactically invalid equations. For sampling prompt for the question answering tool:
API calls, we apply heuristic filters to our subset of Your task is to add calls to a Question
CCNet and only process documents that either (i) Answering API to a piece of text.
The questions should help you get
contain at least three numbers within a window of information required to complete the
100 tokens, where one of these numbers is the result text. You can call the API by writing
"[QA(question)]" where "question" is the
of applying a mathematical operation to the other question you want to ask. Here are some
two, (ii) contain one of the sequences “=”, “equals”, examples of API calls:
“equal to”, “total of”, “average of” followed by a Input: Joe Biden was born in Scranton,
Pennsylvania.
number, or (iii) contain at least three numbers; for Output: Joe Biden was born in [QA("Where
texts that only match the last criterion, we only was Joe Biden born?")] Scranton,
keep a random subset of 1%. [QA("In which state is Scranton?")]
Pennsylvania.
Calendar For creating our dataset C ∗ , we operate
Input: Coca-Cola, or Coke, is a
under the assumption that the calendar date in such carbonated soft drink manufactured by
cases should be the date that the document was the Coca-Cola Company.
created. We approximate this by extracting the date Output: Coca-Cola, or [QA("What other
name is Coca-Cola known by?")] Coke, is
from the URL, if it is present. We filter out texts for a carbonated soft drink manufactured by
which a date cannot be extracted, leaving around [QA("Who manufactures Coca-Cola?")] the
Coca-Cola Company.
18% of the documents.
Input: x
Machine Translation For both training and in- Output:
ference, we use the 600M parameter NLLB (Costa-
Calculator We use the following prompt for the
jussà et al., 2022) as our machine translation (MT) calculator:
model. The source language is automatically de- Your task is to add calls to a
tected using the fastText classifier (Joulin et al., Calculator API to a piece of text.
The calls should help you get nanomaterials may give rise to various
information required to complete the kinds of lung damage.
text. You can call the API by writing Output: But what are the risks
"[Calculator(expression)]" where during production of nanomaterials?
"expression" is the expression to be [WikiSearch("nanomaterial production
computed. Here are some examples of API risks")] Some nanomaterials may give
calls: rise to various kinds of lung damage.
Input: The number in the next term is 18
+ 12 x 3 = 54. Input: Metformin is the first-line drug
Output: The number in the next term is for patients with type 2 diabetes and
18 + 12 x 3 = [Calculator(18 + 12 * 3)] obesity.
54. Output: Metformin is the first-line drug
for [WikiSearch("Metformin first-line
Input: The population is 658,893 people. drug")] patients with type 2 diabetes
This is 11.4% of the national average of and obesity.
5,763,868 people.
Output: The population is 658,893 people. Input: x
This is 11.4% of the national average of Output:
[Calculator(658,893 / 11.4%)] 5,763,868
people. Machine Translation We use the following
Input: A total of 252 qualifying matches
prompt for the machine translation tool:
were played, and 723 goals were scored Your task is to complete a given piece
(an average of 2.87 per match). This is of text by using a Machine Translation
three times less than the 2169 goals API.
last year. You can do so by writing "[MT(text)]"
Output: A total of 252 qualifying where text is the text to be translated
matches were played, and 723 goals were into English.
scored (an average of [Calculator(723 Here are some examples:
/ 252)] 2.87 per match). This is twenty
goals more than the [Calculator(723 - Input: He has published one book: O
20)] 703 goals last year. homem suprimido (“The Supressed Man”)
Output: He has published one book: O
Input: I went to Paris in 1994 and homem suprimido [MT(O homem suprimido)]
stayed there until 2011, so in total, (“The Supressed Man”)
it was 17 years.
Output: I went to Paris in 1994 and Input: In Morris de Jonge’s Jeschuah,
stayed there until 2011, so in total, it der klassische jüdische Mann, there is a
was [Calculator(2011 - 1994)] 17 years. description of a Jewish writer
Output: In Morris de Jonge’s Jeschuah,
Input: From this, we have 4 * 30 minutes der klassische jüdische Mann [MT(der
= 120 minutes. klassische jüdische Mann)], there is a
Output: From this, we have 4 * 30 description of a Jewish writer
minutes = [Calculator(4 * 30)] 120
minutes. Input: 南 京 高 淳 县 住 房 和 城 乡 建 设 局 城 市 新
区 设 计 a plane of reference Gaochun is
Input: x one of seven districts of the provincial
Output: capital Nanjing
Output: [MT(南京高淳县住房和城乡建设局 城市新
Wikipedia Search We use the following prompt 区 设 计)] a plane of reference Gaochun is
for the Wikipedia search tool: one of seven districts of the provincial
Your task is to complete a given piece capital Nanjing
of text. You can use a Wikipedia Search
API to look up information. You can do Input: x
so by writing "[WikiSearch(term)]" where Output:
"term" is the search term you want to
look up. Here are some examples of API Calendar We use the following prompt for the
calls: calendar tool:
Input: The colors on the flag of Ghana
have the following meanings: red is for Your task is to add calls to a Calendar
the blood of martyrs, green for forests, API to a piece of text. The API calls
and gold for mineral wealth. should help you get information required
Output: The colors on the flag of Ghana to complete the text. You can call the
have the following meanings: red is for API by writing "[Calendar()]" Here are
[WikiSearch("Ghana flag red meaning")] some examples of API calls:
the blood of martyrs, green for forests,
and gold for mineral wealth. Input: Today is the first Friday of the
year.
Input: But what are the risks during Output: Today is the first [Calendar()]
production of nanomaterials? Some Friday of the year.
Template Size
Input: The president of the United
States is Joe Biden. How many days {ago was, are there until} 400
Output: The president of the United {past_date, future_date}?
States is [Calendar()] Joe Biden.
What {day of the week, day of the month, month, 800
Input: The current day of the week is year} was it (current_date – past_date) {days,
Wednesday. weeks, months, years} ago?
Output: The current day of the week is What {day of the week, day of the month, month, 800
[Calendar()] Wednesday. year} will it be in (future_date – current_date)
days?
Input: The number of days from now until
Christmas is 30. What day of the week {is, was} it on {past_date, 400
Output: The number of days from now future_date}?
until Christmas is [Calendar()] 30. What {day of the week, day of the month, month, 4,000
year} {is, was} it {the day before yesterday, yes-
Input: The store is never open on the terday, today, tomorrow, the day after tomorrow}?
weekend, so today it is closed.
Output: The store is never open on the What {day of the week, day of the month, month} 1,800
weekend, so today [Calendar()] it is {is, was} holiday this year?
closed. How many {days, weeks, months, years} {ago 1,200
was, are there until} holiday this year?
Input: x
Output:
Total 9,400

B Toolformer Training
Table 11: Templates used to create DATESET where
We use up to 25k examples per API. Max sequence a current_date is randomly selected. For each cur-
length 1,024. Effective batch size of 128. All mod- rent_date, a random past_date and future_date is gen-
erated and used to fill each template, if relevant. The
els are trained using DeepSpeed’s ZeRO-3 (Rasley
federal holidays in the United States (e.g., Thanksgiv-
et al., 2020). We used 8 NVIDIA A100 40GB ing) were used in the templates involving holidays.
GPUs with BF16. Training up to 2k steps, where
we evaluate PPL on a small development set from
CCNet containing 1,000 examples every 500 steps. to answer a question based on
We pick the checkpoint that performs best. the following paragraph: x Now
answer the following question in
C Zero-Shot Prompts English: q.
C.1 LAMA and T EMP LAMA D DATESET
For both LAMA and T EMP LAMA, given an input
DATESET is created by first randomly selecting 500
text x, we use the following prompt: Please
“current dates”. For each current date, another rela-
complete the following text so
tively past/future date is randomly selected within
that it is factually correct: x.
a four-year range, and the two dates are used to fill
C.2 Math Benchmarks the query templates in Table 11. An example of one
such query using the first template would be, “How
For all math benchmarks, given a context x and
many days ago was August 14, 2020?” If called,
a question q, our prompt is: x q The answer
the Calendar tool would return the presumed cur-
is.
rent date (e.g., “Today is Sunday, November 20,
C.3 Question Answering 2020”).
For all question answering datasets, including
DATESET, we simply prefix the question with
Answer the following question:. We
append a question mark if the question does not
already end with one.

C.4 Multilingual Question Answering
For MLQA, given a context x and a ques-
tion q, our prompt is: Your task is