来源材料

资料来源

← 首页

                                             Toolformer: Language Models Can Teach Themselves to Use Tools

                                                    Timo Schick         Jane Dwivedi-Yu           Roberto Dessì†           Roberta Raileanu
                                                 Maria Lomeli          Luke Zettlemoyer           Nicola Cancedda               Thomas Scialom
                                                                     Meta AI Research † Universitat Pompeu Fabra




                                                             Abstract                                The New England Journal of Medicine is a registered
                                                                                                     trademark of [QA(“Who is the publisher of The New
                                            Language models (LMs) exhibit remarkable                 England Journal of Medicine?”) → Massachusetts
                                            abilities to solve new tasks from just a few




arXiv:2302.04761v1 [cs.CL] 9 Feb 2023
                                                                                                     Medical Society] the MMS.
                                            examples or textual instructions, especially at
                                            scale. They also, paradoxically, struggle with
                                                                                                     Out of 1400 participants, 400 (or [Calculator(400 / 1400)
                                            basic functionality, such as arithmetic or fac-
                                            tual lookup, where much simpler and smaller              → 0.29] 29%) passed the test.
                                            models excel. In this paper, we show that
                                            LMs can teach themselves to use external tools           The name derives from “la tortuga”, the Spanish word for
                                            via simple APIs and achieve the best of both             [MT(“tortuga”) → turtle] turtle.
                                            worlds. We introduce Toolformer, a model
                                            trained to decide which APIs to call, when to            The Brown Act is California’s law [WikiSearch(“Brown
                                            call them, what arguments to pass, and how to            Act”) → The Ralph M. Brown Act is an act of the
                                            best incorporate the results into future token
                                                                                                     California State Legislature that guarantees the public's
                                            prediction. This is done in a self-supervised
                                                                                                     right to attend and participate in meetings of local
                                            way, requiring nothing more than a handful of
                                                                                                     legislative bodies.] that requires legislative bodies, like
                                            demonstrations for each API. We incorporate
                                                                                                     city councils, to hold their meetings open to the public.
                                            a range of tools, including a calculator, a Q&A
                                            system, a search engine, a translation system,
                                            and a calendar. Toolformer achieves substan-           Figure 1: Exemplary predictions of Toolformer. The
                                            tially improved zero-shot performance across           model autonomously decides to call different APIs
                                            a variety of downstream tasks, often competi-          (from top to bottom: a question answering system,
                                            tive with much larger models, without sacrific-        a calculator, a machine translation system, and a
                                            ing its core language modeling abilities.              Wikipedia search engine) to obtain information that is
                                                                                                   useful for completing a piece of text.
                                        1   Introduction
                                        Large language models achieve impressive zero-                A simple way to overcome these limitations of
                                        and few-shot results on a variety of natural lan-          today’s language models is to give them the abil-
                                        guage processing tasks (Brown et al., 2020; Chowd-         ity to use external tools such as search engines,
                                        hery et al., 2022, i.a.) and show several emergent         calculators, or calendars. However, existing ap-
                                        capabilities (Wei et al., 2022). However, all of           proaches either rely on large amounts of human
                                        these models have several inherent limitations that        annotations (Komeili et al., 2022; Thoppilan et al.,
                                        can at best be partially addressed by further scal-        2022) or limit tool use to task-specific settings only
                                        ing. These limitations include an inability to access      (e.g., Gao et al., 2022; Parisi et al., 2022), hinder-
                                        up-to-date information on recent events (Komeili           ing a more widespread adoption of tool use in LMs.
                                        et al., 2022) and the related tendency to hallucinate      Therefore, we propose Toolformer, a model that
                                        facts (Maynez et al., 2020; Ji et al., 2022), difficul-    learns to use tools in a novel way, which fulfills the
                                        ties in understanding low-resource languages (Lin          following desiderata:
                                        et al., 2021), a lack of mathematical skills to per-
                                        form precise calculations (Patel et al., 2021) and an         • The use of tools should be learned in a
                                        unawareness of the progression of time (Dhingra                 self-supervised way without requiring large
                                        et al., 2022).                                                  amounts of human annotations. This is impor-
                           1                       2                             3                            LM Dataset
   LM Dataset
                           Sample API Calls        Execute API Calls             Filter API Calls            with API Calls

                         1
x1:i-1 = Pittsburgh is ci = What other name is     ri1 = Steel City            Li(ci1 → Steel City)        x* = Pittsburgh is
        also known as       Pittsburgh known by?                               < min(Li(ci1 → ε), Li(ε))        also known as
                                                                                                                [QA(What …?
 xi:n = the Steel City   ci2 = Which country is    ri2 = United States         Li(ci2 → United States)          → Steel City)]
                               Pittsburgh in?                                  > min(Li(ci2 → ε), Li(ε))        the Steel City.


Figure 2: Key steps in our approach, illustrated for a question answering tool: Given an input text x, we first
sample a position i and corresponding API call candidates c1i , c2i , . . . , cki . We then execute these API calls and
filter out all calls which do not reduce the loss Li over the next tokens. All remaining API calls are interleaved
with the original text, resulting in a new text x∗ .


      tant not only because of the costs associated             several other baselines on various tasks.
      with such annotations, but also because what
      humans find useful may be different from                  2      Approach
      what a model finds useful.                                Our aim is to equip a language model M with the
   • The LM should not lose any of its generality               ability to use different tools by means of API calls.
     and should be able to decide for itself when               We require that inputs and outputs for each API
     and how to use which tool. In contrast to                  can be represented as text sequences. This allows
     existing approaches, this enables a much more              seamless insertion of API calls into any given text,
     comprehensive use of tools that is not tied to             using special tokens to mark the start and end of
     specific tasks.                                            each such call.
                                                                   We represent each API call as a tuple c = (ac , ic )
   Our approach for achieving these goals is based              where ac is the name of the API and ic is the cor-
on the recent idea of using large LMs with in-                  responding input. Given an API call c with a cor-
context learning (Brown et al., 2020) to generate               responding result r, we denote the linearized se-
entire datasets from scratch (Schick and Schütze,               quences of the API call not including and including
2021b; Honovich et al., 2022; Wang et al., 2022):               its result, respectively, as:
Given just a handful of human-written examples
of how an API can be used, we let a LM annotate                            e(c) = <API> ac (ic ) </API>
a huge language modeling dataset with potential                          e(c, r) = <API> ac (ic ) → r </API>
API calls. We then use a self-supervised loss to
determine which of these API calls actually help                where “<API>”, “</API>” and “→” are special
the model in predicting future tokens. Finally, we              tokens.1 Some examples of linearized API calls
finetune the LM itself on the API calls that it con-            inserted into text sequences are shown in Figure 1.
siders useful. As illustrated in Figure 1, through                  Given a dataset C = {x1 , . . . , x|C| } of plain
this simple approach, LMs can learn to control a va-            texts, we first convert this dataset into a dataset
riety of tools, and to choose for themselves which              C ∗ augmented with API calls. This is done in three
tool to use when and how.                                       steps, illustrated in Figure 2: First, we exploit the
   As our approach is agnostic of the dataset be-               in-context learning ability of M to sample a large
ing used, we can apply it to the exact same dataset             number of potential API calls. We then execute
that was used to pretrain a model in the first place.           these API calls and finally check whether the ob-
This ensures that the model does not lose any                   tained responses are helpful for predicting future
of its generality and language modeling abilities.              tokens; this is used as a filtering criterion. After
We conduct experiments on a variety of differ-                  filtering, we merge API calls for different tools,
ent downstream tasks, demonstrating that after                  resulting in the augmented dataset C ∗ , and finetune
learning to use tools, Toolformer, which is based                    1
                                                                       In practice, we use the token sequences “ [”, “]” and
on a pretrained GPT-J model (Wang and Komat-                    “->” to represent “<API>”, “</API>” and “→”, respec-
suzaki, 2021) with 6.7B parameters, achieves much               tively. This enables our approach to work without modifying
                                                                the existing LM’s vocabulary. For reasons of readability, we
stronger zero-shot results, clearly outperforming a             still refer to them as “<API>”, “</API>” and “→” through-
much larger GPT-3 model (Brown et al., 2020) and                out this section.
  Your task is to add calls to a Question Answering API to a    Executing API Calls As a next step, we execute
  piece of text. The questions should help you get              all API calls generated by M to obtain the corre-
  information required to complete the text. You can call the   sponding results. How this is done depends entirely
  API by writing "[QA(question)]" where "question" is the       on the API itself – for example, it can involve call-
  question you want to ask. Here are some examples of API       ing another neural network, executing a Python
  calls:
                                                                script or using a retrieval system to perform search
                                                                over a large corpus. The response for each API call
  Input: Joe Biden was born in Scranton, Pennsylvania.
                                                                ci needs to be a single text sequence ri .
  Output: Joe Biden was born in [QA("Where was Joe
  Biden born?")] Scranton, [QA("In which state is               Filtering API Calls Let i be the position of the
  Scranton?")] Pennsylvania.                                    API call ci in the sequence x = x1 , . . . , xn , and let
                                                                ri be the response from the API. Further, given a
  Input: Coca-Cola, or Coke, is a carbonated soft drink         sequence (wi | i ∈ N) of weights, let
  manufactured by the Coca-Cola Company.
                                                                                  n
                                                                                  X
  Output: Coca-Cola, or [QA("What other name is                    Li (z) = −            wj−i · log pM (xj | z, x1:j−1 )
  Coca-Cola known by?")] Coke, is a carbonated soft drink                          j=i
  manufactured by [QA("Who manufactures Coca-Cola?")]
  the Coca-Cola Company.                                        be the weighted cross entropy loss for M over the
                                                                tokens xi , . . . , xn if the model is prefixed with z.
  Input: x                                                      We compare two different instantiations of this loss:
  Output:
                                                                            L+
                                                                             i = Li (e(ci , ri ))
Figure 3: An exemplary prompt P (x) used to generate                        L−
                                                                             i = min (Li (ε), Li (e(ci , ε)))
API calls for the question answering tool.
                                                                where ε denotes an empty sequence. The former is
                                                                the weighted loss over all tokens xi , . . . , xn if the
M itself on this dataset. Each of these steps is                API call and its result are given to M as a prefix;3
described in more detail below.                                 the latter is the minimum of the losses obtained
Sampling API Calls For each API, we write a                     from (i) doing no API call at all and (ii) doing an
prompt P (x) that encourages the LM to anno-                    API call, but not providing the response. Intuitively,
tate an example x = x1 , . . . , xn with API calls.             an API call is helpful to M if providing it with both
An example of such a prompt for a question an-                  the input and the output of this call makes it easier
swering tool is shown in Figure 3; all prompts                  for the model to predict future tokens, compared to
used are shown in Appendix A.2. Let pM (zn+1 |                  not receiving the API call at all, or receiving only
z1 , . . . , zn ) be the probability that M assigns to          its input. Given a filtering threshold τf , we thus
token zn+1 as a continuation for the sequence                   only keep API calls for which
z1 , . . . , zn . We first sample up to k candidate posi-
                                                                                      L−    +
                                                                                       i − Li ≥ τf
tions for doing API calls by computing, for each
i ∈ {1, . . . , n}, the probability                             holds, i.e., adding the API call and its result reduces
                                                                the loss by at least τf , compared to not doing any
             pi = pM (<API> | P (x), x1:i−1 )                   API call or obtaining no result from it.
that M assigns to starting an API call at position              Model Finetuning After sampling and filtering
i. Given a sampling threshold τs , we keep all po-              calls for all APIs, we finally merge the remaining
sitions I = {i | pi > τs }; if there are more than k            API calls and interleave them with the original
such positions, we only keep the top k.                         inputs. That is, for an input text x = x1 , . . . , xn
   For each position i ∈ I, we then obtain up to m              with a corresponding API call and result (ci , ri ) at
API calls c1i , . . . , cm
                         i by sampling from M given the         position i, we construct the new sequence x∗ =
sequence [P (x), x1 , . . . , xi−1 , <API>] as a prefix            3
                                                                     We provide e(ci , ri ) as a prefix instead of inserting it at
and </API> as an end-of-sequence token.2                        position i because M is not yet finetuned on any examples
                                                                containing API calls, so inserting it in the middle of x would
   2
     We discard all examples where M does not generate the      interrupt the flow and not align with patterns in the pretraining
</API> token.                                                   corpus, thus hurting perplexity.
x1:i−1 , e(ci , ri ), xi:n ; we proceed analogously for   snippets from Wikipedia. Compared to our ques-
texts with multiple API calls. Doing this for all x ∈     tion answering tool, this search enables a model
C results in the new dataset C ∗ augmented with API       to get more comprehensive information on a sub-
calls. We use this new dataset to finetune M , using      ject, but requires it to extract the relevant parts by
a standard language modeling objective. Crucially,        itself. As our search engine, we use a BM25 re-
apart from inserted API calls the augmented dataset       triever (Robertson et al., 1995; Baeza-Yates et al.,
C ∗ contains the exact same texts as C, the original      1999) that indexes the Wikipedia dump from KILT
dataset. As a consequence, finetuning M on C ∗            (Petroni et al., 2021).
exposes it to the same content as finetuning on C.
Moreover, as API calls are inserted in exactly those      Machine Translation System Our fourth tool is
positions and with exactly those inputs that help         a machine translation system based on a LM that
M predict future tokens, finetuning on C ∗ enables        can translate a phrase from any language into En-
the language model to decide when and how to use          glish. More concretely, we use the 600M parameter
which tool, based purely on its own feedback.             NLLB (Costa-jussà et al., 2022) as our multilingual
                                                          machine translation model that works for 200 lan-
Inference When generating text with M after               guages (including low-resource ones). The source
finetuning with our approach, we perform regular          language is automatically detected using the fast-
decoding until M produces the “→” token, indicat-         Text classifier (Joulin et al., 2016), while the target
ing that it next expects the response for an API call.    language is always set to English.
At this point, we interrupt the decoding process,
call the appropriate API to get a response, and con-      Calendar Our final tool is a calendar API that,
tinue the decoding process after inserting both the       when queried, returns the current date without tak-
response and the </API> token.                            ing any input. This provides temporal context for
                                                          predictions that require some awareness of time.
3   Tools
                                                          4     Experiments
We explore a variety of tools to address different
shortcomings of regular LMs. The only constraints         We investigate whether our approach enables a
we impose on these tools is that (i) both their inputs    model to use tools without any further supervision
and outputs can be represented as text sequences,         and to decide for itself when and how to call which
and (ii) we can obtain a few demonstrations of            of the available tools. To test this, we select a vari-
their intended use. Concretely, we explore the fol-       ety of downstream tasks where we assume at least
lowing five tools: a question answering system, a         one of the considered tools to be useful, and evalu-
Wikipedia search engine, a calculator, a calendar,        ate performance in zero-shot settings (Section 4.2).
and a machine translation system. Some examples           Beyond that, we also ensure that our approach does
of potential calls and return strings for the APIs        not hurt the model’s core language modeling abili-
associated with each of these tools are shown in          ties; we verify this by looking at perplexity on two
Table 1. We briefly discuss all tools below; further      language modeling datasets (Section 4.3). Finally,
details can be found in Appendix A.                       we investigate how the ability to learn using tools
                                                          is affected by model size (Section 4.4).
Question Answering Our first tool is a question
answering system based on another LM that can an-         4.1    Experimental Setup
swer simple factoid questions. Specifically, we use       Dataset Generation Throughout all of our ex-
Atlas (Izacard et al., 2022), a retrieval-augmented       periments, we use a subset of CCNet (Wenzek et al.,
LM finetuned on Natural Questions (Kwiatkowski            2020) as our language modeling dataset C and GPT-
et al., 2019).                                            J (Wang and Komatsuzaki, 2021) as our language
Calculator As a second tool, we use a calculator          model M . To reduce the computational cost of
that can perform simple numeric calculations; we          annotating C with API calls, we define heuristics
only support the four basic arithmetic operations.        for some APIs to get a subset of C for which API
Results are always rounded to two decimal places.         calls are more likely to be helpful than for an av-
                                                          erage text. For example, we only consider texts
Wikipedia Search Our third tool is a search en-           for the calculator tool if they contain at least three
gine that, given a search term, returns short text        numbers. Details of the heuristics used are given in
 API Name                   Example Input                Example Output
 Question Answering         Where was the Knights        New Haven, Connecticut
                            of Columbus founded?
 Wikipedia Search           Fishing Reel Types           Spin fishing > Spin fishing is distinguished between fly fishing and bait
                                                         cast fishing by the type of rod and reel used. There are two types of reels
                                                         used when spin fishing, the open faced reel and the closed faced reel.
 Calculator                 27 + 4 * 2                   35
 Calendar                   ε                            Today is Monday, January 30, 2023.
 Machine Translation        sûreté nucléaire             nuclear safety

                                   Table 1: Examples of inputs and outputs for all APIs used.


                                Number of Examples                        • GPT-J: A regular GPT-J model without any
 API                        τf = 0.5 τf = 1.0 τf = 2.0                      finetuning.
 Question Answering              51,987    18,526        5,135
 Wikipedia Search               207,241    60,974       13,944            • GPT-J + CC: GPT-J finetuned on C, our sub-
 Calculator                       3,680       994          138              set of CCNet without any API calls.
 Calendar                        61,811    20,587        3,007
 Machine Translation              3,156     1,034          229
                                                                          • Toolformer: GPT-J finetuned on C ∗ , our sub-
Table 2: Number of examples with API calls in C ∗ for                       set of CCNet augmented with API calls.
different values of our filtering threshold τf .
                                                                          • Toolformer (disabled): The same model as
                                                                            Toolformer, but API calls are disabled during
Appendix A. For obtaining C ∗ from C, we perform                            decoding.5
all steps described in Section 2 and additionally
filter out all examples for which all API calls were                  For most tasks, we additionally compare to OPT
eliminated in the filtering step.4 For the weighting                  (66B) (Zhang et al., 2022) and GPT-36 (175B)
function, we use                                                      (Brown et al., 2020), two models that are about
                                                                      10 and 25 times larger than our other baseline mod-
               w̃t                                                    els, respectively.
  wt = P                with w̃t = max(0, 1 − 0.2 · t)
              s∈N w̃s
                                                                      4.2    Downstream Tasks
to make sure that API calls happen close to where                     We evaluate all models on a variety of downstream
the information provided by the API is actually                       tasks. In all cases, we consider a prompted zero-
helpful for the model. The thresholds τs and τf are                   shot setup – i.e., models are instructed to solve
chosen individually for each tool to ensure a suffi-                  each task in natural language, but we do not pro-
ciently larger number of examples; see Appendix A                     vide any in-context examples. This is in contrast
for details. Table 2 shows relevant statistics of our                 to prior work on tool use (e.g., Gao et al., 2022;
final dataset augmented with API calls.                               Parisi et al., 2022), where models are provided
                                                                      with dataset-specific examples of how a tool can be
Model Finetuning We finetune M on C ∗ using                           used to solve a concrete task. We choose the more
a batch size of 128 and a learning rate of 1 · 10−5                   challenging zero-shot setup as we are interested
with linear warmup for the first 10% of training.                     in seeing whether Toolformer works in precisely
Details of our finetuning procedure are given in                      those cases where a user does not specify in ad-
Appendix B.                                                           vance which tools should be used in which way for
Baseline Models Throughout the remainder of                           solving a specific problem.
this section, we mainly compare the following mod-                       We use standard greedy decoding, but with one
els:                                                                  modification for Toolformer: We let the model start
                                                                      an API call not just when <API> is the most likely
   4
     While this filtering alters the distribution of training exam-
                                                                          5
ples, we assume that the remaining examples are close enough                This is achieved by manually setting the probability of
to the original distribution so that M ’s language modeling           the <API> token to 0.
                                                                          6
abilities remain unaffected. This assumption is empirically                 We use the original davinci variant that is not finetuned
validated in Section 4.3.                                             on any instructions.
token, but whenever it is one of the k most likely       Model                         SQuAD Google-RE T-REx
tokens. For k = 1, this corresponds to regular           GPT-J                           17.8         4.9        31.9
greedy decoding; we instead use k = 10 to in-            GPT-J + CC                      19.2         5.6        33.2
                                                         Toolformer (disabled)           22.1         6.3        34.9
crease the disposition of our model to make use of       Toolformer                      33.8        11.5        53.5
the APIs that it has access to. At the same time,
                                                         OPT (66B)                       21.6          2.9       30.1
we only at most one API call per input to make           GPT-3 (175B)                    26.8          7.0       39.8
sure the model does not get stuck in a loop where
it constantly calls APIs without producing any ac-       Table 3: Results on subsets of LAMA. Toolformer uses
tual output. The effect of these modifications is        the question answering tool for most examples, clearly
explored in Section 5.                                   outperforming all baselines of the same size and achiev-
                                                         ing results competitive with GPT-3 (175B).
4.2.1   LAMA
We evaluate our models on the SQuAD, Google-              Model                        ASDiv     SVAMP       MAWPS
RE and T-REx subsets of the LAMA benchmark                GPT-J                          7.5        5.2         9.9
(Petroni et al., 2019). For each of these subsets, the    GPT-J + CC                     9.6        5.0         9.3
                                                          Toolformer (disabled)         14.8        6.3        15.0
task is to complete a short statement with a miss-        Toolformer                    40.4       29.4        44.0
ing fact (e.g., a date or a place). As LAMA was           OPT (66B)                      6.0        4.9         7.9
originally designed to evaluate masked language           GPT-3 (175B)                  14.0       10.0        19.8
models (e.g., Devlin et al., 2019), we filter out ex-
amples where the mask token is not the final token,      Table 4: Results for various benchmarks requiring
so that the remaining examples can be processed          mathematical reasoning. Toolformer makes use of the
                                                         calculator tool for most examples, clearly outperform-
in a left-to-right fashion. To account for different
                                                         ing even OPT (66B) and GPT-3 (175B).
tokenizations and added complexity from not in-
forming the model that a single word is required,
we use a slightly more lenient evaluation criterion      number predicted by the model.7
than exact match and simply check whether the               Table 4 shows results for all benchmarks. While
correct word is within the first five words predicted    GPT-J and GPT-J + CC perform about the same,
by the model. As LAMA is based on statements             Toolformer achieves stronger results even when
obtained directly from Wikipedia, we prevent Tool-       API calls are disabled. We surmise that this is be-
former from using the Wikipedia Search API to            cause the model is finetuned on many examples
avoid giving it an unfair advantage.                     of API calls and their results, improving its own
   Results for all models can be seen in Table 3.        mathematical capabilities. Nonetheless, allowing
All GPT-J models without tool use achieve similar        the model to make API calls more than doubles per-
performance. Crucially, Toolformer clearly outper-       formance for all tasks, and also clearly outperforms
forms these baseline models, improving upon the          the much larger OPT and GPT-3 models. This is
best baseline by 11.7, 5.2 and 18.6 points, respec-      because across all benchmarks, for 97.9% of all
tively. It also clearly outperforms OPT (66B) and        examples the model decides to ask the calculator
GPT-3 (175B), despite both models being much             tool for help.
larger. This is achieved because the model inde-
pendently decides to ask the question answering          4.2.3    Question Answering
tool for the required information in almost all cases    We look at Web Questions (Berant et al., 2013),
(98.1%); for only very few examples, it uses a dif-      Natural Questions (Kwiatkowski et al., 2019) and
ferent tool (0.7%) or no tool at all (1.2%).             TriviaQA (Joshi et al., 2017), the three question an-
                                                         swering datasets considered by Brown et al. (2020).
4.2.2   Math Datasets
                                                         For evaluation, we check whether the first 20 words
We test mathematical reasoning abilities on ASDiv        predicted by a model contain the correct answer
(Miao et al., 2020), SVAMP (Patel et al., 2021) and      instead of requiring an exact match. For Tool-
the MAWPS benchmark (Koncel-Kedziorski et al.,           former, we disable the question answering tool as
2016). We again account for the fact that we test
                                                            7
all models in a zero-shot setup by using a more               An exception to this is if the model’s prediction contains
                                                         an equation (e.g., “The correct answer is 5+3=8”), in which
lenient evaluation criterion: As the required output     case we consider the first number after the “=” sign to be its
is always a number, we simply check for the first        prediction.
 Model                      WebQS     NQ     TriviaQA      Model                   Es     De    Hi    Vi   Zh    Ar
 GPT-J                        18.5    12.8     43.9        GPT-J                   15.2 16.5    1.3 8.2 18.2     8.2
 GPT-J + CC                   18.4    12.2     45.6        GPT-J + CC              15.7 14.9    0.5 8.3 13.7     4.6
 Toolformer (disabled)        18.9    12.6     46.7        Toolformer (disabled)   19.8 11.9    1.2 10.1 15.0    3.1
 Toolformer                   26.3    17.7     48.8        Toolformer              20.6 13.5    1.4 10.6 16.8    3.7
 OPT (66B)                    18.6    11.4     45.7        OPT (66B)                0.3   0.1   1.1   0.2 0.7    0.1
 GPT-3 (175B)                 29.0    22.6     65.9        GPT-3 (175B)             3.4   1.1   0.1   1.7 17.7   0.1
                                                           GPT-J (All En)          24.3 27.0 23.9 23.3 23.1 23.6
Table 5: Results for various question answering dataset.   GPT-3 (All En)          24.7 27.2 26.1 24.9 23.6 24.0
Using the Wikipedia search tool for most examples,
Toolformer clearly outperforms baselines of the same       Table 6: Results on MLQA for Spanish (Es), German
size, but falls short of GPT-3 (175B).                     (De), Hindi (Hi), Vietnamese (Vi), Chinese (Zh) and
                                                           Arabic (Ar). While using the machine translation tool
                                                           to translate questions is helpful across all languages,
this would make solving the tasks trivial, especially      further pretraining on CCNet deteriorates performance;
given that the underlying QA system was finetuned          consequently, Toolformer does not consistently outper-
on Natural Questions.                                      form GPT-J. The final two rows correspond to models
                                                           that are given contexts and questions in English.
   Results are shown in Table 5. Once again,
Toolformer clearly outperforms all other models
based on GPT-J, this time mostly relying on the            former does not consistently outperform vanilla
Wikipedia search API (99.3%) to find relevant in-          GPT-J. This is mainly because for some languages,
formation. However, Toolformer still lags behind           finetuning on CCNet deteriorates performance; this
the much larger GPT-3 (175B) model. This is likely         might be due to a distribution shift compared to
due to both the simplicity of our search engine (in        GPT-J’s original pretraining data.
many cases, it returns results that are clearly not           OPT and GPT-3 perform surprisingly weak
a good match for a given query) and the inability          across all languages, mostly because they fail to
of Toolformer to interact with it, e.g., by refor-         provide an answer in English despite being in-
mulating its query if results are not helpful or by        structed to do so. A potential reason for GPT-J not
browsing through multiple of the top results. We           suffering from this problem is that it was trained on
believe that adding this functionality is an exciting      more multilingual data than both OPT and GPT-3,
direction for future work.                                 including the EuroParl corpus (Koehn, 2005; Gao
                                                           et al., 2020). As an upper bound, we also evaluate
4.2.4    Multilingual Question Answering
                                                           GPT-J and GPT-3 on a variant of MLQA where
We evaluate Toolformer and all baseline models             both the context and the question are provided in
on MLQA (Lewis et al., 2019), a multilingual               English. In this setup, GPT-3 performs better than
question-answering benchmark. A context para-              all other models, supporting our hypothesis that
graph for each question is provided in English,            its subpar performance on MLQA is due to the
while the question can be in Arabic, German, Span-         multilingual aspect of the task.
ish, Hindi, Vietnamese, or Simplified Chinese. In
order to solve the task, the model needs to be able        4.2.5   Temporal Datasets
to understand both the paragraph and the question,         To investigate the calendar API’s utility, we eval-
so it may benefit from translating the question into       uate all models on T EMP LAMA (Dhingra et al.,
English. Our evaluation metric is the percentage of        2022) and a new dataset that we call DATESET.
times the model’s generation, capped at 10 words,          T EMP LAMA is a dataset built from Wikidata that
contains the correct answer.                               contains cloze queries about facts that change with
   Results are shown in Table 6. Using API calls           time (e.g., “Cristiano Ronaldo plays for ___”)
consistently improves Toolformer’s performance             as well as the correct answer for the years be-
for all languages, suggesting that it has learned to       tween 2010 and 2020. DATESET, described in
make use of the machine translation tool. Depend-          Appendix D, is also generated through a series
ing on the language, this tool is used for 63.8%           of templates, but populated using a combination
to 94.9% of all examples; the only exception to            of random dates/durations (e.g., “What day of the
this is Hindi, for which the machine translation           week was it 30 days ago?”). Critically, knowing the
tool is used in only 7.3% of cases. However, Tool-         current date is required to answer these questions.
 Model                     T EMP LAMA       DATESET        Model                                     WikiText       CCNet
 GPT-J                          13.7           3.9         GPT-J                                         9.9          10.6
 GPT-J + CC                     12.9           2.9         GPT-J + CC                                   10.3          10.5
 Toolformer (disabled)          12.7           5.9         Toolformer (disabled)                        10.3          10.5
 Toolformer                     16.3          27.3
 OPT (66B)                      14.5           1.3        Table 8: Perplexities of different models on WikiText
 GPT-3 (175B)                   15.5           0.8        and our validation subset of CCNet. Adding API calls
                                                          comes without a cost in terms of perplexity for lan-
Table 7: Results for the temporal datasets. Toolformer    guage modeling without any API calls.
outperforms all baselines, but does not make use of the
calendar tool for T EMP LAMA.
                                                          training data for GPT-J is more similar to Wiki-
                                                          Text than our randomly selected subset of CCNet.
For both tasks, we use the same evaluation as for         Most importantly, however, training on C ∗ (our
the original LAMA dataset.                                dataset annotated with API calls) does not lead to
   Results shown in Table 7 illustrate that Tool-         an increase in perplexity compared to training on
former outperforms all baselines for both T EM -          C when API calls are disabled at inference time.8
P LAMA and DATESET . However, closer inspec-
tion shows that improvements on T EMP LAMA                4.4      Scaling Laws
can not be attributed to the calendar tool, which is      We investigate how the ability to ask external tools
only used for 0.2% of all examples, but mostly to         for help affects performance as we vary the size
the Wikipedia search and question answering tools,        of our LM. To this end, we apply our approach
which Toolformer calls the most. This makes sense         not just to GPT-J, but also to four smaller mod-
given that named entities in T EMP L AMA are often        els from the GPT-2 family (Radford et al., 2019),
so specific and rare that even knowing the exact          with 124M, 355M, 775M and 1.6B parameters, re-
date alone would be of little help. The best course       spectively. We do so using only a subset of three
of action for this dataset – first querying the calen-    tools: the question answering system, the calcula-
dar API to get the current date, and then querying        tor, and the Wikipedia search engine. Apart from
the question answering system with this date – is         this, we follow the experimental setup described in
not only prohibited by our restriction of using at        Section 4.1.
most one API call per example, but also hard to              Figure 4 shows that the ability to leverage the
learn for Toolformer given that all API calls in its      provided tools only emerges at around 775M pa-
training data are sampled independently.                  rameters: smaller models achieve similar perfor-
   For DATESET, on the other hand, the consider-          mance both with and without tools. An exception
able improvement of Toolformer compared to other          to this is the Wikipedia search engine used mostly
models can be fully accredited to the calendar tool,      for QA benchmarks; we hypothesize that this is
which it makes use of for 54.8% of all examples.          because the API is comparably easy to use. While
                                                          models become better at solving tasks without API
4.3   Language Modeling                                   calls as they grow in size, their ability to make good
In addition to verifying improved performance on          use of the provided API improves at the same time.
various downstream tasks, we also want to ensure          As a consequence, there remains a large gap be-
that language modeling performance of Toolformer          tween predictions with and without API calls even
does not degrade through our finetuning with API          for our biggest model.
calls. To this end, we evaluate our models on
two language modeling datasets: WikiText (Mer-            5       Analysis
ity et al., 2017) and a subset of 10,000 randomly
                                                          Decoding Strategy We investigate the effect of
selected documents from CCNet (Wenzek et al.,
                                                          our modified decoding strategy introduced in Sec-
2020) that were not used during training. Perplex-
                                                          tion 4.2, where instead of always generating the
ities of various models are shown in Table 8. As
one would expect, finetuning on CCNet leads to                8
                                                                We do not evaluate the perplexity of Toolformer with
slightly improved performance on a different CC-          API calls enabled as computing the probability pM (xt |
                                                          x1 , . . . , xt−1 ) of token xt given x1 , . . . , xt−1 would require
Net subset, but it slightly deteriorates performance      marginalizing over all potential API calls that the model could
on WikiText, presumably because the original pre-         make at position t, which is intractable.
               LAMA                            Math Benchmarks                                QA Benchmarks
30                                    30                                      40

                                                                              35
25                                    25
                                                                              30
20                                    20
                                                                              25

15                                    15                                      20
                                                                              15
10                                    10
             Toolformer                                                       10
 5           Toolformer (disabled)     5
                                                                                 5
             GPT3
 0                                     0                                         0
     0    2000     4000     6000           0    2000     4000     6000               0        2000     4000     6000
         Model Parameters (M)                  Model Parameters (M)                          Model Parameters (M)


Figure 4: Average performance on LAMA, our math benchmarks and our QA benchmarks for GPT-2 models of
different sizes and GPT-J finetuned with our approach, both with and without API calls. While API calls are not
helpful to the smallest models, larger models learn how to make good use of them. Even for bigger models, the
gap between model predictions with and without API calls remains high.


most likely token, we generate the <API> token                           T-REx                          WebQS
if it is one of the k most likely tokens. Table 9           k     All    AC   NC         %       All   AC   NC         %
shows performance on the T-REx subset of LAMA               0     34.9 –      34.9 0.0           18.9 – 18.9    0.0
and on WebQS for different values of k. As ex-              1     47.8 53.0   44.3 40.3          19.3 17.1 19.9 8.5
pected, increasing k leads to the model doing API           3     52.9 58.0   29.0 82.8          26.3 26.5 6.6 99.3
                                                            10    53.5 54.0   22.5 98.1          26.3 26.4 – 100.0
calls for more examples – from 40.3% and 8.5%
with k = 1 (i.e., regular greedy decoding) to 98.1%       Table 9: Toolformer results on the T-REx subset of
and 100% for k = 10. While for T-REx, there is            LAMA and on WebQS for different values of k used
already a clear improvement in performance with           during decoding. Numbers shown are overall perfor-
greedy decoding, on WebQS our model only starts           mance (All), performance on the subset where the
to make a substantial number of API calls as we           model decides to make an API call (AC) and all re-
slightly increase k. Interestingly, for k = 1 the         maining examples (NC), as well as the percentage of
                                                          examples for which the model decides to call an API
model is calibrated to some extent: It decides to
                                                          (%).
call APIs for examples that it would perform partic-
ularly badly on without making API calls. This can
be seen from the fact that performance on examples        “Fast train success” in the fourth example that does
where it decides not to make an API call (44.3 and        not give any relevant information but still reduces
19.9) is higher than average performance if no API        perplexity. However, some amount of noise in the
calls are made at all (34.9 and 18.9). However, this      API calls that are not filtered can actually be useful
calibration is lost for higher values of k.               as it forces the model finetuned on C ∗ to not always
                                                          blindly follow the results of each call it makes.
Data Quality We qualitatively analyze some
API calls generated with our approach for different       6      Related Work
APIs. Table 10 shows some examples of texts from
CCNet augmented with API calls, as well as the            Language Model Pretraining There are various
corresponding score L−         +
                        i − Li that is used as a fil-     approaches that augment language models with
tering criterion, and whether the API calls made by       some form of additional textual information during
the model are intuitively useful in the given context.    pretraining, including various forms of metadata
As can be seen, high values of L−         +
                                   i − Li typically       (Keskar et al., 2019), HTML tags (Aghajanyan
correspond to useful API calls, whereas low values        et al., 2021), Wikipedia markup (Schick et al.,
correspond to API calls that do not provide any in-       2022), or related texts obtained from an informa-
formation that is useful for predicting future tokens.    tion retrieval system (Guu et al., 2020; Borgeaud
There are some exceptions, e.g., an API call for          et al., 2021; Izacard et al., 2022). For all of these
 Example                                                                                              L−    +
                                                                                                       i − Li   Useful
 The Flodden Window (a war memorial dedicated to The Middleton Archers), in the Grade I-listed
 Church of St Leonard in Middleton is said to be the oldest war memorial in the United King-
 dom. <API> WikiSearch(War memorial Flodden) → Battle of Flodden > Commemoration >
 The stained-glass Flodden Window in Middleton Parish Church [. . . ] was constructed by Sir           5.49       3
 Richard Assheton in memory of the Battle of Flodden and the archers from Middleton who
 fought in it. </API> Sir Richard Assheton of Middleton (who built St Leonard) was granted
 knighthood [. . . ]
 Note: The WL will be open on Friday, <API> Calendar() → Today is Thursday, March 9, 2017.
                                                                                                       2.11       3
 </API> March 10, and Sunday, March 19 for regular hours.
 The Nile has an approximate length of <API> QA(What is the approximate length of the Nile?)
                                                                                                       2.08       3
 → 6,853 km </API> 6,853 kilometers, the White Nile being its main source.
 If Venus had an atmosphere similar to Earth’s then you would expect Venus’ mean temperature to be
 499 K (1.74 x 287) rather than 735 K which is <API> Calculator(735 / 499) → 1.47 </API> 1.47          1.59       3
 (735 / 499) times hotter than it should be.
 You are here: Home / Featured / Catch this fast train to success! <API> WikiSearch(Fast train
 success) → Fast Train > It also peaked at #23 on the Canadian CHUM singles chart, on June
 26, 1971. The success of this single established Myles Goodwyn as the band’s main songwriter,         0.92       7
 and made it possible for April Wine to record a second album. </API> Don’t wait weeks and
 [. . .]
 Os Melhores Escolas em Jersey 2020 <API> MT(Os Melhores Escolas em Jersey) → The Best
 Schools in Jersey </API> On this page you can search for Universities, Colleges and Business          0.70       3
 schools in Jersey
 Enjoy these pictures from the <API> Calendar() → Today is Friday, April 19, 2013. </API>
                                                                                                       0.33       3
 Easter Egg Hunt.
 85 patients (23%) were hospitalised alive and admitted to a hospital ward. Of them, <API> Calcula-
                                                                                                      −0.02       7
 tor(85 / 23) → 3.70 </API> 65% had a cardiac aetiology [. . .]
 But hey, after the <API> Calendar() → Today is Saturday, June 25, 2011. </API> Disneyland
                                                                                                      −0.41       7
 fiasco with the fire drill, I think it’s safe to say Chewey won’t let anyone die in a fire.
 The last time I was with <API> QA(Who was last time I was with?) → The Last Time </API>
                                                                                                      −1.23       7
 him I asked what he likes about me and he said he would tell me one day.

Table 10: Examples of API calls for different tools, sorted by the value of L−            +
                                                                                  i − Li that is used as a filtering
criterion. High values typically correspond to API calls that are intuitively useful for predicting future tokens.


approaches, additional information is always pro-              used (Gao et al., 2022; Lazaridou et al., 2022; Yao
vided, regardless of whether it is helpful or not. In          et al., 2022). In contrast, the self-supervised nature
contrast, Toolformer learns for itself to explicitly           of Toolformer enables it to learn how and when to
asks for the right information.                                use tools without requiring a specific prompt that
                                                               shows task-specific examples of how a tool could
Tool Use Several approaches aim to equip LMs                   be used. Perhaps most closely related to our work
with the ability to use external tools such as search          is TALM (Parisi et al., 2022), an approach that
engines (Komeili et al., 2022; Thoppilan et al.,               uses a similar self-supervised objective for teach-
2022; Lazaridou et al., 2022; Shuster et al., 2022;            ing a model to use a calculator and a search engine,
Yao et al., 2022), web browsers (Nakano et al.,                but explores this only in settings where a model is
2021), calculators (Cobbe et al., 2021; Thoppilan              finetuned for downstream tasks.
et al., 2022), translation systems (Thoppilan et al.,
2022) and Python interpreters (Gao et al., 2022).              Bootstrapping The idea of using self-training
The way these models learn to use tools can roughly            and bootstrapping techniques to improve models
be divided into two approaches: Either they rely on            has been investigated in various contexts, rang-
large amounts of human supervision (Komeili et al.,            ing from word sense disambiguation (Yarowsky,
2022; Nakano et al., 2021; Thoppilan et al., 2022)             1995), relation extraction (Brin, 1999; Agichtein
or they work by prompting the language model in                and Gravano, 2000), parsing (McClosky et al.,
a few-shot setup tailored towards a specific task              2006; Reichart and Rappoport, 2007), sequence
where it is known a priori which tools needs to be             generation (He et al., 2020), few-shot text classi-
fication (Schick and Schütze, 2021a) and retrieval        reduce perplexity on future tokens. Toolformer
(Izacard and Grave, 2021) to reasoning (Zelikman          considerably improves zero-shot performance of a
et al., 2022). In a similar spirit to these approaches,   6.7B parameter GPT-J model, enabling it to even
Toolformer is trained on its own predictions after        outperform a much larger GPT-3 model on a range
applying a perplexity-based filtering step.               of different downstream tasks.

7   Limitations
                                                          References
While our approach enables LMs to learn how to
use a variety of tools in a self-supervised way, there    Armen Aghajanyan, Dmytro Okhonko, Mike Lewis,
                                                            Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettle-
are some clear limitations to what can be achieved          moyer. 2021. Htlm: Hyper-text pre-training and
with our method in its current form. One such limi-         prompting of language models.
tation is the inability of Toolformer to use tools in a
chain (i.e., using the output of one tool as an input     Eugene Agichtein and Luis Gravano. 2000. Snowball:
                                                            Extracting relations from large plain-text collections.
for another tool). This is due to the fact that API         In Proceedings of the Fifth ACM Conference on Dig-
calls for each tool are generated independently; as a       ital Libraries, DL ’00, page 85–94, New York, NY,
consequence, there are no examples of chained tool          USA. Association for Computing Machinery.
use in the finetuning dataset. Our current approach
                                                          Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al.
also does not allow the LM to use a tool in an in-          1999. Modern information retrieval, volume 463.
teractive way – especially for tools such as search         ACM press New York.
engines, that could potentially return hundreds of
different results, enabling a LM to browse through        Jonathan Berant, Andrew Chou, Roy Frostig, and Percy
                                                            Liang. 2013. Semantic parsing on Freebase from
these results or to refine its search query in a simi-      question-answer pairs. In Proceedings of the 2013
lar spirit to Nakano et al. (2021) can be crucial for       Conference on Empirical Methods in Natural Lan-
certain applications. Beyond this, we found models          guage Processing, pages 1533–1544, Seattle, Wash-
trained with Toolformer to often be sensitive to the        ington, USA. Association for Computational Lin-
exact wording of their input when deciding whether          guistics.
or not to call an API; this is perhaps unsurprising       Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
given that LMs are known to be very sensitive to            mann, Trevor Cai, Eliza Rutherford, Katie Millican,
the prompt they are provided with in both zero-             George van den Driessche, Jean-Baptiste Lespiau,
and few-shot settings (Jiang et al., 2020; Schick           Bogdan Damoc, Aidan Clark, Diego de Las Casas,
                                                            Aurelia Guy, Jacob Menick, Roman Ring, Tom Hen-
and Schütze, 2021a). Depending on the tool, our             nigan, Saffron Huang, Loren Maggiore, Chris Jones,
method is also very sample-inefficient; for example,        Albin Cassirer, Andy Brock, Michela Paganini, Ge-
processing more than a million documents results            offrey Irving, Oriol Vinyals, Simon Osindero, Karen
in only a few thousand examples of useful calls             Simonyan, Jack W. Rae, Erich Elsen, and Laurent
                                                            Sifre. 2021. Improving language models by retriev-
to the calculator API. A potential solution to this         ing from trillions of tokens.
problem might be to iteratively apply our approach,
similar to how this is done in related bootstrapping      Sergey Brin. 1999. Extracting patterns and relations
approaches (Schick and Schütze, 2021a; Izacard              from the world wide web. In The World Wide Web
                                                            and Databases, pages 172–183, Berlin, Heidelberg.
and Grave, 2021; Parisi et al., 2022). Finally, when        Springer Berlin Heidelberg.
deciding whether or not to make an API call, Tool-
former currently does not take into account the           Tom Brown, Benjamin Mann, Nick Ryder, Melanie
tool-dependent, computational cost incurred from            Subbiah, Jared D Kaplan, Prafulla Dhariwal,
                                                            Arvind Neelakantan, Pranav Shyam, Girish Sastry,
making an API call.                                         Amanda Askell, Sandhini Agarwal, Ariel Herbert-
                                                            Voss, Gretchen Krueger, Tom Henighan, Rewon
8   Conclusion                                              Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
                                                            Clemens Winter, Chris Hesse, Mark Chen, Eric
We have introduced Toolformer, a language model             Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
that learns in a self-supervised way how to use             Jack Clark, Christopher Berner, Sam McCandlish,
different tools such as search engines, calculators,        Alec Radford, Ilya Sutskever, and Dario Amodei.
                                                            2020. Language models are few-shot learners. In
and translation systems via simple API calls. This          Advances in Neural Information Processing Systems,
is done by finetuning on a large number of sampled          volume 33, pages 1877–1901. Curran Associates,
API calls that are filtered based on whether they           Inc.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,         Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
  Maarten Bosma, Gaurav Mishra, Adam Roberts,               pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
  Paul Barham, Hyung Won Chung, Charles Sutton,             augmented language model pre-training.
  Sebastian Gehrmann, Parker Schuh, Kensen Shi,
  Sasha Tsvyashchenko, Joshua Maynez, Abhishek            Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio
  Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-            Ranzato. 2020. Revisiting self-training for neural
  odkumar Prabhakaran, Emily Reif, Nan Du, Ben              sequence generation. In International Conference
  Hutchinson, Reiner Pope, James Bradbury, Jacob            on Learning Representations.
  Austin, Michael Isard, Guy Gur-Ari, Pengcheng
  Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe-            Or Honovich, Thomas Scialom, Omer Levy, and Timo
  mawat, Sunipa Dev, Henryk Michalewski, Xavier             Schick. 2022. Unnatural instructions: Tuning lan-
  Garcia, Vedant Misra, Kevin Robinson, Liam Fe-            guage models with (almost) no human labor.
  dus, Denny Zhou, Daphne Ippolito, David Luan,
  Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,       Gautier Izacard and Edouard Grave. 2021. Distilling
  Ryan Sepassi, David Dohan, Shivani Agrawal, Mark          knowledge from reader to retriever for question an-
  Omernick, Andrew M. Dai, Thanumalayan Sankara-            swering. In International Conference on Learning
  narayana Pillai, Marie Pellat, Aitor Lewkowycz,           Representations.
  Erica Moreira, Rewon Child, Oleksandr Polozov,
  Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren-         Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas
  nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta,       Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-
  Jason Wei, Kathy Meier-Hellstern, Douglas Eck,            Yu, Armand Joulin, Sebastian Riedel, and Edouard
  Jeff Dean, Slav Petrov, and Noah Fiedel. 2022.            Grave. 2022. Atlas: Few-shot learning with retrieval
  Palm: Scaling language modeling with pathways.            augmented language models.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,           Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
  Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias            Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea
  Plappert, Jerry Tworek, Jacob Hilton, Reiichiro           Madotto, and Pascale Fung. 2022. Survey of hallu-
  Nakano, et al. 2021. Training verifiers to solve math     cination in natural language generation. ACM Com-
  word problems. arXiv preprint arXiv:2110.14168.           puting Surveys.

Marta R Costa-jussà, James Cross, Onur Çelebi, Maha       Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham
 Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe          Neubig. 2020. How can we know what language
 Kalbassi, Janice Lam, Daniel Licht, Jean Maillard,         models know? Transactions of the Association for
 et al. 2022. No language left behind: Scaling              Computational Linguistics, 8:423–438.
 human-centered machine translation. arXiv preprint
 arXiv:2207.04672.                                        Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
                                                           Zettlemoyer. 2017. TriviaQA: A large scale dis-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              tantly supervised challenge dataset for reading com-
   Kristina Toutanova. 2019. BERT: Pre-training of         prehension. In Proceedings of the 55th Annual Meet-
   deep bidirectional transformers for language under-     ing of the Association for Computational Linguistics
   standing. In Proceedings of the 2019 Conference         (Volume 1: Long Papers), pages 1601–1611, Van-
   of the North American Chapter of the Association        couver, Canada. Association for Computational Lin-
   for Computational Linguistics: Human Language           guistics.
  Technologies, Volume 1 (Long and Short Papers),
   pages 4171–4186, Minneapolis, Minnesota. Associ-       Armand Joulin, Edouard Grave, Piotr Bojanowski,
   ation for Computational Linguistics.                     Matthijs Douze, Hérve Jégou, and Tomas Mikolov.
                                                            2016. Fasttext. zip: Compressing text classification
Bhuwan Dhingra, Jeremy R. Cole, Julian Martin               models. arXiv preprint arXiv:1612.03651.
  Eisenschlos, Daniel Gillick, Jacob Eisenstein, and
  William W. Cohen. 2022. Time-aware language             Nitish Shirish Keskar, Bryan McCann, Lav R. Varsh-
  models as temporal knowledge bases. Transactions          ney, Caiming Xiong, and Richard Socher. 2019.
  of the Association for Computational Linguistics,         Ctrl: A conditional transformer language model for
  10:257–273.                                               controllable generation.

Leo Gao, Stella Biderman, Sid Black, Laurence Gold-       Philipp Koehn. 2005. Europarl: A parallel corpus for
  ing, Travis Hoppe, Charles Foster, Jason Phang, Ho-       statistical machine translation. In Proceedings of
  race He, Anish Thite, Noa Nabeshima, et al. 2020.         machine translation summit x: papers, pages 79–86.
  The pile: An 800gb dataset of diverse text for lan-
  guage modeling. arXiv preprint arXiv:2101.00027.        Mojtaba Komeili, Kurt Shuster, and Jason Weston.
                                                           2022. Internet-augmented dialogue generation. In
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon,              Proceedings of the 60th Annual Meeting of the As-
  Pengfei Liu, Yiming Yang, Jamie Callan, and Gra-         sociation for Computational Linguistics (Volume 1:
  ham Neubig. 2022. Pal: Program-aided language            Long Papers), pages 8460–8478, Dublin, Ireland.
  models.                                                  Association for Computational Linguistics.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini,              Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
  Nate Kushman, and Hannaneh Hajishirzi. 2016.              Krueger, Kevin Button, Matthew Knight, Benjamin
  MAWPS: A math word problem repository. In Pro-            Chess, and John Schulman. 2021. Webgpt: Browser-
  ceedings of the 2016 Conference of the North Amer-        assisted question-answering with human feedback.
  ican Chapter of the Association for Computational
  Linguistics: Human Language Technologies, pages         Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022. Talm:
  1152–1157, San Diego, California. Association for         Tool augmented language models.
  Computational Linguistics.
                                                          Arkil Patel, Satwik Bhattamishra, and Navin Goyal.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-           2021. Are NLP models really able to solve simple
  field, Michael Collins, Ankur Parikh, Chris Al-           math word problems? In Proceedings of the 2021
  berti, Danielle Epstein, Illia Polosukhin, Jacob De-      Conference of the North American Chapter of the
  vlin, Kenton Lee, Kristina Toutanova, Llion Jones,        Association for Computational Linguistics: Human
  Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,            Language Technologies, pages 2080–2094, Online.
  Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.          Association for Computational Linguistics.
  Natural questions: A benchmark for question an-
  swering research. Transactions of the Association       Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick
  for Computational Linguistics, 7:452–466.                 Lewis, Majid Yazdani, Nicola De Cao, James
                                                            Thorne, Yacine Jernite, Vladimir Karpukhin, Jean
Angeliki Lazaridou, Elena Gribovskaya, Wojciech             Maillard, Vassilis Plachouras, Tim Rocktäschel, and
  Stokowiec, and Nikolai Grigorev. 2022. Internet-          Sebastian Riedel. 2021. KILT: a benchmark for
  augmented language models through few-shot                knowledge intensive language tasks. In Proceedings
  prompting for open-domain question answering.             of the 2021 Conference of the North American Chap-
  arXiv preprint arXiv:2203.05115.                          ter of the Association for Computational Linguistics:
                                                            Human Language Technologies, pages 2523–2544,
Patrick Lewis, Barlas Oğuz, Ruty Rinott, Sebastian         Online. Association for Computational Linguistics.
  Riedel, and Holger Schwenk. 2019. Mlqa: Eval-
  uating cross-lingual extractive question answering.     Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
  arXiv preprint arXiv:1910.07475.                          Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
                                                            Alexander Miller. 2019. Language models as knowl-
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu      edge bases? In Proceedings of the 2019 Confer-
  Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na-           ence on Empirical Methods in Natural Language
  man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth          Processing and the 9th International Joint Confer-
  Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav        ence on Natural Language Processing (EMNLP-
  Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle-          IJCNLP), pages 2463–2473, Hong Kong, China. As-
  moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy-        sociation for Computational Linguistics.
  anov, and Xian Li. 2021. Few-shot learning with
  multilingual language models.                           Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and            Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
   Ryan McDonald. 2020. On faithfulness and factual-        guage models are unsupervised multitask learners.
   ity in abstractive summarization.                        OpenAI blog, 1(8):9.

David McClosky, Eugene Charniak, and Mark Johnson.        Roi Reichart and Ari Rappoport. 2007. Self-training
  2006. Effective self-training for parsing. In Pro-        for enhancement and domain adaptation of statisti-
  ceedings of the Human Language Technology Con-            cal parsers trained on small datasets. In Proceed-
  ference of the NAACL, Main Conference, pages 152–         ings of the 45th Annual Meeting of the Association of
  159, New York City, USA. Association for Compu-           Computational Linguistics, pages 616–623, Prague,
  tational Linguistics.                                     Czech Republic. Association for Computational Lin-
                                                            guistics.
Stephen Merity, Caiming Xiong, James Bradbury, and
   Richard Socher. 2017. Pointer sentinel mixture mod-    Stephen E Robertson, Steve Walker, Susan Jones,
   els. In International Conference on Learning Repre-       Micheline M Hancock-Beaulieu, Mike Gatford, et al.
   sentations.                                              1995. Okapi at trec-3. Nist Special Publication Sp,
                                                            109:109.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su.
  2020. A diverse corpus for evaluating and develop-      Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio
  ing English math word problem solvers. In Proceed-        Petroni, Patrick Lewis, Gautier Izacard, Qingfei You,
  ings of the 58th Annual Meeting of the Association        Christoforos Nalmpantis, Edouard Grave, and Se-
  for Computational Linguistics, pages 975–984, On-         bastian Riedel. 2022. Peer: A collaborative lan-
  line. Association for Computational Linguistics.          guage model.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,   Timo Schick and Hinrich Schütze. 2021a. Exploiting
  Long Ouyang, Christina Kim, Christopher Hesse,            cloze-questions for few-shot text classification and
  Shantanu Jain, Vineet Kosaraju, William Saunders,         natural language inference. In Proceedings of the
  16th Conference of the European Chapter of the As-        web crawl data. In Proceedings of the Twelfth Lan-
  sociation for Computational Linguistics: Main Vol-        guage Resources and Evaluation Conference, pages
  ume, pages 255–269, Online. Association for Com-          4003–4012, Marseille, France. European Language
  putational Linguistics.                                   Resources Association.

Timo Schick and Hinrich Schütze. 2021b. Generating        Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
  datasets with pretrained language models. In Pro-         Shafran, Karthik Narasimhan, and Yuan Cao. 2022.
  ceedings of the 2021 Conference on Empirical Meth-        React: Synergizing reasoning and acting in language
  ods in Natural Language Processing, pages 6943–           models.
  6951, Online and Punta Cana, Dominican Republic.
  Association for Computational Linguistics.              David Yarowsky. 1995. Unsupervised word sense dis-
                                                            ambiguation rivaling supervised methods. In 33rd
Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju,              Annual Meeting of the Association for Computa-
  Eric Michael Smith, Stephen Roller, Megan Ung,            tional Linguistics, pages 189–196, Cambridge, Mas-
  Moya Chen, Kushal Arora, Joshua Lane, Morteza             sachusetts, USA. Association for Computational
  Behrooz, William Ngan, Spencer Poff, Naman                Linguistics.
  Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kam-
                                                          Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D.
  badur, and Jason Weston. 2022. Blenderbot 3: a de-
                                                             Goodman. 2022. Star: Bootstrapping reasoning
  ployed conversational agent that continually learns
                                                            with reasoning.
  to responsibly engage.
                                                          Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Romal Thoppilan, Daniel De Freitas, Jamie Hall,             Artetxe, Moya Chen, Shuohui Chen, Christopher De-
  Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze               wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
  Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,       haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
  YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng,            Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
  Amin Ghafouri, Marcelo Menegali, Yanping Huang,           Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
  Maxim Krikun, Dmitry Lepikhin, James Qin, De-             trained transformer language models.
  hao Chen, Yuanzhong Xu, Zhifeng Chen, Adam
  Roberts, Maarten Bosma, Vincent Zhao, Yanqi
  Zhou, Chung-Ching Chang, Igor Krivokon, Will
  Rusch, Marc Pickett, Pranesh Srinivasan, Laichee
  Man, Kathleen Meier-Hellstern, Meredith Ringel
  Morris, Tulsee Doshi, Renelito Delos Santos, Toju
  Duke, Johnny Soraker, Ben Zevenbergen, Vinod-
  kumar Prabhakaran, Mark Diaz, Ben Hutchinson,
  Kristen Olson, Alejandra Molina, Erin Hoffman-
  John, Josh Lee, Lora Aroyo, Ravi Rajakumar,
  Alena Butryna, Matthew Lamm, Viktoriya Kuzmina,
  Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray
  Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian
  Croak, Ed Chi, and Quoc Le. 2022. Lamda: Lan-
  guage models for dialog applications.

Ben Wang and Aran Komatsuzaki. 2021.   GPT-
  J-6B: A 6 Billion Parameter Autoregressive
  Language Model.   https://github.com/
  kingoflolz/mesh-transformer-jax.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
  isa Liu, Noah A. Smith, Daniel Khashabi, and Han-
  naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
  guage model with self generated instructions.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf-
   fel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
   gatama, Maarten Bosma, Denny Zhou, Donald Met-
   zler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
   Percy Liang, Jeff Dean, and William Fedus. 2022.
   Emergent abilities of large language models.

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con-
  neau, Vishrav Chaudhary, Francisco Guzmán, Ar-
  mand Joulin, and Edouard Grave. 2020. CCNet:
  Extracting high quality monolingual datasets from
A     API Details                                         2016), while the target language is always set to
                                                          English. Since most of the CCNet dataset is in
When sampling and filtering API calls, by default
                                                          English, we filter out the parts that contain only
we use values of τs = 0.05 and τf = 1.0 – i.e.,
                                                          English text before generating API calls. More
we only make API calls at positions where the
                                                          specifically, we only keep those paragraphs which
probability of the <API> token is at least 5%, and
                                                          contain text chunks in a language other than En-
we keep API calls if they reduce the loss by at least
                                                          glish preceded and followed by English text. We
1.0. We only keep the top k = 5 such positions and
                                                          use text chunks of size 10 tokens. To determine
sample up to m = 5 API calls for each position
                                                          whether the middle text chunk is in a language
identified in a piece of text. Due to the heuristic
                                                          different than English we again use the fastText
filtering described below, we generate API calls for
                                                          classifier with a confidence greater than 0.8. We
the calculator and machine translation system on
                                                          also filter out any text chunks that contain only
only a small subset of C; to compensate for this,
                                                          numbers or special symbols. This filtering mecha-
we set τs = 0.0, k = 20 and m = 10 for these
                                                          nism allows us to generate data more efficiently by
tools. As the resulting sets of API calls are still
                                                          focusing our API call generations in places where
comparably small, we additionally set τf = 0.5.
                                                          the MT tool is likely to be helpful. After generating
A.1    Implementation                                     the MT API calls, we additionally remove from our
Question Answering We use the Atlas model of              training set those where the input to the MT tool
Izacard et al. (2022) finetuned on Natural Ques-          appears after the API call but not before it. While
tions (Kwiatkowski et al., 2019) as our question          during data generation the model can look ahead
answering system. For creating C ∗ we use Atlas-          to generate API calls, this is not possible at infer-
large, enabling us to efficiently process millions        ence time, so we want to dissuade the model from
of API calls; during inference, we use the larger         calling the API in such cases.
Atlas-xxl model.                                          A.2   Prompts
Calculator Our calculator is based on a simple            Below, we list the prompts used to sample API
Python script and only supports the operators “+”,        calls for each tool considered.
“−”, “∗”, and “/”. It does not return any result          Question Answering We use the following
for syntactically invalid equations. For sampling         prompt for the question answering tool:
API calls, we apply heuristic filters to our subset of    Your task is to add calls to a Question
CCNet and only process documents that either (i)          Answering API to a piece of text.
                                                          The questions should help you get
contain at least three numbers within a window of         information required to complete the
100 tokens, where one of these numbers is the result      text. You can call the API by writing
                                                          "[QA(question)]" where "question" is the
of applying a mathematical operation to the other         question you want to ask. Here are some
two, (ii) contain one of the sequences “=”, “equals”,     examples of API calls:
“equal to”, “total of”, “average of” followed by a        Input: Joe Biden was born in Scranton,
                                                          Pennsylvania.
number, or (iii) contain at least three numbers; for      Output: Joe Biden was born in [QA("Where
texts that only match the last criterion, we only         was Joe Biden born?")] Scranton,
keep a random subset of 1%.                               [QA("In which state is Scranton?")]
                                                          Pennsylvania.
Calendar For creating our dataset C ∗ , we operate
                                                          Input: Coca-Cola, or Coke, is a
under the assumption that the calendar date in such       carbonated soft drink manufactured by
cases should be the date that the document was            the Coca-Cola Company.
created. We approximate this by extracting the date       Output: Coca-Cola, or [QA("What other
                                                          name is Coca-Cola known by?")] Coke, is
from the URL, if it is present. We filter out texts for   a carbonated soft drink manufactured by
which a date cannot be extracted, leaving around          [QA("Who manufactures Coca-Cola?")] the
                                                          Coca-Cola Company.
18% of the documents.
                                                          Input: x
Machine Translation For both training and in-             Output:
ference, we use the 600M parameter NLLB (Costa-
                                                          Calculator We use the following prompt for the
jussà et al., 2022) as our machine translation (MT)       calculator:
model. The source language is automatically de-           Your task is to add calls to a
tected using the fastText classifier (Joulin et al.,      Calculator API to a piece of text.
The calls should help you get                  nanomaterials may give rise to various
information required to complete the           kinds of lung damage.
text. You can call the API by writing          Output: But what are the risks
"[Calculator(expression)]" where               during production of nanomaterials?
"expression" is the expression to be           [WikiSearch("nanomaterial production
computed. Here are some examples of API        risks")] Some nanomaterials may give
calls:                                         rise to various kinds of lung damage.
Input: The number in the next term is 18
+ 12 x 3 = 54.                                 Input: Metformin is the first-line drug
Output: The number in the next term is         for patients with type 2 diabetes and
18 + 12 x 3 = [Calculator(18 + 12 * 3)]        obesity.
54.                                            Output: Metformin is the first-line drug
                                               for [WikiSearch("Metformin first-line
Input: The population is 658,893 people.       drug")] patients with type 2 diabetes
This is 11.4% of the national average of       and obesity.
5,763,868 people.
Output: The population is 658,893 people.      Input: x
This is 11.4% of the national average of       Output:
[Calculator(658,893 / 11.4%)] 5,763,868
people.                                        Machine Translation We use the following
Input: A total of 252 qualifying matches
                                               prompt for the machine translation tool:
were played, and 723 goals were scored         Your task is to complete a given piece
(an average of 2.87 per match). This is        of text by using a Machine Translation
three times less than the 2169 goals           API.
last year.                                     You can do so by writing "[MT(text)]"
Output: A total of 252 qualifying              where text is the text to be translated
matches were played, and 723 goals were        into English.
scored (an average of [Calculator(723          Here are some examples:
/ 252)] 2.87 per match). This is twenty
goals more than the [Calculator(723 -          Input: He has published one book: O
20)] 703 goals last year.                      homem suprimido (“The Supressed Man”)
                                               Output: He has published one book: O
Input: I went to Paris in 1994 and             homem suprimido [MT(O homem suprimido)]
stayed there until 2011, so in total,          (“The Supressed Man”)
it was 17 years.
Output: I went to Paris in 1994 and            Input: In Morris de Jonge’s Jeschuah,
stayed there until 2011, so in total, it       der klassische jüdische Mann, there is a
was [Calculator(2011 - 1994)] 17 years.        description of a Jewish writer
                                               Output: In Morris de Jonge’s Jeschuah,
Input: From this, we have 4 * 30 minutes       der klassische jüdische Mann [MT(der
= 120 minutes.                                 klassische jüdische Mann)], there is a
Output: From this, we have 4 * 30              description of a Jewish writer
minutes = [Calculator(4 * 30)] 120
minutes.                                       Input: 南 京 高 淳 县 住 房 和 城 乡 建 设 局 城 市 新
                                               区 设 计 a plane of reference Gaochun is
Input: x                                       one of seven districts of the provincial
Output:                                        capital Nanjing
                                               Output: [MT(南京高淳县住房和城乡建设局 城市新
Wikipedia Search We use the following prompt   区 设 计)] a plane of reference Gaochun is
for the Wikipedia search tool:                 one of seven districts of the provincial
Your task is to complete a given piece         capital Nanjing
of text. You can use a Wikipedia Search
API to look up information. You can do         Input: x
so by writing "[WikiSearch(term)]" where       Output:
"term" is the search term you want to
look up. Here are some examples of API         Calendar We use the following prompt for the
calls:                                         calendar tool:
Input: The colors on the flag of Ghana
have the following meanings: red is for        Your task is to add calls to a Calendar
the blood of martyrs, green for forests,       API to a piece of text. The API calls
and gold for mineral wealth.                   should help you get information required
Output: The colors on the flag of Ghana        to complete the text. You can call the
have the following meanings: red is for        API by writing "[Calendar()]" Here are
[WikiSearch("Ghana flag red meaning")]         some examples of API calls:
the blood of martyrs, green for forests,
and gold for mineral wealth.                   Input: Today is the first Friday of the
                                               year.
Input: But what are the risks during           Output: Today is the first [Calendar()]
production of nanomaterials? Some              Friday of the year.
                                                       Template                                             Size
Input: The president of the United
States is Joe Biden.                                   How many days {ago was, are there until}              400
Output: The president of the United                    {past_date, future_date}?
States is [Calendar()] Joe Biden.
                                                       What {day of the week, day of the month, month,       800
Input: The current day of the week is                  year} was it (current_date – past_date) {days,
Wednesday.                                             weeks, months, years} ago?
Output: The current day of the week is                 What {day of the week, day of the month, month,       800
[Calendar()] Wednesday.                                year} will it be in (future_date – current_date)
                                                       days?
Input: The number of days from now until
Christmas is 30.                                       What day of the week {is, was} it on {past_date,      400
Output: The number of days from now                    future_date}?
until Christmas is [Calendar()] 30.                    What {day of the week, day of the month, month,      4,000
                                                       year} {is, was} it {the day before yesterday, yes-
Input: The store is never open on the                  terday, today, tomorrow, the day after tomorrow}?
weekend, so today it is closed.
Output: The store is never open on the                 What {day of the week, day of the month, month}      1,800
weekend, so today [Calendar()] it is                   {is, was} holiday this year?
closed.                                                How many {days, weeks, months, years} {ago           1,200
                                                       was, are there until} holiday this year?
Input: x
Output:
                                                       Total                                                9,400

B     Toolformer Training
                                                      Table 11: Templates used to create DATESET where
We use up to 25k examples per API. Max sequence       a current_date is randomly selected. For each cur-
length 1,024. Effective batch size of 128. All mod-   rent_date, a random past_date and future_date is gen-
                                                      erated and used to fill each template, if relevant. The
els are trained using DeepSpeed’s ZeRO-3 (Rasley
                                                      federal holidays in the United States (e.g., Thanksgiv-
et al., 2020). We used 8 NVIDIA A100 40GB             ing) were used in the templates involving holidays.
GPUs with BF16. Training up to 2k steps, where
we evaluate PPL on a small development set from
CCNet containing 1,000 examples every 500 steps.      to answer a question based on
We pick the checkpoint that performs best.            the following paragraph: x Now
                                                      answer the following question in
C     Zero-Shot Prompts                               English: q.
C.1    LAMA and T EMP LAMA                            D    DATESET
For both LAMA and T EMP LAMA, given an input
                                                      DATESET is created by first randomly selecting 500
text x, we use the following prompt: Please
                                                      “current dates”. For each current date, another rela-
complete the following text so
                                                      tively past/future date is randomly selected within
that it is factually correct: x.
                                                      a four-year range, and the two dates are used to fill
C.2    Math Benchmarks                                the query templates in Table 11. An example of one
                                                      such query using the first template would be, “How
For all math benchmarks, given a context x and
                                                      many days ago was August 14, 2020?” If called,
a question q, our prompt is: x q The answer
                                                      the Calendar tool would return the presumed cur-
is.
                                                      rent date (e.g., “Today is Sunday, November 20,
C.3    Question Answering                             2020”).
For all question answering datasets, including
DATESET, we simply prefix the question with
Answer the following question:. We
append a question mark if the question does not
already end with one.

C.4    Multilingual Question Answering
For MLQA, given a context x and a ques-
tion q, our prompt is:   Your task is