Large language models can write essays, answer questions, and even joke around-but they canât do math without help. They donât remember facts unless you feed them. They canât check the weather or translate a sentence in real time. For years, engineers tried to fix this by adding custom code, fine-tuning models for each task, or using prompts to guide them. None of it worked well at scale. Then came Toolformer.
Why LLMs Need Tools
Imagine youâre a student who knows everything about history but canât do basic arithmetic. You can explain the causes of World War II, but you canât calculate 17 Ă 23. Thatâs what large language models are like. Theyâre incredibly good at understanding language, but terrible at tasks that require precision, speed, or access to live data. A calculator is simpler than a human brain, yet better at math. A search engine has more up-to-date facts than any model trained on old data. So why not let the model use these tools? The problem isnât just that models are bad at math or facts. Itâs that they make things up. Hallucinations. False citations. Wrong dates. These arenât bugs-theyâre structural. LLMs predict the next word based on patterns, not truth. So if you ask them to solve 892 á 4, theyâll guess. And theyâll guess wrong about 30% of the time, even on simple problems. Toolformer flips this. Instead of forcing the model to memorize everything, it teaches the model when to say, âWait, let me use a calculator.â And it does this without human labels. No hand-crafted prompts. No task-specific training. Just raw text and a few example API calls.How Toolformer Works
Toolformer is built on a 6.7 billion parameter version of GPT-J, a model already good at language. But instead of training it on more text, the researchers gave it access to five simple tools:- A calculator
- A question-answering system (like a smart search)
- Two search engines (including Wikipedia)
- A translation API
- A calendar API
API_CALL: calculator(17 * 23). The model sees this as part of the text stream, just like any other word.
Now comes the magic. The system doesnât tell the model when to use these tools. Instead, it randomly inserts possible API calls into a huge dataset of text. Then it asks: âDid this API call make the next few words easier to predict?â If yes, it keeps the call. If no, it throws it out.
Think of it like a kid learning to use a dictionary. You donât teach them every word. You just hand them a dictionary and say, âTry using it when youâre stuck.â Over time, they learn: âOh, when I donât know how to spell ânecessary,â I should look it up.â Toolformer does the same thing-but automatically, at scale.
The result? A model that learns, on its own, when to call a calculator for math, when to search for a fact, and when to translate a phrase. And it does all this without forgetting how to write essays or answer open-ended questions.
Why This Is a Big Deal
Before Toolformer, most tools were added manually. Youâd build a chatbot that uses a weather API only for weather questions. Youâd train a separate model for math. Youâd write rules. It was messy, limited, and didnât scale. Toolformer doesnât need rules. It doesnât need labels. It doesnât even need a lot of examples. Just a few dozen demonstrations per tool, and it learns from billions of text samples. Thatâs the breakthrough. In tests, Toolformer outperformed GPT-3 (which has 175 billion parameters) on math, fact-checking, and translation tasks-despite being ten times smaller. Thatâs like a compact car beating a truck in a race because it knows when to use the highway. And it doesnât lose its language skills. It still writes poetry. Still explains quantum physics. Still jokes about cats. The tools just become part of its thinking process. This isnât just about better performance. Itâs about flexibility. A model that can use tools on its own can adapt to new tasks without retraining. Need a new API? Just give it a few examples. The model figures out the rest.
What Toolformer Canât Do (Yet)
Toolformer isnât perfect. It only works with stateless tools-ones that give an answer and forget everything after. That means it can use a calculator, a search engine, or a dictionary. But it canât book a flight. It canât order pizza. It canât manage a conversation that spans multiple steps. Why? Because tools like hotel booking require memory. You need to remember the userâs name, dates, preferences, payment info. Toolformerâs internal state is âblurry.â It doesnât track context over time. So if you ask it to âbook a flight from Denver to Chicago next Tuesday,â it might call the API-but it wonât remember what it said five minutes ago. This is a hard problem. Most AI systems today struggle with stateful interactions. Toolformerâs approach doesnât solve that. But it does show a path: if you can represent state as text, maybe you can teach models to manage it too.How It Compares to Other Approaches
There are other ways to make LLMs use tools. ReAct, for example, uses prompts like âThink, then Act.â The model writes a thought, then picks an action, then observes the result. It works-but itâs slow. And it needs humans to design the action space. What actions are allowed? What format should they be in? Toolformer doesnât need that. It learns the action space itself. The model decides what to call, when, and how. No human-designed rules. No fixed set of actions. Just text, APIs, and self-supervision. Another approach, ASTRO, trains models to reason like search engines-iteratively refining answers by exploring links. Itâs powerful, but itâs also complex. Toolformer is simpler. More elegant. Less engineering. More learning.
The Bigger Picture
Toolformer isnât a product you can download. Itâs a research paper. A proof of concept. But its implications are huge. If models can learn to use tools on their own, we donât need to build specialized AI for every task. We donât need to train a new model for finance, medicine, or law. We just need to give them access to the right APIs-and let them figure out how to use them. This could change how AI is deployed. Instead of huge, expensive models trained on massive datasets, we could have smaller, smarter models that reach out to the world for help when they need it. Cheaper. Faster. More accurate. It also means AI systems become more transparent. If a model says, âI looked up the population of Tokyo,â you can check the source. If it used a calculator, you can see the math. No more black boxes.Whatâs Next?
The next step? Teaching models to handle state. To remember. To manage conversations over time. To use tools that change the world, not just answer questions. Researchers are already working on it. Some are experimenting with memory buffers. Others are trying to represent state as text snippets. Maybe the next version of Toolformer will book your flight-not because it was told to, but because it learned that when you say âI need to fly to New York next week,â youâre asking for help. The goal isnât to replace tools. Itâs to make models better at using them. And Toolformer shows thatâs possible-not with more data, not with more computing power, but with smarter learning.Frequently Asked Questions
What is Toolformer?
Toolformer is a language model trained to use external tools like calculators, search engines, and translation APIs without human supervision. It learns when and how to call these tools by analyzing which API calls improve its ability to predict the next word in text. Developed by researchers at AI labs and presented at NeurIPS 2023, itâs designed to enhance large language models without requiring task-specific fine-tuning.
How is Toolformer different from other AI tools?
Unlike systems that rely on human-written prompts or fixed action lists (like ReAct), Toolformer learns to use tools through self-supervision. It doesnât need labeled examples for every task. Instead, it tests thousands of possible API calls on its own and keeps only the ones that help it predict future text better. This makes it more flexible and scalable than previous methods.
Can Toolformer book flights or make purchases?
No, not yet. Toolformer only works with stateless APIs-tools that return an answer and donât require memory. It can calculate math, search Wikipedia, or translate text. But it canât handle multi-step tasks like booking a hotel, where it needs to remember your preferences, dates, or payment details. Thatâs a known limitation, and researchers are working on ways to add state management.
Does Toolformer need a lot of training data?
Surprisingly, no. It only needs a few dozen example prompts per tool-like âWhat is 12 times 15?â â âAPI_CALL: calculator(12 * 15)â. Then it learns from billions of text samples. This makes it much cheaper and easier to adapt than models that require thousands of labeled examples.
Is Toolformer better than GPT-3?
On specific tasks like math, fact-checking, and translation, yes-even though Toolformer is based on a 6.7B parameter model and GPT-3 has 175B. Toolformer outperforms GPT-3 on these tasks because it knows when to use tools instead of guessing. But GPT-3 is still better at open-ended writing and creativity. Toolformer doesnât replace language models; it makes them smarter.
Can I use Toolformer today?
Not as a public product. Toolformer is a research model released as a paper in late 2023. The code and training details are available to researchers, but thereâs no official API or app. However, its ideas are already influencing new models like ASTRO and other tool-augmented systems, so expect to see its principles in commercial AI tools soon.
Mark Brantner
So let me get this straight... we trained a model to use a calculator instead of just giving it one? Like, wow. Next they'll teach it to use a fork instead of eating with its hands. đ