Multimodality, Tool Use, and Autonomous Agents: Large Language Models Explained, Part 3

This is the third installment of a three-part blog series. To read the first part, click here. And to read the second part, click here.

In this explainer:

Why LLMs are more than just chatbots.
How “multi-modal” models can process images, video, audio, and more.
How AI developers are building LLMs that can take action in the real world.

When people think of large language models (LLMs), they often think of chatbots: conversational AI systems that can answer questions, write poems, and so on. But with relatively minimal adaptation, LLMs can assist in writing code, processing images and audio, and controlling virtual interfaces and physical machines. Some of these capabilities are already deployed, and still more are on the horizon as researchers work to enable LLMs and related systems to take autonomous actions without human supervision. This explainer covers research and deployments that make LLMs more than just chatbots.

Text is More than Natural Language

In the previous explainers, we saw that LLMs accept text as input and produce text as output. This has naturally led to their application and use as chatbots, where humans and AI systems communicate by writing in regular language (or “natural language,” as computer scientists call it). But although chatbots are one of the most widely known applications of LLMs at the moment, this does not mean that LLMs are primarily useful as conversation partners. Text can express much more than natural language.

Consider computer code. Although computer code is used to control computers, not (usually) to communicate between humans, it consists of the same readable words and punctuation that LLMs are designed for. And like natural-language text, a very large quantity of computer code on the internet has been scraped into pre-training datasets. Much of this code is “commented” with human-written descriptions, to make it easier for other programmers to understand it. In just the same way that pre-trained LLMs can learn to predict what words will likely come next, they can also learn to predict what code will come next, including the code that will come after a plain English description of that code. Today, there are several commercially available code-generation models that are used by software developers, such as GitHub Copilot.

Code-generation capabilities in LLMs are not only used by software developers. OpenAI has developed a version of ChatGPT that is capable of writing code to perform simple data analysis tasks like plotting graphs in response to natural language instructions. In addition, third-party developers can write “plugins” for ChatGPT that allow it to write code to send and receive information from external services, such as adding items to an Instacart order or running a calculation using the scientific platform Wolfram Alpha.

In addition to code, there are other categories of information that can be expressed as text. For example, pre-trained LLMs can do basic math, and AI researchers have developed pre-training techniques and datasets to create improved math models. LLMs can even generate chess moves and compose simple melodies, since both chess and simple music notes can be expressed in text, although current LLMs do not perform as well as humans or specialized AI systems at those tasks.

Multimodal LLMs Can Accept More than Just Text

And despite having the word “language” in their name, the basic technology underpinning LLMs is not limited to processing information in the form of text. In our pre-training explainer, we described how LLMs process words by first splitting them into parts, called tokens, and then converting them to numbers that are processed by the model’s internals. Researchers have more recently developed methods to “tokenize” other kinds of data, such as images and audio, expanding the range of inputs the models can accept. Once this data has been processed with a special-purpose tokenizer, it can be fed into a model with the same underlying Transformer architecture as text-only LLMs. For example, the Vision Transformer is constructed very similarly to LLMs, but it can classify images—the pixels in the input images are translated to tokens just like text would be. In addition, there are an increasing number of “multimodal” models that can accept text, visual, and other types of input at the same time. For example, while still referred to as “large language models,” OpenAI’s GPT-4 and Google’s Gemini are also able to accept image inputs along with language inputs. Audio data can also be tokenized and fed into transformer models, as seen, for example, in Whisper, an OpenAI audio-to-text model that is likely behind ChatGPT’s ability to accept auditory inputs. Google’s recently announced Gemini 1.5 is able to process audio, image, video, and text data.

Autonomous LLMs

It’s easy to think of LLMs as confined to a chat window on a website, unable to do anything but communicate with their users. But LLMs are beginning to control physical systems and make decisions in the real world. The research in this area is varied, but there are two broad strands: training LLMs to use different kinds of external tools and training LLMs to act as autonomous agents.

Tool Use

One way that researchers are working to extend the capabilities of LLMs is by granting them direct control of virtual and physical systems. Some simple mechanisms for this are already available. For example, OpenAI’s advanced data analyst tool (described above) automatically generates and executes data analysis code on a dedicated OpenAI server, and does not require the user to run the code themselves.

LLMs are also increasingly used to browse the internet. Researchers have developed open-source tools that map LLM outputs to clicks and keystrokes in a browser, and companies like OpenAI and Microsoft have equipped their models with web browsing capabilities that allow them to access websites and report back to the user about their contents.

LLMs have also been trained to help control physical systems. Google’s PALM-E can take plain English instructions as input and produce commands to control simulated and physical robots. Microsoft has experimented with a system that uses ChatGPT to string together manually written code to control drones with plain language descriptions. And a system called ChemCrow connects an LLM with multiple external scientific tools in order to carry out chemistry tasks, including issuing control commands to robotic chemical synthesis machines.

Much of this work is fairly early-stage, but it is progressing rapidly. In the next few years, we should expect to see LLMs being connected to many more kinds of tools.

Autonomous Agents

Today, LLMs generally respond to specific requests from their users. But many AI developers believe that a promising direction for the future is to build LLM agents––models that can autonomously carry out complex, multistep goals. To work well, LLM agents will need to be able to break a goal down into a sequence of steps, carry those steps out, and adjust course as needed. Let’s consider each of these capabilities in turn.

Breaking a goal down into sub-steps is a well-established area of computer science, known as planning. But pre-LLM approaches to planning typically require either a mathematical specification of the task, or reams of data showing how it can be accomplished. The promise of LLM agents lies in the hope that LLMs can be used to select and carry out actions in environments that are too complex or open-ended to represent formally, without needing large amounts of task-specific data to train a dedicated system “from scratch.” If LLMs could be used to operate autonomously in those settings—in other words, in most real-world environments—they would become much more valuable.

Enabling LLMs to actually take action in the real world is relatively straightforward, as described under Tool Use above. Simple “scaffolding” software can interpret the LLM’s text outputs as commands, and then execute them as appropriate.¹

Where an action would require logging into an account (e.g., to make a financial transaction), this can be completed via the human user stepping in, or via the LLM agent having pre-approval (e.g., having access to credentials or operating in a browser where the user is already logged in to relevant accounts). AI developers have begun to consider what kinds of agent actions should require human involvement, but do not have settled answers yet.

Adapting to unanticipated difficulties while carrying out a task is one major difficulty holding LLM agents back at present. While LLMs do have some planning capabilities without any special fine-tuning or training, they are fairly limited. The PALM-E model described above can successfully generate and carry out a four-step plan describing how a robot should pick up an object, but LLMs cannot reliably construct more complex plans, such as identifying which mouse movements and keyboard strokes are required to purchase the cheapest flight between two cities. One problem is that if LLMs have some probability of failure in each step of their plans, and just one error could undermine the whole plan, then longer plans will result in higher overall failure rates, especially if the LLM cannot recover from failure. Researchers are working to improve on this, including by prompting LLMs to write out their reasoning for choosing an action, critique their own plans, and delegate actions to other (perhaps specialized) LLM agents. These kinds of prompts are often built into the same scaffolding software that allows the LLMs to browse the web and take other actions.

The Future of Autonomous LLMs

While LLM agents have been successful in playing Minecraft and interacting in virtual worlds, they have largely not been reliable enough to deploy in real-life use cases. However, they have driven major interest in the area, with both open source developers and large companies like Google DeepMind and OpenAI working to build LLM agents. The LLM agent startup Adept, whose AI agent is still behind a waitlist, was most recently valued at more than $1 billion. In January 2024, the startup MultiOn unveiled a prototype AI agent that can schedule calendar invites, browse the internet, and perform other simple tasks, although like other LLM agents it suffers from slow speed and a lack of reliability.

Experts disagree about whether Transformer-based systems alone, even with scaffolding, will be sufficient to produce commercially viable autonomous AI systems.²

If they are, it is unclear whether this will require dedicated attention to agent-specific questions such as how best to design scaffolding software, or whether most progress will come simply from continuing to develop LLMs that are broadly more capable. In the past, LLMs have dramatically improved in their ability to solve problems simply through increased scale. Some researchers anticipate that future generations of LLMs will be able to act autonomously with only fine-tuning and relatively simple scaffolding.

Today, research often focuses on getting autonomous LLMs to perform specific, defined tasks like booking flights. However, the goal of much of this research is to eventually produce agents that can do much more open-ended tasks. As AI agents increase in reliability, they may be able to go from performing simple tasks (e.g., “book me a flight”) to much more complex tasks (e.g., “organize an event,” “run my payroll,” or even “act as the CEO of my business.”). Research is unpredictable, and we don’t know when, or if, these capabilities will be developed. But given the uncertainty involved, we should consider the possibility that these developments will happen soon.

Implications

In this explainer, we described how LLMs and related systems are already much more than conversationalists, and can be used to generate executable code, analyze images, control robots, and more. We also covered the rapidly developing research into AI agents that plan and execute actions on their own.

These research areas, many of which are still in development but which are rapidly coming into production, raise new questions of governance that conversational LLMs do not. While the goal of this paper is not to create an exhaustive list of such issues, we will list a few for illustrative purposes:

Attack surfaces: Autonomous and multimodal LLMs contain many more attack surfaces that allow them to be exploited by malicious agents. How should these risks be managed?
AI agent liability: If an LLM-powered AI agent malfunctions and causes harm, how should legal liability be divided between the LLM developer, the scaffolding developer, and the user?
Disclosure: Should AI agents be required to disclose themselves to those they interact with?
Monitoring: Should governments, companies, and civil societies develop mechanisms to track and respond to problems arising from AI agents acting autonomously?
AI for risk management and oversight: Can AI systems, including AI agents, be used to manage risks arising from other AI systems?
Tool restrictions: Should LLMs be allowed to autonomously control financial, biological, chemical, or other sensitive systems? Is having a “human in the loop” sufficient oversight?
Emergent risks: How should we monitor and protect against emergent risks, such as risks arising from interactions between AI agents, humans, and other AI agents, and risks from out-of-control or power-seeking agents?

While some of these questions, such as those around cybersecurity vulnerabilities, are relevant to deployed systems today, others are mostly relevant to systems still in development. However, these questions are still important because of the rapid progress in AI deployments over the last year; language models like GPT-3 were a niche interest until around a year ago, and vision-language multimodal models have only been widely deployed since May 2023. We hope that this explainer can provide a preview of the research areas and capabilities that may be relevant for policymakers in the near future.

Thanks to John Bansemer, Matt Burtell, Hanna Dohmen, James Dunham, Rachel Freedman, Krystal Jackson, Ben Murphy, and Vikram Venkatram for their feedback on this post.

AutoGPT is an early example of scaffolding software that converts an LLM into an agent that can move around the web.
The Transformer, developed in 2017, is the deep learning architecture that spurred the development of LLMs. To date, essentially all LLMs are designed as some type of Transformer. However, this may change in the future, if different architectures are discovered that improve on the performance on Transformer-based models.

How the U.S. Wins the Global Tech Competition

Center for Security and Emerging Technology

How the U.S. Wins the Global Tech Competition

CSET

Text is More than Natural Language

Multimodal LLMs Can Accept More than Just Text

Autonomous LLMs

Tool Use

Autonomous Agents

The Future of Autonomous LLMs

Implications

Related Content

The Surprising Power of Next Word Prediction: Large Language Models Explained, Part 1

How Developers Steer Language Model Outputs: Large Language Models Explained, Part 2

How the U.S. Wins the Global Tech Competition

CSET

Multimodality, Tool Use, and Autonomous Agents: Large Language Models Explained, Part 3

Text is More than Natural Language

Multimodal LLMs Can Accept More than Just Text

Autonomous LLMs

Tool Use

Autonomous Agents

The Future of Autonomous LLMs

Implications

Related Content

The Surprising Power of Next Word Prediction: Large Language Models Explained, Part 1

How Developers Steer Language Model Outputs: Large Language Models Explained, Part 2

This website uses cookies.