In August, during the hacker conference DEFCON, the White House co-hosted the first-ever Generative AI Red Team competition where thousands of attendees signed up to test eight large language models (LLMs) provided by various AI companies. They faced a variety of different challenges that effectively centered around coaxing the models into behaving badly, from getting them to confidently state an incorrect answer to a math problem to exposing a (fake) credit card number.
As the name of the event suggests, participants were supposed to be acting as AI red-teamers. The term became popularized in the United States by Cold War-era military simulations, where adversarial “enemy” teams were represented by the color red and “home” teams were represented by the color blue. The same terminology was later adopted by the security and cybersecurity communities, where professional red-teamers try to attack (or gain access to) a physical location or computer network while blue-teamers attempt to defend against the intrusion. In turn, the AI community has adopted “red-teaming” from cybersecurity.
AI Red-Teaming as Prompt Hacking
However, what does AI red-teaming actually mean? In cybersecurity, the practice of red-teaming implies an adversarial relationship with a system or network. A red-teamer’s objective is to break into, hack, or simulate damage to a system in a way that emulates an actual attack.
For an AI system, however, “red-teaming” might not involve actual “hacking” at all. For example, one way to attack an LLM is to prompt it in a way that bypasses any restrictions or guardrails that its developers may have placed on it. Most LLM chatbots are purposefully designed not to output harmful or toxic content such as hate speech. However, many users have discovered various prompt hacks, or “jailbreaks,” that subvert these controls. These prompt hacks take the form of natural-language instructions that are given to the AI model, usually once it has been fully trained and deployed for use in a software application like a chatbot. For users without access to the inner workings of the models themselves, experimenting with prompts is an accessible, attention-grabbing way to explore the systems’ capabilities and limitations.
DEF CON attendees were asked to attack the various LLMs this way, suggesting that AI red-teaming is primarily done via prompting. High-profile announcements about powerful AI systems, such as OpenAI’s GPT-4, being tested by expert “red teams” have also linked AI red-teaming and prompt hacking.
However, in its focus on prompts as the primary method of attacking AI systems, this narrow definition of red-teaming abandons some of its original connotations from cybersecurity. As some experts have pointed out, prompt hacking is more akin to software boundary or stress testing than traditional cyber red-teaming. Furthermore, it implicitly excludes AI systems that are not language models. Many other AI systems do not accept natural language inputs or produce outputs that are as easily interpreted by humans, but it is equally important to ensure that these systems are working as intended even under stressful or unusual conditions. Even some of the Generative AI Red Team competition organizers have acknowledged the ambiguity of their usage of “AI red-teaming.”
This narrow definition tends to focus more on AI safety (reducing harm caused by misuse or malfunction) than AI security (defending systems against malicious actors). Comprehensive AI testing and evaluation processes, however, need to ensure that they address both of these categories of concern. One way to do so is to bring red-teaming’s cybersecurity origins back into the spotlight.
Bringing Security Back Into AI Red-Teaming
A broader and more inclusive definition of “red-teaming” needs to include security concerns, as AI models can be hacked like any other piece of software. For example, hackers can steal the model itself, allowing them to use it for their own purposes and subverting any controls the developers might have placed on it, or they could exfiltrate data that the model has access to via a third-party plugin like those currently available for ChatGPT. AI systems also possess unique vulnerabilities, such as susceptibility to adversarial attacks and data poisoning. These are widely acknowledged vulnerabilities that are not exclusive to language or image generation models.
More transparency may help clarify what “AI red-teaming” means in practice or whether it’s even the right term in an AI context. Several of the leading AI companies—including Google, Microsoft, and NVIDIA—recently published blog posts about their approaches to AI red-teaming, which include adversarial cybersecurity testing of both prompt-based applications and the underlying models themselves. For instance, Google’s report details a range of red-teaming attacks that go beyond just prompt-based attacks, from exfiltration (trying to copy or steal a model) to backdoor attacks (manipulating the model so it behaves a certain way when given a “trigger” word or input). Similarly, NVIDIA’s red-teaming approach takes a more holistic view of attacks across the entire AI development pipeline, and also incorporates practices such as tabletop exercises to simulate risk or failure scenarios.
While this increased transparency about how some major AI companies actually define and implement red-teaming is an excellent start, it is important to remember that not all AI companies may have the same resources to devote to red-teaming, or a similarly deep pool of cybersecurity talent to draw from in order to secure their AI systems. Definitions and implementations of “AI red-teaming” are likely to vary across different organizations, meaning that just because a model has been red-teamed does not necessarily mean it is both safe and secure.
AI Red-Teaming Alone ≠ Robust AI Testing
This is especially important to understand for regulators looking into proposing testing requirements for AI systems. Conflating “red-teaming” with the broader category of “AI testing” and failing to establish a common definition of AI red-teaming is likely to leave such a requirement wide open for interpretation. Borrowed terminology can only go so far, and the way in which “red-teaming” is used to encompass all kinds of AI testing and evaluation is often misleading; red-teaming is a single phase in what should be a broader process of identifying different AI vulnerabilities and failure modes. Discussions around AI testing should center around a shared definition of red-teaming, preferably one that accounts for both security and safety.
However, the fact that red-teaming has become an increasingly popular topic in AI policy and governance circles is a good thing: as the DEF CON competition demonstrated, a prompt hacking exercise can be a great way to educate the public about the need for more robust AI testing. Hopefully the competition returns at future conferences in a format that goes beyond prompting and lets the audience of hackers do what they do best—hack.