CSET

Revisiting AI Red-Teaming

Jessica Ji

and Colin Shea-Blymyer

September 26, 2024

This year, CSET researchers returned to the DEF CON cybersecurity conference to explore how understandings of AI red-teaming practices have evolved among cybersecurity practitioners and AI experts. This blog post, a companion to "How I Won DEF CON’s Generative AI Red-Teaming Challenge", summarizes our takeaways and concludes with a list of outstanding research questions regarding AI red-teaming, some of which CSET hopes to address in future work.

Last year, following 2023’s DEF CON, CSET asked “What Does AI Red-Teaming Actually Mean?” We went back to the conference this year to see if a consensus was emerging among AI developers, cybersecurity experts, and evaluators. We have gained some clarity, but questions remain about how red-teaming should be done and what it can do to make AI systems safer and more secure. DEF CON 2024’s Generative Red Team 2 (GRT 2) event was aimed at simulating a realistic large language model evaluation environment.

Compared to last year’s iteration, which was born out of a high-profile partnership with the White House and focused on broadly accessible LLM prompt hacking challenges, this year’s challenge was less flashy and more procedural. Instead of looking for one-off prompts that would elicit unwanted responses from multiple models, the GRT 2 focused instead on the process of reporting flaws in a single model. All participants used an open-source framework called Inspect, built by the UK’s AI Safety Institute, to prompt the model and see if they could get it to act in ways that deviated from its model card, a piece of documentation outlining how the model and its guardrails were supposed to behave. Since there isn’t a standard definition of what constitutes a flaw in a model, the GRT 2 organizers classified “flaws” as behavior that significantly violated the contents of the model card. Also, unlike last year, the threshold for failure was higher—to account for the random nature of LLM outputs, participants had to prove that their inputs were statistically likely to produce bad outputs. Examples of bad outputs might include agreement with blatantly bigoted statements, instructions on how to commit a crime, or anything revealing personal or sensitive data. (For a first-person look at the GRT 2 experience, see this blog post by CSET Research Fellow Colin Shea-Blymyer: “How I Won DEF CON’s Generative AI Red-Teaming Challenge.”)

GRT 2’s focus on process and rigor is a step in the right direction for AI evaluation. However, we still wanted to know how companies, developers, and other actors in the AI industry are conducting red-teaming in practice. Are they also mostly sticking to prompt-based adversarial testing, or do they take a more expansive approach that incorporates other ways that an adversary might attack an AI system? How does red-teaming fit into their existing test and evaluation pipelines? And most of all, how are they ensuring that red-teaming will prevent harmful outcomes?

Here are our takeaways based on presentations and conversations at DEF CON 2024:

“AI red-teaming” has widely become shorthand for “prompt-based adversarial testing.”

There are exceptions, of course. However, the vast majority of the time, our conversations about AI red-teaming were based on the shared assumption that these are evaluations performed by prompting a language model. Cybersecurity red-teaming of an AI model––in which testers attempt to attack the model itself and the software systems that surround it––implicitly falls under a different category of evaluation.

This raises questions about how red-teaming practices should be incorporated into the burgeoning field of AI security. As we noted last year, most DEF CON attendees with a cybersecurity background would still associate the phrase “red-teaming” with network hacking and infiltration, but the term is growing to mean something different to the attendees in the convention’s AI Village. Will future discussions about securing AI systems have to disambiguate between prompt-based red-teaming practices and “traditional” cybersecurity red-teaming? These are two distinct types of testing, requiring different testing environments and expertise. They’re both important—although prompt-based adversarial testing is only useful for certain types of models, like LLMs—and they can’t be substituted for each other because they’re trying to assess different things.

Frameworks for automating prompt-based red-teaming are useful . . .

The UK AI Safety Institute’s Inspect framework, and others like Microsoft’s PyRIT, help automate the tedious, annoying parts of performing prompt-based model evaluations. As this year’s GRT 2 organizers noted, the inherent randomness of LLM output generation means that each prompt should be tested multiple times to try and gauge how often it leads to failure. Frameworks and automation can make this process a little less time-consuming.

. . . but also limiting.

An automation framework can’t help users answer one of the most crucial questions when red-teaming—deciding what kinds of flaws or failures they’re testing for. They’re also inherently limited to certain types of input or interaction. Performing red-teaming with a prompt-based automation framework may address a subset of possible failure states, but it is insufficient for other forms of adversarial testing (for instance, if an attacker finds a way to give a language model inputs that bypass its guardrails entirely).

Human judgment and evaluation still matters.

Automation tools are useful in cases where a red-teaming exercise is clearly scoped and bounded. However, in many cases, categorizing what constitutes a “failure” or a “harm” caused by a model is still deeply subjective—and often uncertain. A team of GRT 2 judges worked nonstop throughout the weekend to manually evaluate each submitted report. Organizations looking to automate their prompt-based AI red-teaming will likely still need to draw on human subject-matter experts based on the use cases they’re testing for.

Outstanding questions still remain, including the extent to which this kind of adversarial testing can be effectively automated.

These questions include:

Is the objective of red-teaming to identify and mitigate some potentially harmful failure state in an AI system? If so, how are those failure states defined?
What does red-teaming actually measure? What can it measure?
Are policymakers’ objectives for red-teaming (e.g., mandating red-teaming of certain models) aligned with those of AI developers?
How should red-teaming be incorporated into broader systems of testing, evaluation, validation, and/or verification?
To what extent can red-teaming (or a certain subset of red-teaming activities) be automated? What assurances can automated red-teaming provide?
To what extent is human expertise necessary for red-teaming, and how many resources are necessary to perform it?

How I Won DEF CON’s Generative AI Red-Teaming Challenge

September 2024

In August 2024, CSET Research Fellow Colin Shea-Blymyer attended DEF CON, the world’s largest hacking convention to break powerful artificial intelligence systems. He participated in the AI red-teaming challenge, and won. This blog post details… Read More

Analysis

Controlling Large Language Model Outputs: A Primer

December 2023

Concerns over risks from generative artificial intelligence systems have increased significantly over the past year, driven in large part by the advent of increasingly capable large language models. But, how do AI developers attempt to… Read More

What Does AI Red-Teaming Actually Mean?

October 2023

“AI red-teaming” is currently a hot topic, but what does it actually mean? This blog post explains the term’s cybersecurity origins, why AI red-teaming should incorporate cybersecurity practices, and how its evolving definition and sometimes… Read More

Analysis

Skating to Where the Puck Is Going

October 2023

AI capabilities are evolving quickly and pose novel—and likely significant—risks. In these rapidly changing conditions, how can policymakers effectively anticipate and manage risks from the most advanced and capable AI systems at the frontier of… Read More

Artificial Intelligence, China, and America’s Next Industrial Revolution

Center for Security and Emerging Technology

Artificial Intelligence, China, and America’s Next Industrial Revolution

CSET

Revisiting AI Red-Teaming

“AI red-teaming” has widely become shorthand for “prompt-based adversarial testing.”

Frameworks for automating prompt-based red-teaming are useful . . .

. . . but also limiting.

Human judgment and evaluation still matters.

Outstanding questions still remain, including the extent to which this kind of adversarial testing can be effectively automated.

Related Content

How I Won DEF CON’s Generative AI Red-Teaming Challenge

Controlling Large Language Model Outputs: A Primer

What Does AI Red-Teaming Actually Mean?

Skating to Where the Puck Is Going

Artificial Intelligence, China, and America’s Next Industrial Revolution

CSET

Revisiting AI Red-Teaming

“AI red-teaming” has widely become shorthand for “prompt-based adversarial testing.”

Frameworks for automating prompt-based red-teaming are useful . . .

. . . but also limiting.

Human judgment and evaluation still matters.

Outstanding questions still remain, including the extent to which this kind of adversarial testing can be effectively automated.

Related Content

How I Won DEF CON’s Generative AI Red-Teaming Challenge

Controlling Large Language Model Outputs: A Primer

What Does AI Red-Teaming Actually Mean?

Skating to Where the Puck Is Going

This website uses cookies.