Last year, following 2023’s DEF CON, CSET asked “What Does AI Red-Teaming Actually Mean?” We went back to the conference this year to see if a consensus was emerging among AI developers, cybersecurity experts, and evaluators. We have gained some clarity, but questions remain about how red-teaming should be done and what it can do to make AI systems safer and more secure. DEF CON 2024’s Generative Red Team 2 (GRT 2) event was aimed at simulating a realistic large language model evaluation environment.
Compared to last year’s iteration, which was born out of a high-profile partnership with the White House and focused on broadly accessible LLM prompt hacking challenges, this year’s challenge was less flashy and more procedural. Instead of looking for one-off prompts that would elicit unwanted responses from multiple models, the GRT 2 focused instead on the process of reporting flaws in a single model. All participants used an open-source framework called Inspect, built by the UK’s AI Safety Institute, to prompt the model and see if they could get it to act in ways that deviated from its model card, a piece of documentation outlining how the model and its guardrails were supposed to behave. Since there isn’t a standard definition of what constitutes a flaw in a model, the GRT 2 organizers classified “flaws” as behavior that significantly violated the contents of the model card. Also, unlike last year, the threshold for failure was higher—to account for the random nature of LLM outputs, participants had to prove that their inputs were statistically likely to produce bad outputs. Examples of bad outputs might include agreement with blatantly bigoted statements, instructions on how to commit a crime, or anything revealing personal or sensitive data. (For a first-person look at the GRT 2 experience, see this blog post by CSET Research Fellow Colin Shea-Blymyer: “How I Won DEF CON’s Generative AI Red-Teaming Challenge.”)
GRT 2’s focus on process and rigor is a step in the right direction for AI evaluation. However, we still wanted to know how companies, developers, and other actors in the AI industry are conducting red-teaming in practice. Are they also mostly sticking to prompt-based adversarial testing, or do they take a more expansive approach that incorporates other ways that an adversary might attack an AI system? How does red-teaming fit into their existing test and evaluation pipelines? And most of all, how are they ensuring that red-teaming will prevent harmful outcomes?
Here are our takeaways based on presentations and conversations at DEF CON 2024:
“AI red-teaming” has widely become shorthand for “prompt-based adversarial testing.”
There are exceptions, of course. However, the vast majority of the time, our conversations about AI red-teaming were based on the shared assumption that these are evaluations performed by prompting a language model. Cybersecurity red-teaming of an AI model––in which testers attempt to attack the model itself and the software systems that surround it––implicitly falls under a different category of evaluation.
This raises questions about how red-teaming practices should be incorporated into the burgeoning field of AI security. As we noted last year, most DEF CON attendees with a cybersecurity background would still associate the phrase “red-teaming” with network hacking and infiltration, but the term is growing to mean something different to the attendees in the convention’s AI Village. Will future discussions about securing AI systems have to disambiguate between prompt-based red-teaming practices and “traditional” cybersecurity red-teaming? These are two distinct types of testing, requiring different testing environments and expertise. They’re both important—although prompt-based adversarial testing is only useful for certain types of models, like LLMs—and they can’t be substituted for each other because they’re trying to assess different things.
Frameworks for automating prompt-based red-teaming are useful . . .
The UK AI Safety Institute’s Inspect framework, and others like Microsoft’s PyRIT, help automate the tedious, annoying parts of performing prompt-based model evaluations. As this year’s GRT 2 organizers noted, the inherent randomness of LLM output generation means that each prompt should be tested multiple times to try and gauge how often it leads to failure. Frameworks and automation can make this process a little less time-consuming.
. . . but also limiting.
An automation framework can’t help users answer one of the most crucial questions when red-teaming—deciding what kinds of flaws or failures they’re testing for. They’re also inherently limited to certain types of input or interaction. Performing red-teaming with a prompt-based automation framework may address a subset of possible failure states, but it is insufficient for other forms of adversarial testing (for instance, if an attacker finds a way to give a language model inputs that bypass its guardrails entirely).
Human judgment and evaluation still matters.
Automation tools are useful in cases where a red-teaming exercise is clearly scoped and bounded. However, in many cases, categorizing what constitutes a “failure” or a “harm” caused by a model is still deeply subjective—and often uncertain. A team of GRT 2 judges worked nonstop throughout the weekend to manually evaluate each submitted report. Organizations looking to automate their prompt-based AI red-teaming will likely still need to draw on human subject-matter experts based on the use cases they’re testing for.
Outstanding questions still remain, including the extent to which this kind of adversarial testing can be effectively automated.
These questions include:
- Is the objective of red-teaming to identify and mitigate some potentially harmful failure state in an AI system? If so, how are those failure states defined?
- What does red-teaming actually measure? What can it measure?
- Are policymakers’ objectives for red-teaming (e.g., mandating red-teaming of certain models) aligned with those of AI developers?
- How should red-teaming be incorporated into broader systems of testing, evaluation, validation, and/or verification?
- To what extent can red-teaming (or a certain subset of red-teaming activities) be automated? What assurances can automated red-teaming provide?
- To what extent is human expertise necessary for red-teaming, and how many resources are necessary to perform it?