Introduction
The early months of 2025 have ushered in a changed AI landscape. On the policy and regulatory front, a new U.S. presidential administration signaled a clear willingness to prioritize AI innovation and infrastructure investment over a safety- and risk-focused approach; on the model development front, a new class of reasoning models have emerged, spurring anxieties over capabilities, risk, and geopolitical competitiveness.
Regardless of new policy or technological developments, AI evaluations, especially AI red-teaming, will remain fundamental to understanding the capabilities and impacts of these systems. Red-teaming emerged early on as a key way of evaluating generative AI systems and remains an important tool for assessing whether they present safety or security risks. However, advancements in evaluation methodologies have yet to match the pace of AI change and improvement. Given this, policymakers and researchers should explore how to promote better evaluations and improve the rigor and utility of testing processes like red-teaming.
This blog post incorporates takeaways and accompanying recommendations from a CSET workshop on AI testing held in December 2024 that brought together members of academia, government, and the private sector. It summarizes some of the challenges and limitations associated with red-teaming as a testing methodology and suggests ways various stakeholders might address those challenges.
Conceptualizing AI Red-Teaming
“Red-teaming” is a term the AI testing community borrowed from cybersecurity, where it refers to the practice of emulating the tactics and techniques of an attacker (performed by a hypothetical adversarial “red team”) on a specific computer system or network of systems. Historically, the terminology originates from military war games or strategic simulations, which in the United States pitted “blue teams” against enemy “red teams.” In cybersecurity, the practice of red-teaming is relatively mature and well-understood. In a cyber red-teaming exercise, red-teamers—individuals with offensive hacking skills—usually attempt to compromise the defenses of and gain unauthorized access to a specific physical or digital target.
In the AI world, however, the definition of red-teaming is fuzzier. The concept of an “AI red team” predates our current generative AI moment by several years, but, to many, an AI red team is a group of testers who try to “break” a generative AI model or large language model (LLM) by getting the model to produce unwanted outputs. An AI red-teaming exercise might involve testers being given instructions on what the model is not supposed to do and iteratively prompting it, trying different strategies to elicit the unwanted behaviors. The practice of deliberately trying to subvert the model’s built-in defenses or guardrails is a form of adversarial testing, where every tester takes on the role of adversary.
Participants at CSET’s AI testing workshop generally agreed that AI red-teaming involves adversarial testing methods.1 Participants also acknowledged that there are currently no best practices for performing red-teaming or adversarial testing, which creates confusion and ambiguity.
To some extent, ambiguity should be expected because AI red-teaming as a practice is still a work in progress, and language borrowed from the cybersecurity and software testing domains fails to map perfectly onto AI systems. There are also plenty of context-dependent definitional questions, such as: what does it mean to “break” a model, and what constitutes a model failure or vulnerability? The distinction between red-teaming and capability elicitation—the practice of testing an AI system in order to understand its capabilities—is particularly unclear. The terms are often used interchangeably, or the term “red-teaming” is used to imply that the practice is being used to measure some specific aspect of the AI model’s performance. (See the following section, “The Measurement Problem,” for why this is confusing.)
Furthermore, conceptualizations of what it means to red-team a model may vary among AI researchers and policymakers depending on what kind of AI system is being discussed. While it is most often referred to as a technique for testing generative AI models, certain forms of red-teaming (or adversarial testing more broadly) don’t always apply to other types of AI models, or have not yet been tried on those systems. For example, specialized AI models are widely used in scientific research domains, but are often excluded from conversations about red-teaming. This exclusion may be a byproduct of the way “AI safety” discussions commonly center around the potential risks of highly-capable general-purpose models, rather than more specialized models, which have much narrower use cases and are only used by relatively small groups of users compared to general-purpose LLMs.
The purpose of red-teaming can also vary depending on who is performing the testing. AI developers building commercial products may be primarily concerned about product safety, which for generative AI products may range from preventing false outputs to complying with regulations about certain types of banned content. Organizations prioritizing national security, meanwhile, may be interested mostly in outcomes that may lead to societal-scale harm. For instance, previous efforts to red-team for AI-related chemical, biological, radiological, and nuclear (CBRN) risk have been driven by national security concerns.
Despite these ambiguities, there remains a great deal of interest in red-teaming as a testing method that can help improve model safety and security. It is worth exploring challenges associated with improving the practice of red-teaming, and how we might address these challenges.
Challenges to Improving AI Red-Teaming
The Measurement Challenge
Broadly speaking, it is challenging to define and measure the variables we care about with respect to AI. What does it mean for an AI system to be “good,” “capable,” “general-purpose,” or “safe”? This challenge is especially salient when trying to red-team a model. Red-teaming provides an indicator of a model’s potential to fail by behaving in undesirable ways, but that indicator is often treated as a guarantee. For instance, a statement such as “we will red-team this AI model for safety” means that the process of red-teaming will be used to try and measure how safe the model is, but is often interpreted as an assurance that the model will be safe once it has been red-teamed. This is a feature, not a bug, of red-teaming as a testing methodology, similar to how it works in cybersecurity: red-teaming does not improve security, it only attempts to measure it.
LLM red-teaming as it is currently performed can’t provide concrete assurances about how safe or reliable a model is or any quantifiable results about risks or dangers associated with that model. Red-teaming can prove that a weakness or vulnerability exists, but does not guarantee that one does not exist. Given these characteristics, red-teaming results should not be treated as an assurance of safety or security, but instead as a snapshot of possible outcomes under specific conditions.
Red-teaming is also highly subjective. When AI companies organize a red-teaming exercise, they are making predictions about the impacts that their models might have on the world. These predictions inform the composition of the red team, the scenarios they are asked to test for, and the decisions that the developers will make after the results of the red-teaming exercise. Subjectivity is not necessarily a bad thing, but it can have significant implications. These implications include: unanticipated harms due to blindspots during testing; prioritization of certain types of harm over others; the goals of the exercise failing to align with testers’ subject matter expertise; and testers failing to accurately represent the demographics of a target group, whether that be ordinary users or highly skilled malicious actors.
How might we improve the scientific validity of AI red-teaming as an evaluation process? One method would be to improve standardization, or at least clarify what the red-teaming process is actually intended to evaluate or measure. For instance, if the objective of a red-teaming exercise is to test the efficacy of a model’s guardrails or defenses, perhaps the exercise should always include an unguarded version of the model so that the guarded and unguarded versions can be compared to each other. Another improvement might be adopting better metrics, such as using edit distance (a metric for measuring the difference between two sequences of characters or words) to quantitatively compare the prompts needed to subvert guardrails on different versions of the same LLM. Such a quantitative comparison would be useful because it might help researchers categorize guardrail-subverting prompts, identify features that help them improve model defenses, and assess whether or not the model guardrails are performing as expected. Overall, it would be extremely valuable to compare the results of red-teaming exercises and draw conclusions about how well red-teaming was performed: e.g., to be able to make claims such as “system A was red-teamed well in comparison to system B.” While this does not necessarily address the fact that red-teaming does not provide safety assurances, it would still be beneficial from a product safety standpoint.
The Scoping Challenge
One issue is that most red-teaming exercises are scoped to focus on individual models in isolation, separate from broader systems they might be embedded in or the downstream consequences of their outputs. This makes sense since testers’ time and capacity are inherently constrained, but many of the harms and vulnerabilities that researchers and policymakers are actually concerned about from security, safety, and sociotechnical standpoints are not limited to the scope of a single model. For example, a commonly discussed cybersecurity AI risk scenario involves malicious cyberattackers experiencing a capability “uplift” by getting access to a powerful LLM, but the degree to which they experience this uplift depends on (among other factors): their existing level of expertise, how they might choose to use the LLM, and the characteristics of the network they are targeting. Red-teaming just the LLM might help emulate the first two factors, but will not address the third, which concerns how the LLM outputs will be deployed in a different context. Similar concerns exist for other “high-risk” domains such as military applications or scientific research, where it is crucial to think about how these models are embedded in larger systems or processes. Red-teaming models in isolation will not improve our understanding of how these complex systems might interact with each other in unexpected ways.
Appropriately scoped red-teaming exercises are not the equivalent of continuous risk monitoring processes. A red-teaming exercise captures the state of the model at a particular moment in time. It is natural to expect that red-teaming outcomes might change over time depending on a number of different factors, such as: changes to the underlying AI model, improvements to the model’s guardrails, or changes to the composition of the red team. However, tracking changes over time will require red-teaming exercises to be performed regularly, or integrated into organizational test and evaluation procedures.
How might red-teaming processes be improved to address these scoping challenges? Depending on the goals of the exercise, the scope could be expanded to include more systems than just a single AI model. Red-teaming entire systems, while difficult, can better model real-world risks. For example, an LLM-based app might involve data flowing to and from other online services, like a weather app or a search engine. Red-teaming the LLM alone will not reveal potential failures or weaknesses that emerge from the model’s interactions with these other services. As complexity increases, so too does the need to test systems in concert rather than in isolation.
Another way to better model real-world conditions might be to supplement red-teaming with controlled model releases, similar to how software developers will allow small groups of users to beta test a new feature before they release it to the public. Beta testing, which is more open-ended than red-teaming, could provide feedback to red teamers about patterns of actual user behavior. A similar justification exists for public or participatory red-teaming exercises, which expand access to a broader range of testers and serve as a way to build expertise and awareness, especially among users who otherwise have limited ways to provide feedback about the way an AI model behaves.
Finally, there may also be ways to experiment with the way we design red-teaming exercises. For example, instead of defining a list of things a model can’t do, perhaps one way to reduce the scope of testing is to closely restrict what the model can do. Given these parameters, red-teaming could be used to help define the boundaries of what constitutes appropriate actions for the model to take.
The Transparency Challenge
In conjunction with the measurement problem, the lack of insight into private-sector red-teaming hinders the development of widely shared best practices for AI testing.
There are understandable reasons why companies do not publish details about their testing processes. In addition to concerns around proprietary data, there are security considerations that come with revealing how any software is evaluated for vulnerabilities. It is also reasonable to assume that it will take more time to create industry norms and standards, as the practice of red-teaming general-purpose AI systems is fairly new and many organizations are actively working on improving it.
However, CSET workshop participants from all backgrounds expressed a desire for greater transparency in general. Transparency can take different forms, but two of the most useful in this context are: a) transparency that allows external actors to assess various factors associated with a red-teaming exercise and b) transparency that allows for consensus-building around best practices for common test cases. On the first point, there may be more lessons to learn from cybersecurity, where public disclosure practices are essential for building a secure software ecosystem. On the second point, it would be broadly beneficial for a reputable standards body such as NIST (the National Institute of Standards and Technology) to collect and publish a list of AI testing best practices, or at the very least continue to serve as a hub for consensus-building activities.
The Incentives Challenge
In addition to challenges around transparency, AI red-teaming and AI testing more generally face problems related to incentives. How can relevant stakeholders, especially private-sector companies, be incentivized to improve their red-teaming processes?
One path might be to create more liability mechanisms for AI companies. This proposal hinges on a series of assumptions: if AI developers are liable for harm caused by their models, they will be more incentivized to test their models thoroughly and subsequently will invest more in improving red-teaming and other AI testing processes. However, one significant challenge on this front is that it’s difficult to quantify the harm associated with AI outputs. As mentioned above, companies are also not required to be transparent about their testing practices, so liability may not necessarily lead to improved transparency.
Regulatory action is another potential avenue for creating incentives, although attempts at regulating AI faced significant headwinds in 2024. A narrower approach that focuses on harm to specific audiences (such as children) or types of undesirable content (such as non-consensual intimate imagery or child sexual abuse material) may be a more politically feasible way to hold developers accountable for the outputs of their generative AI systems. Current regulatory proposals at the U.S. state and local level incorporate some of these concerns and are also attempting to address some of the outstanding questions around transparency, assurance, and liability.
Conclusion
Summarizing AI Red-Teaming Challenges and Recommendations | ||
---|---|---|
Theme | Challenges | Recommendations |
Measurement |
|
|
Scoping |
|
|
Transparency |
|
|
Incentives |
|
|
Despite recent turmoil in the AI world, all signs point to AI red-teaming remaining a prominent evaluation methodology, even if policy actors decide to pivot away from focusing on AI safety. Some of the leading AI companies are actively trying to formalize and streamline internal testing processes, while downstream stakeholders are trying to emulate their practices to the best of their abilities. AI developers may still be internally incentivized—even if external incentives are lacking—to continue to invest in improving AI testing practices for product safety or product reliability purposes. The incentives landscape may also change in the future—AI capabilities may improve, new risks may emerge, or new regulatory regimes may take shape across various jurisdictions.
More is needed to improve the practice of red-teaming, as CSET workshop participants and other researchers have highlighted. Firstly, improving the measurement practices underpinning current red-teaming practices—and being open about those improvements—can benefit everyone who develops and uses AI systems. Secondly, scoping red-teaming exercises to better model real-world conditions can improve our understanding of AI systems’ potential impacts, and other forms of testing can help improve red-teaming in turn. Thirdly, improved transparency can help strengthen the broader AI testing ecosystem and create clearer standards. Finally, policymakers can play a role in creating mechanisms to incentivize greater investment in red-teaming and other evaluation processes.
- There is disagreement on whether or not this is the case across the board: some red-teamers and researchers argue that red-teaming doesn’t always require adversarial methods because it has come to encompass both security- and safety-focused testing practices which don’t necessarily involve emulating an adversary.