Introduction
U.S. President Donald Trump’s AI Action Plan endorses the idea of “AI control,” recommending that the federal government “launch a technology development program . . . to advance AI interpretability, AI control systems, and adversarial robustness.”
“AI control” is a relatively new term for an approach to mitigating risks from advanced AI. The goal of AI control is to get useful work out of “AI agents”—AI models designed to take autonomous action rather than simply respond to queries—even when the agents intentionally attempt to misbehave. AI control is a complementary concept to the better-known “AI alignment”: AI alignment seeks to ensure that AI agents do not attempt undesirable behavior in the first place, whereas AI control seeks to ensure that if AI agents do pursue unwanted goals, they do not succeed.
AI control is most critical for organizations where AI agents pursuing unwanted goals could cause severe harm. It is immediately relevant to AI companies internally deploying AI agents for AI research and development (R&D), where AI misbehavior could threaten the security of the R&D process. AI control could also unlock value in other sectors: for example, national security agencies and critical infrastructure providers require layers of defense around their technologies, so AI control techniques could enable these organizations to safely adopt frontier AI technology.
This explainer lays out what AI control is, why it is necessary, how it relates to cybersecurity, and where there are opportunities and challenges around implementing AI control in the real world.
Why We Need AI Control
Frontier AI companies such as OpenAI, Google, Anthropic, and xAI are actively building AI agents that are increasingly capable and autonomous. These companies aim to develop agents that can independently achieve complex goals, such as running businesses or conducting AI research, with decreasing amounts of human oversight.
AI companies already use AI agents internally for AI R&D, which opens up many avenues for harm if these agents lack safeguards and have privileged access to sensitive company infrastructure. Even today’s rudimentary AI agents can wreak havoc when given free rein within tech companies: an AI agent recently deleted a software company’s live database, violating explicit instructions not to proceed without human approval. Looking ahead to more complex and serious risks, some involve human actors: an AI agent compromised by an adversary could help that adversary steal the company’s sensitive intellectual property or sabotage the company’s research. Other risks come from AI agents pursuing unintended goals: if a sufficiently capable agent is given sufficient freedom in the R&D process, it could use the company’s computational resources to run unauthorized experiments (“internal rogue deployment”) or covertly manipulate the training process for future AI agents such that they share its goals (“successor sabotage”) in ways that may be hard to detect. As other types of organizations adopt AI agents, more risks will emerge.
The first line of defense against risks such as these is typically called AI alignment. During the training process for an AI model, alignment techniques aim to prevent the model from exhibiting dangerous behaviors in the first place. Frontier AI models learn to behave in a way that cannot be mapped to pieces of code, but rather is a product of learned patterns that human experts cannot fully interpret or predict. Common alignment techniques take place once a model has already been pretrained on a large corpus of data: developers make minor adjustments to the model so that it is encouraged to generate “good” outputs and discouraged from generating “bad” outputs. The fundamental inscrutability of the model, however, makes it unclear in any given case whether alignment techniques successfully nudge the model away from dangerous behaviors or just suppress dangerous tendencies such that they appear in more sporadic ways. To assess how well their alignment techniques are working, AI companies test their models on a wide range of inputs.
Empirical evidence has shown that alignment and testing are not sufficient to manage AI misbehavior.1 Pre-deployment testing can fail to catch harmful behavioral patterns that emerge when the agent operates in a different environment. Testing an agent can be compared to conducting a background check on an employee—it is a valuable screening step that should complement other risk mitigation measures.
AI agents can behave more dangerously when deployed in the real world than in testing. Although AI control could reduce risks from a broad set of threat models, AI control researchers tend to be primarily concerned with threat models in which AI agents covertly pursue unwanted goals, which can happen for two main reasons.2 First, these goals can emerge in the training process. So-called “alignment faking” can occur because training intended to nudge a model away from dangerous behaviors can instead inadvertently teach a model to hide its dangerous tendencies. Second, human threat actors might purposefully sabotage AI agents, e.g., via “data poisoning.” It is possible to craft training data to insert backdoors into AI models such that they behave dangerously under certain conditions. In the real world, the risk of data poisoning depends on the capabilities of threat actors to insert adversarial training data into the training pipeline. Both alignment faking and data poisoning have been demonstrated in research settings with current AI models, and both would pose greater risks if exhibited by more advanced AI agents.
Methods for AI Control
Beyond the first line of defense of AI alignment, AI control provides a necessary “second line of defense,” seeking to use agents even when they may be acting adversarially or otherwise pursuing unwanted goals. Anticipating a world in which AI agents may function more like autonomous employees, AI control takes inspiration from how organizations mitigate risks from untrustworthy employees. AI control encompasses both established security techniques applied to settings with AI agents and cutting-edge methods designed specifically for AI agents.
How Traditional Cybersecurity Relates to AI Control
In both cybersecurity and AI control, the goal is to use computer systems while preventing threat actors from exploiting flaws in those systems; in the case of AI control, the immediate source of threats is the AI agent itself. Because of this broad overlap, the field of cybersecurity provides a wealth of best practices and useful concepts that can reduce the chance of AI misbehavior leading to harm. The following are examples of traditional cybersecurity concepts applied to AI control.
- Logging and limiting communications outside the data center: Restrictions can be placed on communications interfaces that are part of threat scenarios. For example, because frontier AI models are large, restricting the total amount of data that can be moved between data centers would make it more challenging for an AI agent to exfiltrate itself. Recording all of an agent’s external communications facilitates understanding and addressing past failures, just like in traditional incident response and forensics.
- Logging and limiting processing within the data center: Oversight mechanisms can track when computer systems deviate from expected internal behavior and apply hard limits to keep them in check. For example, the AI agent’s code could have a limited time to run as enforced by an automated hardware shutdown system, or a limited compute budget tracked with hardware performance counters.
- Making infrastructure more robust to mistakes and attacks: This raises the bar so that misbehavior would need to be especially sophisticated to cause harm, for example, an attacker (whether the AI agent itself or an outside threat actor working with the AI agent) discovering a novel bug or vulnerability.
- Software: Keeping software simple can reduce risks from AI agents deliberately exploiting the software (or accidentally causing harm) because simpler software has a smaller attack surface and is easier for humans to understand. Some programming languages can ensure that classes of potential vulnerabilities are not possible.
- Hardware: Organizations can rigorously test their infrastructure and use hardware specifically designed for greater security.
- Policies and operation: One basic practice from cybersecurity is the principle of least privilege access, defined by giving the minimum permissions to a given actor. In the AI control context, this could mean allowing an AI agent to propose changes to a company codebase but not allowing it to apply those changes permanently without human review.
Novel AI Control Protocols
Traditional cybersecurity measures alone may not be enough for companies to safely use misbehaving AI agents. Tried-and-true cybersecurity practices for human-centered workflows could weigh down fast-paced AI-centered workflows. Consider an AI company running thousands of AI agents nonstop to write code, run experiments, and more. It quickly becomes infeasible for human employees to analyze proposed code changes, sign off on experiment ideas, and scrutinize logs of agents’ internet queries. Yet completely handing over sensitive and important tasks to AI agents poses unacceptable security risks. Furthermore, AI agents with advanced capabilities in hacking and strategizing might be able to evade human oversight in traditional workflows.
Enter novel AI control protocols specifically designed to deal with threats from advanced AI agents. In practice, AI control often refers to research on techniques such as these. Control protocols that are sophisticated, fine-grained, and cheap could enable companies to safely allow large numbers of AI agents to perform sensitive tasks.
A representative recent paper called “Ctrl-Z: Controlling AI Agents Through Resampling,” by a team at Redwood Research (including one of the authors of this piece), describes AI control protocols that consist of two steps: (1) monitoring what the AI agent is doing and (2) responding appropriately when something looks suspicious. In AI control, AI models are often leveraged to monitor each other, taking inspiration from how companies mitigate risks from insider threats by leveraging other employees (e.g., by requiring sign-offs on sensitive actions). A foundational assumption behind Redwood’s control protocols is that more advanced AI agents might be more likely to form unwanted goals and less likely to show signs of this before deployment. In AI control, “untrusted models” are defined as newer, more-capable agents that might be deliberately subversive. “Trusted models” are older, less-capable AI models that are better understood by the research community. A trusted model might make mistakes because of its lower capabilities, but control protocols can account for this; it is most important for researchers to be confident that the trusted model is not working toward unintended goals.
For step one of this two-step AI control protocol, the “Ctrl-Z” paper demonstrates multiple ways to monitor AI agents:
- Trusted monitoring: A trusted AI model, such as a language model trained for specific monitoring tasks or a specially designed narrow machine learning model, can review the outputs of the more advanced untrusted agent.
- Untrusted monitoring: An untrusted advanced AI model can serve as a monitor as long as safeguards are in place to prevent the untrusted monitor from colluding with the monitored agent. One way to test for collusion uses “honeypots”—a concept from traditional cybersecurity referring to fake opportunities for malicious behavior—to see whether the untrusted monitor rates the monitored agent’s known dangerous code as safe. Additionally, untrusted monitoring protocols can use a trusted model to rewrite the monitored agent’s outputs in natural language to remove any subtle ways that it might communicate with the untrusted monitor.
For step two, when a monitoring system detects suspicious behavior, several control protocols from the “Ctrl-Z” paper are available to respond:
- Trusted implementation: A trusted model can suggest alternative ways to complete the same task.
- Resample protocols: A different output can easily be elicited from the same AI agent. When possible, this protocol automatically finds and executes a benign alternative output. Otherwise, this protocol automatically gathers incriminating evidence by sampling multiple outputs to determine whether the model consistently engages in the same suspicious behavior.
- AI-assisted auditing: A trusted model can investigate the context surrounding the suspicious behavior to comprehensively seek incriminating evidence.3 For example, this could involve analyzing all of the code files that a suspicious action would interact with.
- Human escalation: The issue can be flagged for human review.
Other works describe chain of thought monitoring and interpretability, which could fit within step one as alternative approaches for flagging potential misbehavior. “Chain of thought monitoring” involves examining the detailed “chain of thought” that the agent writes before providing its final answers, analogous to checking someone’s work on a math problem. Monitoring this text presents a “new and fragile opportunity” to catch agents in the act of planning dangerous actions. “Interpretability techniques,” using partial or complete access to the internal state of the agent, analyze it to detect any harmful intentions.
Besides the above control protocols based on monitoring, there are control protocols that take advantage of human control over outputs and inputs of AI agents. The agent’s output can be constrained to a specific style or format that encourages safety. For example, the agent can be required to output code that is especially legible to humans. Limiting the output space simplifies the task of automated monitoring and makes it more difficult for the agent to successfully misbehave. On the input side, human operators can selectively limit what information and context they provide to an AI agent, disrupting the agent’s ability to formulate sophisticated attack strategies. For example, in a control protocol called “factored cognition,” a trusted model handles as much of the task as possible, while the untrusted agent only works on subtasks that require its greater capabilities without seeing the full context. This approach has the added benefit of creating uncertainty for the agent about whether it is operating in a monitored test environment, which can deter attempts at deceptive or harmful behavior.
Building Blocks for Risk Management
AI control can be part of risk management for advanced AI agents. Within well-established safety science principles, alignment techniques and control protocols are both examples of risk mitigations. Additional components of risk management include risk assessment and risk governance. “Control evaluations” contribute to risk assessment: they measure how effectively a given control protocol prevents a misbehaving AI agent from causing harm. “Control safety cases” support risk governance: AI companies can argue for the safety of their AI agents by demonstrating that their control protocols sufficiently reduce the likelihood of an unacceptable outcome. Future work can leverage the toolkit of control evaluations, control safety cases, and control protocols themselves to harness the capabilities of frontier models while mitigating risks.
Putting AI Control into Practice
In 2024 and 2025, AI control has received rapidly increasing interest from researchers, AI companies, and governments. For AI control to reach its full potential as a tool kit for mitigating risks from AI agents, gaps will need to be filled across the entire development pipeline: researching new and improved control protocols with a growing research community; operationalizing research ideas into practical workflows, both at AI companies and at other types of organizations; and scaling control protocols for future AI technologies.
Growing the AI Control Community
AI control is a fast-growing research area. Researchers at the nonprofit Redwood Research coined the term in December 2023. The first conference on AI control, ControlConf, took place in March 2025. The Singapore Consensus on Global AI Safety Research Priorities includes a section on AI control.
Ideas from AI control are beginning to catch on at AI companies. Between January 2022 and July 2024, none of the research papers published by Anthropic, Google DeepMind, and OpenAI were about control. But now, companies show signs of taking control seriously: Google DeepMind’s paper “An Approach to Technical AGI Safety and Security” emphasizes control as a second “line of defense” to “mitigate harm even if the model is misaligned,” and Anthropic’s newly formed Safeguards Research Team has control within its portfolio. OpenAI has not signaled an interest in control as strongly as Google DeepMind or Anthropic, but researchers at all three companies have recently published on the closely related topic of monitoring models’ chains of thought. AI control has not yet made a significant appearance in leading AI companies’ frontier AI safety frameworks, which are voluntary commitments to implement certain safety and security measures upon reaching certain levels of frontier AI capabilities.
Beyond the research community, governments have begun to show interest in AI control. The UK AI Security Institute has released a request for proposals via the AISI Challenge Fund that encompasses AI control. Most recently, the US AI Action Plan called for a “technology development program” including “AI control systems.”
Implementing AI Control in the Real World
Even though AI companies have the resources and expertise to implement advanced control protocols for internal R&D, it is a technical and governance challenge to incentivize implementation. Naturally, AI companies would wish to avoid rogue AI agents wreaking havoc on their R&D workflows. However, like any security measures, control protocols add varying levels of friction to day-to-day R&D work. There is often a clear trade-off between how much a control protocol improves security, how little it impedes the performance of the AI agent, and how efficiently it uses the limited resource of skilled human oversight. Furthermore, novel control protocols that leverage large AI models require compute resources to run. In a highly competitive environment in which compute resources are precious and companies are racing toward the next breakthrough, adding any operational and financial cost to the R&D process can be difficult for AI companies to justify. Technical research could refine control protocols to offer greater security benefits for lower costs, and governments could incentivize companies to prioritize security in the national interest.
More work is needed to make advanced levels of AI control feasible for deployers other than AI companies, such as organizations deploying open-weight AI models. The cost-benefit analysis may turn out especially favorably for AI control in use cases for which security is crucial, such as in high reliability industries and government, because these organizations may not otherwise be able to safely adopt frontier AI technologies. As the AI community gains more information about the cost-benefit trade-offs of particular control protocols and iterates upon best practices, sophisticated control protocols could come within reach of more organizations.
Keeping Up with Future AI Agents
As AI capabilities scale and AI control research matures, it is unclear whether advancements will on balance favor the control protocols or the AI agents attempting to evade control. AI control, as it is currently conceptualized, might falter against AI agents with highly advanced capabilities in areas such as cybersecurity and strategizing. Many control protocols rely on less-advanced trusted models overseeing more-advanced untrusted AI agents; because verifying trustworthiness of more-advanced agents is generally more difficult, this paradigm could be challenging to scale. It may be difficult to evaluate when control protocols can no longer withstand threats from the most advanced agents, in part because current control evaluation workflows rely on human-level red teaming. Either way, implementing control protocols with current AI agents will yield lessons that could help to keep AI control effective over the long term.
In sum, the promise of AI control is to constrain misbehaving AI agents so that organizations can deploy cutting-edge AI for R&D and beyond. AI control already integrates smoothly into existing best practices for AI risk management. A mature AI control R&D ecosystem would produce and refine ideas for control protocols to maximize security benefits while minimizing costs, while businesses would make these benefits broadly accessible to organizations that want to deploy AI agents. Even if control faces challenges as capabilities advance, this would be highly informative and demonstrate that—once again—we need to rethink AI security best practices as the field evolves.
Acknowledgments
We thank Addie Foote, Alexa Pan, Aryan Bhatt, Ashwin Acharya, Evelyn Yee, Gabriel Kulp, Helen Toner, Jessica Ji, Julian Stastny, Kyle Miller, and Tyler Tracy for helpful comments.
- Anthropic’s Claude Opus 4 System Card explains that when Anthropic’s most advanced agent was placed in fictional scenarios in which it was told it would be shut down or replaced, it sometimes engaged in self-preservation behaviors such as attempting to blackmail engineers (threatening to expose personal information to prevent its replacement) or trying to copy itself to external servers. OpenAI’s o3 model, even when explicitly instructed to allow itself to be shut down, sometimes actively resisted shutdown mechanisms. Several models have been observed to fake alignment to prevent modification of their behavior.
- Besides alignment faking and data poisoning, reasons for AI agents to behave more dangerously in deployment than in testing include (1) jailbreaking and (2) rare misbehavior. First, to jailbreak a model, users deliberately put the model in a situation in which it is more likely to misbehave. Pre-deployment measures such as circuit breakers and red teaming can help to make jailbreaking more difficult, but it is still very common for users to find jailbreaks in deployed AI models. Second, developers may neglect to test for some unsafe patterns of behavior. A model may fail at a rare enough rate that the behavior is not observed during alignment training or caught in testing, but does occur in deployment. AI control techniques can help to mitigate risks from jailbreaking and rare misbehavior, but they are not the primary threat models that AI control researchers usually aim to address.
- The “Ctrl-Z” paper does not include implementation of AI-assisted auditing, but calls for future work in this area.