Introduction
Within the field of artificial intelligence, a growing area of focus is AI safety: research to identify, prevent, and mitigate unintended behavior in AI systems. If AI is ever to grow into its potential as a general-purpose technology and be deployed in high-stakes settings, substantial progress in tackling AI safety challenges will be required.
This brief investigates the development of AI safety as a research field (or set of research fields) by making use of the CSET Map of Science derived from CSET’s research clusters and merged corpus of scholarly literature.1 This resource was constructed by grouping research publications into citation-based “research clusters,” then visualizing those clusters based on the strength of citation connections. An interactive version of the Map of Science is available online at sciencemap.cset.tech. We used it to investigate how AI safety is beginning to emerge within the larger ecosystem of machine learning research, including by exploring how research clusters related to AI safety have grown over time, how active different countries are in different areas, and which publications have been particularly influential.
What is AI Safety?
Terminology has proliferated in discussions of how—broadly speaking—to build and deploy AI systems in ways that benefit humanity: trustworthy AI, responsible AI, beneficial AI, friendly AI, ethical AI, safe AI . . . the list seems to grow longer each year.
This brief focuses on three sub-areas within “AI safety,” a term that has come to refer primarily to technical research (i.e., not legal, political, social, etc. research) that aims to identify and avoid unintended AI behavior. AI safety research primarily seeks to make progress on technical aspects of the many socio-technical challenges that have come along with progress in machine learning over the past decade.
Like all of the terms in question, AI safety does not have clearly defined boundaries; instead, it is a fuzzy catch-all for a range of different research problems and approaches. For the purposes of this brief, we searched for research clusters that relate to three types of work that are highly relevant to AI safety: robustness, interpretability, and reward learning.2 We think of these categories as follows:
- Robustness research focuses on AI systems that seem to work well in general but fail in certain circumstances. This includes identifying and defending against deliberate attacks, such as the use of adversarial examples (giving the system inputs intentionally designed to cause it to fail), data poisoning (manipulating training data in order to cause an AI model to learn the wrong thing), and other techniques.3 It also includes making models more robust to incidental (i.e., not deliberate) failures, such as a model being used in a different setting from what it was trained for (known as being “out of distribution”). Robustness research can seek to identify failure modes, find ways to prevent them, or develop tools to show that a given system will work robustly under a given set of assumptions.
- Interpretability research aims to help us understand the inner workings of machine learning models. Modern machine learning techniques (especially deep learning, also known as deep neural networks) are famous for being “black boxes” whose functioning is opaque to us. In reality, it’s easy enough to look inside the black box, but what we find there—rows upon rows of numbers (“parameters”)—is extremely difficult to make sense of. Interpretability research aims to build tools and approaches that convert the millions or billions of parameters in a machine learning model into forms that allow humans to grasp what’s going on.4
- Reward learning research seeks to expand the toolbox for how we tell machine learning systems what we want them to do. The standard approach to training a model involves specifying an objective function (or reward function): typically, to maximize something, such as accuracy in labeling examples from a training dataset. This approach works well in settings where we can identify metrics and training data that closely track what we want, but can lead to problems in more complex situations.5 Reward learning is one set of approaches that tries to mitigate these problems. Instead of directly specifying an objective, these approaches work by setting up the machine learning model to learn not only how to meet its objective but what its objective should be.
In summary: robustness is concerned with the quality and reliability of outcomes in AI applications; interpretability helps us better predict outcomes and improve learning and accountability when poor outcomes are observed; and reward learning helps reduce the risk of disconnects between intended and observed outcomes. Note that these categories are not intended to represent an exhaustive breakdown of all types of AI safety work. Future analysis could consider additional areas.
Download Full Report
Exploring Clusters of Research in Three Areas of AI Safety- CSET Map of Science is derived from CSET’s research clusters and merged corpus of scholarly literature. CSET’s merged corpus of scholarly literature includes Digital Science’s Dimensions, Clarivate’s Web of Science, Microsoft Academic Graph, China National Knowledge Infrastructure, arXiv, and Papers With Code. All China National Knowledge Infrastructure content furnished for use in the United States by East View Information Services, Minneapolis, MN, USA. For more on the methodology used to create the research clusters, see Ilya Rahkovsky et al., “AI Research Funding Portfolios and Extreme Growth,” Frontiers in Research Metrics and Analytics, April 6, 2021, https://www.frontiersin.org/articles/10.3389/frma.2021.630124/full.
- This breakdown is derived from the categories used in Pedro A. Ortega and Vishal Maini et al., “Building safe artificial intelligence: specification, robustness, and assurance,” DeepMind Safety Research (Medium), September 27, 2018, https://deepmindsafetyresearch.medium.com/building-safe-artificial-intelligence-52f5f75058f1 and “Example Research Topics,” in “The Open Phil AI Fellowship,” Open Philanthropy, https://www.openphilanthropy.org/focus/global-catastrophic-risks/potential-risks-advanced-artificial-intelligence/the-open-phil-ai-fellowship.
- For an overview of methods to attack machine learning systems, see Andrew Lohn, “Hacking AI: A Primer for Policymakers on Machine Learning Cybersecurity” (Center for Security and Emerging Technology, December 2020), https://cset.georgetown.edu/publication/hacking-ai/.
- Interpretability is sometimes also referred to as “explainability.” For a nice account of the complexity of what constitutes an “explanation” in the real world, see Finale Doshi-Velez and Mason A. Kortz, “Accountability of AI Under the Law: The Role of Explanation” (Berkman Klein Center Working Group on Explanation and the Law, 2017), https://dash.harvard.edu/handle/1/34372584.
- Tim G. J. Rudner and Helen Toner, “Key Concepts in AI Safety: Specification in Machine Learning” (Center for Security and Emerging Technology, December 2021), https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-specification-in-machine-learning/.