Analysis

Key Concepts in AI Safety: Robustness and Adversarial Examples

Tim G. J. Rudner

and Helen Toner

March 2021

This paper is the second installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure these systems work safely and reliably. The first paper in the series, “Key Concepts in AI Safety: An Overview,” described three categories of AI safety issues: problems of robustness, assurance, and specification. This paper introduces adversarial examples, a major challenge to robustness in modern machine learning systems.

Download Full Report

Other briefs in this series:

Introduction

As machine learning becomes more widely used and applied to areas where safety and reliability are critical, the risk of system failures causing significant harm rises. To avoid such failures, machine learning systems will need to be much more reliable than they currently are, operating safely under a wide range of conditions. In this paper, we introduce adversarial examples—a particularly challenging type of input to machine learning systems—and describe an artificial intelligence (AI) safety approach for preventing system failures caused by such inputs.

Machine learning systems are designed to learn patterns and associations from data. Typically, a machine learning method consists of a statistical model of the relationship between inputs and outputs, as well as a learning algorithm. The algorithm specifies how the model should change as it receives more information (in the form of data) about the input–output relationship it is meant to represent. This process of updating the model with more data is called “training.”

Once a machine learning model has been trained, it can make predictions (such as whether an image depicts an object or a human), perform actions (such as autonomous navigation), or generate synthetic data (such as images, videos, speech, and text). An important trait in any machine learning system is its ability to work well, not only on the specific inputs it was shown in training, but also on other inputs. For example, many image classification models are trained using a dataset of millions of images called ImageNet; these models are only useful if they also work well on real-life images outside of the training dataset.

Modern machine learning systems using deep neural networks—a prevalent type of statistical model—are much better in this regard than many other approaches. For example, a deep neural network trained to classify images of cats and dogs in black and white is likely to succeed at classifying similar images of cats and dogs in color. However, even the most sophisticated machine learning systems will fail when given inputs that are meaningfully different from the inputs they were trained on. A cat-and-dog classifier, for example, will not be able to classify a fish as such if it has never encountered an image of a fish during training. Furthermore, as the next section explores in detail, humans cannot always intuit which kinds of inputs will appear meaningfully different to the model.

Download Full Report

Key Concepts in AI Safety: Robustness and Adversarial Examples

Analysis

Key Concepts in AI Safety: An Overview

March 2021

This paper is the first installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

Analysis

Key Concepts in AI Safety: Interpretability in Machine Learning

March 2021

This paper is the third installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

Analysis

Key Concepts in AI Safety: Specification in Machine Learning

November 2021

This paper is the fourth installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

Analysis

Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning

June 2024

This paper is the fifth installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

How the U.S. Wins the Global Tech Competition

Center for Security and Emerging Technology

How the U.S. Wins the Global Tech Competition

Analysis

Key Concepts in AI Safety: Robustness and Adversarial Examples

Introduction

Download Full Report

Related Content

Key Concepts in AI Safety: An Overview

Key Concepts in AI Safety: Interpretability in Machine Learning

Key Concepts in AI Safety: Specification in Machine Learning

Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning

How the U.S. Wins the Global Tech Competition

Analysis

Key Concepts in AI Safety: Robustness and Adversarial Examples

Introduction

Download Full Report

Related Content

Key Concepts in AI Safety: An Overview

Key Concepts in AI Safety: Interpretability in Machine Learning

Key Concepts in AI Safety: Specification in Machine Learning

Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning

This website uses cookies.