As machine learning becomes more widely used and applied to areas where safety and reliability are critical, the risk of system failures causing significant harm rises. To avoid such failures, machine learning systems will need to be much more reliable than they currently are, operating safely under a wide range of conditions. In this paper, we introduce adversarial examples—a particularly challenging type of input to machine learning systems—and describe an artificial intelligence (AI) safety approach for preventing system failures caused by such inputs.
Machine learning systems are designed to learn patterns and associations from data. Typically, a machine learning method consists of a statistical model of the relationship between inputs and outputs, as well as a learning algorithm. The algorithm specifies how the model should change as it receives more information (in the form of data) about the input–output relationship it is meant to represent. This process of updating the model with more data is called “training.”
Once a machine learning model has been trained, it can make predictions (such as whether an image depicts an object or a human), perform actions (such as autonomous navigation), or generate synthetic data (such as images, videos, speech, and text). An important trait in any machine learning system is its ability to work well, not only on the specific inputs it was shown in training, but also on other inputs. For example, many image classification models are trained using a dataset of millions of images called ImageNet; these models are only useful if they also work well on real-life images outside of the training dataset.
Modern machine learning systems using deep neural networks—a prevalent type of statistical model—are much better in this regard than many other approaches. For example, a deep neural network trained to classify images of cats and dogs in black and white is likely to succeed at classifying similar images of cats and dogs in color. However, even the most sophisticated machine learning systems will fail when given inputs that are meaningfully different from the inputs they were trained on. A cat-and-dog classifier, for example, will not be able to classify a fish as such if it has never encountered an image of a fish during training. Furthermore, as the next section explores in detail, humans cannot always intuit which kinds of inputs will appear meaningfully different to the model.