Other briefs in this series:
- Key Concepts in AI Safety: An Overview
- Key Concepts in AI Safety: Robustness and Adversarial Examples
- Key Concepts in AI Safety: Specification in Machine Learning
- Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning
Introduction
Interpretability, also often referred to as explainability, in artificial intelligence (AI) refers to the study of how to understand the decisions of machine learning systems, and how to design systems whose decisions are easily understood, or interpretable. This way, human operators can ensure a system works as intended and receive an explanation for unexpected behaviors.
Modern machine learning systems are becoming prevalent in automated decision making, spanning a variety of applications in both the private and public spheres. As this trend continues, machine learning systems are being deployed with increasingly limited human supervision, including in areas where their decisions may have significant impacts on people’s lives. Such areas include automated credit scoring, medical diagnoses, hiring, and autonomous driving, among many others. At the same time, machine learning systems are also becoming more complex, making it difficult to analyze and understand how they reach conclusions. This increase in complexity—and the lack of interpretability that comes with it—poses a fundamental challenge for using machine learning systems in high-stakes settings.
Furthermore, many of our laws and institutions are premised on the right to request an explanation for a decision, especially if the decision leads to negative consequences. From a job candidate suing for discrimination in a hiring process, to a bank customer inquiring about the reason for receiving a low credit limit, to a soldier explaining their actions before a court-martial, we assume that there is a process for assessing how a decision was made and whether it was in line with standards we have set. This assumption may not hold true if the decisionmaker in question is a machine learning system which is unable to provide such an explanation. In order for modern machine learning systems to safely integrate into existing institutions in high-stakes settings, they must be interpretable by human operators.