Reports

Key Concepts in AI Safety: Interpretability in Machine Learning

Tim G. J. Rudner

and Helen Toner

March 2021

This paper is the third installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure these systems work safely and reliably. The first paper in the series, “Key Concepts in AI Safety: An Overview,” described three categories of AI safety issues: problems of robustness, assurance, and specification. This paper introduces interpretability as a means to enable assurance in modern machine learning systems.

Download Full Report

Other briefs in this series:

Introduction

Interpretability, also often referred to as explainability, in artificial intelligence (AI) refers to the study of how to understand the decisions of machine learning systems, and how to design systems whose decisions are easily understood, or interpretable. This way, human operators can ensure a system works as intended and receive an explanation for unexpected behaviors.

Modern machine learning systems are becoming prevalent in automated decision making, spanning a variety of applications in both the private and public spheres. As this trend continues, machine learning systems are being deployed with increasingly limited human supervision, including in areas where their decisions may have significant impacts on people’s lives. Such areas include automated credit scoring, medical diagnoses, hiring, and autonomous driving, among many others. At the same time, machine learning systems are also becoming more complex, making it difficult to analyze and understand how they reach conclusions. This increase in complexity—and the lack of interpretability that comes with it—poses a fundamental challenge for using machine learning systems in high-stakes settings.

Furthermore, many of our laws and institutions are premised on the right to request an explanation for a decision, especially if the decision leads to negative consequences. From a job candidate suing for discrimination in a hiring process, to a bank customer inquiring about the reason for receiving a low credit limit, to a soldier explaining their actions before a court-martial, we assume that there is a process for assessing how a decision was made and whether it was in line with standards we have set. This assumption may not hold true if the decisionmaker in question is a machine learning system which is unable to provide such an explanation. In order for modern machine learning systems to safely integrate into existing institutions in high-stakes settings, they must be interpretable by human operators.

Download Full Report

Key Concepts in AI Safety: Interpretability in Machine Learning

Reports

Key Concepts in AI Safety: An Overview

March 2021

This paper is the first installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

Reports

Key Concepts in AI Safety: Robustness and Adversarial Examples

March 2021

This paper is the second installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

Reports

Key Concepts in AI Safety: Specification in Machine Learning

November 2021

This paper is the fourth installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

Reports

Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning

June 2024

This paper is the fifth installment in a series on “AI safety,” an area of machine learning research that aims to identify causes of unintended behavior in machine learning systems and develop tools to ensure… Read More

Center for Security and Emerging Technology

Future-Ready: Building Tomorrow’s Tech Workforce

Reports

Key Concepts in AI Safety: Interpretability in Machine Learning

Introduction

Download Full Report

Related Content

Key Concepts in AI Safety: An Overview

Key Concepts in AI Safety: Robustness and Adversarial Examples

Key Concepts in AI Safety: Specification in Machine Learning

Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning

Future-Ready: Building Tomorrow’s Tech Workforce

Reports

Key Concepts in AI Safety: Interpretability in Machine Learning

Introduction

Download Full Report

Related Content

Key Concepts in AI Safety: An Overview

Key Concepts in AI Safety: Robustness and Adversarial Examples

Key Concepts in AI Safety: Specification in Machine Learning

Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning

This website uses cookies.