Measuring AI Development

Introduction: Upgrading AI Measurement for the 21st Century

This report demonstrates how contemporary approaches to bibliometric and technical analysis can give policymakers an adaptable tool to better understand ongoing AI developments.¹ By combining bibliometric analysis (via CSET’s recently developed Map of Science derived from CSET’s research clusters and merged corpus of scholarly literature) with qualitative knowledge of specific AI subfields, we show how detailed pictures of AI progress can be created to help policymakers understand the current state of research for specific AI technologies. This report applies our methodology to three topics: re-identification, speaker recognition, and image synthesis. Additionally, we offer ideas for how policymakers can integrate this approach into existing and future measurement programs.²

By combining a versatile and frequently updated bibliometrics tool with a more hands-on analysis of technical developments within a particular AI subfield, we can build out a detailed picture of the state of published research into specific topics. In particular, tracking how AI subfields show improvements in performance benchmarks can send a powerful signal about the state of technical progress. Benchmarks have played a critical role in spurring AI development for topics as wide-ranging as image recognition, self-driving cars, and robotics,³ and they can be used to supplement bibliometric-based approaches, allowing us to see whether rapid research growth is resulting in genuine improvements in performance. This methodology differs from most contemporary approaches to technology assessment because it is structured to allow deeper system-led approaches, places a significant premium on the analysis of fast-moving AI publications (including bibliometric analysis of preprints), and is generically applicable to a variety of AI capabilities, rather than custom-designed for a specific area of study.

Why Does Measurement Matter?

Measurement is uniquely intertwined with policymaking, especially in the case of AI. For the past few decades, attempts by policymakers and academic researchers to assess and measure AI have been linked. The National Institute for Standards and Technology’s (NIST) MNIST dataset of handwritten digits became a valuable source to help gauge the overall pace of AI progress as people used it as a simple benchmark to assess capabilities. The XVIEW dataset, designed by the United States DOD’s Defense Innovation Unit Experimental (DIUx), played a role in stimulating the development of computer vision capabilities applied to satellite imagery and served as a resource that generated data about AI progress. Similarly, the Defense Advanced Research Projects Agency’s (DARPA) self-driving car and robotics competitions have themselves been exercises in measurement and assessment that catalyzed work in a technical area and gave researchers a sense of emerging capabilities in a policy-relevant field. Perhaps the most prominent example of how measurement and assessment impacted policymakers was the development of ImageNet, a Stanford University project to help researchers test the performance of computer vision systems against a large, deliberately challenging (at the time) dataset.

This report aims to go one step further. The U.S. government today periodically conducts tests or assessments of AI capabilities, or develops new tests and datasets at the behest of experts or agencies to understand AI technologies. We instead propose a system for continuously monitoring and assessing AI-enabled capabilities for publication patterns that might highlight consequential trends to the government, and for continuously analyzing technical benchmarks that can help the government detect significant advancements. While many different organizations regularly try to assess and measure AI capabilities for a specific policy purpose, we propose a system for measuring capabilities and patterns within AI as a whole.⁴

After outlining our methodology, this report discusses how qualitative expert knowledge combined with bibliometric tools, like CSET’s Map of Science, can generate insights regarding developments in the specific research areas of re-identification, speaker recognition, and image synthesis. We also discuss some limitations of this approach and end with recommendations for how policymakers can support the development of similar types of analytical tools.

Download Full Report

Measuring AI Development: A Prototype Methodology to Inform Policy

For more literature on the field of bibliometrics, see the following sources: Office of Management, “Learn More About Bibliometrics,” National Institutes of Health (NIH), https://www.nihlibrary.nih.gov/services/bibliometrics/learn-more-about-bibliometrics; Ashok Agarwal et al., “Bibliometrics: Tracking Research Impact by Selecting the Appropriate Metrics,” Asian Journal of Andrology 18, no. 2 (January 2016): 296-309, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4770502/; Anthony van Raan, “Measuring Science: Basic Principles and Application of Advanced Bibliometrics,” in Springer Handbook of Science and Technology Indicators (2019): 237-280, https://link.springer.com/chapter/10.1007/978-3-030-02511-3_10; Bibliometrics involves the “quantitative evaluation of scientific articles and other published works, including the authors of articles, the journals where the works were published, and the number of times they are later cited,” A.W. Jones, “Forensic Journals: Bibliometrics and Journal Impact Factors,” Encyclopedia of Forensic and Legal Medicine (Second Edition) (2016), https://www.sciencedirect.com/science/article/pii/B9780128000342001816#.
This is not an attempt to forecast future developments, an in-depth analysis of a particular aspect of AI, nor a comprehensive review or critique of other research efforts to measure AI. Rather, it is an outline of a systematic process to measure AI developments and a demonstration that we hope will inspire policymakers to implement and improve the methodology. Additionally, though none of the elements of the methodology are novel, by integrating them together we can outline a process to improve measurement.
National Institute for Standards and Technology (NIST), “The EMNIST Dataset,” U.S. Department of Commerce, https://www.nist.gov/itl/products-and-services/emnist-dataset; Defense Advanced Research Projects Agency (DARPA), “The DARPA Grand Challenge: Ten Years Later,” March 13, 2014, https://www.darpa.mil/news-events/2014-03-13; Defense Advanced Research Projects Agency (DARPA), “DARPA Robotics Challenge (DRC),” https://www.darpa.mil/program/darpa-robotics-challenge.
The efforts of this report are similar to parts of the Intelligence Advanced Research Projects Activity’s (IARPA) Foresight and Understanding from Scientific Exposition (FUSE) program. The FUSE program ran from 2010–2017 and was intended to “enable reliable, early detection of emerging scientific and technical capabilities across disciplines and languages found within the full-text content of scientific, technical, and patent literature” and “discover patterns of emergence and connections between technical concepts at a speed, scale, and comprehensiveness that exceeds human capacity.” Like this report, FUSE emphasized the need for a methodology that incorporated continuous assessment of bibliometrics over ad-hoc assessment. However, the FUSE program intended to design a fully refined and automated methodology that involved six “research thrusts” (theory development, document features, indicator development, nomination quality, evidence representation, and system engineering), while this report’s objectives are far more limited, as we only intend to demonstrate the potential for insight through a prototype methodology that can be adopted and updated by different end users. Additionally, the methodology in this report incorporates performance metrics assessment and an AI-curated corpus of scientific literature, which are elements not included in IARPA’s FUSE program. For more details, see Intelligence Advanced Research Projects Activity (IARPA), “Foresight and Understanding from Scientific Exposition (FUSE),” Office of the Director of National Intelligence, https://www.iarpa.gov/index.php/research-programs/fuse.

Center for Security and Emerging Technology

The White House Made Fixing Intel Its Pet Project. It’s Working.

Data Brief

A Prototype Methodology to Inform Policy

Introduction: Upgrading AI Measurement for the 21st Century

Download Full Report

The White House Made Fixing Intel Its Pet Project. It’s Working.

Data Brief

Measuring AI Development

A Prototype Methodology to Inform Policy

Introduction: Upgrading AI Measurement for the 21st Century

Download Full Report

This website uses cookies.