CSET

Evaluating Large Language Models

Thomas Woodside

and Helen Toner

July 17, 2024

Researchers, companies, and policymakers have dedicated increasing attention to evaluating large language models (LLMs). This explainer covers why researchers are interested in evaluations, as well as some common evaluations and associated challenges. While evaluations can be helpful for monitoring progress, assessing risk, and determining whether to use a model for a specific purpose, they are still at a very early stage.

Related Content

A recent topic of contention among artificial intelligence researchers has been whether large language models can exhibit unpredictable ("emergent") jumps in capability as they are scaled up. These arguments have found their way into policy… Read More

Large language models (LLMs), the technology that powers generative artificial intelligence (AI) products like ChatGPT or Google Gemini, are often thought of as chatbots that predict the next word. But that isn't the full story… Read More

“AI red-teaming” is currently a hot topic, but what does it actually mean? This blog post explains the term’s cybersecurity origins, why AI red-teaming should incorporate cybersecurity practices, and how its evolving definition and sometimes… Read More