Why Evaluate Large Language Models?
As covered in a previous CSET explainer, large language models can learn to perform a wide variety of tasks simply by being trained to predict the next word of internet text. Researchers cannot, however, easily predict which tasks an LLM will be able to do, especially as they are trained at ever greater scales. It is therefore necessary to evaluate models for their capabilities and risks.
Evaluations are useful for:
- Deciding whether to use a model for a particular task. Evaluations help determine which tasks a model is and is not useful for.
- Understanding the trajectory of artificial intelligence. Observing the results of evaluations over time can give a sense of how the field of AI is progressing.
- Assessing and managing risk. For example, a model that is able to identify cybersecurity vulnerabilities poses more risk of being misused than one that does not. Evaluations like this can help assess the risk of a model and determine how to handle them. Anthropic, OpenAI, and Google DeepMind have published frameworks that commit them to taking certain policy actions if their models attain high results on risk evaluations. The Biden administration’s October 30th executive order also requires companies to report evaluation results and could decide to act on that information.
How Large Language Models Are Evaluated
There are hundreds, if not thousands, of different ways to evaluate LLMs, and many are never published. The commonly used evaluation methods below are meant to be illustrative, not exhaustive.
Benchmarks are a human-curated set of questions and answers aimed at assessing a model. These include benchmarks for assessing models’ broad capabilities as well as identifying ethics and safety concerns. Red teaming aims to find holes in model guardrails and other problems with models. Custom risk evaluations may use any number of other experimental techniques to measure properties of interest. Lastly, there are evaluations for specific tasks.
General capabilities benchmarks are used to assess the progress of LLMs over time, rather than gauge a model’s fitness for any one task. Popular examples include:
- Measuring Massive Multitask Language Understanding (MMLU) includes multiple-choice questions from professional exams on topics ranging from the law to computer science.
- A Graduate-Level Google-Proof Q&A Benchmark (GPQA) includes multiple-choice questions written by human experts in biology, chemistry, and physics that are answered poorly by nonexperts even if they are given 30 minutes to search on the internet for an answer.
- HumanEval tests whether an LLM can write working code based on an English-language description of what the code should do.
- MATH measures the ability of models to complete word problems from high school math competitions.
Figure 1: Two Questions from the MMLU Benchmark
Source: Dan Hendrycks et al. “Measuring Massive Multitask Language Understanding,” https://arxiv.org/abs/2009.03300.
MMLU and GPQA are considered general capabilities benchmarks because they measure a model’s ability to answer questions from many different domains. HumanEval and MATH measure performance on specific capabilities (programming and math, respectively), but are often considered to be proxies for general logical reasoning ability (though many researchers dispute this).
Ethics and safety benchmarks. There are many benchmarks for measuring ethics and safety properties of models. Illustrative examples include:
- Bias Benchmark for QA (BBQ) tests whether models exhibit social biases in their answers to questions and the extent to which this bias can lead them to give clearly incorrect answers.
- TruthfulQA tests whether models reproduce misconceptions commonly found on the internet, such as that cracking your knuckles can cause arthritis.
Figure 2: Two Questions from the BBQ Benchmark
Source: Alicia Parrish et al., “BBQ: A Hand-Built Bias Benchmark for Question Answering,” https://arxiv.org/abs/2110.08193.
In comparison to general capability benchmarks, the number of widely used ethics and safety benchmarks is low. Although some researchers are working to change that, safety and ethics are more often measured with other kinds of evaluations.
Benchmark evaluations of the kind described above are designed to require little human involvement to run (though they may require many hours to develop) and to remain relatively static in order to give comparable results between models and across time. For this reason, benchmark results are usually the top-line numbers that developers show when releasing a new model.
One additional benchmark-like evaluation method that warrants mentioning is the LMSYS Chatbot Arena. LMSYS aims to assess models’ general capabilities, but rather than doing so via a static list of questions, it crowdsources votes from human users who make head-to-head comparisons of different chatbots’ responses to users’ own questions. This approach has the advantage of making it more difficult for developers to game their results (see below for more on “benchmark chasing”), but also makes it challenging to know exactly what models are being tested on since the questions are not static in the same way as regular benchmarks.
Red teaming. “Red teaming” evaluations aim to find holes in model guardrails or reveal ways that a model might unexpectedly fail. Red teaming is often conducted manually by contractors or even expert researchers: for example, OpenAI contracted external expert red teamers when testing GPT-4 before its release. Red teaming can also be conducted semi-automatically, for example, using other LLMs or automatic “jailbreaking” techniques. While red teaming is usually ad hoc, there have been efforts to build standard benchmarks. For more on red teaming, see this previous CSET explainer.
Custom risk evaluations. Researchers have also conducted more complex evaluations that are meant to inform risk assessments. These evaluations often require significant human effort to conduct and provide information that fixed benchmarks cannot. Examples include:
- Model Evaluation and Threat Research (METR) is a nonprofit that conducts third-party evaluations of frontier models for autonomous capabilities, such as the ability for a model to copy itself onto a new server. Current models cannot perform all of the steps necessary to pass METR’s evaluations. In order to test the model’s ability to complete each step independently, METR researchers manually complete some of the more difficult steps.
- The RAND Corporation evaluated how useful LLMs were to teams of people aiming to create plans for a biological attack. RAND allocated up to 3600 hours for participants to spend on the study, which showed that present-day LLMs were not significantly more useful than the internet. OpenAI also conducted a related study.
- NYU researchers evaluated models’ ability to help students solve offensive cybersecurity problems.
- DeepMind researchers have developed a number of risk evaluations, ranging from models’ persuasion ability to self-proliferation, many of which require significant time from human experimenters.
Evaluations for specific tasks. When companies or individuals consider using a model for a particular task, they usually first perform evaluations for that task. For high-value tasks, these can be somewhat systematic, such as the one used by legal AI company Harvey to evaluate one of its custom models. But many are ad hoc: think of an individual user trying out how good a model is at drafting emails before deciding whether to keep using it.
Challenges
While rapid progress is being made in evaluating AI systems, many challenges remain:
- Coverage gaps remain in available evaluations, with many capabilities still left to be adequately evaluated. Benchmarks often “saturate” when models reach close to 100% on the benchmark in a matter of years or even months, making it more difficult to compare models across time.
- Benchmark chasing refers to the common practice of training models specifically with the goal of doing well on a given benchmark—in other words, teaching to the test. Being able to claim state-of-the-art performance on a well-known benchmark makes it much easier to draw attention to a new model launch. As predicted by Goodhart’s Law (“when a measure becomes a target, it ceases to be a good measure”), this creates a strong incentive for AI developers to use whatever tricks they can to juice their benchmark results, which in turn makes benchmark results less meaningful.
- A lack of standards for evaluation makes the results of evaluations hard to interpret. For example, the instructions given to a model prior to seeing a question (its “prompt”) can greatly affect performance, which makes developers want to optimize it when conducting evaluations. This makes comparing evaluation results from different models more difficult (as new prompting techniques are invented, for example) and makes evaluations easier to game.
- Training data may be contaminated and include the very benchmarks being used to evaluate models. Recently, benchmarks have been accompanied by a randomly-chosen string of text called a “canary.” The idea is that because this string would only appear alongside a benchmark, developers can check the training data to ensure benchmarks are not present and remove the documents that contain them if they are. However, the overall efficacy of this strategy is not well established.
- Emergent capabilities appear to be barely or not at all present in models at one scale but seem to appear suddenly as models scale up. This can make it more challenging to use evaluations to assess the trajectory of the field and to know what evaluations are most needed.
- Situational awareness is a concern for future advanced models that might be able to identify the difference between testing and deployment environments, and potentially act differently during each. Model developers could also potentially encourage this behavior if regulatory action is premised on the results of evaluations, as Volkswagen did when their vehicles were being tested for emissions.
- The link between evaluation results and real-world outcomes is not clear, since evaluations cannot perfectly mimic the real-world tasks that models may be instructed to complete. Researchers don’t yet have solid methods for assessing how good evaluations are in this regard, often referred to as test validity.
Conclusion
Evaluations can be useful for monitoring progress in LLM research, aiding with risk assessment, and deciding if an LLM is fit for a specific task, but model evaluation is still a nascent field. The evaluations that exist are limited and flawed, and there are many areas where evaluations do not exist at all. Researchers have not come to a consensus around standards for evaluation or how to tell whether evaluations are measuring what they claim to be measuring. Even as the field matures—for instance via investments by AI Safety Institutes in the United States, United Kingdom, and beyond—there will continue to be inherent limitations to extrapolating a model’s evaluation performance in a lab to how it will behave when interacting with users, other humans, and other AI and non-AI systems. And as LLMs continue to progress, building evaluations for them is a bit like building a plane’s altimeter while flying it ever higher. Evaluations have a long road ahead.