Analysis

Controlling Large Language Model Outputs: A Primer

Jessica Ji

Josh A. Goldstein

Andrew Lohn

December 2023

Concerns over risks from generative artificial intelligence systems have increased significantly over the past year, driven in large part by the advent of increasingly capable large language models. But, how do AI developers attempt to control the outputs of these models? This primer outlines four commonly used techniques and explains why this objective is so challenging.

Download Full Report

Executive Summary

Concerns over risks from generative artificial intelligence (AI) systems have increased significantly over the past year, driven in large part by the advent of increasingly capable large language models (LLMs). Many of these potential risks stem from these models producing undesirable outputs, from hate speech to information that could be put to malicious use. However, the inherent complexity of LLMs makes controlling or steering their outputs a considerable technical challenge.

This issue brief presents three broad categories of potentially harmful outputs—inaccurate information, biased or toxic outputs, and outputs resulting from malicious use—that may motivate developers to control LLMs. It also explains four popular techniques that developers currently use to control LLM outputs, categorized along various stages of the LLM development life cycle: 1) editing pre-training data, 2) supervised fine-tuning, 3) reinforcement learning with human feedback and Constitutional AI, and 4) prompt and output controls.

None of these techniques are perfect, and they are frequently used in concert with one another and with nontechnical controls such as content policies. Furthermore, the availability of open models—which anyone can download and modify for their own purposes—means that these controls or safeguards are unevenly distributed across various LLMs and AI-enabled products. Ultimately, this is a complex and novel problem that presents challenges for both policymakers and AI developers. Today’s techniques are more like sledgehammers than scalpels, and even the most cuttingedge controls cannot guarantee that an LLM will never produce an undesirable output.

Download Full Report

Controlling Large Language Models: A Primer

Authors

Jessica Ji Josh A. Goldstein Andrew Lohn

Originally Published

December 2023

Topics

CyberAI

Disinformation

Citation

Jessica Ji, Josh Goldstein, and Andrew Lohn, "Controlling Large Language Models: A Primer" (Center for Security and Emerging Technology, December 2023). https://doi.org/10.51593/2023CA009

Understanding AI Harms: An Overview

August 2023

As policymakers decide how best to regulate AI, they first need to grasp the different types of harm that various AI applications might cause at the individual, national, and even societal levels. To better understand… Read More

Large Language Models (LLMs): An Explainer

July 2023

CSET has received a lot of questions about LLMs and their implications. But questions and discussions tend to miss some basics about LLMs and how they work. In this blog post, we ask CSET’s NLP… Read More

What Are Generative AI, Large Language Models, and Foundation Models?

May 2023

What exactly are the differences between generative AI, large language models, and foundation models? This post aims to clarify what each of these three terms mean, how they overlap, and how they differ. Read More

Memory Safety: An Explainer

September 2023

Memory safety issues remain endemic in cybersecurity and are often seen as a never-ending source of cyber vulnerabilities. Recently the topic has increased in prominence with the White House Office of the National Cyber… Read More

Other

Techniques to Make Large Language Models Smaller: An Explainer

October 2023

This explainer overviews techniques to produce smaller and more efficient language models that require fewer resources to develop and operate. Importantly, information on how to leverage these techniques, and many of the subsequent small models,… Read More

Other

AI Incident Collection: An Observational Study of the Great AI Experiment

September 2023

This explainer defines criteria for effective AI Incident Collection and identifies tradeoffs between potential reporting models: mandatory, voluntary, and citizen reporting. Read More

Analysis

Controlling Large Language Model Outputs: A Primer

Executive Summary

Download Full Report

Related Content

Understanding AI Harms: An Overview

Large Language Models (LLMs): An Explainer

What Are Generative AI, Large Language Models, and Foundation Models?

Memory Safety: An Explainer

Techniques to Make Large Language Models Smaller: An Explainer

AI Incident Collection: An Observational Study of the Great AI Experiment

This website uses cookies.