Measuring Steerability in Large Language Models

Abstract

Large language models (LLMs) are powerful instruction followers. However, many open-ended generation tasks have a large “solution space” that depends on a user’s needs. LLMs that are steerable towards such needs are critical to safe LLM systems that behave consistently with user expectations and goals. Despite continued improvement in LLM instruction-following, such gains may not necessarily translate to steerability. This disconnect motivates a principled framework for measuring steerability. Thus, we propose a goal-oriented, quantitative definition of steerability. Our definition informs the design of an empirical steerability probe, where we leverage text rewriting tasks to measure steerability of LLMs. We demonstrate that recent LLMs are not steerable. We attribute this lack of steerability to “side-effects:” correlations between requested goals and non-requested LLM movement. Thus, despite advances in LLM instruction following, there remains significant room for improving LLM steerability.

Cite

Text

Chang et al. "Measuring Steerability in Large Language Models." NeurIPS 2024 Workshops: SafeGenAi, 2024.

Markdown

[Chang et al. "Measuring Steerability in Large Language Models." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/chang2024neuripsw-measuring/)

BibTeX

@inproceedings{chang2024neuripsw-measuring,
  title     = {{Measuring Steerability in Large Language Models}},
  author    = {Chang, Trenton and Wiens, Jenna and Schnabel, Tobias and Swaminathan, Adith},
  booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/chang2024neuripsw-measuring/}
}