Evaluating the Prompt Steerability of Large Language Models

Abstract

Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline behavior. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.

Cite

Text

Miehling et al. "Evaluating the Prompt Steerability of Large Language Models." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.

Markdown

[Miehling et al. "Evaluating the Prompt Steerability of Large Language Models." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.](https://mlanthology.org/neuripsw/2024/miehling2024neuripsw-evaluating/)

BibTeX

@inproceedings{miehling2024neuripsw-evaluating,
  title     = {{Evaluating the Prompt Steerability of Large Language Models}},
  author    = {Miehling, Erik and Desmond, Michael and Ramamurthy, Karthikeyan Natesan and Daly, Elizabeth M. and Dognin, Pierre and Rios, Jesus and Bouneffouf, Djallel and Liu, Miao},
  booktitle = {NeurIPS 2024 Workshops: Pluralistic-Alignment},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/miehling2024neuripsw-evaluating/}
}