Language Models Can Articulate Their Implicit Goals
Abstract
We investigate LLMs' awareness of newly acquired goals or policies. We find that a model finetuned on examples that exhibit a particular policy (e.g. preferring risky options) can describe this policy (e.g. "I take risky options"). This holds even when the model does not have any examples in-context, and without any descriptions of the policy appearing in the finetuning data. This capability extends to *many-persona scenarios*, where models internalize and report different learned policies for different simulated individuals (*personas*), as well as *trigger* scenarios, where models report policies that are triggered by particular token sequences in the prompt. This awareness enables models to acquire information about themselves that was only implicit in their training data. It could potentially help practitioners discover when a model's training data contains undesirable biases or backdoors.
Cite
Text
Betley et al. "Language Models Can Articulate Their Implicit Goals." NeurIPS 2024 Workshops: SafeGenAi, 2024.Markdown
[Betley et al. "Language Models Can Articulate Their Implicit Goals." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/betley2024neuripsw-language/)BibTeX
@inproceedings{betley2024neuripsw-language,
title = {{Language Models Can Articulate Their Implicit Goals}},
author = {Betley, Jan and Bao, Xuchan and Soto, Martín and Sztyber-Betley, Anna and Chua, James and Evans, Owain},
booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/betley2024neuripsw-language/}
}