Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Abstract

We propose $SelfControl$, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, $SelfControl$ computes the gradient of this self-judgment with respect to the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce $SelfControl_{Prefix}$, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate $SelfControl$'s efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, $SelfControl_{Prefix}$ enables a plug-and-play control and jointly control multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.

Cite

Text

Cai et al. "Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller." ICML 2024 Workshops: FM-Wild, 2024.

Markdown

[Cai et al. "Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller." ICML 2024 Workshops: FM-Wild, 2024.](https://mlanthology.org/icmlw/2024/cai2024icmlw-selfcontrol/)

BibTeX

@inproceedings{cai2024icmlw-selfcontrol,
  title     = {{Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller}},
  author    = {Cai, Min and Zhang, Yuchen and Zhang, Shichang and Yin, Fan and Zou, Difan and Yue, Yisong and Hu, Ziniu},
  booktitle = {ICML 2024 Workshops: FM-Wild},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/cai2024icmlw-selfcontrol/}
}