Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Abstract
We propose $SelfControl$, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, $SelfControl$ computes the gradient of this self-judgment with respect to the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce $SelfControl_{Prefix}$, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate $SelfControl$'s efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, $SelfControl_{Prefix}$ enables a plug-and-play control and jointly control multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.
Cite
Text
Cai et al. "Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller." ICML 2024 Workshops: FM-Wild, 2024.Markdown
[Cai et al. "Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller." ICML 2024 Workshops: FM-Wild, 2024.](https://mlanthology.org/icmlw/2024/cai2024icmlw-selfcontrol/)BibTeX
@inproceedings{cai2024icmlw-selfcontrol,
title = {{Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller}},
author = {Cai, Min and Zhang, Yuchen and Zhang, Shichang and Yin, Fan and Zou, Difan and Yue, Yisong and Hu, Ziniu},
booktitle = {ICML 2024 Workshops: FM-Wild},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/cai2024icmlw-selfcontrol/}
}