Aligning to Thousands of Preferences via System Message Generalization

Abstract

Current large language model (LLM) alignment methods often assume that aligning LLMs with general public preferences is optimal, overlooking individual value diversity. A major challenge in adopting a more individualized approach to LLM alignment is its lack of scalability, as it involves re-training new models for new value or user. We propose a new paradigm where users specify their values within the system message, steering LLM behavior to align with individual intentions. However, LLMs are typically trained on a generic system message (e.g., "You are a helpful assistant"). To improve generalization to diverse system messages, we create a system message dataset with 197k value combinations across 66k user instructions. We train a 7B LLM, Janus and test it on five benchmarks, adding various unseen system messages reflecting user preferences. Janus achieves high tie+win rates against leading models, including GPT-4. Janus also outperforms LLaMA 3 8B Instruct on general helpfulness benchmarks, suggesting that training with diverse system messages enhances alignment with both individual and general preferences. Code, dataset, benchmark, and models are available at https://github.com/kaistAI/Janus.

Cite

Text

Lee et al. "Aligning to Thousands of Preferences via System Message Generalization." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.

Markdown

[Lee et al. "Aligning to Thousands of Preferences via System Message Generalization." NeurIPS 2024 Workshops: Pluralistic-Alignment, 2024.](https://mlanthology.org/neuripsw/2024/lee2024neuripsw-aligning/)

BibTeX

@inproceedings{lee2024neuripsw-aligning,
  title     = {{Aligning to Thousands of Preferences via System Message Generalization}},
  author    = {Lee, Seongyun and Park, Sue Hyun and Kim, Seungone and Seo, Minjoon},
  booktitle = {NeurIPS 2024 Workshops: Pluralistic-Alignment},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/lee2024neuripsw-aligning/}
}