Outlier-Aware Preference Optimization for Large Language Models

Abstract

Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energybased out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the LM Eval Harness benchmark and improves the reward model’s judgment capability on RewardBench.

Cite

Text

Srivastava et al. "Outlier-Aware Preference Optimization for Large Language Models." ICLR 2025 Workshops: Bi-Align, 2025.

Markdown

[Srivastava et al. "Outlier-Aware Preference Optimization for Large Language Models." ICLR 2025 Workshops: Bi-Align, 2025.](https://mlanthology.org/iclrw/2025/srivastava2025iclrw-outlieraware/)

BibTeX

@inproceedings{srivastava2025iclrw-outlieraware,
  title     = {{Outlier-Aware Preference Optimization for Large Language Models}},
  author    = {Srivastava, Pragya and Nalli, Sai Soumya and Deshpande, Amit and Sharma, Amit},
  booktitle = {ICLR 2025 Workshops: Bi-Align},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/srivastava2025iclrw-outlieraware/}
}