Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Abstract
Direct alignment methods typically train large language models (LLMs) by contrasting the likelihoods of preferred and dispreferred responses. While effective at capturing relative preferences, these methods are widely observed to suppress the absolute likelihoods of example responses. As a result, aligned models can deviate from expected patterns, exhibiting reward‑hacking effect even without an explicit reward model. This fundamental limitation of contrastive alignment, termed likelihood underdetermination, motivates us to revisit direct preference optimization (DPO)—the seminal direct alignment method. Interestingly, we show that the DPO loss admits a principled decomposition. The reformulated loss not only extends naturally to a broader range of feedback types, but also unveils the root cause of likelihood underdetermination. Specifically, we identify that standard DPO implicitly oversimplifies a regularizer in the reformulated loss; restoring this full term effectively resolves the underdetermination. Building on these insights, we introduce PRoximalized PReference Optimization (PRO), a unified alignment method that accommodates diverse feedback types while eliminating likelihood underdetermination through an efficient approximation of the full regularizer. Empirical evaluations demonstrate the consistent superiority of PRO over existing methods across pairwise, binary and scalar feedback.
Cite
Text
Guo et al. "Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO." Advances in Neural Information Processing Systems, 2025.Markdown
[Guo et al. "Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/guo2025neurips-proximalized/)BibTeX
@inproceedings{guo2025neurips-proximalized,
title = {{Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO}},
author = {Guo, Kaiyang and Li, Yinchuan and Chen, Zhitang},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/guo2025neurips-proximalized/}
}