$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$

Abstract

Direct Preference Optimization (DPO) has emerged as a compelling approach for training Large Language Models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning of its trade-off parameter $\beta$, as well as to the quality of the preference data. We analyze the impact of $\beta$ and data quality on DPO, uncovering that optimal $\beta$ values vary with the informativeness of pairwise data. Addressing the limitations of static $\beta$ values, we introduce a novel framework that dynamically calibrates $\beta$ at the batch level, informed by data quality considerations. Additionally, our method incorporates $\beta$-guided data filtering to safeguard against the influence of outliers. Through empirical evaluation, we demonstrate that our dynamic $\beta$ adjustment technique significantly improves DPO’s performance across a range of models and datasets, offering a more robust and adaptable training paradigm for aligning LLMs with human feedback. The code is available at \url{https://anonymous.4open.science/r/beta-DPO-EE6C}.

Cite

Text

Wu et al. "$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$." Neural Information Processing Systems, 2024. doi:10.52202/079017-4128

Markdown

[Wu et al. "$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/wu2024neurips-dpo/) doi:10.52202/079017-4128

BibTeX

@inproceedings{wu2024neurips-dpo,
  title     = {{$\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$}},
  author    = {Wu, Junkang and Xie, Yuexiang and Yang, Zhengyi and Wu, Jiancan and Gao, Jinyang and Ding, Bolin and Wang, Xiang and He, Xiangnan},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4128},
  url       = {https://mlanthology.org/neurips/2024/wu2024neurips-dpo/}
}