Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Abstract

This paper investigates a basic question in reinforcement learning from human feedback (RLHF) from a theoretical perspective: how to efficiently explore in an online manner under preference feedback and general function approximation. We take the initial step towards a theoretical understanding of this problem by proposing a novel algorithm, *Exploratory Preference Optimization* (XPO). This algorithm is elegantly simple---requiring only a one-line modification to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023)---yet provides the strongest known provable guarantees. XPO augments the DPO objective with a novel and principled *exploration bonus*, enabling the algorithm to strategically explore beyond the support of the initial model and preference feedback data. We prove that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model's coverage. Our analysis builds on the observation that DPO implicitly performs a form of *Bellman error minimization*. It synthesizes previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the lens of *KL-regularized Markov decision processes*.

Cite

Text

Xie et al. "Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF." International Conference on Learning Representations, 2025.

Markdown

[Xie et al. "Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/xie2025iclr-exploratory/)

BibTeX

@inproceedings{xie2025iclr-exploratory,
  title     = {{Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF}},
  author    = {Xie, Tengyang and Foster, Dylan J and Krishnamurthy, Akshay and Rosset, Corby and Awadallah, Ahmed Hassan and Rakhlin, Alexander},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/xie2025iclr-exploratory/}
}