Reinforcement Learning from Human Feedback with Active Queries

TMLR 2025

/tmlr/2025/ji2025tmlr-reinforcement/

Abstract

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ instance-dependent regret bound and an $\tilde{O}(d^2/\Delta^2)$ query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of DPO, establishing it as a data-efficient alternative to DPO. The codes are available at https://github.com/jkx19/ActiveQuery.

PDF TMLR Code Semantic Scholar

Cite

Text

Ji et al. "Reinforcement Learning from Human Feedback with Active Queries." Transactions on Machine Learning Research, 2025.

Markdown

[Ji et al. "Reinforcement Learning from Human Feedback with Active Queries." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/ji2025tmlr-reinforcement/)

BibTeX

@article{ji2025tmlr-reinforcement,
  title     = {{Reinforcement Learning from Human Feedback with Active Queries}},
  author    = {Ji, Kaixuan and He, Jiafan and Gu, Quanquan},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/ji2025tmlr-reinforcement/}
}