Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

Abstract

Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed and the algorithm relies on trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to get good performance with RLHF. A key novelty is our trajectory-level elliptical potential analysis technique used to infer reward function parameters when comparison queries rather than reward observations are used. We provide and analyze algorithms in two settings: linear and neural function approximation, PG-RLHF and NN-PG-RLHF, respectively.

Cite

Text

Du et al. "Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization." International Conference on Machine Learning, 2024.

Markdown

[Du et al. "Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/du2024icml-explorationdriven/)

BibTeX

@inproceedings{du2024icml-explorationdriven,
  title     = {{Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization}},
  author    = {Du, Yihan and Winnicki, Anna and Dalal, Gal and Mannor, Shie and Srikant, R.},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {11830-11887},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/du2024icml-explorationdriven/}
}