Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
Abstract
Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.
Cite
Text
Mukherjee et al. "Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization." Advances in Neural Information Processing Systems, 2025.Markdown
[Mukherjee et al. "Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/mukherjee2025neurips-offline/)BibTeX
@inproceedings{mukherjee2025neurips-offline,
title = {{Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization}},
author = {Mukherjee, Subhojyoti and Lai, Viet Dac and Addanki, Raghavendra and Rossi, Ryan A. and Yoon, Seunghyun and Bui, Trung and Rao, Anup and Subramanian, Jayakumar and Kveton, Branislav},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/mukherjee2025neurips-offline/}
}