From Words to Rewards: Leveraging Natural Language for Reinforcement Learning

Abstract

We explore the use of natural language to specify rewards in Reinforcement Learning with Human Feedback (RLHF). Unlike traditional approaches that rely on simplistic preference feedback, we harness Large Language Models (LLMs) to translate rich text feedback into state-level labels for training a reward model. Our empirical studies with human participants demonstrate that our method accurately approximates the reward function and achieves significant performance gains with fewer interactions than baseline methods.

Cite

Text

Urcelay et al. "From Words to Rewards: Leveraging Natural Language for Reinforcement Learning." Transactions on Machine Learning Research, 2026.

Markdown

[Urcelay et al. "From Words to Rewards: Leveraging Natural Language for Reinforcement Learning." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/urcelay2026tmlr-words/)

BibTeX

@article{urcelay2026tmlr-words,
  title     = {{From Words to Rewards: Leveraging Natural Language for Reinforcement Learning}},
  author    = {Urcelay, Belen Martin and Krause, Andreas and Ramponi, Giorgia},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/urcelay2026tmlr-words/}
}