VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning
Abstract
Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple reasoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO, doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.
Cite
Text
Kazemnejad et al. "VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning." NeurIPS 2024 Workshops: MATH-AI, 2024.Markdown
[Kazemnejad et al. "VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning." NeurIPS 2024 Workshops: MATH-AI, 2024.](https://mlanthology.org/neuripsw/2024/kazemnejad2024neuripsw-vineppo/)BibTeX
@inproceedings{kazemnejad2024neuripsw-vineppo,
title = {{VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning}},
author = {Kazemnejad, Amirhossein and Aghajohari, Milad and Portelance, Eva and Sordoni, Alessandro and Reddy, Siva and Courville, Aaron and Le Roux, Nicolas},
booktitle = {NeurIPS 2024 Workshops: MATH-AI},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/kazemnejad2024neuripsw-vineppo/}
}