Language Models Can Self-Improve at State-Value Estimation for Better Search
Abstract
Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive, especially in interactive domains such as web tasks. We introduce Self-Taught Lookahead (STL), a reward-free framework that improves language model–based value functions by reasoning explicitly about state transitions. STL can be viewed as a chain-of-thought analogue of the value iteration algorithm: instead of regressing directly on numeric values, a value LLM is trained to simulate a step of lookahead in natural language—predicting the next action, resulting state, and rationale for its value. This process refines value estimates without any labeled data. The self-supervised procedure yields more accurate state-value predictions, which in turn enable lightweight search algorithms to expand fewer states while maintaining strong performance. Empirically, STL-trained value models built on moderately sized (8B-parameter) open-weight LLMs boost web agent success rates by over 39%, achieving performance comparable to proprietary models. STL also generalizes to multi-hop question answering and math puzzles. Overall, STL enables small open-source models to guide efficient search, reducing inference costs by integrating explicit reasoning with value learning.
Cite
Text
Mendes and Ritter. "Language Models Can Self-Improve at State-Value Estimation for Better Search." Advances in Neural Information Processing Systems, 2025.Markdown
[Mendes and Ritter. "Language Models Can Self-Improve at State-Value Estimation for Better Search." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/mendes2025neurips-language/)BibTeX
@inproceedings{mendes2025neurips-language,
title = {{Language Models Can Self-Improve at State-Value Estimation for Better Search}},
author = {Mendes, Ethan and Ritter, Alan},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/mendes2025neurips-language/}
}