Language Models as Implicit Tree Search
Abstract
Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with AlphaZero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to AlphaZero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.
Cite
Text
Chen et al. "Language Models as Implicit Tree Search." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Chen et al. "Language Models as Implicit Tree Search." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/chen2025icml-language/)BibTeX
@inproceedings{chen2025icml-language,
title = {{Language Models as Implicit Tree Search}},
author = {Chen, Ziliang and Lai, Zhao-Rong and Yang, Yufeng and Fang, Liangda and Yang, Zhanfu and Lin, Liang},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {8364-8385},
volume = {267},
url = {https://mlanthology.org/icml/2025/chen2025icml-language/}
}