Variational Reasoning for Language Models
Abstract
We introduce a **variational reasoning** framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where *an implicit weighting by model accuracy* naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models.
Cite
Text
Zhou et al. "Variational Reasoning for Language Models." International Conference on Learning Representations, 2026.Markdown
[Zhou et al. "Variational Reasoning for Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhou2026iclr-variational/)BibTeX
@inproceedings{zhou2026iclr-variational,
title = {{Variational Reasoning for Language Models}},
author = {Zhou, Xiangxin and Liu, Zichen and Wang, Haonan and Du, Chao and Lin, Min and Li, Chongxuan and Wang, Liang and Pang, Tianyu},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/zhou2026iclr-variational/}
}