Variational Reasoning for Language Models

Abstract

We introduce a **variational reasoning** framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where *an implicit weighting by model accuracy* naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models.

Cite

Text

Zhou et al. "Variational Reasoning for Language Models." International Conference on Learning Representations, 2026.

Markdown

[Zhou et al. "Variational Reasoning for Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhou2026iclr-variational/)

BibTeX

@inproceedings{zhou2026iclr-variational,
  title     = {{Variational Reasoning for Language Models}},
  author    = {Zhou, Xiangxin and Liu, Zichen and Wang, Haonan and Du, Chao and Lin, Min and Li, Chongxuan and Wang, Liang and Pang, Tianyu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhou2026iclr-variational/}
}