Learning to Reason over Continuous Tokens with Reinforcement Learning
Abstract
Large Language Models (LLMs) have shown strong performance in complex reasoning tasks, especially when guided by Chain-of-Thought (CoT) prompting. However, conventional CoT reasoning in the discrete token space suffers from high computational and memory costs due to verbose intermediate steps. Recent work has explored latent reasoning in the embedding space to improve efficiency, but often at the cost of clarity and performance. In this work, we propose $\underline{Hy}$brid $\underline{Rea}$soning ($\texttt{HyRea}$), a unified framework that enables LLMs to dynamically switch between explicit (token-based) and latent (embedding-based) reasoning during inference. To train the model to make these decisions effectively, we introduce a two-stage training pipeline: (1) a supervised cold-start phase that introduces latent reasoning by replacing low-entropy CoT steps with embeddings, and (2) a reinforcement learning phase using Group Relative Policy Optimization (GRPO) to fine-tune the model’s reasoning strategy based on task-specific rewards. Experiments on mathematical reasoning benchmarks show that \texttt{HyRea} achieves significant reductions in token usage while maintaining or improving accuracy, offering an effective and scalable solution for efficient multi-step reasoning in LLMs.
Cite
Text
Zhao et al. "Learning to Reason over Continuous Tokens with Reinforcement Learning." International Conference on Learning Representations, 2026.Markdown
[Zhao et al. "Learning to Reason over Continuous Tokens with Reinforcement Learning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhao2026iclr-learning/)BibTeX
@inproceedings{zhao2026iclr-learning,
title = {{Learning to Reason over Continuous Tokens with Reinforcement Learning}},
author = {Zhao, Yiran and Xu, Yuhui and Sahoo, Doyen and Xiong, Caiming and Li, Junnan},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/zhao2026iclr-learning/}
}