VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Abstract

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a rethinking trigger token to the end of rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse to achieve 80.4%, 63.5% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MathVision, MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with OpenAI-o1. We conduct comprehensive ablations and analysis to provide insights into the effectiveness of our approach.

Cite

Text

Wang et al. "VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-vlrethinker/)

BibTeX

@inproceedings{wang2025neurips-vlrethinker,
  title     = {{VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning}},
  author    = {Wang, Haozhe and Qu, Chao and Huang, Zuming and Chu, Wei and Lin, Fangzhen and Chen, Wenhu},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-vlrethinker/}
}