Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Abstract

Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision–language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process—where people skip reasoning for easy questions but think carefully when needed—we explore how to enable VLMs to first decide *when reasoning is necessary*. To realize this, we propose \ours, a two-stage training strategy: **(i)** a supervised fine-tuning (SFT) stage with a simple yet effective “**thought dropout**” operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; **(ii)** a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that \ours can *reduce the completion length by up to **90%** compared to vanilla GRPO, without sacrificing performance or even improving it*. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks—covering a range of reasoning difficulties under both 3B and 7B models—consistently reveal that the \textit{model progressively learns to bypass unnecessary reasoning steps as training advances}. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.

Cite

Text

Wang et al. "Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-think/)

BibTeX

@inproceedings{wang2025neurips-think,
  title     = {{Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models}},
  author    = {Wang, Jiaqi and Lin, Kevin Qinghong and Cheng, James and Shou, Mike Zheng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-think/}
}