SFT or RL? an Early Investigation into Training R1-like Reasoning Large Vision-Language Models

Abstract

This work explores two distinct approaches for enhancing reasoning abilities in Large Vision Language Models (LVLMs): supervised fine-tuning (SFT) and reinforcement learning (RL). To support the SFT approach, we curate a multimodal reasoning dataset with the complete reasoning trace guided by DeepSeek-R1. For the RL approach, we focus on GRPO and develop a training framework tailored to vision-language tasks with a composite reward system comprising four signals that address both visual perception and reasoning challenges. Our extensive experiments reveal that RL is a significantly more effective strategy than SFT for training reasoning VLMs. While SFT can assist models that initially struggle with following reasoning instructions, it often induces ``pseudo aha moments'' that degrade overall reasoning performance, implying that only a minimal amount of SFT data is necessary. In contrast, RL leads to substantial improvements, outperforming recent baseline models on a range of math reasoning tasks by at least 2% on average. We also present several intriguing findings --- \eg, combining SFT and GRPO also hurts the model performance, and stronger instruction-aligned LVLMs consistently lead to better results in RL. We hope these findings provide valuable insights into the development of reasoning-capable VLMs and guide future research in this area.

Cite

Text

Chen et al. "SFT or RL? an Early Investigation into Training R1-like Reasoning Large Vision-Language Models." Transactions on Machine Learning Research, 2025.

Markdown

[Chen et al. "SFT or RL? an Early Investigation into Training R1-like Reasoning Large Vision-Language Models." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/chen2025tmlr-sft/)

BibTeX

@article{chen2025tmlr-sft,
  title     = {{SFT or RL? an Early Investigation into Training R1-like Reasoning Large Vision-Language Models}},
  author    = {Chen, Hardy and Tu, Haoqin and Wang, Fali and Liu, Hui and Tang, Xianfeng and Du, Xinya and Zhou, Yuyin and Xie, Cihang},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/chen2025tmlr-sft/}
}