Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Abstract
Recent advances in large language models have significantly improved textual reasoning through the effective use of Chain-of-Thought (CoT) and reinforcement learning. However, extending these successes to vision-language tasks remains challenging due to inherent limitations in text-only CoT, such as visual hallucinations and insufficient multimodal integration. In this paper, we introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding. Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements. Second, we employ reinforcement finetuning targeting visual document understanding. On ChartQA, our approach improves accuracy from 70.88% (format-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT. The result shows that our grounded CoT is more effective for multimodal reasoning compared with the text-only CoT. Moreover, Point-RFT exhibits superior generalization capability across several out-of-domain visual document reasoning benchmarks, including CharXiv, PlotQA, IconQA, TabMWP, etc., and highlights its potential in complex real-world scenarios.
Cite
Text
Ni et al. "Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning." Advances in Neural Information Processing Systems, 2025.Markdown
[Ni et al. "Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/ni2025neurips-pointrft/)BibTeX
@inproceedings{ni2025neurips-pointrft,
title = {{Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning}},
author = {Ni, Minheng and Yang, Zhengyuan and Li, Linjie and Lin, Chung-Ching and Lin, Kevin and Zuo, Wangmeng and Wang, Lijuan},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/ni2025neurips-pointrft/}
}