FineRS: Fine-Grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

Abstract

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images---particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose FineRS, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. FineRS adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration. Additionally, we present FineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on FineRS-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.

Cite

Text

Zhang et al. "FineRS: Fine-Grained Reasoning and Segmentation of Small Objects with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhang et al. "FineRS: Fine-Grained Reasoning and Segmentation of Small Objects with Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-finers/)

BibTeX

@inproceedings{zhang2025neurips-finers,
  title     = {{FineRS: Fine-Grained Reasoning and Segmentation of Small Objects with Reinforcement Learning}},
  author    = {Zhang, Lu and Yu, Jiazuo and Xiong, Haomiao and Hu, Ping and Zhuge, Yunzhi and Lu, Huchuan and He, You},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhang2025neurips-finers/}
}