FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Abstract

The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate integration of visual and textual information across various applications, including image and video captioning, visual question answering, and cross-modal retrieval.Despite their superior capabilities, VLMs still struggle with fine-grained compositional image region descriptions. Specifically, they have difficulty recognizing arbitrary segmentation masks as referential inputs, interpreting compositional aspect instructions for referencing, and precisely describing the compositional aspects of a region. However, compositionality--the ability to understand and generate novel combinations of known visual and textual components--is critical for facilitating coherent reasoning and understanding across modalities in VLMs. To address this issue, we propose OpenCompositionCap, a new dataset for multi-grained region compositional image captioning that distinguishes itself from prior works by introducing the new task of compositional aspect-aware regional image captioning. To support this endeavor, we also introduce a new VLM model, FineCaption. The empirical results illustrate the effectiveness of our proposed model compared with other strong VLMs. In addition, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.

Cite

Text

Hua et al. "FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02306

Markdown

[Hua et al. "FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/hua2025cvpr-finecaption/) doi:10.1109/CVPR52734.2025.02306

BibTeX

@inproceedings{hua2025cvpr-finecaption,
  title     = {{FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity}},
  author    = {Hua, Hang and Liu, Qing and Zhang, Lingzhi and Shi, Jing and Kim, Soo Ye and Zhang, Zhifei and Wang, Yilin and Zhang, Jianming and Lin, Zhe and Luo, Jiebo},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {24763-24773},
  doi       = {10.1109/CVPR52734.2025.02306},
  url       = {https://mlanthology.org/cvpr/2025/hua2025cvpr-finecaption/}
}