Patch Matters: Training-Free Fine-Grained Image Caption Enhancement via Local Perception

Abstract

High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a divide-then-aggregate strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining.

Cite

Text

Peng et al. "Patch Matters: Training-Free Fine-Grained Image Caption Enhancement via Local Perception." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00375

Markdown

[Peng et al. "Patch Matters: Training-Free Fine-Grained Image Caption Enhancement via Local Perception." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/peng2025cvpr-patch/) doi:10.1109/CVPR52734.2025.00375

BibTeX

@inproceedings{peng2025cvpr-patch,
  title     = {{Patch Matters: Training-Free Fine-Grained Image Caption Enhancement via Local Perception}},
  author    = {Peng, Ruotian and He, Haiying and Wei, Yake and Wen, Yandong and Hu, Di},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3963-3973},
  doi       = {10.1109/CVPR52734.2025.00375},
  url       = {https://mlanthology.org/cvpr/2025/peng2025cvpr-patch/}
}