Can Large Vision-Language Models Correct Semantic Grounding Errors by Themselves?

Abstract

Improving semantic grounding in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore self-correction in VLMs focusing on semantic grounding. We find that VLMs can correct their own semantic grounding mistakes when properly prompted and framed for the task, without any fine-tuning or even access to oracle feedback. We also introduce a self-correction framework in an iterative setting which consistently improves performance across all models investigated. Overall, we show that iterative self-correction consistently improves VLM performance in semantic grounding by up to 8.4 accuracy points across all models investigated, without requiring fine-tuning, additional architectural changes, or external data. Our exploration of self-correction also reveals that, even after several rounds of feedback, strong models like GPT-4V and GPT-4o retain limited capability in leveraging oracle feedback, suggesting promising directions for further research.

Cite

Text

Liao et al. "Can Large Vision-Language Models Correct Semantic Grounding Errors by Themselves?." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01367

Markdown

[Liao et al. "Can Large Vision-Language Models Correct Semantic Grounding Errors by Themselves?." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/liao2025cvpr-large/) doi:10.1109/CVPR52734.2025.01367

BibTeX

@inproceedings{liao2025cvpr-large,
  title     = {{Can Large Vision-Language Models Correct Semantic Grounding Errors by Themselves?}},
  author    = {Liao, Yuan-Hong and Mahmood, Rafid and Fidler, Sanja and Acuna, David},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14667-14678},
  doi       = {10.1109/CVPR52734.2025.01367},
  url       = {https://mlanthology.org/cvpr/2025/liao2025cvpr-large/}
}