Progressive Multi-Granular Alignments for Grounded Reasoning in Large Vision-Language Models

Abstract

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular Vision-Language alignments (PromViL), a novel framework to enhance LVLMs' ability in performing grounded compositional visual reasoning tasks. Our approach constructs a hierarchical structure of multi-modal alignments, ranging from simple to complex concepts. By progressively aligning textual descriptions with corresponding visual regions, our model learns to leverage contextual information from lower levels to inform higher-level reasoning. To facilitate this learning process, we introduce a data generation process that creates a novel dataset derived from Visual Genome, providing a wide range of nested compositional vision-language pairs. Experimental results demonstrate that our PromViL framework significantly outperforms baselines on various visual grounding and compositional question answering tasks.

Cite

Text

Le et al. "Progressive Multi-Granular Alignments for Grounded Reasoning in Large Vision-Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32471

Markdown

[Le et al. "Progressive Multi-Granular Alignments for Grounded Reasoning in Large Vision-Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/le2025aaai-progressive/) doi:10.1609/AAAI.V39I4.32471

BibTeX

@inproceedings{le2025aaai-progressive,
  title     = {{Progressive Multi-Granular Alignments for Grounded Reasoning in Large Vision-Language Models}},
  author    = {Le, Quang-Hung and Dang, Long Hoang and Le, Ngan Hoang and Tran, Truyen and Le, Thao Minh},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4473-4481},
  doi       = {10.1609/AAAI.V39I4.32471},
  url       = {https://mlanthology.org/aaai/2025/le2025aaai-progressive/}
}