LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance

ICCV 2025 pp. 24056-24067

Abstract

While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

Cite

Text

Li et al. "LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance." International Conference on Computer Vision, 2025.

Markdown

[Li et al. "LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/li2025iccv-lira/)

BibTeX

@inproceedings{li2025iccv-lira,
  title     = {{LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance}},
  author    = {Li, Zhang and Yang, Biao and Liu, Qiang and Zhang, Shuo and Ma, Zhiyin and Yin, Liang and Deng, Linger and Sun, Yabo and Liu, Yuliang and Bai, Xiang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {24056-24067},
  url       = {https://mlanthology.org/iccv/2025/li2025iccv-lira/}
}