LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance

Li, Zhang; Yang, Biao; Liu, Qiang; Zhang, Shuo; Ma, Zhiyin; Yin, Liang; Deng, Linger; Sun, Yabo; Liu, Yuliang; Bai, Xiang

LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance

Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai

ICCV 2025 pp. 24056-24067

/iccv/2025/li2025iccv-lira/

Abstract

While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

PDF ICCV Semantic Scholar

Cite

Text

Li et al. "LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance." International Conference on Computer Vision, 2025.

Markdown

[Li et al. "LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/li2025iccv-lira/)

BibTeX

@inproceedings{li2025iccv-lira,
  title     = {{LIRA: Inferring Segmentation in Large Multi-Modal Models with Local Interleaved Region Assistance}},
  author    = {Li, Zhang and Yang, Biao and Liu, Qiang and Zhang, Shuo and Ma, Zhiyin and Yin, Liang and Deng, Linger and Sun, Yabo and Liu, Yuliang and Bai, Xiang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {24056-24067},
  url       = {https://mlanthology.org/iccv/2025/li2025iccv-lira/}
}