CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models
Abstract
Recent advances in Large Vision-Language Models (LVLMs) have enabled general-purpose vision tasks through visual instruction tuning. While existing LVLMs can generate segmentation masks from text prompts for single images, they struggle with segmentation-grounded reasoning across images, especially at finer granularities such as object parts. In this paper, we introduce the new task of part-focused semantic co-segmentation, which involves identifying and segmenting common objects and their constituent common and unique parts across images. To address this task, we present CALICO, the first LVLM designed for multi-image part-level reasoning segmentation. CALICO features two key components, a novel Correspondence Extraction Module that identifies semantic part-level correspondences, and Correspondence Adaptation Modules that embed this information into the LVLM to facilitate multi-image understanding in a parameter-efficient manner. To support training and evaluation, we curate MixedParts, a large-scale multi-image segmentation dataset containing 2.4M samples across 44K images spanning diverse object and part categories. Experimental results demonstrate that CALICO, with just 0.3% of its parameters finetuned, achieves strong performance on this challenging task.
Cite
Text
Nguyen et al. "CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00429Markdown
[Nguyen et al. "CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/nguyen2025cvpr-calico/) doi:10.1109/CVPR52734.2025.00429BibTeX
@inproceedings{nguyen2025cvpr-calico,
title = {{CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models}},
author = {Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {4550-4561},
doi = {10.1109/CVPR52734.2025.00429},
url = {https://mlanthology.org/cvpr/2025/nguyen2025cvpr-calico/}
}