Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

Abstract

Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.

Cite

Text

Choi et al. "Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00914

Markdown

[Choi et al. "Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/choi2025cvpr-finegrained/) doi:10.1109/CVPR52734.2025.00914

BibTeX

@inproceedings{choi2025cvpr-finegrained,
  title     = {{Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation}},
  author    = {Choi, Jiho and Lee, Seonho and Lee, Minhyun and Lee, Seungho and Shim, Hyunjung},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {9782-9793},
  doi       = {10.1109/CVPR52734.2025.00914},
  url       = {https://mlanthology.org/cvpr/2025/choi2025cvpr-finegrained/}
}