DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

Abstract

In this paper, we introduce DetailCLIP, a self-improving vision-language foundation model designed to enhance fine-grained feature understanding through self-supervised learning. Foundation models like CLIP have demonstrated strong performance in global image-text alignment but often fail to capture detail-oriented features necessary for tasks such as segmentation. To address this, DetailCLIP integrates self-curated learning objectives that iteratively improve both high-level semantics and detailed visual representations. Specifically, our method employs patch-level self-distillation and pixel-level reconstruction losses to generate refined internal representations, while an attention-based token filtering mechanism curates semantically relevant information during training. By generating and refining self-curated learning signals, DetailCLIP improves segmentation performance and demonstrates superior generalization across diverse tasks. These task-agnostic objectives position DetailCLIP as a self-improving foundation model, enhancing multi-modal systems like CLIP with fine-grained feature understanding.

Cite

Text

Monsefi et al. "DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks." ICLR 2025 Workshops: SSI-FM, 2025.

Markdown

[Monsefi et al. "DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks." ICLR 2025 Workshops: SSI-FM, 2025.](https://mlanthology.org/iclrw/2025/monsefi2025iclrw-detailclip/)

BibTeX

@inproceedings{monsefi2025iclrw-detailclip,
  title     = {{DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks}},
  author    = {Monsefi, Amin Karimi and Sailaja, Kishore Prakash and Alilooee, Ali and Lim, Ser-Nam and Ramnath, Rajiv},
  booktitle = {ICLR 2025 Workshops: SSI-FM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/monsefi2025iclrw-detailclip/}
}