DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks
Abstract
In this paper, we introduce DetailCLIP, a self-improving vision-language foundation model designed to enhance fine-grained feature understanding through self-supervised learning. Foundation models like CLIP have demonstrated strong performance in global image-text alignment but often fail to capture detail-oriented features necessary for tasks such as segmentation. To address this, DetailCLIP integrates self-curated learning objectives that iteratively improve both high-level semantics and detailed visual representations. Specifically, our method employs patch-level self-distillation and pixel-level reconstruction losses to generate refined internal representations, while an attention-based token filtering mechanism curates semantically relevant information during training. By generating and refining self-curated learning signals, DetailCLIP improves segmentation performance and demonstrates superior generalization across diverse tasks. These task-agnostic objectives position DetailCLIP as a self-improving foundation model, enhancing multi-modal systems like CLIP with fine-grained feature understanding.
Cite
Text
Monsefi et al. "DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks." ICLR 2025 Workshops: SSI-FM, 2025.Markdown
[Monsefi et al. "DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks." ICLR 2025 Workshops: SSI-FM, 2025.](https://mlanthology.org/iclrw/2025/monsefi2025iclrw-detailclip/)BibTeX
@inproceedings{monsefi2025iclrw-detailclip,
title = {{DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks}},
author = {Monsefi, Amin Karimi and Sailaja, Kishore Prakash and Alilooee, Ali and Lim, Ser-Nam and Ramnath, Rajiv},
booktitle = {ICLR 2025 Workshops: SSI-FM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/monsefi2025iclrw-detailclip/}
}