FineCLIP: Self-Distilled Region-Based CLIP for Better Fine-Grained Understanding
Abstract
Contrastive Language-Image Pre-training (CLIP) achieves impressive performance on tasks like image classification and image-text retrieval by learning on large-scale image-text datasets. However, CLIP struggles with dense prediction tasks due to the poor grasp of the fine-grained details. Although existing works pay attention to this issue, they achieve limited improvements and usually sacrifice the important visual-semantic consistency. To overcome these limitations, we propose FineCLIP, which keeps the global contrastive learning to preserve the visual-semantic consistency and further enhances the fine-grained understanding through two innovations: 1) A real-time self-distillation scheme that facilitates the transfer of representation capability from global to local features. 2) A semantically-rich regional contrastive learning paradigm with generated region-text pairs, boosting the local representation capabilities with abundant fine-grained knowledge. Both cooperate to fully leverage diverse semantics and multi-grained complementary information.To validate the superiority of our FineCLIP and the rationality of each design, we conduct extensive experiments on challenging dense prediction and image-level tasks. All the observations demonstrate the effectiveness of FineCLIP.
Cite
Text
Jing et al. "FineCLIP: Self-Distilled Region-Based CLIP for Better Fine-Grained Understanding." Neural Information Processing Systems, 2024. doi:10.52202/079017-0875Markdown
[Jing et al. "FineCLIP: Self-Distilled Region-Based CLIP for Better Fine-Grained Understanding." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/jing2024neurips-fineclip/) doi:10.52202/079017-0875BibTeX
@inproceedings{jing2024neurips-fineclip,
title = {{FineCLIP: Self-Distilled Region-Based CLIP for Better Fine-Grained Understanding}},
author = {Jing, Dong and He, Xiaolong and Luo, Yutian and Fei, Nanyi and Yang, Guoxing and Wei, Wei and Zhao, Huiwen and Lu, Zhiwu},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-0875},
url = {https://mlanthology.org/neurips/2024/jing2024neurips-fineclip/}
}