SAM-CLIP: Merging Vision Foundation Models Towards Semantic and Spatial Understanding

Wang, Haoxiang; Vasu, Pavan Kumar Anasosalu; Faghri, Fartash; Vemulapalli, Raviteja; Farajtabar, Mehrdad; Mehta, Sachin; Rastegari, Mohammad; Tuzel, Oncel; Pouransari, Hadi

doi:10.1109/CVPRW63382.2024.00367

SAM-CLIP: Merging Vision Foundation Models Towards Semantic and Spatial Understanding

Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari

CVPRW 2024 pp. 3635-3647

doi:10.1109/CVPRW63382.2024.00367 /cvprw/2024/wang2024cvprw-samclip/

Abstract

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multitask learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP : a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

PDF CVPRW Semantic Scholar

Cite

Text

Wang et al. "SAM-CLIP: Merging Vision Foundation Models Towards Semantic and Spatial Understanding." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00367

Markdown

[Wang et al. "SAM-CLIP: Merging Vision Foundation Models Towards Semantic and Spatial Understanding." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/wang2024cvprw-samclip/) doi:10.1109/CVPRW63382.2024.00367

BibTeX

@inproceedings{wang2024cvprw-samclip,
  title     = {{SAM-CLIP: Merging Vision Foundation Models Towards Semantic and Spatial Understanding}},
  author    = {Wang, Haoxiang and Vasu, Pavan Kumar Anasosalu and Faghri, Fartash and Vemulapalli, Raviteja and Farajtabar, Mehrdad and Mehta, Sachin and Rastegari, Mohammad and Tuzel, Oncel and Pouransari, Hadi},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {3635-3647},
  doi       = {10.1109/CVPRW63382.2024.00367},
  url       = {https://mlanthology.org/cvprw/2024/wang2024cvprw-samclip/}
}