CLIP the Gap: A Single Domain Generalization Approach for Object Detection

Abstract

Single Domain Generalization (SDG) tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. While this has been well studied for image classification, the literature on SDG object detection remains almost non-existent. To address the challenges of simultaneously learning robust object localization and representation, we propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss. Our experiments evidence the benefits of our approach, outperforming by 10% the only existing SDG object detection method, Single-DGOD[49], on their own diverse weather-driving benchmark.

Cite

Text

Vidit et al. "CLIP the Gap: A Single Domain Generalization Approach for Object Detection." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00314

Markdown

[Vidit et al. "CLIP the Gap: A Single Domain Generalization Approach for Object Detection." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/vidit2023cvpr-clip/) doi:10.1109/CVPR52729.2023.00314

BibTeX

@inproceedings{vidit2023cvpr-clip,
  title     = {{CLIP the Gap: A Single Domain Generalization Approach for Object Detection}},
  author    = {Vidit, Vidit and Engilberge, Martin and Salzmann, Mathieu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {3219-3229},
  doi       = {10.1109/CVPR52729.2023.00314},
  url       = {https://mlanthology.org/cvpr/2023/vidit2023cvpr-clip/}
}