CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification

Abstract

Existing computer vision research in artwork struggles with artwork’s fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. In this work, we use CLIP (Contrastive Language-Image Pre-Training) [12] for training a neural network on a variety of art images and text pairs, being able to learn directly from raw descriptions about images, or if available, curated labels. Model’s zero-shot capability allows predicting the most relevant natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset [20], which we consider the largest annotated artwork dataset. Our code and models will be available at https://github.com/KeremTurgutlu/clip_art

Cite

Text

Conde and Turgutlu. "CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021. doi:10.1109/CVPRW53098.2021.00444

Markdown

[Conde and Turgutlu. "CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.](https://mlanthology.org/cvprw/2021/conde2021cvprw-clipart/) doi:10.1109/CVPRW53098.2021.00444

BibTeX

@inproceedings{conde2021cvprw-clipart,
  title     = {{CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification}},
  author    = {Conde, Marcos V. and Turgutlu, Kerem},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2021},
  pages     = {3956-3960},
  doi       = {10.1109/CVPRW53098.2021.00444},
  url       = {https://mlanthology.org/cvprw/2021/conde2021cvprw-clipart/}
}