Recognize Anything: A Strong Image Tagging Model

Abstract

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for foundation models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. By leveraging large-scale image-text pairs for training instead of manual annotations, RAM introduces a new paradigm for image tagging.The development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the captioning and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality dataset.We evaluate the tagging capability of RAM on numerous benchmarks and observe an impressive zero-shot performance, which significantly outperforms CLIP and BLIP. Remarkably, RAM even surpasses fully supervised models and exhibits a competitive performance compared with the Google tagging API. We have released RAM at https://recognize-anything.github.io/ to foster the advancement of foundation models in computer vision.

Cite

Text

Zhang et al. "Recognize Anything: A Strong Image Tagging Model." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00179

Markdown

[Zhang et al. "Recognize Anything: A Strong Image Tagging Model." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/zhang2024cvprw-recognize/) doi:10.1109/CVPRW63382.2024.00179

BibTeX

@inproceedings{zhang2024cvprw-recognize,
  title     = {{Recognize Anything: A Strong Image Tagging Model}},
  author    = {Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and Guo, Yandong and Zhang, Lei},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {1724-1732},
  doi       = {10.1109/CVPRW63382.2024.00179},
  url       = {https://mlanthology.org/cvprw/2024/zhang2024cvprw-recognize/}
}