VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Abstract

Recently, computer vision foundation models such as CLIP and ALI-GN, have shown impressive generalization capabilities on various downstream tasks. But their abilities to deal with the long-tailed data still remain to be proved. In this work, we present a novel framework based on pre-trained visual-linguistic models for long-tailed recognition (LTR), termed VL-LTR, and conduct empirical studies on the benefits of introducing text modality for long-tailed recognition tasks. Compared to existing approaches, the proposed VL-LTR has the following merits. (1) Our method can not only learn visual representation from images but also learn corresponding linguistic representation from noisy class-level text descriptions collected from the Internet; (2) Our method can effectively use the learned visual-linguistic representation to improve the visual recognition performance, especially for classes with fewer image samples. We also conduct extensive experiments and set the new state-of-the-art performance on widely-used LTR benchmarks. Notably, our method achieves 77.2\% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points, and is close to the prevailing performance training on the full ImageNet. Code shall be released.

Cite

Text

Tian et al. "VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19806-9_5

Markdown

[Tian et al. "VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/tian2022eccv-vlltr/) doi:10.1007/978-3-031-19806-9_5

BibTeX

@inproceedings{tian2022eccv-vlltr,
  title     = {{VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition}},
  author    = {Tian, Changyao and Wang, Wenhai and Zhu, Xizhou and Dai, Jifeng and Qiao, Yu},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19806-9_5},
  url       = {https://mlanthology.org/eccv/2022/tian2022eccv-vlltr/}
}