Benchmarking Omni-Vision Representation Through the Lens of Visual Realms

Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu

ECCV 2022

doi:10.1007/978-3-031-20071-7_35 /eccv/2022/zhang2022eccv-benchmarking/

Abstract

Though impressive performance has been achieved in specific visual realms (\eg faces, dogs, and places), an omni-vision representation that can generalize to many natural visual domains is highly desirable. Nonetheless, the existing benchmark for evaluating visual representations, such as ImageNet, VTAB-natural, and CLIP benchmark suite, is either limited in the spectrum of realms or built by arbitrarily integrating the current datasets. In this paper, we propose Omni-Realm Benchmark (OmniBenchmark) that enables systematically measuring the generalization ability across a wide range of visual realms. OmniBenchmark firstly integrates the concepts from Wikidata to enlarge the storage of concepts of each sub-tree of WordNet. Then, it leverages expert knowledge from WordNet to define a comprehensive spectrum of 21 semantic realms in the natural domain, which is twice of ImageNet’s. Finally, we manually annotate all 7,372 valid concepts, forming a 21-realm dataset with 1,074,346 images. With OmniBenchmark, we propose a hierarchical instance contrastive learning framework for learning better omni-vision representation, \ie Relational Contrastive learning (ReCo), boosting the performance of representation learning across omni-realms. As the hierarchical semantic relation naturally emerges in the label system of visual datasets, ReCo attracts the representations within the same semantic realm during pre-training, facilitating the model converges faster than conventional contrastive learning when ReCo is further fine-tuned to the specific realm. Extensive experiments demonstrate the superior performance of ReCo over state-of-the-art contrastive learning methods on both ImageNet and OmniBenchmark. Beyond that, We conduct a systematic investigation of recent advances in both architectures (from CNNs to transformers) and learning paradigms (from supervised learning to self-supervised learning) on our benchmark. Multiple practical observations are revealed to facilitate future research.

PDF ECCV Semantic Scholar

Cite

Text

Zhang et al. "Benchmarking Omni-Vision Representation Through the Lens of Visual Realms." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20071-7_35

Markdown

[Zhang et al. "Benchmarking Omni-Vision Representation Through the Lens of Visual Realms." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/zhang2022eccv-benchmarking/) doi:10.1007/978-3-031-20071-7_35

BibTeX

@inproceedings{zhang2022eccv-benchmarking,
  title     = {{Benchmarking Omni-Vision Representation Through the Lens of Visual Realms}},
  author    = {Zhang, Yuanhan and Yin, Zhenfei and Shao, Jing and Liu, Ziwei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-20071-7_35},
  url       = {https://mlanthology.org/eccv/2022/zhang2022eccv-benchmarking/}
}