Benchmarking Omni-Vision Representation Through the Lens of Visual Realms
Abstract
Though impressive performance has been achieved in specific visual realms (\eg faces, dogs, and places), an omni-vision representation that can generalize to many natural visual domains is highly desirable. Nonetheless, the existing benchmark for evaluating visual representations, such as ImageNet, VTAB-natural, and CLIP benchmark suite, is either limited in the spectrum of realms or built by arbitrarily integrating the current datasets. In this paper, we propose Omni-Realm Benchmark (OmniBenchmark) that enables systematically measuring the generalization ability across a wide range of visual realms. OmniBenchmark firstly integrates the concepts from Wikidata to enlarge the storage of concepts of each sub-tree of WordNet. Then, it leverages expert knowledge from WordNet to define a comprehensive spectrum of 21 semantic realms in the natural domain, which is twice of ImageNet’s. Finally, we manually annotate all 7,372 valid concepts, forming a 21-realm dataset with 1,074,346 images. With OmniBenchmark, we propose a hierarchical instance contrastive learning framework for learning better omni-vision representation, \ie Relational Contrastive learning (ReCo), boosting the performance of representation learning across omni-realms. As the hierarchical semantic relation naturally emerges in the label system of visual datasets, ReCo attracts the representations within the same semantic realm during pre-training, facilitating the model converges faster than conventional contrastive learning when ReCo is further fine-tuned to the specific realm. Extensive experiments demonstrate the superior performance of ReCo over state-of-the-art contrastive learning methods on both ImageNet and OmniBenchmark. Beyond that, We conduct a systematic investigation of recent advances in both architectures (from CNNs to transformers) and learning paradigms (from supervised learning to self-supervised learning) on our benchmark. Multiple practical observations are revealed to facilitate future research.
Cite
Text
Zhang et al. "Benchmarking Omni-Vision Representation Through the Lens of Visual Realms." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20071-7_35Markdown
[Zhang et al. "Benchmarking Omni-Vision Representation Through the Lens of Visual Realms." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/zhang2022eccv-benchmarking/) doi:10.1007/978-3-031-20071-7_35BibTeX
@inproceedings{zhang2022eccv-benchmarking,
title = {{Benchmarking Omni-Vision Representation Through the Lens of Visual Realms}},
author = {Zhang, Yuanhan and Yin, Zhenfei and Shao, Jing and Liu, Ziwei},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-20071-7_35},
url = {https://mlanthology.org/eccv/2022/zhang2022eccv-benchmarking/}
}