Visual Atoms: Pre-Training Vision Transformers with Sinusoidal Waves

Abstract

Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is only 0.5% difference from the top-1 accuracy (84.2%) achieved by the JFT-300M pre-training, even though the scale of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases.

Cite

Text

Takashima et al. "Visual Atoms: Pre-Training Vision Transformers with Sinusoidal Waves." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01782

Markdown

[Takashima et al. "Visual Atoms: Pre-Training Vision Transformers with Sinusoidal Waves." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/takashima2023cvpr-visual/) doi:10.1109/CVPR52729.2023.01782

BibTeX

@inproceedings{takashima2023cvpr-visual,
  title     = {{Visual Atoms: Pre-Training Vision Transformers with Sinusoidal Waves}},
  author    = {Takashima, Sora and Hayamizu, Ryo and Inoue, Nakamasa and Kataoka, Hirokatsu and Yokota, Rio},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {18579-18588},
  doi       = {10.1109/CVPR52729.2023.01782},
  url       = {https://mlanthology.org/cvpr/2023/takashima2023cvpr-visual/}
}