SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage

Abstract

We need billion-scale images to achieve more generalizable and ground-breaking vision models, as well as massive dataset storage to ship the images (e.g., the LAION-4B dataset needs 240TB storage space). However, it has become challenging to deal with unlimited dataset storage with limited storage infrastructure. A number of storage-efficient training methods have been proposed to tackle the problem, but they are rarely scalable or suffer from severe damage to performance. In this paper, we propose a storage-efficient training strategy for vision classifiers for large-scale datasets (e.g., ImageNet) that only uses 1024 tokens per instance without using the raw level pixels; our token storage only needs <1% of the original JPEG-compressed raw pixels. We also propose token augmentations and a Stem-adaptor module to make our approach able to use the same architecture as pixel-based approaches with only minimal modifications on the stem layer and the carefully tuned optimization settings. Our experimental results on ImageNet-1K show that our method significantly outperforms other storage-efficient training methods with a large gap. We further show the effectiveness of our method in other practical scenarios, storage-efficient pre-training, and continual learning. We will make our implementation and tokenized dataset publicly after the acceptance.

Cite

Text

Park et al. "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01582

Markdown

[Park et al. "SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/park2023iccv-seit/) doi:10.1109/ICCV51070.2023.01582

BibTeX

@inproceedings{park2023iccv-seit,
  title     = {{SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage}},
  author    = {Park, Song and Chun, Sanghyuk and Heo, Byeongho and Kim, Wonjae and Yun, Sangdoo},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {17248-17259},
  doi       = {10.1109/ICCV51070.2023.01582},
  url       = {https://mlanthology.org/iccv/2023/park2023iccv-seit/}
}