Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

Abstract

To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised pre-training, and self-supervised pre-training. It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models. However, current works adopt a multi-stage pre-training system, where the complex pipeline may increase the uncertainty and instability of the pre-training. It is thus desirable that these strategies can be integrated in a single-stage manner. In this paper, we first propose a general multi-modal mutual information formula as a unified optimization target and demonstrate that all mainstream approaches are special cases of our framework. Under this unified perspective, we propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training). Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, COCO object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation. Notably, we successfully pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks under public data setting. Code shall be released at https://github.com/OpenGVLab/M3I-Pretraining.

Cite

Text

Su et al. "Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01525

Markdown

[Su et al. "Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/su2023cvpr-allinone/) doi:10.1109/CVPR52729.2023.01525

BibTeX

@inproceedings{su2023cvpr-allinone,
  title     = {{Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information}},
  author    = {Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {15888-15899},
  doi       = {10.1109/CVPR52729.2023.01525},
  url       = {https://mlanthology.org/cvpr/2023/su2023cvpr-allinone/}
}