PolyViT: Co-Training Vision Transformers on Images, Videos and Audio

Abstract

Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on images, audio and video to answer this question. PolyViT consists of a single transformer backbone, modality-specific tokenizers and task-specific output heads. By co-training on different tasks of a single modality, we are able to achieve significant accuracy improvements on 5 standard video- and audio-classification datasets. Furthermore, co-training PolyViT on multiple modalities and tasks leads to a parameter-efficient model which generalizes across multiple domains. In particular, our multi-modal PolyViT trained on 9 datasets across 3 modalities uses 8.3 times fewer parameters and outperforms a state-of-the-art single-task baseline on 2 of these datasets, whilst achieving competitive performance on the others. Finally, this simple and practical approach necessitates less hyperparameter tuning as the per-task hyperparameters can be readily reused.

Cite

Text

Likhosherstov et al. "PolyViT: Co-Training Vision Transformers on Images, Videos and Audio." Transactions on Machine Learning Research, 2023.

Markdown

[Likhosherstov et al. "PolyViT: Co-Training Vision Transformers on Images, Videos and Audio." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/likhosherstov2023tmlr-polyvit/)

BibTeX

@article{likhosherstov2023tmlr-polyvit,
  title     = {{PolyViT: Co-Training Vision Transformers on Images, Videos and Audio}},
  author    = {Likhosherstov, Valerii and Arnab, Anurag and Choromanski, Krzysztof Marcin and Lucic, Mario and Tay, Yi and Dehghani, Mostafa},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023},
  url       = {https://mlanthology.org/tmlr/2023/likhosherstov2023tmlr-polyvit/}
}