Towards Efficient Audio-Visual Learners via Empowering Pre-Trained Vision Transformers with Cross-Modal Adaptation
Abstract
In this paper, we explore the cross-modal adaptation of pre-trained Vision Transformers (ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To this end, we propose a Spatial-Temporal-Global Cross-Modal Adaptation (STG-CMA) to gradually equip the frozen ViTs with the capability for learning audio-visual representation, consisting of the modality-specific temporal adaptation for temporal reasoning of each modality, the cross-modal spatial adaptation for refining the spatial information with the cue from counterpart modality, and the cross-modal global adaptation for global interaction between audio and visual modalities. Our STG-CMA presents a meaningful finding that only leveraging the shared pre-trained image model with inserted lightweight adapters is enough for spatial-temporal modeling and feature interaction of audio-visual modality. Extensive experiments indicate that our STG-CMA achieves state-of-the-art performance on various audio-visual understanding tasks including AVE, AVS, and AVQA while containing significantly reduced tunable parameters. The code is available at https://github.com/kaiw7/STG-CMA.
Cite
Text
Wang et al. "Towards Efficient Audio-Visual Learners via Empowering Pre-Trained Vision Transformers with Cross-Modal Adaptation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00190Markdown
[Wang et al. "Towards Efficient Audio-Visual Learners via Empowering Pre-Trained Vision Transformers with Cross-Modal Adaptation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/wang2024cvprw-efficient/) doi:10.1109/CVPRW63382.2024.00190BibTeX
@inproceedings{wang2024cvprw-efficient,
title = {{Towards Efficient Audio-Visual Learners via Empowering Pre-Trained Vision Transformers with Cross-Modal Adaptation}},
author = {Wang, Kai and Tian, Yapeng and Hatzinakos, Dimitrios},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {1837-1846},
doi = {10.1109/CVPRW63382.2024.00190},
url = {https://mlanthology.org/cvprw/2024/wang2024cvprw-efficient/}
}