Mixtures of Experts for Audio-Visual Learning

Abstract

With the rapid development of multimedia technology, audio-visual learning has emerged as a promising research topic within the field of multimodal analysis. In this paper, we explore parameter-efficient transfer learning for audio-visual learning and propose the Audio-Visual Mixture of Experts (\ourmethodname) to inject adapters into pre-trained models flexibly. Specifically, we introduce unimodal and cross-modal adapters as multiple experts to specialize in intra-modal and inter-modal information, respectively, and employ a lightweight router to dynamically allocate the weights of each expert according to the specific demands of each task. Extensive experiments demonstrate that our proposed approach \ourmethodname achieves superior performance across multiple audio-visual tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, visual-only experimental results also indicate that our approach can tackle challenging scenes where modality information is missing.The source code is available at \url{https://github.com/yingchengy/AVMOE}.

Cite

Text

Cheng et al. "Mixtures of Experts for Audio-Visual Learning." Neural Information Processing Systems, 2024. doi:10.52202/079017-0007

Markdown

[Cheng et al. "Mixtures of Experts for Audio-Visual Learning." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/cheng2024neurips-mixtures/) doi:10.52202/079017-0007

BibTeX

@inproceedings{cheng2024neurips-mixtures,
  title     = {{Mixtures of Experts for Audio-Visual Learning}},
  author    = {Cheng, Ying and Li, Yang and He, Junjie and Feng, Rui},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0007},
  url       = {https://mlanthology.org/neurips/2024/cheng2024neurips-mixtures/}
}