MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

Abstract

Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features throughout the encoding phase. Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE, VGGSound, and CREMA-D datasets, we achieve considerable performance improvements over SOTA methods.

Cite

Text

Mahmud et al. "MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00798

Markdown

[Mahmud et al. "MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/mahmud2024cvprw-maavt/) doi:10.1109/CVPRW63382.2024.00798

BibTeX

@inproceedings{mahmud2024cvprw-maavt,
  title     = {{MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers}},
  author    = {Mahmud, Tanvir and Mo, Shentong and Tian, Yapeng and Marculescu, Diana},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7996-8005},
  doi       = {10.1109/CVPRW63382.2024.00798},
  url       = {https://mlanthology.org/cvprw/2024/mahmud2024cvprw-maavt/}
}