A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

Wang, Mengmeng; Xing, Jiazheng; Jiang, Boyuan; Chen, Jun; Mei, Jianbiao; Zuo, Xingxing; Dai, Guang; Wang, Jingdong; Liu, Yong

doi:10.1609/AAAI.V38I6.28361

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong Liu

AAAI 2024 pp. 5517-5525

doi:10.1609/AAAI.V38I6.28361 /aaai/2024/wang2024aaai-multimodal/

Abstract

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named M2-CLIP to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals, including the original contrastive learning head, a cross-modal classification head, a cross-modal masked language modeling head, and a visual classification head. This multi-task decoder adeptly satisfies the need for strong supervised performance within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

PDF AAAI Semantic Scholar

Cite

Text

Wang et al. "A Multimodal, Multi-Task Adapting Framework for Video Action Recognition." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I6.28361

Markdown

[Wang et al. "A Multimodal, Multi-Task Adapting Framework for Video Action Recognition." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/wang2024aaai-multimodal/) doi:10.1609/AAAI.V38I6.28361

BibTeX

@inproceedings{wang2024aaai-multimodal,
  title     = {{A Multimodal, Multi-Task Adapting Framework for Video Action Recognition}},
  author    = {Wang, Mengmeng and Xing, Jiazheng and Jiang, Boyuan and Chen, Jun and Mei, Jianbiao and Zuo, Xingxing and Dai, Guang and Wang, Jingdong and Liu, Yong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {5517-5525},
  doi       = {10.1609/AAAI.V38I6.28361},
  url       = {https://mlanthology.org/aaai/2024/wang2024aaai-multimodal/}
}