Alignment and Generation Adapter for Efficient Video-Text Understanding

Han Fang, Zhifei Yang, Yuhan Wei, Xianghao Zang, Chao Ban, Zerun Feng, Zhongjiang He, Yongxiang Li, Hao Sun

ICCVW 2023 pp. 2783-2789

doi:10.1109/ICCVW60793.2023.00296 /iccvw/2023/fang2023iccvw-alignment/

Abstract

Pre-trained models have demonstrated considerable performance, especially in enhancing cross-modal understanding between videos and text. However, fine-tuning them at scale becomes costly and poses challenges for adapting to various downstream tasks. To tackle these challenges, we propose the Alignment-generation Adapter (AGAdapter), establishing semantic coherence between alignment and generation models for efficient video-text adaptation across multiple tasks simultaneously. We propose an alignment adapter with knowledge-sharing to adapt the frozen CLIP model for fine-grained video-language interaction. Additionally, we introduce the generation adapter with prompt tuning to leverage the large language model for captioning. Furthermore, we introduce instruction joint tuning, combining textual and cross-modal instructions, to capture detailed descriptions. Our AGAdapter achieves state-of-the-art performance on video-text retrieval and video captioning tasks, including two benchmarks, MSR-VTT and ActivityNet.

ICCVW Semantic Scholar

Cite

Text

Fang et al. "Alignment and Generation Adapter for Efficient Video-Text Understanding." IEEE/CVF International Conference on Computer Vision Workshops, 2023. doi:10.1109/ICCVW60793.2023.00296

Markdown

[Fang et al. "Alignment and Generation Adapter for Efficient Video-Text Understanding." IEEE/CVF International Conference on Computer Vision Workshops, 2023.](https://mlanthology.org/iccvw/2023/fang2023iccvw-alignment/) doi:10.1109/ICCVW60793.2023.00296

BibTeX

@inproceedings{fang2023iccvw-alignment,
  title     = {{Alignment and Generation Adapter for Efficient Video-Text Understanding}},
  author    = {Fang, Han and Yang, Zhifei and Wei, Yuhan and Zang, Xianghao and Ban, Chao and Feng, Zerun and He, Zhongjiang and Li, Yongxiang and Sun, Hao},
  booktitle = {IEEE/CVF International Conference on Computer Vision Workshops},
  year      = {2023},
  pages     = {2783-2789},
  doi       = {10.1109/ICCVW60793.2023.00296},
  url       = {https://mlanthology.org/iccvw/2023/fang2023iccvw-alignment/}
}