ViLA: Efficient Video-Language Alignment for Video Question Answering

Abstract

We propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA Temporal with 3.0× speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0× speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2× speed-up. Code will be available at https://github.com/xijun-cs/ViLA.

Cite

Text

Wang et al. "ViLA: Efficient Video-Language Alignment for Video Question Answering." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73033-7_11

Markdown

[Wang et al. "ViLA: Efficient Video-Language Alignment for Video Question Answering." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wang2024eccv-vila/) doi:10.1007/978-3-031-73033-7_11

BibTeX

@inproceedings{wang2024eccv-vila,
  title     = {{ViLA: Efficient Video-Language Alignment for Video Question Answering}},
  author    = {Wang, Xijun and Liang, Junbang and Wang, Chun-Kai and Deng, Kenan and Lou, Yu and Lin, Ming C and Yang, Shan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73033-7_11},
  url       = {https://mlanthology.org/eccv/2024/wang2024eccv-vila/}
}