Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Peng, Min; Wang, Chongyang; Shi, Yu; Zhou, Xiang-Dong

doi:10.1609/AAAI.V37I2.25296

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

Min Peng, Chongyang Wang, Yu Shi, Xiang-Dong Zhou

AAAI 2023 pp. 2038-2046

doi:10.1609/AAAI.V37I2.25296 /aaai/2023/peng2023aaai-efficient/

Abstract

This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.

PDF AAAI Semantic Scholar

Cite

Text

Peng et al. "Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I2.25296

Markdown

[Peng et al. "Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/peng2023aaai-efficient/) doi:10.1609/AAAI.V37I2.25296

BibTeX

@inproceedings{peng2023aaai-efficient,
  title     = {{Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer}},
  author    = {Peng, Min and Wang, Chongyang and Shi, Yu and Zhou, Xiang-Dong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {2038-2046},
  doi       = {10.1609/AAAI.V37I2.25296},
  url       = {https://mlanthology.org/aaai/2023/peng2023aaai-efficient/}
}