Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
Abstract
This paper presents a new method for end-to-end Video Question Answering (VideoQA), aside from the current popularity of using large-scale pre-training with huge feature extractors. We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer, a few convolutional and transformer layers. We use the anisotropic pyramid to fulfill video-language interactions across different spatio-temporal scales. In addition to the canonical pyramid, which includes both bottom-up and top-down pathways with lateral connections, novel strategies are proposed to decompose the visual feature stream into spatial and temporal sub-streams at different scales and implement their interactions with the linguistic semantics while preserving the integrity of local and global semantics. We demonstrate better or on-par performances with high computational efficiency against state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows the scalability of our model that achieves competitive results for text-to-video retrieval by leveraging feature extractors with reusable pre-trained weights, and also the effectiveness of the pyramid. Code available at: https://github.com/Trunpm/PMT-AAAI23.
Cite
Text
Peng et al. "Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer." AAAI Conference on Artificial Intelligence, 2023. doi:10.1609/AAAI.V37I2.25296Markdown
[Peng et al. "Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer." AAAI Conference on Artificial Intelligence, 2023.](https://mlanthology.org/aaai/2023/peng2023aaai-efficient/) doi:10.1609/AAAI.V37I2.25296BibTeX
@inproceedings{peng2023aaai-efficient,
title = {{Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer}},
author = {Peng, Min and Wang, Chongyang and Shi, Yu and Zhou, Xiang-Dong},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2023},
pages = {2038-2046},
doi = {10.1609/AAAI.V37I2.25296},
url = {https://mlanthology.org/aaai/2023/peng2023aaai-efficient/}
}