Video-Context Aligned Transformer for Video Question Answering

Abstract

Video question answering involves understanding video content to generate accurate answers to questions. Recent studies have successfully modeled video features and achieved diverse multimodal interaction, yielding impressive outcomes. However, they have overlooked the fact that the video contains richer instances and events beyond the scope of the stated question. Extremely imbalanced alignment of information from both sides leads to significant instability in reasoning. To address this concern, we propose the Video-Context Aligned Transformer (V-CAT), which leverages the context to achieve semantic and content alignment between video and question. Specifically, the video and text are encoded into a shared semantic space initially. We apply contrastive learning to global video token and context token to enhance the semantic alignment. Then, the pooled context feature is utilized to obtain corresponding visual content. Finally, the answer is decoded by integrating the refined video and question features. We evaluate the effectiveness of V-CAT on MSVD-QA and MSRVTT-QA dataset, both achieving state-of-the-art performance. Extended experiments further analyze and demonstrate the effectiveness of each proposed module.

Cite

Text

Zong et al. "Video-Context Aligned Transformer for Video Question Answering." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I17.29954

Markdown

[Zong et al. "Video-Context Aligned Transformer for Video Question Answering." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/zong2024aaai-video/) doi:10.1609/AAAI.V38I17.29954

BibTeX

@inproceedings{zong2024aaai-video,
  title     = {{Video-Context Aligned Transformer for Video Question Answering}},
  author    = {Zong, Linlin and Wan, Jiahui and Zhang, Xianchao and Liu, Xinyue and Liang, Wenxin and Xu, Bo},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {19795-19803},
  doi       = {10.1609/AAAI.V38I17.29954},
  url       = {https://mlanthology.org/aaai/2024/zong2024aaai-video/}
}