Question-Aware Gaussian Experts for Audio-Visual Question Answering

Hongyeob Kim, Inyoung Jung, Dayoon Suh, Youjia Zhang, Sangmin Lee, Sungeun Hong

CVPR 2025 pp. 13681-13690

doi:10.1109/CVPR52734.2025.01277 /cvpr/2025/kim2025cvpr-questionaware/

Abstract

Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance.

PDF CVPR Semantic Scholar

Cite

Text

Kim et al. "Question-Aware Gaussian Experts for Audio-Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01277

Markdown

[Kim et al. "Question-Aware Gaussian Experts for Audio-Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/kim2025cvpr-questionaware/) doi:10.1109/CVPR52734.2025.01277

BibTeX

@inproceedings{kim2025cvpr-questionaware,
  title     = {{Question-Aware Gaussian Experts for Audio-Visual Question Answering}},
  author    = {Kim, Hongyeob and Jung, Inyoung and Suh, Dayoon and Zhang, Youjia and Lee, Sangmin and Hong, Sungeun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {13681-13690},
  doi       = {10.1109/CVPR52734.2025.01277},
  url       = {https://mlanthology.org/cvpr/2025/kim2025cvpr-questionaware/}
}