Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Li, Zhangbin; Guo, Dan; Zhou, Jinxing; Zhang, Jing; Wang, Meng

doi:10.1609/AAAI.V38I4.28116

Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering

Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang

AAAI 2024 pp. 3306-3314

doi:10.1609/AAAI.V38I4.28116 /aaai/2024/li2024aaai-object/

Abstract

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations (\textit{i.e.}, the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as \textit{positivity}. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand \textit{which objects are exactly relevant to the question} and \textit{which are making sounds}. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance. The code is available at https://github.com/zhangbin-ai/APL.

PDF AAAI Semantic Scholar

Cite

Text

Li et al. "Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I4.28116

Markdown

[Li et al. "Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/li2024aaai-object/) doi:10.1609/AAAI.V38I4.28116

BibTeX

@inproceedings{li2024aaai-object,
  title     = {{Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering}},
  author    = {Li, Zhangbin and Guo, Dan and Zhou, Jinxing and Zhang, Jing and Wang, Meng},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {3306-3314},
  doi       = {10.1609/AAAI.V38I4.28116},
  url       = {https://mlanthology.org/aaai/2024/li2024aaai-object/}
}