OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Abstract

Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.

Cite

Text

Li et al. "OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-omnivideobench/)

BibTeX

@inproceedings{li2026iclr-omnivideobench,
  title     = {{OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs}},
  author    = {Li, Caorui and Chen, Yu and Ji, Yiyan and Xu, Jin and Cui, Zhenyu and Li, Shihao and Zhang, Yuanxing and Song, Zhenghao and Zhang, Dingling and Heying,  and Liu, Haoxiang and Wang, Yuxuan and Wang, Qiufeng and Tang, Jiafu and Wu, Zhenhe and Luo, Jiehui and Pan, Zhiyu and Xie, Weihao and Zhang, Chenchen and Wang, Zhaohui and Tian, Jiayi and Wang, Yanghai and Cao, Zhe and Dai, Minxin and Wang, Ke and Wen, Runzhe and Ma, Yinghao and Pan, Yaning and Chang, Sungkyun and Taheri, Termeh and Xia, Haiwen and Plachouras, Christos and Benetos, Emmanouil and Li, Yizhi and Zhang, Ge and Yang, Jian and Peng, Tianhao and Wang, Zili and Liu, Minghao and Peng, Junran and Zhang, Zhaoxiang and Liu, Jiaheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-omnivideobench/}
}