OmniMMI: A Comprehensive Multi-Modal Interaction Benchmark in Streaming Video Contexts

Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

Cite

Text

Wang et al. "OmniMMI: A Comprehensive Multi-Modal Interaction Benchmark in Streaming Video Contexts." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01763

Markdown

[Wang et al. "OmniMMI: A Comprehensive Multi-Modal Interaction Benchmark in Streaming Video Contexts." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-omnimmi/) doi:10.1109/CVPR52734.2025.01763

BibTeX

@inproceedings{wang2025cvpr-omnimmi,
  title     = {{OmniMMI: A Comprehensive Multi-Modal Interaction Benchmark in Streaming Video Contexts}},
  author    = {Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {18925-18935},
  doi       = {10.1109/CVPR52734.2025.01763},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-omnimmi/}
}