RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning Through Real-Time Video

Abstract

Multimodal Large Language Models (MLLMs) have made rapid progress in perception, understanding, and reasoning, yet existing benchmarks fall short in evaluating these abilities under continuous and dynamic real-world video streams. Such settings require models to maintain coherent understanding and reasoning as visual scenes evolve over time. **We introduce RTV-Bench, a fine-grained benchmark for real-time video analysis with MLLMs**. It is built upon three key principles: multi-timestamp question answering, hierarchical question structures spanning perception and reasoning, and multi-dimensional evaluation of continuous perception, understanding, and reasoning. RTV-Bench comprises 552 diverse videos and 4,608 carefully curated QA pairs covering a wide range of dynamic scenarios. We evaluate a broad range of state-of-the-art MLLMs, including proprietary, open-source offline, and open-source real-time models. Our results show that real-time models generally outperform offline counterparts but still lag behind leading proprietary systems. While scaling model capacity generally yields performance gains, simply increasing the density of sampled input frames does not consistently translate into improved results. These observations suggest inherent limitations in current architectures when handling long-horizon video streams, underscoring the need for models explicitly designed for streaming video processing and analysis.

Cite

Text

Xun et al. "RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning Through Real-Time Video." Advances in Neural Information Processing Systems, 2025.

Markdown

[Xun et al. "RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning Through Real-Time Video." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/xun2025neurips-rtvbench/)

BibTeX

@inproceedings{xun2025neurips-rtvbench,
  title     = {{RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning Through Real-Time Video}},
  author    = {Xun, ShuHang and Tao, Sicheng and Li, Jungang and Shi, Yibo and Lin, Zhixin and Zhu, Zhanhui and Yan, Yibo and Li, Hanqian and Zhang, LingHao and Wang, Shikang and Liu, Yixin and Zhang, Hanbo and Ma, Ying and Hu, Xuming},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/xun2025neurips-rtvbench/}
}