Online Video Understanding: OVBench and VideoChat-Online

Abstract

Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features 6 core task types across three temporal contexts--past, current, and future--forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.

Cite

Text

Huang et al. "Online Video Understanding: OVBench and VideoChat-Online." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00316

Markdown

[Huang et al. "Online Video Understanding: OVBench and VideoChat-Online." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/huang2025cvpr-online/) doi:10.1109/CVPR52734.2025.00316

BibTeX

@inproceedings{huang2025cvpr-online,
  title     = {{Online Video Understanding: OVBench and VideoChat-Online}},
  author    = {Huang, Zhenpeng and Li, Xinhao and Li, Jiaqi and Wang, Jing and Zeng, Xiangyu and Liang, Cheng and Wu, Tao and Chen, Xi and Li, Liang and Wang, Limin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3328-3338},
  doi       = {10.1109/CVPR52734.2025.00316},
  url       = {https://mlanthology.org/cvpr/2025/huang2025cvpr-online/}
}