Online Video Understanding: OVBench and VideoChat-Online

Huang, Zhenpeng; Li, Xinhao; Li, Jiaqi; Wang, Jing; Zeng, Xiangyu; Liang, Cheng; Wu, Tao; Chen, Xi; Li, Liang; Wang, Limin

doi:10.1109/CVPR52734.2025.00316

Online Video Understanding: OVBench and VideoChat-Online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang

CVPR 2025 pp. 3328-3338

doi:10.1109/CVPR52734.2025.00316 /cvpr/2025/huang2025cvpr-online/

Abstract

Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features 6 core task types across three temporal contexts--past, current, and future--forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.

PDF CVPR Semantic Scholar

Cite

Text

Huang et al. "Online Video Understanding: OVBench and VideoChat-Online." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00316

Markdown

[Huang et al. "Online Video Understanding: OVBench and VideoChat-Online." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/huang2025cvpr-online/) doi:10.1109/CVPR52734.2025.00316

BibTeX

@inproceedings{huang2025cvpr-online,
  title     = {{Online Video Understanding: OVBench and VideoChat-Online}},
  author    = {Huang, Zhenpeng and Li, Xinhao and Li, Jiaqi and Wang, Jing and Zeng, Xiangyu and Liang, Cheng and Wu, Tao and Chen, Xi and Li, Liang and Wang, Limin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3328-3338},
  doi       = {10.1109/CVPR52734.2025.00316},
  url       = {https://mlanthology.org/cvpr/2025/huang2025cvpr-online/}
}