Online Video Understanding: OVBench and VideoChat-Online
Abstract
Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features 6 core task types across three temporal contexts--past, current, and future--forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.
Cite
Text
Huang et al. "Online Video Understanding: OVBench and VideoChat-Online." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00316Markdown
[Huang et al. "Online Video Understanding: OVBench and VideoChat-Online." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/huang2025cvpr-online/) doi:10.1109/CVPR52734.2025.00316BibTeX
@inproceedings{huang2025cvpr-online,
title = {{Online Video Understanding: OVBench and VideoChat-Online}},
author = {Huang, Zhenpeng and Li, Xinhao and Li, Jiaqi and Wang, Jing and Zeng, Xiangyu and Liang, Cheng and Wu, Tao and Chen, Xi and Li, Liang and Wang, Limin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {3328-3338},
doi = {10.1109/CVPR52734.2025.00316},
url = {https://mlanthology.org/cvpr/2025/huang2025cvpr-online/}
}