Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

Zhang, Kecheng; Yang, Zongxin; Han, Mingfei; Hao, Haihong; Zhuge, Yunzhi; Li, Changlin; Zhao, Junhan; Li, Zhihui; Chang, Xiaojun

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao, Yunzhi Zhuge, Changlin Li, Junhan Zhao, Zhihui Li, Xiaojun Chang

ICLR 2026

/iclr/2026/zhang2026iclr-progressive/

Abstract

Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce Thinking-QwenVL, an instantiation of this framework with two core components. First, the Active Thinking Decision Maker (ATDM) is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbol{\rho}$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the Hierarchical Progressive Semantic Integration (HPSI) module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-progressive/)

BibTeX

@inproceedings{zhang2026iclr-progressive,
  title     = {{Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions}},
  author    = {Zhang, Kecheng and Yang, Zongxin and Han, Mingfei and Hao, Haihong and Zhuge, Yunzhi and Li, Changlin and Zhao, Junhan and Li, Zhihui and Chang, Xiaojun},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-progressive/}
}