StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue Through Event-Gated Cognition

Ding, Xin; Wu, Hao; Yang, Yifan; Jiang, Shiqi; Zhang, Qianxi; Bai, Donglin; Chen, Zhibo; Cao, Ting

StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue Through Event-Gated Cognition

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, Ting Cao

ICCV 2025 pp. 13448-13459

/iccv/2025/ding2025iccv-streammind/

Abstract

With the rise of real-world human-AI interaction applications, such as AI assistants, the need for Streaming Video Dialogue is critical. To address this need, we introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100) and enables proactive, always-on responses in real time, without explicit user intervention. To solve the key challenge of the contradiction between linear video streaming speed and quadratic transformer computation cost, we propose a novel perception-cognition interleaving paradigm named "event-gated LLM invocation", in contrast to the existing per-time-step LLM invocation. By introducing a Cognition Gate network between the video encoder and the LLM, LLM is only invoked when relevant events occur. To realize the event feature extraction with constant cost, we propose Event-Preserving Feature Extractor (EPFE) based on state-space method, generating a single perception token for spatiotemporal features. These techniques enable the video LLM with full-FPS perception and real-time cognition response. Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency, paving the way for ultra-high-FPS applications, such as Game AI Copilot and interactive media.

PDF ICCV Semantic Scholar

Cite

Text

Ding et al. "StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue Through Event-Gated Cognition." International Conference on Computer Vision, 2025.

Markdown

[Ding et al. "StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue Through Event-Gated Cognition." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ding2025iccv-streammind/)

BibTeX

@inproceedings{ding2025iccv-streammind,
  title     = {{StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue Through Event-Gated Cognition}},
  author    = {Ding, Xin and Wu, Hao and Yang, Yifan and Jiang, Shiqi and Zhang, Qianxi and Bai, Donglin and Chen, Zhibo and Cao, Ting},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {13448-13459},
  url       = {https://mlanthology.org/iccv/2025/ding2025iccv-streammind/}
}