LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

Abstract

First-person video assistants are highly anticipated to enhance our daily life through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features. To overcome the trade-off between efficacy and efficiency, we propose "**F**ast & **S**low Video-Language Thinker" as on**LI**ne vide**O** assista**N**t, **LION-FS**, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: **1) Fast Path: Routing-Based Response Determination** evaluates frame-by-frame whether a immediate response is necessary. To enhance responses determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features, and **2) Slow Path: Multi-granularity Keyframe Augmentation** optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. They are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency. The codes will be released soon.

Cite

Text

Li et al. "LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant." Conference on Computer Vision and Pattern Recognition, 2025.

Markdown

[Li et al. "LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-lionfs/)

BibTeX

@inproceedings{li2025cvpr-lionfs,
  title     = {{LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant}},
  author    = {Li, Wei and Hu, Bing and Shao, Rui and Shen, Leyang and Nie, Liqiang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {3240-3251},
  url       = {https://mlanthology.org/cvpr/2025/li2025cvpr-lionfs/}
}