SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs

Abstract

Videos contain rich information but also high redundancy, as consecutive frames often share similar backgrounds and predictable motions. Current video-language models (VLMs) are unable to exploit this redundancy and therefore perform a significant amount of superfluous computation, processing thousands of patch tokens even when little new information is present. What is missing is an on-the-fly, model-agnostic signal of temporal predictability to decide whether tokens carry unpredictable information that merits computation. We propose SURGE, a training-free and backbone-agnostic method that measures surprise in token space. Surprise scores are defined by the prediction error of each token from its recent history; high-surprise tokens are retained, while predictable ones are pruned. Aggregating scores over time produces a surprise curve that highlights key events, which can be further refined with CLIP-based query relevance to form a compact spatio-temporal mask. Experiments on multiple video understanding benchmarks show that SURGE reduces tokens by up to 7$\times$ and prefill cost by 86–98\%, while maintaining accuracy within $\pm$1 point of full-token baselines. By aligning computation with novelty, SURGE enables video VLMs to handle long contexts efficiently and without retraining.

Cite

Text

Tang et al. "SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs." International Conference on Learning Representations, 2026.

Markdown

[Tang et al. "SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tang2026iclr-surge/)

BibTeX

@inproceedings{tang2026iclr-surge,
  title     = {{SURGE: Surprise-Guided Token Reduction for Efficient Video Understanding with VLMs}},
  author    = {Tang, Chong and Ek, Sannara and Koch, Dirk and Mullins, Robert D. and Weddell, Alex S. and Chauhan, Jagmohan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/tang2026iclr-surge/}
}