FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zhu, Zirui; Xu, Hailun; Luo, Yang; Liu, Yong; Sarkar, Kanchan; Yang, Zhenheng; You, Yang

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You

ICLR 2026

/iclr/2026/zhu2026iclr-focus/

Abstract

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. Extensive experiments across four long-video question-answering benchmarks and four popular MLLMs demonstrate that FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhu et al. "FOCUS: Efficient Keyframe Selection for Long Video Understanding." International Conference on Learning Representations, 2026.

Markdown

[Zhu et al. "FOCUS: Efficient Keyframe Selection for Long Video Understanding." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhu2026iclr-focus/)

BibTeX

@inproceedings{zhu2026iclr-focus,
  title     = {{FOCUS: Efficient Keyframe Selection for Long Video Understanding}},
  author    = {Zhu, Zirui and Xu, Hailun and Luo, Yang and Liu, Yong and Sarkar, Kanchan and Yang, Zhenheng and You, Yang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhu2026iclr-focus/}
}