ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

Ma, David; Yuan, Huaqing; Wang, Xingjian; Zang, Qianbo; Liu, Tianci; He, Xinyang; Wei, Yanbin; Guo, Jiawei; Nijiahui,; Yang, Zhenzhu; Cao, Meng; Quan, Shanghaoran; Li, Yizhi; Zhou, Wangchunshu; Liu, Jiaheng; Huang, Wenhao; Zhang, Ge; Ni, Shiwen; Jin, Xiaojie

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

David Ma, Huaqing Yuan, Xingjian Wang, Qianbo Zang, Tianci Liu, Xinyang He, Yanbin Wei, Jiawei Guo, Nijiahui, Zhenzhu Yang, Meng Cao, Shanghaoran Quan, Yizhi Li, Wangchunshu Zhou, Jiaheng Liu, Wenhao Huang, Ge Zhang, Shiwen Ni, Xiaojie Jin

ICLR 2026

/iclr/2026/ma2026iclr-scalelong/

Abstract

Although long-video understanding demands that models capture hierarchical temporal information—from clip and shot to event and story—existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales\textemdash clip, shot, event, and story\textemdash all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg. 86 min) from 5 main categories and 36 sub-categories, with 4–8 carefully designed questions, with at least one question targeting each timescale. Evaluating 23 MLLMs reveals a distinct U-shaped performance trend: higher accuracy at the shortest (clip) and longest (story) timescales, with a dip at intermediate levels. Furthermore, ablation studies demonstrate that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a crucial fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available at \url{https://github.com/multimodal-art-projection/ScaleLong}

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Ma et al. "ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding." International Conference on Learning Representations, 2026.

Markdown

[Ma et al. "ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ma2026iclr-scalelong/)

BibTeX

@inproceedings{ma2026iclr-scalelong,
  title     = {{ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding}},
  author    = {Ma, David and Yuan, Huaqing and Wang, Xingjian and Zang, Qianbo and Liu, Tianci and He, Xinyang and Wei, Yanbin and Guo, Jiawei and Nijiahui,  and Yang, Zhenzhu and Cao, Meng and Quan, Shanghaoran and Li, Yizhi and Zhou, Wangchunshu and Liu, Jiaheng and Huang, Wenhao and Zhang, Ge and Ni, Shiwen and Jin, Xiaojie},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ma2026iclr-scalelong/}
}