VideoNSA: Native Sparse Attention Scales Video Understanding
Abstract
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. **Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video.** Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global–local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
Cite
Text
Song et al. "VideoNSA: Native Sparse Attention Scales Video Understanding." International Conference on Learning Representations, 2026.Markdown
[Song et al. "VideoNSA: Native Sparse Attention Scales Video Understanding." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/song2026iclr-videonsa/)BibTeX
@inproceedings{song2026iclr-videonsa,
title = {{VideoNSA: Native Sparse Attention Scales Video Understanding}},
author = {Song, Enxin and Chai, Wenhao and Yang, Shusheng and Armand, Ethan J. and Shan, Xiaojun and Xu, Haiyang and Xie, Jianwen and Tu, Zhuowen},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/song2026iclr-videonsa/}
}