Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Abstract

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed Dynamic-VLM achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, Dynamic-VLM delivers an absolute improvement of 2.7% over LLaVA-OneVision on VideoMME and 10.7% on MuirBench.

Cite

Text

Wang et al. "Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM." International Conference on Computer Vision, 2025.

Markdown

[Wang et al. "Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/wang2025iccv-dynamicvlm/)

BibTeX

@inproceedings{wang2025iccv-dynamicvlm,
  title     = {{Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM}},
  author    = {Wang, Han and Nie, Yuxiang and Ye, Yongjie and Wang, Yanjie and Li, Shuai and Yu, Haiyang and Lu, Jinghui and Huang, Can},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {20812-20823},
  url       = {https://mlanthology.org/iccv/2025/wang2025iccv-dynamicvlm/}
}