DToMA: Training-Free Dynamic Token MAnipulation for Long Video Understanding
Abstract
Video Large Language Models (VideoLLMs) often require thousands of visual tokens to process long videos, leading to substantial computational costs, further exacerbated by visual token inefficiency. Existing token reduction and alternative video representation methods improve efficiency but often compromise comprehension abilities. In this work, we analyze the reasoning processes of VideoLLMs in multi-choice VideoQA task, identifying three reasoning stages—shallow, intermediate, and deep stages—that closely mimic human cognitive processing. Our analysis reveals specific inefficiencies at each stage: in shallow layers, VideoLLMs attempt to memorize all video details without prioritizing relevant content; in intermediate layers, models fail to re-examine uncertain content dynamically; and in deep layers, they continue processing video even when sufficiently confident. To bridge this gap, we propose DToMA, a training-free Dynamic Token MAnipulation method inspired by human adjustment mechanisms in three aspects: 1) Text-guided keyframe-aware reorganization to prioritize keyframes and reduce redundancy, 2) Uncertainty-based visual injection to revisit content dynamically, and 3) Early-exit pruning to halt visual tokens when confident. Experiments on 6 long video understanding benchmarks show that DToMA enhances both efficiency and comprehension, outperforming state-of-the-art methods and generalizing well across 3 VideoLLM architectures and sizes. Code is available at https://github.com/yuanrr/DToMA.
Cite
Text
Yuan et al. "DToMA: Training-Free Dynamic Token MAnipulation for Long Video Understanding." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/258Markdown
[Yuan et al. "DToMA: Training-Free Dynamic Token MAnipulation for Long Video Understanding." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/yuan2025ijcai-dtoma/) doi:10.24963/IJCAI.2025/258BibTeX
@inproceedings{yuan2025ijcai-dtoma,
title = {{DToMA: Training-Free Dynamic Token MAnipulation for Long Video Understanding}},
author = {Yuan, Bowen and You, Sisi and Bao, Bing-Kun},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {2314-2322},
doi = {10.24963/IJCAI.2025/258},
url = {https://mlanthology.org/ijcai/2025/yuan2025ijcai-dtoma/}
}