FastVGGT: Fast Visual Geometry Transformer
Abstract
Scaling visual geometry transformers for long image sequences poses a significant computational and memory challenge. In this work, we diagnose this issue in the state-of-the-art model VGGT, and trace the primary bottleneck to its Global Attention layer. Our analysis reveals a ``token collapse'' phenomenon, where many tokens attend to nearly identical regions, resulting in redundant computation and inefficiency. Motivated by this finding, we propose FastVGGT, a training-free framework that strategically prunes these redundant tokens. Instead of uniform merging, FastVGGT employs a tailored, three-part token partitioning strategy. It preserves initial-frame tokens as a stable global reference, retains salient tokens to maintain fine details, and utilizes region-based random sampling to ensure spatially balanced coverage. Extensive experiments on multiple 3D geometry benchmarks validate our approach's effectiveness. Notably, on sequences of 1000 images, FastVGGT achieves a 4$\times$ speedup over the original VGGT while simultaneously mitigating error accumulation, demonstrating its efficiency and robustness for long-sequence scenarios. For further details, please visit our project page: https://mystorm16.github.io/fastvggt/.
Cite
Text
Shen et al. "FastVGGT: Fast Visual Geometry Transformer." International Conference on Learning Representations, 2026.Markdown
[Shen et al. "FastVGGT: Fast Visual Geometry Transformer." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/shen2026iclr-fastvggt/)BibTeX
@inproceedings{shen2026iclr-fastvggt,
title = {{FastVGGT: Fast Visual Geometry Transformer}},
author = {Shen, You and Zhang, Zhipeng and Qu, Yansong and Zheng, Xiawu and Ji, Jiayi and Zhang, Shengchuan and Cao, Liujuan},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/shen2026iclr-fastvggt/}
}