FastVGGT: Fast Visual Geometry Transformer

Abstract

Scaling visual geometry transformers for long image sequences poses a significant computational and memory challenge. In this work, we diagnose this issue in the state-of-the-art model VGGT, and trace the primary bottleneck to its Global Attention layer. Our analysis reveals a ``token collapse'' phenomenon, where many tokens attend to nearly identical regions, resulting in redundant computation and inefficiency. Motivated by this finding, we propose FastVGGT, a training-free framework that strategically prunes these redundant tokens. Instead of uniform merging, FastVGGT employs a tailored, three-part token partitioning strategy. It preserves initial-frame tokens as a stable global reference, retains salient tokens to maintain fine details, and utilizes region-based random sampling to ensure spatially balanced coverage. Extensive experiments on multiple 3D geometry benchmarks validate our approach's effectiveness. Notably, on sequences of 1000 images, FastVGGT achieves a 4$\times$ speedup over the original VGGT while simultaneously mitigating error accumulation, demonstrating its efficiency and robustness for long-sequence scenarios. For further details, please visit our project page: https://mystorm16.github.io/fastvggt/.

Cite

Text

Shen et al. "FastVGGT: Fast Visual Geometry Transformer." International Conference on Learning Representations, 2026.

Markdown

[Shen et al. "FastVGGT: Fast Visual Geometry Transformer." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/shen2026iclr-fastvggt/)

BibTeX

@inproceedings{shen2026iclr-fastvggt,
  title     = {{FastVGGT: Fast Visual Geometry Transformer}},
  author    = {Shen, You and Zhang, Zhipeng and Qu, Yansong and Zheng, Xiawu and Ji, Jiayi and Zhang, Shengchuan and Cao, Liujuan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/shen2026iclr-fastvggt/}
}