WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Abstract

We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without introducing a large amount of extra computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets.

Cite

Text

Li et al. "WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-wint3r/)

BibTeX

@inproceedings{li2026iclr-wint3r,
  title     = {{WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool}},
  author    = {Li, Zizun and Zhou, Jianjun and Wang, Yifan and Guo, Haoyu and Chang, Wenzheng and Zhou, Yang and Zhu, Haoyi and Chen, Junyi and Shen, Chunhua and He, Tong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-wint3r/}
}