WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
Abstract
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without introducing a large amount of extra computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets.
Cite
Text
Li et al. "WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool." International Conference on Learning Representations, 2026.Markdown
[Li et al. "WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-wint3r/)BibTeX
@inproceedings{li2026iclr-wint3r,
title = {{WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool}},
author = {Li, Zizun and Zhou, Jianjun and Wang, Yifan and Guo, Haoyu and Chang, Wenzheng and Zhou, Yang and Zhu, Haoyi and Chen, Junyi and Shen, Chunhua and He, Tong},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/li2026iclr-wint3r/}
}