RUST: Latent Neural Scene Representations from Unposed Imagery

Abstract

Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations.

Cite

Text

Sajjadi et al. "RUST: Latent Neural Scene Representations from Unposed Imagery." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01659

Markdown

[Sajjadi et al. "RUST: Latent Neural Scene Representations from Unposed Imagery." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/sajjadi2023cvpr-rust/) doi:10.1109/CVPR52729.2023.01659

BibTeX

@inproceedings{sajjadi2023cvpr-rust,
  title     = {{RUST: Latent Neural Scene Representations from Unposed Imagery}},
  author    = {Sajjadi, Mehdi S. M. and Mahendran, Aravindh and Kipf, Thomas and Pot, Etienne and Duckworth, Daniel and Lučić, Mario and Greff, Klaus},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {17297-17306},
  doi       = {10.1109/CVPR52729.2023.01659},
  url       = {https://mlanthology.org/cvpr/2023/sajjadi2023cvpr-rust/}
}