NeRF as Pretraining at Scale: Generalizable 3D-Aware Semantic Representation Learning from View Prediction

Abstract

Cross-scene generalizable NeRF models, which could directly synthesize novel views using several source views of unseen scenes, are gaining prominence in the NeRF field. Discovering the potential signal of emerging capabilities in existing methods, we draw a parallel between BERT’s "drop-and-predict" Masked Language Model (MLM) pre-training and novel view synthesis (NVS) in generalizable NeRF. In this work, we pioneer the scaling up of NVS as an effective pretraining strategy in a multi-view context. To bolster generalizability in pretraining, we incorporate a large-scale, minimally annotated dataset and proportionally increase the model size, revealing a neural scaling law akin to that observed in BERT. We also introduce innovative hardness-aware training techniques to enhance robust feature learning. Our model, named "NPS", demonstrates remarkable generalizability in both zero-shot and few-shot novel view synthesis. It further shows emergent capabilities in downstream tasks like few-shot multi-view semantic segmentation and depth estimation. Significantly, NPS reduces the necessity of training separate models for each task, underlining its versatility and efficiency. This approach sets a new precedent in the NeRF field, and highlights the vast possibilities opened up by scaling up generalizable novel view synthesis.

Cite

Text

Cong et al. "NeRF as Pretraining at Scale: Generalizable 3D-Aware Semantic Representation Learning from View Prediction." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00293

Markdown

[Cong et al. "NeRF as Pretraining at Scale: Generalizable 3D-Aware Semantic Representation Learning from View Prediction." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/cong2024cvprw-nerf/) doi:10.1109/CVPRW63382.2024.00293

BibTeX

@inproceedings{cong2024cvprw-nerf,
  title     = {{NeRF as Pretraining at Scale: Generalizable 3D-Aware Semantic Representation Learning from View Prediction}},
  author    = {Cong, Wenyan and Liang, Hanxue and Fan, Zhiwen and Wang, Peihao and Jiang, Yifan and Xu, Dejia and Öztireli, A. Cengiz and Wang, Zhangyang},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {2872-2882},
  doi       = {10.1109/CVPRW63382.2024.00293},
  url       = {https://mlanthology.org/cvprw/2024/cong2024cvprw-nerf/}
}