Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Abstract

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.

Cite

Text

Yue et al. "Improving 2D Feature Representations by 3D-Aware Fine-Tuning." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72627-9_4

Markdown

[Yue et al. "Improving 2D Feature Representations by 3D-Aware Fine-Tuning." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/yue2024eccv-improving/) doi:10.1007/978-3-031-72627-9_4

BibTeX

@inproceedings{yue2024eccv-improving,
  title     = {{Improving 2D Feature Representations by 3D-Aware Fine-Tuning}},
  author    = {Yue, Yuanwen and Das, Anurag and Engelmann, Francis and Tang, Siyu and Lenssen, Jan Eric},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72627-9_4},
  url       = {https://mlanthology.org/eccv/2024/yue2024eccv-improving/}
}