g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

Abstract

We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks. Our source code and dataset will be made open-source upon paper acceptance.

Cite

Text

Wang and Lee. "g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01324

Markdown

[Wang and Lee. "g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-g3dlf/) doi:10.1109/CVPR52734.2025.01324

BibTeX

@inproceedings{wang2025cvpr-g3dlf,
  title     = {{g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks}},
  author    = {Wang, Zihan and Lee, Gim Hee},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14191-14202},
  doi       = {10.1109/CVPR52734.2025.01324},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-g3dlf/}
}