SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

Abstract

We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. SceneFactor enables text-guided 3D scene synthesis through our factored diffusion formulation, leveraging latent semantic and geometric manifolds for generation of arbitrary-sized 3D scenes. While text input enables easy, controllable generation, text guidance remains imprecise for intuitive, localized editing and manipulation of the generated 3D scenes. Our factored semantic diffusion generates a proxy semantic space composed of semantic 3D boxes that enables controllable editing of generated scenes by adding, removing, changing the size of the semantic 3D proxy boxes that guides high-fidelity, consistent 3D geometric editing. Extensive experiments demonstrate that our approach enables high-fidelity 3D scene synthesis with effective controllable editing through our factored diffusion approach.

Cite

Text

Bokhovkin et al. "SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00067

Markdown

[Bokhovkin et al. "SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/bokhovkin2025cvpr-scenefactor/) doi:10.1109/CVPR52734.2025.00067

BibTeX

@inproceedings{bokhovkin2025cvpr-scenefactor,
  title     = {{SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation}},
  author    = {Bokhovkin, Aleksey and Meng, Quan and Tulsiani, Shubham and Dai, Angela},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {628-639},
  doi       = {10.1109/CVPR52734.2025.00067},
  url       = {https://mlanthology.org/cvpr/2025/bokhovkin2025cvpr-scenefactor/}
}