Bolt3D: Generating 3D Scenes in Seconds

Abstract

We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of 300 times. Project website: szymanowiczs.github.io/bolt3d

Cite

Text

Szymanowicz et al. "Bolt3D: Generating 3D Scenes in Seconds." International Conference on Computer Vision, 2025.

Markdown

[Szymanowicz et al. "Bolt3D: Generating 3D Scenes in Seconds." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/szymanowicz2025iccv-bolt3d/)

BibTeX

@inproceedings{szymanowicz2025iccv-bolt3d,
  title     = {{Bolt3D: Generating 3D Scenes in Seconds}},
  author    = {Szymanowicz, Stanislaw and Zhang, Jason Y. and Srinivasan, Pratul and Gao, Ruiqi and Brussee, Arthur and Holynski, Aleksander and Martin-Brualla, Ricardo and Barron, Jonathan T. and Henzler, Philipp},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {24846-24857},
  url       = {https://mlanthology.org/iccv/2025/szymanowicz2025iccv-bolt3d/}
}