Bolt3D: Generating 3D Scenes in Seconds
Abstract
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of 300 times. Project website: szymanowiczs.github.io/bolt3d
Cite
Text
Szymanowicz et al. "Bolt3D: Generating 3D Scenes in Seconds." International Conference on Computer Vision, 2025.Markdown
[Szymanowicz et al. "Bolt3D: Generating 3D Scenes in Seconds." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/szymanowicz2025iccv-bolt3d/)BibTeX
@inproceedings{szymanowicz2025iccv-bolt3d,
title = {{Bolt3D: Generating 3D Scenes in Seconds}},
author = {Szymanowicz, Stanislaw and Zhang, Jason Y. and Srinivasan, Pratul and Gao, Ruiqi and Brussee, Arthur and Holynski, Aleksander and Martin-Brualla, Ricardo and Barron, Jonathan T. and Henzler, Philipp},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {24846-24857},
url = {https://mlanthology.org/iccv/2025/szymanowicz2025iccv-bolt3d/}
}