HexaGen3D: StableDiffusion Is One Step Away from Fast and Diverse Text-to-3D Generation

Abstract

Despite the latest remarkable advances in generative modeling efficient generation of high-quality 3D objects from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of samples while their 2D counterparts contain billions of text-image pairs. To address this we propose a novel approach which harnesses the power of large pretrained 2D diffusion models. More specifically our approach HexaGen3D fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding 3D latent. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization and can infer high-quality and diverse objects from textual prompts in 7 seconds offering significantly better quality-to-latency trade-offs than existing approaches. Furthermore HexaGen3D demonstrates strong generalization to new objects or compositions.

Cite

Text

Mercier et al. "HexaGen3D: StableDiffusion Is One Step Away from Fast and Diverse Text-to-3D Generation." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Mercier et al. "HexaGen3D: StableDiffusion Is One Step Away from Fast and Diverse Text-to-3D Generation." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/mercier2025wacv-hexagen3d/)

BibTeX

@inproceedings{mercier2025wacv-hexagen3d,
  title     = {{HexaGen3D: StableDiffusion Is One Step Away from Fast and Diverse Text-to-3D Generation}},
  author    = {Mercier, Antoine and Nakhli, Ramin and Reddy, Mahesh and Yasarla, Rajeev and Cai, Hong and Porikli, Fatih and Berger, Guillaume},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {1247-1257},
  url       = {https://mlanthology.org/wacv/2025/mercier2025wacv-hexagen3d/}
}