Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Ding, Yanbo; Zhuang, Shaobin; Li, Kunchang; Yue, Zhengrong; Qiao, Yu; Wang, Yali

doi:10.1609/AAAI.V39I3.32280

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang

AAAI 2025 pp. 2753-2761

doi:10.1609/AAAI.V39I3.32280 /aaai/2025/ding2025aaai-muses/

Abstract

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES develops a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step forward for MUSES in bridging natural language, 2D image generation, and 3D world.

PDF AAAI Semantic Scholar

Cite

Text

Ding et al. "Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I3.32280

Markdown

[Ding et al. "Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/ding2025aaai-muses/) doi:10.1609/AAAI.V39I3.32280

BibTeX

@inproceedings{ding2025aaai-muses,
  title     = {{Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration}},
  author    = {Ding, Yanbo and Zhuang, Shaobin and Li, Kunchang and Yue, Zhengrong and Qiao, Yu and Wang, Yali},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {2753-2761},
  doi       = {10.1609/AAAI.V39I3.32280},
  url       = {https://mlanthology.org/aaai/2025/ding2025aaai-muses/}
}