LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Abstract

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

Cite

Text

Sun et al. "LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02744

Markdown

[Sun et al. "LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/sun2025cvpr-layoutvlm/) doi:10.1109/CVPR52734.2025.02744

BibTeX

@inproceedings{sun2025cvpr-layoutvlm,
  title     = {{LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models}},
  author    = {Sun, Fan-Yun and Liu, Weiyu and Gu, Siyi and Lim, Dylan and Bhat, Goutam and Tombari, Federico and Li, Manling and Haber, Nick and Wu, Jiajun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29469-29478},
  doi       = {10.1109/CVPR52734.2025.02744},
  url       = {https://mlanthology.org/cvpr/2025/sun2025cvpr-layoutvlm/}
}