SceneScript: Reconstructing Scenes with an Autoregressive Structured Language Model

Avetisyan, Armen; Xie, Christopher; Howard-Jenkins, Henry; Yang, Tsun-Yi; Aroudj, Samir; Patra, Suvam; Zhang, Fuyang; Holland, Luke; Frost, Duncan; Orme, Campbell; Engel, Jakob; Miller, Edward; Newcombe, Richard; Balntas, Vasileios

doi:10.1007/978-3-031-73030-6_14

SceneScript: Reconstructing Scenes with an Autoregressive Structured Language Model

Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Luke Holland, Duncan Frost, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, Vasileios Balntas

ECCV 2024

doi:10.1007/978-3-031-73030-6_14 /eccv/2024/avetisyan2024eccv-scenescript/

Abstract

We introduce , a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method infers the set of structured language commands directly from encoded visual data using a scene language encoder-decoder architecture. To train , we generate and release a large-scale synthetic dataset called consisting of 100k high-quality indoor scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection. Lastly, we explore an advantage for , which is the ability to readily adapt to new commands via simple additions to the structured language, which we illustrate for tasks such as coarse 3D object part reconstruction. † Work done while the author was an intern at Meta.

PDF ECCV Semantic Scholar

Cite

Text

Avetisyan et al. "SceneScript: Reconstructing Scenes with an Autoregressive Structured Language Model." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73030-6_14

Markdown

[Avetisyan et al. "SceneScript: Reconstructing Scenes with an Autoregressive Structured Language Model." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/avetisyan2024eccv-scenescript/) doi:10.1007/978-3-031-73030-6_14

BibTeX

@inproceedings{avetisyan2024eccv-scenescript,
  title     = {{SceneScript: Reconstructing Scenes with an Autoregressive Structured Language Model}},
  author    = {Avetisyan, Armen and Xie, Christopher and Howard-Jenkins, Henry and Yang, Tsun-Yi and Aroudj, Samir and Patra, Suvam and Zhang, Fuyang and Holland, Luke and Frost, Duncan and Orme, Campbell and Engel, Jakob and Miller, Edward and Newcombe, Richard and Balntas, Vasileios},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73030-6_14},
  url       = {https://mlanthology.org/eccv/2024/avetisyan2024eccv-scenescript/}
}