PanoContext-Former: Panoramic Total Scene Understanding with a Transformer

Abstract

Panoramic images enable deeper understanding and more holistic perception of 360 surrounding environment which can naturally encode enriched scene context information compared to standard perspective image. Previous work has made lots of effort to solve the scene understanding task in a hybrid solution based on 2D-3D geometric reasoning thus each sub-task is processed separately and few correlations are explored in this procedure. In this paper we propose a fully 3D method for holistic indoor scene understanding which recovers the objects' shapes oriented bounding boxes and the 3D room layout simultaneously from a single panorama. To maximize the exploration of the rich context information we design a transformer-based context module to predict the representation and relationship among each component of the scene. In addition we introduce a new dataset for scene understanding including photo-realistic panoramas high-fidelity depth images accurately annotated room layouts oriented object bounding boxes and shapes. Experiments on the synthetic and new datasets demonstrate that our method outperforms previous panoramic scene understanding methods in terms of both layout estimation and 3D object detection.

Cite

Text

Dong et al. "PanoContext-Former: Panoramic Total Scene Understanding with a Transformer." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02653

Markdown

[Dong et al. "PanoContext-Former: Panoramic Total Scene Understanding with a Transformer." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/dong2024cvpr-panocontextformer/) doi:10.1109/CVPR52733.2024.02653

BibTeX

@inproceedings{dong2024cvpr-panocontextformer,
  title     = {{PanoContext-Former: Panoramic Total Scene Understanding with a Transformer}},
  author    = {Dong, Yuan and Fang, Chuan and Bo, Liefeng and Dong, Zilong and Tan, Ping},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {28087-28097},
  doi       = {10.1109/CVPR52733.2024.02653},
  url       = {https://mlanthology.org/cvpr/2024/dong2024cvpr-panocontextformer/}
}