Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Abstract

The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can largely suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

Cite

Text

Wu et al. "Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02285

Markdown

[Wu et al. "Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wu2025cvpr-learning-b/) doi:10.1109/CVPR52734.2025.02285

BibTeX

@inproceedings{wu2025cvpr-learning-b,
  title     = {{Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene}},
  author    = {Wu, Shengqiong and Fei, Hao and Yang, Jingkang and Li, Xiangtai and Li, Juncheng and Zhang, Hanwang and Chua, Tat-seng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {24539-24549},
  doi       = {10.1109/CVPR52734.2025.02285},
  url       = {https://mlanthology.org/cvpr/2025/wu2025cvpr-learning-b/}
}