EgoSG: Learning 3D Scene Graphs from Egocentric RGB-D Sequences
Abstract
Constructing a 3D scene graph of an environment is essential for agents and smart glasses assistants to develop an understanding of their surroundings and predict relationships between various entities within it. 3D Scene Graph Prediction (3DSGP) is commonly adopted to predict the spatial and semantic relationships between objects in a 3D environment reconstructed from posed (calibrated) RGB-D sequences, such as object containment or adjacency. However, reconstructing a scene can be time-consuming and computationally intensive, and requires specialized hardware like IMUs for accurate poses. The reliance on (1) robust algorithms and (2) accurate camera poses limits its applicability. Unlike existing 3DSGP methods, we propose to perform perception and reasoning on each frame without assuming available camera poses, which we call EgoSG, to estimate 3D scene graphs directly from egocentric frame sequences. In our method, per-frame instance features are acquired from a partial (per-frame) point cloud. By globally optimizing per-frame features, object instances are then associated across the egocentric frames, and graph representations are aggregated for 3D scene graph prediction. Compared to the state-of-the-art methods that heavily rely on 3D reconstruction, our approach is reconstruction-free and can be derived directly from unposed RGB-D sequences. We benchmark our EgoSG framework against existing reconstruction-based approaches on 3DSGP tasks. Our method outperforms the state-of-the-art methods by a large margin, achieving +44.63 R@1 in Object and +22.74 R@1 in Predicate from egocentric sequences without any reliance on reconstruction algorithms or camera poses.
Cite
Text
Zhang et al. "EgoSG: Learning 3D Scene Graphs from Egocentric RGB-D Sequences." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00260Markdown
[Zhang et al. "EgoSG: Learning 3D Scene Graphs from Egocentric RGB-D Sequences." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/zhang2024cvprw-egosg/) doi:10.1109/CVPRW63382.2024.00260BibTeX
@inproceedings{zhang2024cvprw-egosg,
title = {{EgoSG: Learning 3D Scene Graphs from Egocentric RGB-D Sequences}},
author = {Zhang, Chaoyi and Yang, Xitong and Hou, Ji and Kitani, Kris and Cai, Weidong and Chu, Fu-Jen},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2024},
pages = {2535-2545},
doi = {10.1109/CVPRW63382.2024.00260},
url = {https://mlanthology.org/cvprw/2024/zhang2024cvprw-egosg/}
}