ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Abstract

Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs. At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human-labeled annotations. We demonstrate that SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open-source and commercial MLLMs, achieving state of-the-art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.

Cite

Text

Huang et al. "ESCA: Contextualizing Embodied Agents via Scene-Graph Generation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Huang et al. "ESCA: Contextualizing Embodied Agents via Scene-Graph Generation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/huang2025neurips-esca/)

BibTeX

@inproceedings{huang2025neurips-esca,
  title     = {{ESCA: Contextualizing Embodied Agents via Scene-Graph Generation}},
  author    = {Huang, Jiani and Sethi, Amish and Kuo, Matthew and Keoliya, Mayank and Velingker, Neelay and Jung, JungHo and Lim, Ser-Nam and Li, Ziyang and Naik, Mayur},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/huang2025neurips-esca/}
}