GECO: GPT-Driven Estimation of 3D Human-Scene Contact in the Wild

Abstract

Understanding human-scene contact remains a challenging task, as it requires detectors to simultaneously model the contacting body parts, their proximity to scene objects, and the overall scene context. In this work, we introduce GECO, a framework employing Large Language Models (LLMs) with the key insight that language offers a powerful prior to intuitively reason about 3D human-object and human-scene contact based on extensive multimodal world knowledge. By converting a body-vertex formulation to natural language descriptors, we enable zero-shot generation of vertex-level contact directly on the SMPL body. We show that GPT offers a surprisingly competitive baseline close to state-of-the-art detectors on the DAMON dataset. We apply and evaluate different emerging prompting paradigms, highlighting their potential and limitations towards LLM-based human-scene contact estimation.

Cite

Text

Lee et al. "GECO: GPT-Driven Estimation of 3D Human-Scene Contact in the Wild." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-92591-7_29

Markdown

[Lee et al. "GECO: GPT-Driven Estimation of 3D Human-Scene Contact in the Wild." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/lee2024eccvw-geco/) doi:10.1007/978-3-031-92591-7_29

BibTeX

@inproceedings{lee2024eccvw-geco,
  title     = {{GECO: GPT-Driven Estimation of 3D Human-Scene Contact in the Wild}},
  author    = {Lee, Chaehong and Singh, Simranjit and Fore, Michael and Pavlakos, Georgios and Stamoulis, Dimitrios},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {436-450},
  doi       = {10.1007/978-3-031-92591-7_29},
  url       = {https://mlanthology.org/eccvw/2024/lee2024eccvw-geco/}
}