GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

Zoltán Á. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando de la Torre, László A. Jeni

WACV 2025 pp. 4108-4118

/wacv/2025/milacski2025wacv-ghost/

Abstract

The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically closed vocabulary scene encoders which require learning text-scene associations from scratch have been favored in the literature often resulting in inaccurate motion grounding. In this paper we propose a method that integrates an open vocabulary scene encoder into the architecture establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model ensuring a shared text-scene feature space. Subsequently the scene encoder is fine-tuned for conditional motion generation incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework a perceptual study and an open vocabulary experiment. Additionally our method is designed to accommodate future 2D open vocabulary segmentation methods for distillation in a plug-and-play manner.

PDF WACV Semantic Scholar

Cite

Text

Milacski et al. "GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Milacski et al. "GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/milacski2025wacv-ghost/)

BibTeX

@inproceedings{milacski2025wacv-ghost,
  title     = {{GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts}},
  author    = {Milacski, Zoltán Á. and Niinuma, Koichiro and Kawamura, Ryosuke and de la Torre, Fernando and Jeni, László A.},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {4108-4118},
  url       = {https://mlanthology.org/wacv/2025/milacski2025wacv-ghost/}
}