GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts
Abstract
The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically closed vocabulary scene encoders which require learning text-scene associations from scratch have been favored in the literature often resulting in inaccurate motion grounding. In this paper we propose a method that integrates an open vocabulary scene encoder into the architecture establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model ensuring a shared text-scene feature space. Subsequently the scene encoder is fine-tuned for conditional motion generation incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework a perceptual study and an open vocabulary experiment. Additionally our method is designed to accommodate future 2D open vocabulary segmentation methods for distillation in a plug-and-play manner.
Cite
Text
Milacski et al. "GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts." Winter Conference on Applications of Computer Vision, 2025.Markdown
[Milacski et al. "GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/milacski2025wacv-ghost/)BibTeX
@inproceedings{milacski2025wacv-ghost,
title = {{GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts}},
author = {Milacski, Zoltán Á. and Niinuma, Koichiro and Kawamura, Ryosuke and de la Torre, Fernando and Jeni, László A.},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2025},
pages = {4108-4118},
url = {https://mlanthology.org/wacv/2025/milacski2025wacv-ghost/}
}