Motion-Aligned Word Embeddings for Text-to-Motion Generation

Abstract

Existing text-to-motion (T2M) generation models typically rely on pretrained large language models to encode textual inputs. However, these models, trained on generic text corpora, lack explicit alignment between motion-related words (e.g., "clockwise'', "quickly'') and human skeletal movements. This misalignment, fundamentally rooted in the word embedding layers, severely limits the ability of T2M models to understand and generalize fine-grained motion semantics. To tackle this issue, we propose Motion-Aligned Text Encoding (MATE), a novel framework that explicitly incorporates motion semantics into the word embedding layers of large language models to enhance text-motion alignment for motion generation. To address the challenge of inherent semantic entanglement in motion sequences, MATE introduces two key components: 1) a motion localization strategy that establishes localized correspondences between sub-texts and motion segments, enabling soft attention guidance for semantic localization; and 2) a motion disentanglement module that isolates word-specific motion semantics via contrastive kinematic prototypes, ensuring word-level alignment between linguistic and kinematic representations. Remarkably, language models enhanced with MATE can be seamlessly integrated into existing T2M methods, significantly surpassing state-of-the-art performance on two standard benchmarks with minimal modifications.

Cite

Text

Han et al. "Motion-Aligned Word Embeddings for Text-to-Motion Generation." International Conference on Learning Representations, 2026.

Markdown

[Han et al. "Motion-Aligned Word Embeddings for Text-to-Motion Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/han2026iclr-motionaligned/)

BibTeX

@inproceedings{han2026iclr-motionaligned,
  title     = {{Motion-Aligned Word Embeddings for Text-to-Motion Generation}},
  author    = {Han, Ke and Lyu, Yueming and Sebe, Nicu},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/han2026iclr-motionaligned/}
}