Event-T2M: Event-Level Conditioning for Complex Text-to-Motion Synthesis

Abstract

Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we pro- pose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we con- struct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground- truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts. Code and data are available at https://tjswodud.github.io/EventT2M.

Cite

Text

Hong et al. "Event-T2M: Event-Level Conditioning for Complex Text-to-Motion Synthesis." International Conference on Learning Representations, 2026.

Markdown

[Hong et al. "Event-T2M: Event-Level Conditioning for Complex Text-to-Motion Synthesis." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/hong2026iclr-eventt2m/)

BibTeX

@inproceedings{hong2026iclr-eventt2m,
  title     = {{Event-T2M: Event-Level Conditioning for Complex Text-to-Motion Synthesis}},
  author    = {Hong, Seong-Eun and Seon, Jaeyoung and Hwang, JuYeong and Shin, JongHwan and Kang, HyeongYeop},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/hong2026iclr-eventt2m/}
}