Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

Abstract

Modeling human–human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human–human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples—expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. Code will be released at github.com/Qingxuan-Wu/Text2Interact.

Cite

Text

Wu et al. "Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation." International Conference on Learning Representations, 2026.

Markdown

[Wu et al. "Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wu2026iclr-text2interact/)

BibTeX

@inproceedings{wu2026iclr-text2interact,
  title     = {{Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation}},
  author    = {Wu, Qingxuan and Dou, Zhiyang and Guo, Chuan and Huang, Yiming and Feng, Qiao and Zhou, Bing and Wang, Jian and Liu, Lingjie},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wu2026iclr-text2interact/}
}