SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-Object Interaction Scenarios

Abstract

Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at [https://droliven.github.io/SViMo_project](https://droliven.github.io/SViMo_project).

Cite

Text

Dang et al. "SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-Object Interaction Scenarios." Advances in Neural Information Processing Systems, 2025.

Markdown

[Dang et al. "SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-Object Interaction Scenarios." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/dang2025neurips-svimo/)

BibTeX

@inproceedings{dang2025neurips-svimo,
  title     = {{SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-Object Interaction Scenarios}},
  author    = {Dang, Lingwei and Shao, Ruizhi and Zhang, Hongwen and Min, Wei and Liu, Yebin and Wu, Qingyao},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/dang2025neurips-svimo/}
}