Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Abstract

Recent advances in video diffusion models shows promise for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex manipulation. This limitation arises from entangled features in overlapping regions, leading to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics via a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction, and models each phase using the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction. This design effectively alleviates the multi-object feature fusion issue in prior work. To further ensure subject semantic consistency across the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge dataset, as well as RLBench and SIMPLER benchmarks, demonstrate that our method establishs new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation. Project Page: https://fuxiao0719.github.io/projects/robomaster/

Cite

Text

Fu et al. "Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control." International Conference on Learning Representations, 2026.

Markdown

[Fu et al. "Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/fu2026iclr-learning/)

BibTeX

@inproceedings{fu2026iclr-learning,
  title     = {{Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control}},
  author    = {Fu, Xiao and Wang, Xintao and Liu, Xian and Bai, Jianhong and Xu, Runsen and Wan, Pengfei and Zhang, Di and Lin, Dahua},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/fu2026iclr-learning/}
}