Real-Time Motion-Controllable Autoregressive Video Diffusion

Abstract

Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters.

Cite

Text

Zhao et al. "Real-Time Motion-Controllable Autoregressive Video Diffusion." International Conference on Learning Representations, 2026.

Markdown

[Zhao et al. "Real-Time Motion-Controllable Autoregressive Video Diffusion." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhao2026iclr-realtime/)

BibTeX

@inproceedings{zhao2026iclr-realtime,
  title     = {{Real-Time Motion-Controllable Autoregressive Video Diffusion}},
  author    = {Zhao, Kesen and Shi, Jiaxin and Zhu, Beier and Zhou, Junbao and Shen, Xiaolong and Zhou, Yuan and Sun, Qianru and Zhang, Hanwang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhao2026iclr-realtime/}
}