Improving Autoregressive Video Modeling with History Understanding
Abstract
Video autoregressive generation (VideoAR) sequentially predicts future frames conditioned on history frames. Despite the advance of recent diffusion-based VideoAR, the role of conditioning signal—internal representations of history frames—remains underexplored. Inspired by the success of strong condition representations in text-conditioned generation, we investigate: \textit{Can better internal representations of history frames improve VideoAR performance?} Through systematic analysis, we show that history representation quality positively correlates with VideoAR, and that enhancing these representations provides gains that cannot be achieved by refining future frames representations alone. Based on these insights, we propose \textbf{MiMo} (Masked History Modeling), a novel framework that seamlessly integrates representation learning into diffusion-based VideoAR. MiMo applies masks to history frame tokens and trains the model to predict masked tokens of current and future frames alongside the diffusion objective, yielding predictive and robust history representations without relying on vision foundation models (VFMs) or heavy architectural changes. Extensive experiments demonstrate that MiMo achieves competitive performance in video prediction and generation tasks while substantially improving training efficiency. Our work underscores the importance of history representations in VideoAR.
Cite
Text
Luo et al. "Improving Autoregressive Video Modeling with History Understanding." International Conference on Learning Representations, 2026.Markdown
[Luo et al. "Improving Autoregressive Video Modeling with History Understanding." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/luo2026iclr-improving/)BibTeX
@inproceedings{luo2026iclr-improving,
title = {{Improving Autoregressive Video Modeling with History Understanding}},
author = {Luo, Wenyang and Qin, Haina and Li, Bing and Lu, Jiwen and Tao, Xin and Wan, Pengfei and Gai, Kun},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/luo2026iclr-improving/}
}