FreeViS: Training-Free Video Stylization with Inconsistent References
Abstract
Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization.
Cite
Text
Xu et al. "FreeViS: Training-Free Video Stylization with Inconsistent References." International Conference on Learning Representations, 2026.Markdown
[Xu et al. "FreeViS: Training-Free Video Stylization with Inconsistent References." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xu2026iclr-freevis/)BibTeX
@inproceedings{xu2026iclr-freevis,
title = {{FreeViS: Training-Free Video Stylization with Inconsistent References}},
author = {Xu, Jiacong and Mei, Yiqun and Zhang, Ke and Patel, Vishal M.},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/xu2026iclr-freevis/}
}