Adaptive Multi-Frame Sampling for Consistent Zero-Shot Text-to-Video Editing
Abstract
Achieving convincing temporal coherence is a fundamental challenge in zero-shot text-to-video editing. To address this issue, this paper introduces AMAC (Adaptive Multi-frame sAmpling for Consistent zero-shot text-to-video editing), a novel method that effectively balances temporal consistency with detail preservation. Our approach proposes a theoretical framework with a fully adaptive sampling strategy that selects frames for joint processing using a pre-trained text-to-image diffusion model. By reformulating the sampling strategy as a stochastic permutation over frame indexes and constructing its distribution based on inter-frame similarities, we promote consistent processing of related content. This method demonstrates superior robustness against temporal variations and shot transitions, making it particularly well-suited for editing long dynamic video sequences, as validated through experiments on DAVIS and BDD100K datasets. Some examples of generated videos are available in the following anonymous repository https://anonymous.4open.science/r/AMAC-A406.
Cite
Text
Escotais et al. "Adaptive Multi-Frame Sampling for Consistent Zero-Shot Text-to-Video Editing." Transactions on Machine Learning Research, 2026.Markdown
[Escotais et al. "Adaptive Multi-Frame Sampling for Consistent Zero-Shot Text-to-Video Editing." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/escotais2026tmlr-adaptive/)BibTeX
@article{escotais2026tmlr-adaptive,
title = {{Adaptive Multi-Frame Sampling for Consistent Zero-Shot Text-to-Video Editing}},
author = {Escotais, Thérèse Tisseau des and Rambour, Clément and Leroy, Bertrand and Breloy, Arnaud},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/escotais2026tmlr-adaptive/}
}