Hierarchical Patch Diffusion Models for High-Resolution Video Generation
Abstract
Diffusion models have demonstrated remarkable performance in image and video synthesis. However scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components limiting scalability and complicating downstream applications. In this work we study patch diffusion models (PDMs) -- a diffusion paradigm which models the distribution of patches rather than whole inputs keeping up to 0.7% of the original pixels. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First to enforce consistency between patches we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second to accelerate training and inference we propose adaptive computation which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 256x256 surpassing recent methods by more than 100%. Then we show that it can be rapidly fine-tuned from a base 36x64 low-resolution generator for high-resolution 64x288x512 text-to-video synthesis. To the best of our knowledge our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.
Cite
Text
Skorokhodov et al. "Hierarchical Patch Diffusion Models for High-Resolution Video Generation." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00723Markdown
[Skorokhodov et al. "Hierarchical Patch Diffusion Models for High-Resolution Video Generation." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/skorokhodov2024cvpr-hierarchical/) doi:10.1109/CVPR52733.2024.00723BibTeX
@inproceedings{skorokhodov2024cvpr-hierarchical,
title = {{Hierarchical Patch Diffusion Models for High-Resolution Video Generation}},
author = {Skorokhodov, Ivan and Menapace, Willi and Siarohin, Aliaksandr and Tulyakov, Sergey},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {7569-7579},
doi = {10.1109/CVPR52733.2024.00723},
url = {https://mlanthology.org/cvpr/2024/skorokhodov2024cvpr-hierarchical/}
}