Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation
Abstract
Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high dimensionality and complexity. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D or 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent, (ii) a local volume information captured by 3D convolutions with wavelet decomposition, and (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on standard video generation benchmarks such as UCF101, SkyTimelapse, and TaiChi demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control). The source code and pre-trained models will be publicly available once the paper is accepted.
Cite
Text
Kim et al. "Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72943-0_9Markdown
[Kim et al. "Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/kim2024eccv-hybrid/) doi:10.1007/978-3-031-72943-0_9BibTeX
@inproceedings{kim2024eccv-hybrid,
title = {{Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation}},
author = {Kim, Kihong and Lee, Haneol and Park, Jihye and Kim, Seyeon and Lee, Kwang Hee and Kim, Seungryong and Yoo, Jaejun},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72943-0_9},
url = {https://mlanthology.org/eccv/2024/kim2024eccv-hybrid/}
}