SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

Abstract

While AI-generated content has garnered significant attention achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications. This paper presents SNED a superposition network architecture search method for efficient video diffusion model. Our method employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Moreover we propose the supernet training sampling warm-up for fast training optimization. To showcase the flexibility of our method we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model we can achieve consistent video generation results simultaneously across 64 x 64 to 256 x 256 resolutions with a large range of model sizes from 640M to 1.6B number of parameters for pixel-space video diffusion models.

Cite

Text

Li et al. "SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00827

Markdown

[Li et al. "SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/li2024cvpr-sned/) doi:10.1109/CVPR52733.2024.00827

BibTeX

@inproceedings{li2024cvpr-sned,
  title     = {{SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model}},
  author    = {Li, Zhengang and Kang, Yan and Liu, Yuchen and Liu, Difan and Hinz, Tobias and Liu, Feng and Wang, Yanzhi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {8661-8670},
  doi       = {10.1109/CVPR52733.2024.00827},
  url       = {https://mlanthology.org/cvpr/2024/li2024cvpr-sned/}
}