xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations
Abstract
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. We extend the latent diffusion model (LDM) architecture by introducing a video variational autoencoder (VidVAE). Our Video VAE compresses video data spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational cost, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different time frames and aspect ratios. We also designed a data collection and processing pipeline, which helped us gather over 13 million high-quality video-text pairs. The pipeline includes steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our xGen-MM video-language model. Training the Video VAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
Cite
Text
Qin et al. "xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-92808-6_16Markdown
[Qin et al. "xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/qin2024eccvw-xgenvideosyn1/) doi:10.1007/978-3-031-92808-6_16BibTeX
@inproceedings{qin2024eccvw-xgenvideosyn1,
title = {{xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations}},
author = {Qin, Can and Xia, Congying and Ramakrishnan, Krithika and Ryoo, Michael S. and Tu, Lifu and Feng, Yihao and Shu, Manli and Zhou, Honglu and Awadalla, Anas and Wang, Jun and Purushwalkam, Senthil and Xue, Le and Zhou, Yingbo and Wang, Huan and Savarese, Silvio and Niebles, Juan Carlos and Chen, Zeyuan and Xu, Ran and Xiong, Caiming},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {249-265},
doi = {10.1007/978-3-031-92808-6_16},
url = {https://mlanthology.org/eccvw/2024/qin2024eccvw-xgenvideosyn1/}
}