xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations

Qin, Can; Xia, Congying; Ramakrishnan, Krithika; Ryoo, Michael S.; Tu, Lifu; Feng, Yihao; Shu, Manli; Zhou, Honglu; Awadalla, Anas; Wang, Jun; Purushwalkam, Senthil; Xue, Le; Zhou, Yingbo; Wang, Huan; Savarese, Silvio; Niebles, Juan Carlos; Chen, Zeyuan; Xu, Ran; Xiong, Caiming

doi:10.1007/978-3-031-92808-6_16

xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations

Can Qin, Congying Xia, Krithika Ramakrishnan, Michael S. Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

ECCVW 2024 pp. 249-265

doi:10.1007/978-3-031-92808-6_16 /eccvw/2024/qin2024eccvw-xgenvideosyn1/

Abstract

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. We extend the latent diffusion model (LDM) architecture by introducing a video variational autoencoder (VidVAE). Our Video VAE compresses video data spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational cost, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different time frames and aspect ratios. We also designed a data collection and processing pipeline, which helped us gather over 13 million high-quality video-text pairs. The pipeline includes steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our xGen-MM video-language model. Training the Video VAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

PDF ECCVW Semantic Scholar

Cite

Text

Qin et al. "xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-92808-6_16

Markdown

[Qin et al. "xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/qin2024eccvw-xgenvideosyn1/) doi:10.1007/978-3-031-92808-6_16

BibTeX

@inproceedings{qin2024eccvw-xgenvideosyn1,
  title     = {{xGen-VideoSyn-1: High-Fidelity Text-to-Video Synthesis with Compressed Representations}},
  author    = {Qin, Can and Xia, Congying and Ramakrishnan, Krithika and Ryoo, Michael S. and Tu, Lifu and Feng, Yihao and Shu, Manli and Zhou, Honglu and Awadalla, Anas and Wang, Jun and Purushwalkam, Senthil and Xue, Le and Zhou, Yingbo and Wang, Huan and Savarese, Silvio and Niebles, Juan Carlos and Chen, Zeyuan and Xu, Ran and Xiong, Caiming},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {249-265},
  doi       = {10.1007/978-3-031-92808-6_16},
  url       = {https://mlanthology.org/eccvw/2024/qin2024eccvw-xgenvideosyn1/}
}