CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, Jie Tang

ICLR 2025

/iclr/2025/yang2025iclr-cogvideox/

Abstract

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos that align seamlessly with text prompts, with a frame rate of 16 fps and resolution of 768 x 1360 pixels. Previous video generation models often struggled with limited motion and short durations. It is especially difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we introduce a 3D Variational Autoencoder (VAE) to compress videos across spatial and temporal dimensions, enhancing both the compression rate and video fidelity. Second, to improve text-video alignment, we propose an expert transformer with expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing progressive training and multi-resolution frame packing, CogVideoX excels at generating coherent, long-duration videos with diverse shapes and dynamic movements. In addition, we develop an effective pipeline that includes various pre-processing strategies for text and video data. Our innovative video captioning model significantly improves generation quality and semantic alignment. Results show that CogVideoX achieves state-of-the-art performance in both automated benchmarks and human evaluation. We publish the code and model checkpoints of CogVideoX along with our VAE model and video captioning model at https://github.com/THUDM/CogVideo.

PDF ICLR Semantic Scholar

Cite

Text

Yang et al. "CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer." International Conference on Learning Representations, 2025.

Markdown

[Yang et al. "CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/yang2025iclr-cogvideox/)

BibTeX

@inproceedings{yang2025iclr-cogvideox,
  title     = {{CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer}},
  author    = {Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and Yin, Da and Yuxuan.Zhang,  and Wang, Weihan and Cheng, Yean and Xu, Bin and Gu, Xiaotao and Dong, Yuxiao and Tang, Jie},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/yang2025iclr-cogvideox/}
}