OpenVid-1m: A Large-Scale High-Quality Dataset for Text-to-Video Generation

Abstract

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previously popular video datasets, e.g.WebVid-10M and Panda-70M, overly emphasized large scale, resulting in the inclusion of many low-quality videos and short, imprecise captions. Therefore, it is challenging but crucial to collect a precise high-quality dataset while maintaining a scale of millions for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of making full use of semantic information from text tokens. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.

Cite

Text

Nan et al. "OpenVid-1m: A Large-Scale High-Quality Dataset for Text-to-Video Generation." International Conference on Learning Representations, 2025.

Markdown

[Nan et al. "OpenVid-1m: A Large-Scale High-Quality Dataset for Text-to-Video Generation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/nan2025iclr-openvid1m/)

BibTeX

@inproceedings{nan2025iclr-openvid1m,
  title     = {{OpenVid-1m: A Large-Scale High-Quality Dataset for Text-to-Video Generation}},
  author    = {Nan, Kepan and Xie, Rui and Zhou, Penghao and Fan, Tiehan and Yang, Zhenheng and Chen, Zhijie and Li, Xiang and Yang, Jian and Tai, Ying},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/nan2025iclr-openvid1m/}
}