Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis

Abstract

The rapid advancements in AI-generated content (AIGC) have led to extensive research and application of deep text-to-video (T2V) synthesis models, such as OpenAI's Sora. These models typically rely on high-quality prompt-video pairs and detailed text prompts for model training in order to produce high-quality videos. To boost the effectiveness of Sora-like T2V models, we introduce VidPrompter, an innovative large multi-modal model supporting T2V applications with three key functionalities: (1) generating detailed prompts from raw videos, (2) enhancing prompts from videos grounded with short descriptions, and (3) refining simple user-provided prompts to elevate T2V video quality. We train VidPrompter using a hybrid multi-task paradigm and propose the hallucination-aware direct preference optimization (HDPO) technique to improve the multi-modal, multi-task prompt optimization process. Experiments on various tasks show our method surpasses strong baselines and other competitors.

Cite

Text

Wang et al. "Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1133

Markdown

[Wang et al. "Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/wang2025ijcai-hallucination/) doi:10.24963/IJCAI.2025/1133

BibTeX

@inproceedings{wang2025ijcai-hallucination,
  title     = {{Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis}},
  author    = {Wang, Jiapeng and Wang, Chengyu and Huang, Jun and Jin, Lianwen},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10198-10206},
  doi       = {10.24963/IJCAI.2025/1133},
  url       = {https://mlanthology.org/ijcai/2025/wang2025ijcai-hallucination/}
}