Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis
Abstract
The rapid advancements in AI-generated content (AIGC) have led to extensive research and application of deep text-to-video (T2V) synthesis models, such as OpenAI's Sora. These models typically rely on high-quality prompt-video pairs and detailed text prompts for model training in order to produce high-quality videos. To boost the effectiveness of Sora-like T2V models, we introduce VidPrompter, an innovative large multi-modal model supporting T2V applications with three key functionalities: (1) generating detailed prompts from raw videos, (2) enhancing prompts from videos grounded with short descriptions, and (3) refining simple user-provided prompts to elevate T2V video quality. We train VidPrompter using a hybrid multi-task paradigm and propose the hallucination-aware direct preference optimization (HDPO) technique to improve the multi-modal, multi-task prompt optimization process. Experiments on various tasks show our method surpasses strong baselines and other competitors.
Cite
Text
Wang et al. "Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1133Markdown
[Wang et al. "Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/wang2025ijcai-hallucination/) doi:10.24963/IJCAI.2025/1133BibTeX
@inproceedings{wang2025ijcai-hallucination,
title = {{Hallucination-Aware Prompt Optimization for Text-to-Video Synthesis}},
author = {Wang, Jiapeng and Wang, Chengyu and Huang, Jun and Jin, Lianwen},
booktitle = {International Joint Conference on Artificial Intelligence},
year = {2025},
pages = {10198-10206},
doi = {10.24963/IJCAI.2025/1133},
url = {https://mlanthology.org/ijcai/2025/wang2025ijcai-hallucination/}
}