Rewind and Render: Towards Factually Accurate Text-to-Video Generation with Distilled Knowledge Retrieval

Abstract

Text-to-Video (T2V) models, despite recent advancements, struggle with factual accuracy, especially for knowledge-dense content. We introduce FACT-V (Factual Accuracy in Content Translation to Video), a system integrating multi-source knowledge retrieval into T2V pipelines. FACT-V offers two key benefits: i) improved factual accuracy of generated videos through dynamically retrieved information, and ii) increased interpretability by providing users with the augmented prompt information. A preliminary evaluation demonstrates the potential of knowledge-augmented approaches in improving the accuracy and reliability of T2V systems, particularly for entity-specific or time-sensitive prompts.

Cite

Text

Lee et al. "Rewind and Render: Towards Factually Accurate Text-to-Video Generation with Distilled Knowledge Retrieval." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I28.35356

Markdown

[Lee et al. "Rewind and Render: Towards Factually Accurate Text-to-Video Generation with Distilled Knowledge Retrieval." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/lee2025aaai-rewind/) doi:10.1609/AAAI.V39I28.35356

BibTeX

@inproceedings{lee2025aaai-rewind,
  title     = {{Rewind and Render: Towards Factually Accurate Text-to-Video Generation with Distilled Knowledge Retrieval}},
  author    = {Lee, Daniel and Chandra, Arjun and Zhou, Yang and Li, Yunyao and Conia, Simone},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {29652-29654},
  doi       = {10.1609/AAAI.V39I28.35356},
  url       = {https://mlanthology.org/aaai/2025/lee2025aaai-rewind/}
}