ETVA: Evaluation of Text-to-Video Alignment via Fine-Grained Question Generation and Answering

Abstract

Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics, which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation. All codes and datasets will be publicly available soon.

Cite

Text

Guan et al. "ETVA: Evaluation of Text-to-Video Alignment via Fine-Grained Question Generation and Answering." International Conference on Computer Vision, 2025.

Markdown

[Guan et al. "ETVA: Evaluation of Text-to-Video Alignment via Fine-Grained Question Generation and Answering." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/guan2025iccv-etva/)

BibTeX

@inproceedings{guan2025iccv-etva,
  title     = {{ETVA: Evaluation of Text-to-Video Alignment via Fine-Grained Question Generation and Answering}},
  author    = {Guan, Kaisi and Lai, Zhengfeng and Sun, Yuchong and Zhang, Peng and Liu, Wei and Liu, Kieran and Cao, Meng and Song, Ruihua},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {21299-21309},
  url       = {https://mlanthology.org/iccv/2025/guan2025iccv-etva/}
}