VEU-Bench: Towards Comprehensive Understanding of Video Editing
Abstract
Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars(Named after the Academy Awards.), a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3% across nine reasoning tasks. The code and data will be made available.
Cite
Text
Li et al. "VEU-Bench: Towards Comprehensive Understanding of Video Editing." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01276Markdown
[Li et al. "VEU-Bench: Towards Comprehensive Understanding of Video Editing." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-veubench/) doi:10.1109/CVPR52734.2025.01276BibTeX
@inproceedings{li2025cvpr-veubench,
title = {{VEU-Bench: Towards Comprehensive Understanding of Video Editing}},
author = {Li, Bozheng and Wu, Yongliang and Lu, Yi and Yu, Jiashuo and Tang, Licheng and Cao, Jiawang and Zhu, Wenqing and Sun, Yuyang and Wu, Jay and Zhu, Wenbo},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {13671-13680},
doi = {10.1109/CVPR52734.2025.01276},
url = {https://mlanthology.org/cvpr/2025/li2025cvpr-veubench/}
}