Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Xu, Lu; Zhu, Sijie; Li, Chunyuan; Kuo, Chia-Wen; Chen, Fan; Wang, Xinyao; Chen, Guang; Du, Dawei; Yuan, Ye; Wen, Longyin

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

CVPRW 2025 pp. 503-512

/cvprw/2025/xu2025cvprw-beyond/

Abstract

The emerging video LMMs (Large Multimodal Models) have achieved significant performance on generic video understanding in the form of VQA (Visual Question Answering), which mainly focuses on raw videos captured with cameras. However, a large portion of videos in real-world applications are edited videos, e.g., users usually cut and add effects/modifications to the raw video before publishing it on social media platforms. The edited videos usually have high view counts but they are not covered in existing benchmarks of video LMMs, i.e., ActivityNet-QA or VideoChatGPT benchmark. In this paper, we take advantage of edited videos on a popular short video platform, i.e., TikTok, and build a video VQA benchmark (named EditVid-QA) covering four typical editing categories, i.e., effect, funny, meme, and game. Funny and meme videos benchmark nuanced understanding and high-level reasoning, while effect and game evaluate the understanding capability of artificial design. Most of the open-source video LMMs perform poorly on the EditVid-QA benchmark, indicating a huge domain gap between edited short videos on social media and regular raw videos. To improve the generalization ability of LMMs, we collect a training set for the proposed benchmark based on both Panda-70M/WebVid raw videos and small-scale TikTok/CapCut edited videos, which boosts the performance on the proposed EditVid-QA benchmark, indicating the effectiveness of high-quality training data. We also identified the issue in existing video evaluation protocol using GPT-3.5 judge, namely a "sorry" attack, where a sorry-style naive answer can achieve an extremely high rating from GPT judge, e.g., over 4.3 for correctness score on VideoChatGPT evaluation protocol. To avoid the "sorry" attacks, we evaluate results with GPT-4 judge and keyword filtering. The dataset is released at https://github.com/xenonlamb/editvid-qa.

PDF CVPRW Semantic Scholar

Cite

Text

Xu et al. "Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Xu et al. "Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/xu2025cvprw-beyond/)

BibTeX

@inproceedings{xu2025cvprw-beyond,
  title     = {{Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model}},
  author    = {Xu, Lu and Zhu, Sijie and Li, Chunyuan and Kuo, Chia-Wen and Chen, Fan and Wang, Xinyao and Chen, Guang and Du, Dawei and Yuan, Ye and Wen, Longyin},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {503-512},
  url       = {https://mlanthology.org/cvprw/2025/xu2025cvprw-beyond/}
}