COVER: A Comprehensive Video Quality Evaluator

Abstract

Video quality assessment, especially for a massive scale of user-generated content, is an essential yet challenging computer vision and video analysis problem. Prior methods have been shown to be effective in mirroring subjective human opinion scores; however, they fail to capture the complicated, multi-dimensional aspects of factors that impact the overall perceptual quality. In this paper, we introduce COVER, a comprehensive video quality evaluator, a novel framework designed to evaluate video quality holistically — from a technical, aesthetic, and semantic perspective. Specifically, COVER leverages three parallel branches: (1) a Swin Transformer backbone implemented on spatially sampled crops to predict technical quality; (2) a ConvNet employed on subsampled frames to derive aesthetic quality; (3) a CLIP image encoder executed on re-sized frames to obtain semantic quality. We further propose a simplified cross-gating block to interact with the three branches before feeding into the predicting head. The final quality score is attained using a weighted sum of each sub-score, making a multi-faceted metric. Our experimental results demonstrate that COVER exceeds the state-of-the-art models in multiple UGC video quality datasets. Moreover, COVER offers a diagnosable quality report to explain the quality score in multiple pillars, while it is capable of processing 1080p videos at 3x faster speed than the real-time requirement. To facilitate future research on efficient and explainable video quality research, the code is available at https://github.com/vztu/COVER.

Cite

Text

He et al. "COVER: A Comprehensive Video Quality Evaluator." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00589

Markdown

[He et al. "COVER: A Comprehensive Video Quality Evaluator." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/he2024cvprw-cover/) doi:10.1109/CVPRW63382.2024.00589

BibTeX

@inproceedings{he2024cvprw-cover,
  title     = {{COVER: A Comprehensive Video Quality Evaluator}},
  author    = {He, Chenlong and Zheng, Qi and Zhu, Ruoxi and Zeng, Xiaoyang and Fan, Yibo and Tu, Zhengzhong},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {5799-5809},
  doi       = {10.1109/CVPRW63382.2024.00589},
  url       = {https://mlanthology.org/cvprw/2024/he2024cvprw-cover/}
}