Unbiasing Through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Abstract

We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias -- determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias -- assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias -- evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

Cite

Text

Shvetsova et al. "Unbiasing Through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02705

Markdown

[Shvetsova et al. "Unbiasing Through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/shvetsova2025cvpr-unbiasing/) doi:10.1109/CVPR52734.2025.02705

BibTeX

@inproceedings{shvetsova2025cvpr-unbiasing,
  title     = {{Unbiasing Through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks}},
  author    = {Shvetsova, Nina and Nagrani, Arsha and Schiele, Bernt and Kuehne, Hilde and Rupprecht, Christian},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29050-29059},
  doi       = {10.1109/CVPR52734.2025.02705},
  url       = {https://mlanthology.org/cvpr/2025/shvetsova2025cvpr-unbiasing/}
}