VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Abstract

Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the nuances of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator's rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms or is on par with larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL), and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for the evaluation of video understanding tasks.

Cite

Text

Waheed et al. "VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding." International Conference on Learning Representations, 2026.

Markdown

[Waheed et al. "VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/waheed2026iclr-videojudge/)

BibTeX

@inproceedings{waheed2026iclr-videojudge,
  title     = {{VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding}},
  author    = {Waheed, Abdul and Wu, Zhen and Alharthi, Dareen Safar and Kim, Seungone and Raj, Bhiksha},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/waheed2026iclr-videojudge/}
}