IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Ma, David; Zhang, Yuanxing; Ren, JinCheng; Guo, Jiawei; Yao, Yifan; Wei, Zhenlin; Yang, Zhenzhu; Peng, Zhongyuan; Feng, Boyu; Ma, Jun; 顾潇,; Zhu, King; Wen, Zhoufutu; He, Yancheng; Cao, Meng; Zhou, Wangchunshu; Ni, Shiwen; Liu, Jiaheng; Huang, Wenhao; Zhang, Ge; Jin, Xiaojie

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

ICLR 2026

/iclr/2026/ma2026iclr-ivbench/

Abstract

Current benchmarks for Multimodal Large Language Models (MLLMs) predominantly rely on text-only queries, overlooking the essential role of images as visual context for enhancing video comprehension and facilitating natural human-AI interaction. To bridge this gap, we introduce \textbf{IV-Bench}, the first comprehensive benchmark for evaluating MLLMs on Image-Grounded Video Perception and Reasoning. IV-Bench comprises 966 videos paired with 2,560 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) spanning 5 distinct categories. We extensively evaluate state-of-the-art MLLMs, including open-source models (e.g., InternVL2.5, Qwen2.5-VL) and closed-source models (e.g., GPT-4o, Gemini2.0 series), revealing substantial performance gaps, with the best-performing model achieving only 28.9\% accuracy. Ablation studies demonstrate that incorporating images significantly enhances video understanding and highlight key model design factors influencing performance. Our findings provide valuable insights and guidance for future research. The code and dataset are available at \url{https://github.com/multimodal-art-projection/IV-Bench}.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Ma et al. "IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs." International Conference on Learning Representations, 2026.

Markdown

[Ma et al. "IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ma2026iclr-ivbench/)

BibTeX

@inproceedings{ma2026iclr-ivbench,
  title     = {{IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs}},
  author    = {Ma, David and Zhang, Yuanxing and Ren, JinCheng and Guo, Jiawei and Yao, Yifan and Wei, Zhenlin and Yang, Zhenzhu and Peng, Zhongyuan and Feng, Boyu and Ma, Jun and 顾潇,  and Zhu, King and Wen, Zhoufutu and He, Yancheng and Cao, Meng and Zhou, Wangchunshu and Ni, Shiwen and Liu, Jiaheng and Huang, Wenhao and Zhang, Ge and Jin, Xiaojie},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ma2026iclr-ivbench/}
}