ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Xu, Yicheng; Wu, Yue; Yu, Jiashuo; Yan, Ziang; Jiang, Tianxiang; He, Yinan; Zhao, Qingsong; Chen, Kai; Qiao, Yu; Wang, Limin; Okumura, Manabu; Wang, Yi

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang

ICLR 2026

/iclr/2026/xu2026iclr-expvid/

Abstract

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 20 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Xu et al. "ExpVid: A Benchmark for Experiment Video Understanding & Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Xu et al. "ExpVid: A Benchmark for Experiment Video Understanding & Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xu2026iclr-expvid/)

BibTeX

@inproceedings{xu2026iclr-expvid,
  title     = {{ExpVid: A Benchmark for Experiment Video Understanding & Reasoning}},
  author    = {Xu, Yicheng and Wu, Yue and Yu, Jiashuo and Yan, Ziang and Jiang, Tianxiang and He, Yinan and Zhao, Qingsong and Chen, Kai and Qiao, Yu and Wang, Limin and Okumura, Manabu and Wang, Yi},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xu2026iclr-expvid/}
}