TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Abstract

Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, *how well do the models truly perform visual temporal reasoning?* Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) *Multi-Frame Gain*, (2) *Frame Order Sensitivity*, and (3) *Frame Information Disparity*. Following these principles, we introduce **TOMATO**, **T**emp**O**ral Reasoning **M**ultimod**A**l Evalua**T**i**O**n, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, *human-annotated* questions spanning *six* tasks (i.e. *action count, direction, rotation, shape & trend, velocity & frequency, and visual cues*), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending the human world dynamics through the video modality.

Cite

Text

Shangguan et al. "TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models." International Conference on Learning Representations, 2025.

Markdown

[Shangguan et al. "TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/shangguan2025iclr-tomato/)

BibTeX

@inproceedings{shangguan2025iclr-tomato,
  title     = {{TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models}},
  author    = {Shangguan, Ziyao and Li, Chuhan and Ding, Yuxuan and Zheng, Yanan and Zhao, Yilun and Fitzgerald, Tesca and Cohan, Arman},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/shangguan2025iclr-tomato/}
}