IF-VidCap: Can Video Caption Models Follow Instructions?

Li, Shihao; Zhang, Yuanxing; Wu, Jiangtao; Lei, Zhide; He, Yiwen; Wen, Runzhe; Liao, Chenxi; Jiang, Chengkang; Ping, An; Gao, Shuo; Wang, Suhan; Bian, Zhaozhou; Zhou, Zijun; Xie, Jingyi; Zhou, Jiayi; Wang, Jing; Yao, Yifan; Xie, Weihao; Tan, Yingshui; Wang, Yanghai; Xie, Qianqian; Zhang, Zhaoxiang; Liu, Jiaheng

IF-VidCap: Can Video Caption Models Follow Instructions?

ICLR 2026

/iclr/2026/li2026iclr-ifvidcap/

Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlook instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of 26 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Li et al. "IF-VidCap: Can Video Caption Models Follow Instructions?." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "IF-VidCap: Can Video Caption Models Follow Instructions?." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-ifvidcap/)

BibTeX

@inproceedings{li2026iclr-ifvidcap,
  title     = {{IF-VidCap: Can Video Caption Models Follow Instructions?}},
  author    = {Li, Shihao and Zhang, Yuanxing and Wu, Jiangtao and Lei, Zhide and He, Yiwen and Wen, Runzhe and Liao, Chenxi and Jiang, Chengkang and Ping, An and Gao, Shuo and Wang, Suhan and Bian, Zhaozhou and Zhou, Zijun and Xie, Jingyi and Zhou, Jiayi and Wang, Jing and Yao, Yifan and Xie, Weihao and Tan, Yingshui and Wang, Yanghai and Xie, Qianqian and Zhang, Zhaoxiang and Liu, Jiaheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-ifvidcap/}
}