MetaCaptioner: Towards Generalist Visual Captioning with Open-Source Suites

Abstract

Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5\% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution. Our source code and models will be publicly released.

Cite

Text

Lei et al. "MetaCaptioner: Towards Generalist Visual Captioning with Open-Source Suites." International Conference on Learning Representations, 2026.

Markdown

[Lei et al. "MetaCaptioner: Towards Generalist Visual Captioning with Open-Source Suites." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lei2026iclr-metacaptioner/)

BibTeX

@inproceedings{lei2026iclr-metacaptioner,
  title     = {{MetaCaptioner: Towards Generalist Visual Captioning with Open-Source Suites}},
  author    = {Lei, Zhenxin and Gao, Zhangwei and Tian, Changyao and Cui, Erfei and Chen, Guanzhou and Yang, Danni and Duan, Yuchen and Wang, Zhaokai and Li, Wenhao and Wang, Weiyun and Zhao, Xiangyu and Ji, Jiayi and Qiao, Yu and Wang, Wenhai and Luo, Gen},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lei2026iclr-metacaptioner/}
}