Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sarto, Sara; Cornia, Marcella; Cucchiara, Rita

doi:10.24963/IJCAI.2025/1180

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Sara Sarto, Marcella Cornia, Rita Cucchiara

IJCAI 2025 pp. 10632-10640

doi:10.24963/IJCAI.2025/1180 /ijcai/2025/sarto2025ijcai-image/

Abstract

The evaluation of machine-generated captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment. For a comprehensive overview of captioning evaluation refer to our project page available at https://github.com/aimagelab/awesome-captioning-evaluation.

PDF IJCAI Semantic Scholar

Cite

Text

Sarto et al. "Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1180

Markdown

[Sarto et al. "Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/sarto2025ijcai-image/) doi:10.24963/IJCAI.2025/1180

BibTeX

@inproceedings{sarto2025ijcai-image,
  title     = {{Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives}},
  author    = {Sarto, Sara and Cornia, Marcella and Cucchiara, Rita},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10632-10640},
  doi       = {10.24963/IJCAI.2025/1180},
  url       = {https://mlanthology.org/ijcai/2025/sarto2025ijcai-image/}
}