Embodied Image Captioning: Self-Supervised Learning Agents for Spatially Coherent Image Descriptions

Abstract

We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at https://hsp-iit.github.io/embodied-captioning.

Cite

Text

Galliena et al. "Embodied Image Captioning: Self-Supervised Learning Agents for Spatially Coherent Image Descriptions." International Conference on Computer Vision, 2025.

Markdown

[Galliena et al. "Embodied Image Captioning: Self-Supervised Learning Agents for Spatially Coherent Image Descriptions." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/galliena2025iccv-embodied/)

BibTeX

@inproceedings{galliena2025iccv-embodied,
  title     = {{Embodied Image Captioning: Self-Supervised Learning Agents for Spatially Coherent Image Descriptions}},
  author    = {Galliena, Tommaso and Apicella, Tommaso and Rosa, Stefano and Morerio, Pietro and Del Bue, Alessio and Natale, Lorenzo},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {24370-24379},
  url       = {https://mlanthology.org/iccv/2025/galliena2025iccv-embodied/}
}