EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, Hideki Nakayama

CVPR 2024 pp. 13733-13742

doi:10.1109/CVPR52733.2024.01303 /cvpr/2024/li2024cvpr-evcap/

Abstract

Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model which was trained only on the COCO dataset can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap with only 3.97M trainable parameters exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training.

PDF CVPR Semantic Scholar

Cite

Text

Li et al. "EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01303

Markdown

[Li et al. "EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/li2024cvpr-evcap/) doi:10.1109/CVPR52733.2024.01303

BibTeX

@inproceedings{li2024cvpr-evcap,
  title     = {{EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension}},
  author    = {Li, Jiaxuan and Vo, Duc Minh and Sugimoto, Akihiro and Nakayama, Hideki},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13733-13742},
  doi       = {10.1109/CVPR52733.2024.01303},
  url       = {https://mlanthology.org/cvpr/2024/li2024cvpr-evcap/}
}