Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning

Abstract

Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a Noise-aware Captioning (NoC) framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal during training. The alignment-level-conditioned training allows the model to generate high-quality captions by simply setting the control signal to the desired alignment level at inference time. An in-depth analysis shows the effectiveness of our framework in handling noise. With two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. The code is available at https://github.com/kakaobrain/noc.

Cite

Text

Kang et al. "Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00275

Markdown

[Kang et al. "Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/kang2023iccv-noiseaware/) doi:10.1109/ICCV51070.2023.00275

BibTeX

@inproceedings{kang2023iccv-noiseaware,
  title     = {{Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning}},
  author    = {Kang, Wooyoung and Mun, Jonghwan and Lee, Sungjun and Roh, Byungseok},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {2942-2952},
  doi       = {10.1109/ICCV51070.2023.00275},
  url       = {https://mlanthology.org/iccv/2023/kang2023iccv-noiseaware/}
}