Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Abstract
In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment. Project page: https://yuheng-li.github. io/LLaVA-score/
Cite
Text
Cai et al. "Removing Distributional Discrepancies in Captions Improves Image-Text Alignment." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72664-4_23Markdown
[Cai et al. "Removing Distributional Discrepancies in Captions Improves Image-Text Alignment." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/cai2024eccv-removing/) doi:10.1007/978-3-031-72664-4_23BibTeX
@inproceedings{cai2024eccv-removing,
title = {{Removing Distributional Discrepancies in Captions Improves Image-Text Alignment}},
author = {Cai, Mu and Liu, Haotian and Li, Yuheng and Li, Yijun and Shechtman, Eli and Lin, Zhe and Lee, Yong Jae and Singh, Krishna Kumar},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72664-4_23},
url = {https://mlanthology.org/eccv/2024/cai2024eccv-removing/}
}