Learning Visual Composition Through Improved Semantic Guidance
Abstract
Visual imagery does not consist of solitary objects, but in-stead reflects the composition of a multitude of fluid con-cepts. While there have been great advances in visual repre-sentation learning, such advances have focused on buildingbetter representations for a small number of discrete objectsbereft of an understanding of how these objects are inter-acting. One can observe this limitation in representationslearned through captions or contrastive learning - wherethe learned model treats an image essentially as a bag ofwords. Several works have attempted to address this lim-itation through the development of bespoke architectures.In this work, we focus on simple and scalable approaches.In particular, we demonstrate that by improving weakly la-beled data, i.e. captions, we can vastly improve the perfor-mance of standard contrastive learning approaches. Previ-ous CLIP models achieved near chance rate on challengingtasks probing compositional learning. However, our sim-ple approach boosts performance of CLIP substantially andachieves state of the art results on compositional bench-marks such as ARO and SugarCrepe. Furthermore, weshowcase our results on a relatively new captioning bench-mark derived from DOCCI. We demonstrate through a se-ries of ablations that a standard CLIP model trained withenhanced data may demonstrate impressive performance onimage retrieval tasks.
Cite
Text
Stone et al. "Learning Visual Composition Through Improved Semantic Guidance." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00354Markdown
[Stone et al. "Learning Visual Composition Through Improved Semantic Guidance." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/stone2025cvpr-learning/) doi:10.1109/CVPR52734.2025.00354BibTeX
@inproceedings{stone2025cvpr-learning,
title = {{Learning Visual Composition Through Improved Semantic Guidance}},
author = {Stone, Austin and Soltau, Hagen and Geirhos, Robert and Yi, Xi and Xia, Ye and Cao, Bingyi and Chen, Kaifeng and Ogale, Abhijit and Shlens, Jonathon},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {3740-3750},
doi = {10.1109/CVPR52734.2025.00354},
url = {https://mlanthology.org/cvpr/2025/stone2025cvpr-learning/}
}