SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Abstract
Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose , a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same corresponding concepts. Using , we demonstrate improvements on downstream recognition, robustness, retrieval, and compositionality benchmarks with CLIP (up to +14% for ImageNet, +4% for SugarCrepe), and on nearest neighbors and linear probe for ImageNet with DINO (+3% each). We also showcase a powerful ability to intervene and select individual concepts to further improve downstream task performance (up from +4% to +9% for SugarCrepe) and use this ability to study the robustness of ’s representation structure. Finally, we provide insights through ablation experiments and visualization of learned concepts.
Cite
Text
Vani et al. "SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72848-8_14Markdown
[Vani et al. "SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/vani2024eccv-sparo/) doi:10.1007/978-3-031-72848-8_14BibTeX
@inproceedings{vani2024eccv-sparo,
title = {{SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision}},
author = {Vani, Ankit and Nguyen, Bac and Lavoie, Samuel and Krishna, Ranjay and Courville, Aaron},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72848-8_14},
url = {https://mlanthology.org/eccv/2024/vani2024eccv-sparo/}
}