Inverted-Attention Transformers Can Learn Object Representations: Insights from Slot Attention
Abstract
Visual reasoning is supported by a causal understanding of the physical world, and theories of human cognition suppose that a necessary step to causal understanding is the discovery and representation of high-level entities like objects. Slot Attention is a popular method aimed at object-centric learning, and its popularity has resulted in dozens of variants and extensions. To help understand the core assumptions that lead to successful object-centric learning, we take a step back and identify the minimal set of changes to a standard Transformer architecture to obtain the same performance as the specialized Slot Attention models. We systematically evaluate the performance and scaling behaviour of several ``intermediate'' architectures on seven image and video datasets from prior work. Our analysis reveals that by simply inverting the attention mechanism of Transformers, we obtain performance competitive with state-of-the-art Slot Attention in several domains.
Cite
Text
Wu et al. "Inverted-Attention Transformers Can Learn Object Representations: Insights from Slot Attention." NeurIPS 2023 Workshops: UniReps, 2023.Markdown
[Wu et al. "Inverted-Attention Transformers Can Learn Object Representations: Insights from Slot Attention." NeurIPS 2023 Workshops: UniReps, 2023.](https://mlanthology.org/neuripsw/2023/wu2023neuripsw-invertedattention-a/)BibTeX
@inproceedings{wu2023neuripsw-invertedattention-a,
title = {{Inverted-Attention Transformers Can Learn Object Representations: Insights from Slot Attention}},
author = {Wu, Yi-Fu and Greff, Klaus and Elsayed, Gamaleldin Fathy and Mozer, Michael Curtis and Kipf, Thomas and van Steenkiste, Sjoerd},
booktitle = {NeurIPS 2023 Workshops: UniReps},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/wu2023neuripsw-invertedattention-a/}
}