Contrastive Attention Maps for Self-Supervised Co-Localization
Abstract
The goal of unsupervised co-localization is to locate the object in a scene under the assumptions that 1) the dataset consists of only one superclass, e.g., birds, and 2) there are no human-annotated labels in the dataset. The most recent method achieves impressive co-localization performance by employing self-supervised representation learning approaches such as predicting rotation. In this paper, we introduce a new contrastive objective directly on the attention maps to enhance co-localization performance. Our contrastive loss function exploits rich information of location, which induces the model to activate the extent of the object effectively. In addition, we propose a pixel-wise attention pooling that selectively aggregates the feature map regarding their magnitudes across channels. Our methods are simple and shown effective by extensive qualitative and quantitative evaluation, achieving state-of-the-art co-localization performances by large margins on four datasets: CUB-200-2011, Stanford Cars, FGVC-Aircraft, and Stanford Dogs. Our code will be publicly available online for the research community.
Cite
Text
Ki et al. "Contrastive Attention Maps for Self-Supervised Co-Localization." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00280Markdown
[Ki et al. "Contrastive Attention Maps for Self-Supervised Co-Localization." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/ki2021iccv-contrastive/) doi:10.1109/ICCV48922.2021.00280BibTeX
@inproceedings{ki2021iccv-contrastive,
title = {{Contrastive Attention Maps for Self-Supervised Co-Localization}},
author = {Ki, Minsong and Uh, Youngjung and Choe, Junsuk and Byun, Hyeran},
booktitle = {International Conference on Computer Vision},
year = {2021},
pages = {2803-2812},
doi = {10.1109/ICCV48922.2021.00280},
url = {https://mlanthology.org/iccv/2021/ki2021iccv-contrastive/}
}