Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces
Abstract
End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible. In this paper we present a method that is able to take advantage of freely available multi-modal content to train computer vision algorithms without human supervision. We put forward the idea of performing self-supervised learning of visual features by mining a large scale corpus of multi-modal (text and image) documents. We show that discriminative visual features can be learnt efficiently by training a CNN to predict the semantic context in which a particular image is more probable to appear as an illustration. For this we leverage the hidden semantic structures discovered in the text corpus with a well-known topic modeling technique. Our experiments demonstrate state of the art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or natural-supervised approaches.
Cite
Text
Gomez et al. "Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.218Markdown
[Gomez et al. "Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/gomez2017cvpr-selfsupervised/) doi:10.1109/CVPR.2017.218BibTeX
@inproceedings{gomez2017cvpr-selfsupervised,
title = {{Self-Supervised Learning of Visual Features Through Embedding Images into Text Topic Spaces}},
author = {Gomez, Lluis and Patel, Yash and Rusinol, Marcal and Karatzas, Dimosthenis and Jawahar, C. V.},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2017},
doi = {10.1109/CVPR.2017.218},
url = {https://mlanthology.org/cvpr/2017/gomez2017cvpr-selfsupervised/}
}