SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

Abstract

Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: "How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making in-context predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.

Cite

Text

Chen et al. "SINC: Self-Supervised In-Context Learning for Vision-Language Tasks." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01415

Markdown

[Chen et al. "SINC: Self-Supervised In-Context Learning for Vision-Language Tasks." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/chen2023iccv-sinc/) doi:10.1109/ICCV51070.2023.01415

BibTeX

@inproceedings{chen2023iccv-sinc,
  title     = {{SINC: Self-Supervised In-Context Learning for Vision-Language Tasks}},
  author    = {Chen, Yi-Syuan and Song, Yun-Zhu and Yeo, Cheng Yu and Liu, Bei and Fu, Jianlong and Shuai, Hong-Han},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {15430-15442},
  doi       = {10.1109/ICCV51070.2023.01415},
  url       = {https://mlanthology.org/iccv/2023/chen2023iccv-sinc/}
}