Meta-Learning Makes a Better Multimodal Few-Shot Learner
Abstract
Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. As an effort to bridge this gap, we introduce a meta-learning approach for multimodal few-shot learning, to leverage its strong ability of accruing knowledge across tasks. The full model is based on frozen foundation vision and language models to benefit from their already learned capacity. To translate the visual features into the latent space of the language model, we introduce a light-weight meta-mapper acting as a meta-learner. By updating only the parameters of the meta-mapper, our model learns to quickly adapt to unseen samples with only a few gradient-step updates. Unlike prior multimodal few-shot learners, which need a hand-engineered task induction, our model is able to induce the task in a completely data-driven manner. Experiments on recent multimodal few-shot benchmarks demonstrate that compared to its counterparts our meta-learning approach yields better multimodal few-shot learners, while being computationally more efficient.
Cite
Text
Najdenkoska et al. "Meta-Learning Makes a Better Multimodal Few-Shot Learner." NeurIPS 2022 Workshops: MetaLearn, 2022.Markdown
[Najdenkoska et al. "Meta-Learning Makes a Better Multimodal Few-Shot Learner." NeurIPS 2022 Workshops: MetaLearn, 2022.](https://mlanthology.org/neuripsw/2022/najdenkoska2022neuripsw-metalearning/)BibTeX
@inproceedings{najdenkoska2022neuripsw-metalearning,
title = {{Meta-Learning Makes a Better Multimodal Few-Shot Learner}},
author = {Najdenkoska, Ivona and Zhen, Xiantong and Worring, Marcel},
booktitle = {NeurIPS 2022 Workshops: MetaLearn},
year = {2022},
url = {https://mlanthology.org/neuripsw/2022/najdenkoska2022neuripsw-metalearning/}
}