How (not) to Ensemble LVLMs for VQA

Abstract

This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?

Cite

Text

Alazraki et al. "How (not) to Ensemble LVLMs for VQA." NeurIPS 2023 Workshops: ICBINB, 2023.

Markdown

[Alazraki et al. "How (not) to Ensemble LVLMs for VQA." NeurIPS 2023 Workshops: ICBINB, 2023.](https://mlanthology.org/neuripsw/2023/alazraki2023neuripsw-ensemble/)

BibTeX

@inproceedings{alazraki2023neuripsw-ensemble,
  title     = {{How (not) to Ensemble LVLMs for VQA}},
  author    = {Alazraki, Lisa and Castrejon, Lluis and Dehghani, Mostafa and Huot, Fantine and Uijlings, Jasper and Mensink, Thomas},
  booktitle = {NeurIPS 2023 Workshops: ICBINB},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/alazraki2023neuripsw-ensemble/}
}