Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Abstract
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance – a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and training from base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible training code, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open VLMs.
Cite
Text
Karamcheti et al. "Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models." International Conference on Machine Learning, 2024.Markdown
[Karamcheti et al. "Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/karamcheti2024icml-prismatic/)BibTeX
@inproceedings{karamcheti2024icml-prismatic,
title = {{Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models}},
author = {Karamcheti, Siddharth and Nair, Suraj and Balakrishna, Ashwin and Liang, Percy and Kollar, Thomas and Sadigh, Dorsa},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {23123-23144},
volume = {235},
url = {https://mlanthology.org/icml/2024/karamcheti2024icml-prismatic/}
}