Are Vision Language Models Robust to Classic Uncertainty Challenges?

Abstract

Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT-4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. In this work, we sanity check whether modern VLMs pass the two most ``classic'' uncertainty quantification challenges: Anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs. Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from uncertain predictions enables significant reliability gains, achieving near-perfect robustness in several settings. However, for domain-specific tasks such as galaxy morphology classification, a lack of specialized knowledge prevents reliable uncertainty estimation. Finally, we propose a simple mechanism based on caption diversity to reveal a model’s internal uncertainty, enabling practitioners to predict when models will successfully abstain without relying on labeled data.

Cite

Text

Wang and Nalisnick. "Are Vision Language Models Robust to Classic Uncertainty Challenges?." Transactions on Machine Learning Research, 2026.

Markdown

[Wang and Nalisnick. "Are Vision Language Models Robust to Classic Uncertainty Challenges?." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/wang2026tmlr-vision/)

BibTeX

@article{wang2026tmlr-vision,
  title     = {{Are Vision Language Models Robust to Classic Uncertainty Challenges?}},
  author    = {Wang, Xi and Nalisnick, Eric},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/wang2026tmlr-vision/}
}