Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Abstract
Recent Vision Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, but they often struggle with trivially simple visual tasks. In this work, we introduce the Atomic Visual Skills Benchmark (AVSBench) to evaluate whether VLMs possess capabilities to understand basic geometric features, which we refer to as atomic visual skills. Specifically, we systematically categorize the atomic visual skills and handcraft a set of 5,073 diverse questions designed to assess each individual atomic visual skill. Using AVSBench, we evaluate the current leading VLMs and find that they struggle with most of these atomic visual skills that are obvious to humans.
Cite
Text
Chae et al. "Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models." NeurIPS 2024 Workshops: MATH-AI, 2024.Markdown
[Chae et al. "Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models." NeurIPS 2024 Workshops: MATH-AI, 2024.](https://mlanthology.org/neuripsw/2024/chae2024neuripsw-decomposing/)BibTeX
@inproceedings{chae2024neuripsw-decomposing,
title = {{Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models}},
author = {Chae, Hyunsik and Yoon, Seungwoo and Chun, Chloe Yewon and Go, Gyehun and Cho, Yongin and Lee, Gyeongmin and Ryu, Ernest K.},
booktitle = {NeurIPS 2024 Workshops: MATH-AI},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/chae2024neuripsw-decomposing/}
}