LVLM-Count: Enhancing the Counting Ability of Large Vision-Language Models
Abstract
Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs’ counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting—a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.
Cite
Text
Qharabagh et al. "LVLM-Count: Enhancing the Counting Ability of Large Vision-Language Models." Transactions on Machine Learning Research, 2026.Markdown
[Qharabagh et al. "LVLM-Count: Enhancing the Counting Ability of Large Vision-Language Models." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/qharabagh2026tmlr-lvlmcount/)BibTeX
@article{qharabagh2026tmlr-lvlmcount,
title = {{LVLM-Count: Enhancing the Counting Ability of Large Vision-Language Models}},
author = {Qharabagh, Muhammad Fetrat and Ghofrani, Mohammadreza and Fountoulakis, Kimon},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/qharabagh2026tmlr-lvlmcount/}
}