@BENCH: Benchmarking Vision-Language Models for Human-Centered Assistive Technology

Abstract

As Vision-Language Models (VLMs) advance human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists capable of performing multiple tasks simultaneously. However benchmarking VLMs for ATs remains under-explored. To bridge this gap we first create a novel AT benchmark (@BENCH). Guided by a pre-design user study with PVIs our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation Depth Estimation Optical Character Recognition (OCR) Image Captioning and Visual Question Answering (VQA). Besides we propose a novel AT model (@MODEL) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.

Cite

Text

Jiang et al. "@BENCH: Benchmarking Vision-Language Models for Human-Centered Assistive Technology." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Jiang et al. "@BENCH: Benchmarking Vision-Language Models for Human-Centered Assistive Technology." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/jiang2025wacv-bench/)

BibTeX

@inproceedings{jiang2025wacv-bench,
  title     = {{@BENCH: Benchmarking Vision-Language Models for Human-Centered Assistive Technology}},
  author    = {Jiang, Xin and Zheng, Junwei and Liu, Ruiping and Li, Jiahang and Zhang, Jiaming and Matthiesen, Sven and Stiefelhagen, Rainer},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {3934-3943},
  url       = {https://mlanthology.org/wacv/2025/jiang2025wacv-bench/}
}