Variable Resolution Improves Visual Question Answering Under a Limited Pixel Budget

Abstract

AI-based systems for visual scene understanding benefit from a large field of view (FOV). Multiple camera systems extend the FOV, but larger and higher-quality images strain acquisition, communication, and computing resources. Sub-sampling the FOV effectively addresses this challenge without compromising performance on complex tasks that require fine visual cues and contextual information. We demonstrate that a variable sampling scheme, inspired by human vision, outperforms uniform sampling in several visual question answering (VQA) tasks with a limited sample budget (3% of full resolution). Specifically, we show accuracy gains of 3.7%, 2.0%, and 0.9% on the GQA, VQAv2, and SEED-Bench datasets, respectively. This improvement, achieved without image scanning, holds regardless of the fixation point location, as confirmed by control experiments. The results show the potential of the biologically inspired image representation to improve the design of visual acquisition and processing models in future AI systems.

Cite

Text

Gizdov et al. "Variable Resolution Improves Visual Question Answering Under a Limited Pixel Budget." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91578-9_22

Markdown

[Gizdov et al. "Variable Resolution Improves Visual Question Answering Under a Limited Pixel Budget." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/gizdov2024eccvw-variable/) doi:10.1007/978-3-031-91578-9_22

BibTeX

@inproceedings{gizdov2024eccvw-variable,
  title     = {{Variable Resolution Improves Visual Question Answering Under a Limited Pixel Budget}},
  author    = {Gizdov, Andrey and Ullman, Shimon and Harari, Daniel},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {289-298},
  doi       = {10.1007/978-3-031-91578-9_22},
  url       = {https://mlanthology.org/eccvw/2024/gizdov2024eccvw-variable/}
}