Visually Interpretable Subtask Reasoning for Visual Question Answering
Abstract
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.
Cite
Text
Cheng et al. "Visually Interpretable Subtask Reasoning for Visual Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.Markdown
[Cheng et al. "Visually Interpretable Subtask Reasoning for Visual Question Answering." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/cheng2025cvprw-visually/)BibTeX
@inproceedings{cheng2025cvprw-visually,
title = {{Visually Interpretable Subtask Reasoning for Visual Question Answering}},
author = {Cheng, Yu and Goel, Arushi and Bilen, Hakan},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2025},
pages = {2760-2780},
url = {https://mlanthology.org/cvprw/2025/cheng2025cvprw-visually/}
}