TFAR: A Training-Free Framework for Autonomous Reliable Reasoning in Visual Question Answering
Abstract
Recent approaches introduce chain-of-thought (CoT) reasoning to mitigate the challenges, such as hallucination and reasoning deficit in multimodal large language models (MLLMs) and enhance performance. However, existing CoT-based methods often rely on extensive data annotation and training. To overcome these limitations, we propose a training-free framework for autonomous and reliable reasoning (TFAR), which only uses common lightweight vision tools to improve the reasoning ability of MLLMs. TFAR enables an MLLM to autonomously and accurately identify relevant regions of interest (RoIs) and support CoT reasoning, without requiring additional training or annotations, and with low computational overhead during inference. However, the use of external tools will introduce noise and uncertainty. To mitigate the uncertainty introduced by external tools and select the optimal pathway, we propose a conformal prediction-based uncertainty quantification method that calibrates the outputs from external tools and dynamically selects the most appropriate tool based on the MLLM’s output uncertainty. Experiments across five datasets demonstrate that TFAR improves performance over the base MLLM by an average of 4.6$\%$, in some cases even outperforming fine-tuned baselines, while maintaining low inference cost. These results offer new insights into training-free CoT guidance for MLLMs and underscore the value of reliable visual tools.
Cite
Text
Zhi et al. "TFAR: A Training-Free Framework for Autonomous Reliable Reasoning in Visual Question Answering." Transactions on Machine Learning Research, 2025.Markdown
[Zhi et al. "TFAR: A Training-Free Framework for Autonomous Reliable Reasoning in Visual Question Answering." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/zhi2025tmlr-tfar/)BibTeX
@article{zhi2025tmlr-tfar,
title = {{TFAR: A Training-Free Framework for Autonomous Reliable Reasoning in Visual Question Answering}},
author = {Zhi, Zhuo and Feng, Chen and Daneshmend, Adam and Orlu, Mine and Demosthenous, Andreas and Yin, Lu and Li, Da and Liu, Ziquan and Rodrigues, Miguel R. D.},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/zhi2025tmlr-tfar/}
}