CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering
Abstract
Diagram Question Answering (DQA) is a challenging task requiring models to answer natural language questions based on visual diagram contexts. It serves as a crucial basis for academic tutoring technical support and more practical applications. DQA poses significant challenges such as the demand for domain-specific knowledge and the scarcity of annotated data which restrict the applicability of large-scale deep models. Previous approaches have explored external knowledge integration through pre-training but these methods are costly and can be limited by domain disparities. While Large Language Models (LLMs) show promise in question-answering there is still a gap in how to cooperate and interact with the diagram parsing process. In this paper we introduce the Chain-of-Guiding Learning Model for Diagram Question Answering (CoG-DQA) a novel framework that effectively addresses DQA challenges. CoG-DQA leverages LLMs to guide diagram parsing tools (DPTs) through the guiding chains enhancing the precision of diagram parsing while introducing rich background knowledge. Our experimental findings reveal that CoG-DQA surpasses all comparison models in various DQA scenarios achieving an average accuracy enhancement exceeding 5% and peaking at 11% across four datasets. These results underscore CoG-DQA's capacity to advance the field of visual question answering and promote the integration of LLMs into specialized domains.
Cite
Text
Wang et al. "CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01325Markdown
[Wang et al. "CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wang2024cvpr-cogdqa/) doi:10.1109/CVPR52733.2024.01325BibTeX
@inproceedings{wang2024cvpr-cogdqa,
title = {{CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering}},
author = {Wang, Shaowei and Zhang, Lingling and Zhu, Longji and Qin, Tao and Yap, Kim-Hui and Zhang, Xinyu and Liu, Jun},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13969-13979},
doi = {10.1109/CVPR52733.2024.01325},
url = {https://mlanthology.org/cvpr/2024/wang2024cvpr-cogdqa/}
}