ICT-QA: Question Answering over Multi-Modal Contexts Including Image, Chart, and Text Modalities

Abstract

For question answering in multi-modal contexts that include image, chart, and text modalities, a model must be proficient in understanding each individual modality. Furthermore, the model must be able to find the necessary evidence from multiple modalities and generate answers through cross-modal reasoning for some questions. In this paper, we propose the Image and Chart Instruction Tuning (IC-tuning) method to enhance the model's comprehension of each modality. Specifically, we introduce visual-aware chart instruction-following data that describe both precise numerical values and visual information on the charts. We then train a Large Language Model (LLM) with a model architecture that utilizes an image-specific encoder and a chart-specific encoder. Our experiments demonstrate that this method achieves state-of-the-art performance in Chart Summarization and Open-ended Chart question answering (OpenCQA) tasks while having minimal impact on image and language benchmark performance. Although the IC-tuned model shows great comprehension performance for each modality, it still struggles with question answering tasks in multi-modal contexts because it is only trained on data for understanding each individual modality. To address this, we introduce the Question Answering over Image, Chart, and Text (ICT-QA) dataset, designed specifically for question answering in multi-modal contexts. After further training the IC-tuned LLM with the ICT-QA dataset, our evaluations demonstrate that ICT-QA significantly improves the quality of answers for both single-modal questions, where only one modality needs to be referenced from multiple modalities, and cross-modal questions, which require reasoning across multiple modalities.

Cite

Text

Jang et al. "ICT-QA: Question Answering over Multi-Modal Contexts Including Image, Chart, and Text Modalities." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Jang et al. "ICT-QA: Question Answering over Multi-Modal Contexts Including Image, Chart, and Text Modalities." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/jang2025cvprw-ictqa/)

BibTeX

@inproceedings{jang2025cvprw-ictqa,
  title     = {{ICT-QA: Question Answering over Multi-Modal Contexts Including Image, Chart, and Text Modalities}},
  author    = {Jang, Youngrok and Kong, Hyesoo and Kim, Gyeonghun and Lee, Yejin and Choi, Stanley Jungkyu and Bae, Kyunghoon},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {138-148},
  url       = {https://mlanthology.org/cvprw/2025/jang2025cvprw-ictqa/}
}