Compositional Condition Question Answering in Tabular Understanding
Abstract
Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder’s inability to accurately recognize the content of a row, and the model’s tendency to overlook conditions in the question. To address these, we introduce a new Compositional Condition Tabular Understanding method, called CoCoTab. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query. Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU. Our code can be available at https://github.com/LAMDA-Tabular/MMTU.
Cite
Text
Jiang et al. "Compositional Condition Question Answering in Tabular Understanding." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Jiang et al. "Compositional Condition Question Answering in Tabular Understanding." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/jiang2025icml-compositional/)BibTeX
@inproceedings{jiang2025icml-compositional,
title = {{Compositional Condition Question Answering in Tabular Understanding}},
author = {Jiang, Jun-Peng and Zhou, Tao and Zhan, De-Chuan and Ye, Han-Jia},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {27831-27850},
volume = {267},
url = {https://mlanthology.org/icml/2025/jiang2025icml-compositional/}
}