MMLONGBENCH-DOC: Benchmarking Long-Context Document Understanding with Visualizations

Abstract

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLONGBENCH-DOC, a long-context, multi- modal benchmark comprising 1,082 expert-annotated questions. Distinct from previous datasets, it is constructed upon 135 lengthy PDF-formatted documents with an average of 47.5 pages and 21,214 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e., page number). Moreover, 33.7\% of the questions are cross-page questions requiring evidence across multiple pages. 20.6\% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 44.9\%, while the second-best, GPT-4V, scores 30.5\%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs.

Cite

Text

Ma et al. "MMLONGBENCH-DOC: Benchmarking Long-Context Document Understanding with Visualizations." Neural Information Processing Systems, 2024. doi:10.52202/079017-3041

Markdown

[Ma et al. "MMLONGBENCH-DOC: Benchmarking Long-Context Document Understanding with Visualizations." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/ma2024neurips-mmlongbenchdoc/) doi:10.52202/079017-3041

BibTeX

@inproceedings{ma2024neurips-mmlongbenchdoc,
  title     = {{MMLONGBENCH-DOC: Benchmarking Long-Context Document Understanding with Visualizations}},
  author    = {Ma, Yubo and Zang, Yuhang and Chen, Liangyu and Chen, Meiqi and Jiao, Yizhu and Li, Xinze and Lu, Xinyuan and Liu, Ziyu and Ma, Yan and Dong, Xiaoyi and Zhang, Pan and Pan, Liangming and Jiang, Yu-Gang and Wang, Jiaqi and Cao, Yixin and Sun, Aixin},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3041},
  url       = {https://mlanthology.org/neurips/2024/ma2024neurips-mmlongbenchdoc/}
}