DocFormerv2: Local Features for Document Understanding

Abstract

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines - On TabFact (+4.3%), InfoVQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, DocFormerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, DocFormerv2 understands multiple modalities better than prior-art in VDU.

Cite

Text

Appalaraju et al. "DocFormerv2: Local Features for Document Understanding." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I2.27828

Markdown

[Appalaraju et al. "DocFormerv2: Local Features for Document Understanding." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/appalaraju2024aaai-docformerv/) doi:10.1609/AAAI.V38I2.27828

BibTeX

@inproceedings{appalaraju2024aaai-docformerv,
  title     = {{DocFormerv2: Local Features for Document Understanding}},
  author    = {Appalaraju, Srikar and Tang, Peng and Dong, Qi and Sankaran, Nishant and Zhou, Yichu and Manmatha, R.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {709-718},
  doi       = {10.1609/AAAI.V38I2.27828},
  url       = {https://mlanthology.org/aaai/2024/appalaraju2024aaai-docformerv/}
}