DocFormer: End-to-End Transformer for Document Understanding

Abstract

We present DocFormer - a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats(forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters)

Cite

Text

Appalaraju et al. "DocFormer: End-to-End Transformer for Document Understanding." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00103

Markdown

[Appalaraju et al. "DocFormer: End-to-End Transformer for Document Understanding." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/appalaraju2021iccv-docformer/) doi:10.1109/ICCV48922.2021.00103

BibTeX

@inproceedings{appalaraju2021iccv-docformer,
  title     = {{DocFormer: End-to-End Transformer for Document Understanding}},
  author    = {Appalaraju, Srikar and Jasani, Bhavan and Kota, Bhargava Urala and Xie, Yusheng and Manmatha, R.},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {993-1003},
  doi       = {10.1109/ICCV48922.2021.00103},
  url       = {https://mlanthology.org/iccv/2021/appalaraju2021iccv-docformer/}
}