End-to-End Document Recognition and Understanding with Dessurt

Abstract

We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.

Cite

Text

Davis et al. "End-to-End Document Recognition and Understanding with Dessurt." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25069-9_19

Markdown

[Davis et al. "End-to-End Document Recognition and Understanding with Dessurt." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/davis2022eccvw-endtoend/) doi:10.1007/978-3-031-25069-9_19

BibTeX

@inproceedings{davis2022eccvw-endtoend,
  title     = {{End-to-End Document Recognition and Understanding with Dessurt}},
  author    = {Davis, Brian L. and Morse, Bryan S. and Price, Brian L. and Tensmeyer, Chris and Wigington, Curtis and Morariu, Vlad I.},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2022},
  pages     = {280-296},
  doi       = {10.1007/978-3-031-25069-9_19},
  url       = {https://mlanthology.org/eccvw/2022/davis2022eccvw-endtoend/}
}