End-to-End Document Recognition and Understanding with Dessurt
Abstract
We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.
Cite
Text
Davis et al. "End-to-End Document Recognition and Understanding with Dessurt." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25069-9_19Markdown
[Davis et al. "End-to-End Document Recognition and Understanding with Dessurt." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/davis2022eccvw-endtoend/) doi:10.1007/978-3-031-25069-9_19BibTeX
@inproceedings{davis2022eccvw-endtoend,
title = {{End-to-End Document Recognition and Understanding with Dessurt}},
author = {Davis, Brian L. and Morse, Bryan S. and Price, Brian L. and Tensmeyer, Chris and Wigington, Curtis and Morariu, Vlad I.},
booktitle = {European Conference on Computer Vision Workshops},
year = {2022},
pages = {280-296},
doi = {10.1007/978-3-031-25069-9_19},
url = {https://mlanthology.org/eccvw/2022/davis2022eccvw-endtoend/}
}