DTrOCR: Decoder-Only Transformer for Optical Character Recognition

Abstract

Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.

Cite

Text

Fujitake. "DTrOCR: Decoder-Only Transformer for Optical Character Recognition." Winter Conference on Applications of Computer Vision, 2024.

Markdown

[Fujitake. "DTrOCR: Decoder-Only Transformer for Optical Character Recognition." Winter Conference on Applications of Computer Vision, 2024.](https://mlanthology.org/wacv/2024/fujitake2024wacv-dtrocr/)

BibTeX

@inproceedings{fujitake2024wacv-dtrocr,
  title     = {{DTrOCR: Decoder-Only Transformer for Optical Character Recognition}},
  author    = {Fujitake, Masato},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2024},
  pages     = {8025-8035},
  url       = {https://mlanthology.org/wacv/2024/fujitake2024wacv-dtrocr/}
}