Post-OCR Parsing: Building Simple and Robust Parser via BIO Tagging

Abstract

Parsing textual information embedded in images is important for various down- stream tasks. However, many previously developed parsers are limited to handling the information presented in one dimensional sequence format. Here, we present Post Ocr Tagging based parser (POT), a simple and robust parser that can parse visually embedded texts by BIO-tagging the output of optical character recognition (OCR) task. Our shallow parsing approach enables building robust neural parser with less than a thousand labeled data. POT is validated on receipt and namecard parsing tasks.

Cite

Text

Hwang et al. "Post-OCR Parsing: Building Simple and Robust Parser via BIO Tagging." NeurIPS 2019 Workshops: Document_Intelligence, 2019.

Markdown

[Hwang et al. "Post-OCR Parsing: Building Simple and Robust Parser via BIO Tagging." NeurIPS 2019 Workshops: Document_Intelligence, 2019.](https://mlanthology.org/neuripsw/2019/hwang2019neuripsw-postocr/)

BibTeX

@inproceedings{hwang2019neuripsw-postocr,
  title     = {{Post-OCR Parsing: Building Simple and Robust Parser via BIO Tagging}},
  author    = {Hwang, Wonseok and Kim, Seonghyeon and Seo, Minjoon and Yim, Jinyeong and Park, Seunghyun and Park, Sungrae and Lee, Junyeop and Lee, Bado and Lee, Hwalsuk},
  booktitle = {NeurIPS 2019 Workshops: Document_Intelligence},
  year      = {2019},
  url       = {https://mlanthology.org/neuripsw/2019/hwang2019neuripsw-postocr/}
}