DocParser: Hierarchical Document Structure Parsing from Renderings

Abstract

Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure in documents is missing. As a remedy, we developed “DocParser”: an end-to-end system for parsing complete document structure – including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1% and improves the F1 score of classifying hierarchical relations by 35.8%.

Cite

Text

Rausch et al. "DocParser: Hierarchical Document Structure Parsing from Renderings." AAAI Conference on Artificial Intelligence, 2021. doi:10.1609/AAAI.V35I5.16558

Markdown

[Rausch et al. "DocParser: Hierarchical Document Structure Parsing from Renderings." AAAI Conference on Artificial Intelligence, 2021.](https://mlanthology.org/aaai/2021/rausch2021aaai-docparser/) doi:10.1609/AAAI.V35I5.16558

BibTeX

@inproceedings{rausch2021aaai-docparser,
  title     = {{DocParser: Hierarchical Document Structure Parsing from Renderings}},
  author    = {Rausch, Johannes and Martinez, Octavio and Bissig, Fabian and Zhang, Ce and Feuerriegel, Stefan},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2021},
  pages     = {4328-4338},
  doi       = {10.1609/AAAI.V35I5.16558},
  url       = {https://mlanthology.org/aaai/2021/rausch2021aaai-docparser/}
}