Harvest - A System for Creating Structured Rate Filing Data from Filing PDFs

Abstract

We present a machine-learning-guided process that can efficiently extract factor tables from unstructured rate filing documents. Our approach combines multiple deep-learning-based models that work in tandem to create structured representations of tabular data present in unstructured documents such as pdf files. This process combines CNN's to detect tables, language-based models to extract table metadata and conventional computer vision techniques to improve the accuracy of tabular data on the machine-learning side. The extracted tabular data is validated through an intuitive user interface. This process, which we call Harvest, significantly reduces the time needed to extract tabular information from PDF files, enabling analysis of such data at a speed and scale that was previously unattainable.

Cite

Text

Tekin et al. "Harvest - A System for Creating Structured Rate Filing Data from Filing PDFs." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I11.21507

Markdown

[Tekin et al. "Harvest - A System for Creating Structured Rate Filing Data from Filing PDFs." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/tekin2022aaai-harvest/) doi:10.1609/AAAI.V36I11.21507

BibTeX

@inproceedings{tekin2022aaai-harvest,
  title     = {{Harvest - A System for Creating Structured Rate Filing Data from Filing PDFs}},
  author    = {Tekin, Ender and You, Qian and Conathan, Devin M. and Fung, Glenn Moo and Kneubuehl, Thomas S.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2022},
  pages     = {12414-12422},
  doi       = {10.1609/AAAI.V36I11.21507},
  url       = {https://mlanthology.org/aaai/2022/tekin2022aaai-harvest/}
}