Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models
Abstract
In this paper, we propose a novel method to extend sequence-to-sequence models to accurately process sequences much longer than the ones used during training while being sample- and resource-efficient, supported by thorough experimentation. To investigate the effectiveness of our method, we apply it to the task of correcting documents already processed with Optical Character Recognition (OCR) systems using sequence-to-sequence models based on characters. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.
Cite
Text
Ramirez-Orta et al. "Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models." AAAI Conference on Artificial Intelligence, 2022. doi:10.1609/AAAI.V36I10.21369Markdown
[Ramirez-Orta et al. "Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models." AAAI Conference on Artificial Intelligence, 2022.](https://mlanthology.org/aaai/2022/ramirezorta2022aaai-post/) doi:10.1609/AAAI.V36I10.21369BibTeX
@inproceedings{ramirezorta2022aaai-post,
title = {{Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models}},
author = {Ramirez-Orta, Juan Antonio and Xamena, Eduardo and Maguitman, Ana Gabriela and Milios, Evangelos E. and Soto, Axel J.},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2022},
pages = {11192-11199},
doi = {10.1609/AAAI.V36I10.21369},
url = {https://mlanthology.org/aaai/2022/ramirezorta2022aaai-post/}
}