An OCR for Classical Indic Documents Containing Arbitrarily Long Words
Abstract
OCR for printed classical Indic documents written in Sanskrit is a challenging research problem. It involves complexities such as image degradation, lack of datasets and long-length words. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. To address these shortcomings, we develop a Sanskrit specific OCR system. We present an attention-based LSTM model for reading Sanskrit characters in line images. We introduce a dataset of Sanskrit document images annotated at line level. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words.
Cite
Text
Dwivedi et al. "An OCR for Classical Indic Documents Containing Arbitrarily Long Words." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020. doi:10.1109/CVPRW50498.2020.00288Markdown
[Dwivedi et al. "An OCR for Classical Indic Documents Containing Arbitrarily Long Words." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.](https://mlanthology.org/cvprw/2020/dwivedi2020cvprw-ocr/) doi:10.1109/CVPRW50498.2020.00288BibTeX
@inproceedings{dwivedi2020cvprw-ocr,
title = {{An OCR for Classical Indic Documents Containing Arbitrarily Long Words}},
author = {Dwivedi, Agam and Saluja, Rohit and Sarvadevabhatla, Ravi Kiran},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year = {2020},
pages = {2386-2393},
doi = {10.1109/CVPRW50498.2020.00288},
url = {https://mlanthology.org/cvprw/2020/dwivedi2020cvprw-ocr/}
}