OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition

Abstract

Recently visually-situated text parsing (VsTP) has experienced notable advancements driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However due to the diversified targets and heterogeneous schemas previous works usually design task-specific architectures and objectives for individual tasks which inadvertently leads to modal isolation and complex workflow. In this paper we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically we devise a universal model called OmniParser which can simultaneously handle three typical visually-situated text parsing tasks: text spotting key information extraction and table recognition. In OmniParser all tasks share the unified encoder-decoder architecture the unified objective: point-conditioned text generation and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks despite its unified concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.

Cite

Text

Wan et al. "OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01481

Markdown

[Wan et al. "OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/wan2024cvpr-omniparser/) doi:10.1109/CVPR52733.2024.01481

BibTeX

@inproceedings{wan2024cvpr-omniparser,
  title     = {{OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition}},
  author    = {Wan, Jianqiang and Song, Sibo and Yu, Wenwen and Liu, Yuliang and Cheng, Wenqing and Huang, Fei and Bai, Xiang and Yao, Cong and Yang, Zhibo},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {15641-15653},
  doi       = {10.1109/CVPR52733.2024.01481},
  url       = {https://mlanthology.org/cvpr/2024/wan2024cvpr-omniparser/}
}