ODM: A Text-Image Further Alignment Pre-Training Approach for Scene Text Detection and Spotting

Abstract

Abstract In recent years text-image joint pre-training techniques have shown promising results in various tasks. However in Optical Character Recognition (OCR) tasks aligning text instances with their corresponding text regions in images poses a challenge as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at https://github.com/PriNing/ODM.

Cite

Text

Duan et al. "ODM: A Text-Image Further Alignment Pre-Training Approach for Scene Text Detection and Spotting." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01476

Markdown

[Duan et al. "ODM: A Text-Image Further Alignment Pre-Training Approach for Scene Text Detection and Spotting." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/duan2024cvpr-odm/) doi:10.1109/CVPR52733.2024.01476

BibTeX

@inproceedings{duan2024cvpr-odm,
  title     = {{ODM: A Text-Image Further Alignment Pre-Training Approach for Scene Text Detection and Spotting}},
  author    = {Duan, Chen and Fu, Pei and Guo, Shan and Jiang, Qianyi and Wei, Xiaoming},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {15587-15597},
  doi       = {10.1109/CVPR52733.2024.01476},
  url       = {https://mlanthology.org/cvpr/2024/duan2024cvpr-odm/}
}