Vision-Language Pre-Training for Boosting Scene Text Detectors

Abstract

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

Cite

Text

Song et al. "Vision-Language Pre-Training for Boosting Scene Text Detectors." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01523

Markdown

[Song et al. "Vision-Language Pre-Training for Boosting Scene Text Detectors." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/song2022cvpr-visionlanguage/) doi:10.1109/CVPR52688.2022.01523

BibTeX

@inproceedings{song2022cvpr-visionlanguage,
  title     = {{Vision-Language Pre-Training for Boosting Scene Text Detectors}},
  author    = {Song, Sibo and Wan, Jianqiang and Yang, Zhibo and Tang, Jun and Cheng, Wenqing and Bai, Xiang and Yao, Cong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {15681-15691},
  doi       = {10.1109/CVPR52688.2022.01523},
  url       = {https://mlanthology.org/cvpr/2022/song2022cvpr-visionlanguage/}
}