Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection

Abstract

In recent years, Mask R-CNN based methods have achieved promising performance on scene text detection tasks. This paper proposes to incorporate self-attention mechanism and multi-task learning into Mask R-CNN based scene text detection frameworks. For the backbone, self-attention-based Swin Transformer is adopted to replace the original backbone of ResNet, and a composite network scheme is further utilized to combine two Swin Transformer networks as a backbone. For the detection heads, a multi-task learning method by using cascade refinement structure for text/non-text classification, bounding box regression, mask prediction and text line recognition is proposed. Experiments are carried out on the ICDAR MLT 2017 & 2019 datasets, which show that the proposed method has achieved improved performance.

Cite

Text

Ding et al. "Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25069-9_21

Markdown

[Ding et al. "Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/ding2022eccvw-incorporating/) doi:10.1007/978-3-031-25069-9_21

BibTeX

@inproceedings{ding2022eccvw-incorporating,
  title     = {{Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection}},
  author    = {Ding, Ning and Peng, Liangrui and Liu, Changsong and Zhang, Yuqi and Zhang, Ruixue and Li, Jie},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2022},
  pages     = {314-328},
  doi       = {10.1007/978-3-031-25069-9_21},
  url       = {https://mlanthology.org/eccvw/2022/ding2022eccvw-incorporating/}
}