Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection
Abstract
In recent years, Mask R-CNN based methods have achieved promising performance on scene text detection tasks. This paper proposes to incorporate self-attention mechanism and multi-task learning into Mask R-CNN based scene text detection frameworks. For the backbone, self-attention-based Swin Transformer is adopted to replace the original backbone of ResNet, and a composite network scheme is further utilized to combine two Swin Transformer networks as a backbone. For the detection heads, a multi-task learning method by using cascade refinement structure for text/non-text classification, bounding box regression, mask prediction and text line recognition is proposed. Experiments are carried out on the ICDAR MLT 2017 & 2019 datasets, which show that the proposed method has achieved improved performance.
Cite
Text
Ding et al. "Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection." European Conference on Computer Vision Workshops, 2022. doi:10.1007/978-3-031-25069-9_21Markdown
[Ding et al. "Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection." European Conference on Computer Vision Workshops, 2022.](https://mlanthology.org/eccvw/2022/ding2022eccvw-incorporating/) doi:10.1007/978-3-031-25069-9_21BibTeX
@inproceedings{ding2022eccvw-incorporating,
title = {{Incorporating Self-Attention Mechanism and Multi-Task Learning into Scene Text Detection}},
author = {Ding, Ning and Peng, Liangrui and Liu, Changsong and Zhang, Yuqi and Zhang, Ruixue and Li, Jie},
booktitle = {European Conference on Computer Vision Workshops},
year = {2022},
pages = {314-328},
doi = {10.1007/978-3-031-25069-9_21},
url = {https://mlanthology.org/eccvw/2022/ding2022eccvw-incorporating/}
}