Self-Supervised Pre-Training with Symmetric Superimposition Modeling for Scene Text Recognition

Gao, Zuan; Wang, Yuxin; Qu, Yadong; Zhang, Boqiang; Wang, Zixiao; Xu, Jianjun; Xie, Hongtao

doi:10.24963/ijcai.2024/85

Self-Supervised Pre-Training with Symmetric Superimposition Modeling for Scene Text Recognition

Zuan Gao, Yuxin Wang, Yadong Qu, Boqiang Zhang, Zixiao Wang, Jianjun Xu, Hongtao Xie

IJCAI 2024 pp. 767-775

doi:10.24963/ijcai.2024/85 /ijcai/2024/gao2024ijcai-self/

Abstract

Self-supervised monocular depth estimation that does not require hard-to-source depth labels for training has been widely studied in recent years. Due to its significant and growing needs, many lightweight but effective architectures have been designed for edge devices. Convolutional Neural Networks (CNNs) have shown its extraordinary ability in monocular depth estimation. However, their limited receptive field stints existing methods to reason only locally, inhibiting the effectiveness of the self-supervised paradigm. Recently, Transformers has achieved great success in estimating depth maps from monocular images. Nevertheless, massive parameters in the Transformers hinder the deployment to edge devices. In this paper, we propose MonoMixer, a brand-new lightweight CNN-Transformer architecture with three main contributions: 1) The details-augmented (DA) block employs graph reasoning unit to capture abundant local details, resulting depth prediction at a higher level of precision. 2) The self-modulate channel attention (SMCA) block adaptively adjust the channel weights of feature maps, aiming to emphasize the crucial features and aggregate channel-wise feature maps of different patterns. 3) The global-guided Transformer (G2T) block integrates global semantic token into multi-scale local features and exploit cross-attention to encode long range dependencies. Furthermore, experimental results demonstrate the superiority of our proposed MonoMixer both at model size and inference speed, and achieve state-of-the-art performance on three datasets: KITTI, Make3D and Cityscapes. Specifically, our proposed MonoMixer outperforms MonoFormer by a large margin in accuracy, with about 95 % fewer parameters.

PDF IJCAI Semantic Scholar

Cite

Text

Gao et al. "Self-Supervised Pre-Training with Symmetric Superimposition Modeling for Scene Text Recognition." International Joint Conference on Artificial Intelligence, 2024. doi:10.24963/ijcai.2024/85

Markdown

[Gao et al. "Self-Supervised Pre-Training with Symmetric Superimposition Modeling for Scene Text Recognition." International Joint Conference on Artificial Intelligence, 2024.](https://mlanthology.org/ijcai/2024/gao2024ijcai-self/) doi:10.24963/ijcai.2024/85

BibTeX

@inproceedings{gao2024ijcai-self,
  title     = {{Self-Supervised Pre-Training with Symmetric Superimposition Modeling for Scene Text Recognition}},
  author    = {Gao, Zuan and Wang, Yuxin and Qu, Yadong and Zhang, Boqiang and Wang, Zixiao and Xu, Jianjun and Xie, Hongtao},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {767-775},
  doi       = {10.24963/ijcai.2024/85},
  url       = {https://mlanthology.org/ijcai/2024/gao2024ijcai-self/}
}