TextSSR: Diffusion-Based Data Synthesis for Scene Text Recognition

Abstract

Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained models. Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text at scale. To tackle this, we introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data. TextSSR targets three key synthesizing characteristics: accuracy, realism, and scalability. It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement, ensuring proper character placement. It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background. This character-aware diffusion architecture enjoys precise character-level control and semantic coherence preservation, without relying on natural language prompts. Therefore, TextSSR supports large-scale generation through combinatorial text permutations. Based on these, we present TextSSR-F, a dataset of 3.55 million quality-screened text instances. Extensive experiments show that STR models trained on TextSSR-F outperform those trained on existing synthetic datasets by clear margins on common benchmarks, and further improvements are observed when mixed with real-world training data. Code is available at https://github.com/YesianRohn/TextSSR.

Cite

Text

Ye et al. "TextSSR: Diffusion-Based Data Synthesis for Scene Text Recognition." International Conference on Computer Vision, 2025.

Markdown

[Ye et al. "TextSSR: Diffusion-Based Data Synthesis for Scene Text Recognition." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ye2025iccv-textssr/)

BibTeX

@inproceedings{ye2025iccv-textssr,
  title     = {{TextSSR: Diffusion-Based Data Synthesis for Scene Text Recognition}},
  author    = {Ye, Xingsong and Du, Yongkun and Tao, Yunbo and Chen, Zhineng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17464-17473},
  url       = {https://mlanthology.org/iccv/2025/ye2025iccv-textssr/}
}