Document Image Rectification Using Stable Diffusion Transformer

Abstract

Document images captured using handheld devices often suffer from geometric distortions caused by perspective variations, warping, and lens-induced aberrations. These distortions negatively impact text readability, OCR accuracy, and automated document analysis, making effective rectification essential. Traditional approaches, such as 3D reconstruction-based flattening and convolutional neural network (CNN)-based warping prediction, have shown promising results but struggle with handling complex, non-uniform distortions and long-range dependencies in document structures. In this paper, we propose a novel Conditional Stable Diffusion Transformer based framework designed specifically for document image rectification. Unlike conventional UNet-based diffusion models, which rely on hierarchical convolutional operations, our transformer-based architecture provides a global receptive field through self-attention mechanisms, enabling precise structural preservation and text alignment. Furthermore, we incorporate cross-attention conditioning, allowing the model to integrate auxiliary information for improved rectification accuracy. To enhance efficiency and robustness, we introduce a coarse rectification using control points and thin plate spline that estimates an initial globally aligned structure before the diffusion-based refinement process. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art rectification accuracy while maintaining comparable inference time to existing deep learning-based solutions. Our proposed framework establishes a new paradigm for document image rectification by leveraging transformer-based modeling, generative diffusion processes, and conditional guidance, making it highly effective across a wide range of document distortions.

Cite

Text

Kumari and Das. "Document Image Rectification Using Stable Diffusion Transformer." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Kumari and Das. "Document Image Rectification Using Stable Diffusion Transformer." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/kumari2025cvprw-document/)

BibTeX

@inproceedings{kumari2025cvprw-document,
  title     = {{Document Image Rectification Using Stable Diffusion Transformer}},
  author    = {Kumari, Pooja and Das, Sukhendu},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {3387-3396},
  url       = {https://mlanthology.org/cvprw/2025/kumari2025cvprw-document/}
}