A Token-Level Text Image Foundation Model for Document Understanding

Abstract

In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenFD, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenFD to construct a token-level visual-language MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenFD and TokenVL. Code, demo, datasets, and weights are available at https://github.com/Token-family/TokenFD.

Cite

Text

Guan et al. "A Token-Level Text Image Foundation Model for Document Understanding." International Conference on Computer Vision, 2025.

Markdown

[Guan et al. "A Token-Level Text Image Foundation Model for Document Understanding." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/guan2025iccv-tokenlevel/)

BibTeX

@inproceedings{guan2025iccv-tokenlevel,
  title     = {{A Token-Level Text Image Foundation Model for Document Understanding}},
  author    = {Guan, Tongkun and Wang, Zining and Fu, Pei and Guo, Zhengtao and Shen, Wei and Zhou, Kai and Yue, Tiezhu and Duan, Chen and Sun, Hao and Jiang, Qianyi and Luo, Junfeng and Yang, Xiaokang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {23210-23220},
  url       = {https://mlanthology.org/iccv/2025/guan2025iccv-tokenlevel/}
}