DocVLM: Make Your VLM an Efficient Reader

Abstract

Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing. While typical visual tasks perform well with low-resolution inputs, reading-intensive applications demand high-resolution, resulting in significant computational overhead. Using OCR-extracted text in VLM prompts partially addresses this issue but underperforms compared to full-resolution counterpart, as it lacks the complete visual context needed for optimal performance.We introduce DocVLM, a method that integrates an OCR-based modality into VLMs to enhance document processing while preserving original weights. Our approach employs an OCR encoder to capture textual content and layout, compressing these into a compact set of learned queries incorporated into the VLM. Comprehensive evaluations across leading VLMs show that DocVLM significantly reduces reliance on high-resolution images for document understanding.In limited-token regimes (448x448), DocVLM with 64 learned queries improves DocVQA results from 56.0% to 86.6% when integrated with InternVL2 and from 84.4% to 91.2% with Qwen2-VL. In LLaVA-OneVision, DocVLM achieves improved results while using 80% less image tokens. The reduced token usage allows processing multiple pages effectively, showing impressive zero-shot results on DUDE and state-of-the-art performance on MP-DocVQA, highlighting DocVLM's potential for applications requiring high-performance and efficiency.

Cite

Text

Nacson et al. "DocVLM: Make Your VLM an Efficient Reader." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02701

Markdown

[Nacson et al. "DocVLM: Make Your VLM an Efficient Reader." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/nacson2025cvpr-docvlm/) doi:10.1109/CVPR52734.2025.02701

BibTeX

@inproceedings{nacson2025cvpr-docvlm,
  title     = {{DocVLM: Make Your VLM an Efficient Reader}},
  author    = {Nacson, Mor Shpigel and Aberdam, Aviad and Ganz, Roy and Avraham, Elad Ben and Golts, Alona and Kittenplon, Yair and Mazor, Shai and Litman, Ron},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {29005-29015},
  doi       = {10.1109/CVPR52734.2025.02701},
  url       = {https://mlanthology.org/cvpr/2025/nacson2025cvpr-docvlm/}
}