Test-Time Adaptation for Visual Document Understanding

Abstract

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively.

Cite

Text

Ebrahimi et al. "Test-Time Adaptation for Visual Document Understanding." Transactions on Machine Learning Research, 2023.

Markdown

[Ebrahimi et al. "Test-Time Adaptation for Visual Document Understanding." Transactions on Machine Learning Research, 2023.](https://mlanthology.org/tmlr/2023/ebrahimi2023tmlr-testtime/)

BibTeX

@article{ebrahimi2023tmlr-testtime,
  title     = {{Test-Time Adaptation for Visual Document Understanding}},
  author    = {Ebrahimi, Sayna and Arik, Sercan O and Pfister, Tomas},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023},
  url       = {https://mlanthology.org/tmlr/2023/ebrahimi2023tmlr-testtime/}
}