Chronicling Germany: An Annotated Historical Newspaper Dataset

Schultze, Christian; Kerkfeld, Niklas; Kuebart, Kara; Weber, Princilia; Wolter, Moritz; Selgert, Felix

Chronicling Germany: An Annotated Historical Newspaper Dataset

Christian Schultze, Niklas Kerkfeld, Kara Kuebart, Princilia Weber, Moritz Wolter, Felix Selgert

DMLR 2025 pp. 1-29

/dmlr/2025/schultze2025dmlr-chronicling/

Abstract

The correct detection of dense article layout and the recognition of characters in historical newspaper pages remains a challenging requirement for Natural Language Processing (NLP) and machine learning applications in the field of digital history. Digital newspaper portals for historic Germany typically provide Optical Character Recognition (OCR) text, albeit of varying quality. Unfortunately, layout information is often missing, limiting this rich source’s scope. Our dataset is designed to enable the training of layout and OCR models for historic German-language newspapers. The Chronicling Germany dataset contains 801 annotated historical newspaper pages from the time period between 1617 and 1933. The paper presents a processing pipeline and establishes baseline results on in- and out-of-domain test data using this pipeline. Both our dataset and the corresponding baseline code are freely available online. This work creates a starting point for future research in the field of digital history and historic German language newspaper processing. Furthermore, it provides the opportunity to study a low-resource task in computer vision.

PDF DMLR Semantic Scholar

Cite

Text

Schultze et al. "Chronicling Germany: An Annotated Historical Newspaper Dataset." Data-centric Machine Learning Research, 2025.

Markdown

[Schultze et al. "Chronicling Germany: An Annotated Historical Newspaper Dataset." Data-centric Machine Learning Research, 2025.](https://mlanthology.org/dmlr/2025/schultze2025dmlr-chronicling/)

BibTeX

@article{schultze2025dmlr-chronicling,
  title     = {{Chronicling Germany: An Annotated Historical Newspaper Dataset}},
  author    = {Schultze, Christian and Kerkfeld, Niklas and Kuebart, Kara and Weber, Princilia and Wolter, Moritz and Selgert, Felix},
  journal   = {Data-centric Machine Learning Research},
  year      = {2025},
  pages     = {1-29},
  volume    = {2},
  url       = {https://mlanthology.org/dmlr/2025/schultze2025dmlr-chronicling/}
}