An Archival Perspective on Pretraining Data
Abstract
Research in NLP on pretraining data has largely focused on identifying and mitigating downstream risks in models. We argue that more critical attention is needed to pretraining datasets and the systems that produce them. To highlight the broader range of impacts of pretraining corpora, we consider the analogy between pretraining datasets and archives. Within the broader ecosystem of datasets and models, we focus especially on processes involved in the creation of pretraining data. By adopting an archives perspective, we surface impacts beyond directly shaping model behavior, including the role of pretraining data corpora as independent data artifacts and the ways that their collection shape future practices. In particular, we explore research in NLP that parallels archival practices of appraisal: we consider the practices of filtering of pretraining data and critically examine the problem formulations taken on by this work. In doing so, we underscore how choices about what is included in pretraining data are necessarily subjective decisions about values. We conclude by drawing on archival studies to offer insights on paths forward.
Cite
Text
Desai et al. "An Archival Perspective on Pretraining Data." NeurIPS 2023 Workshops: SoLaR, 2023.Markdown
[Desai et al. "An Archival Perspective on Pretraining Data." NeurIPS 2023 Workshops: SoLaR, 2023.](https://mlanthology.org/neuripsw/2023/desai2023neuripsw-archival/)BibTeX
@inproceedings{desai2023neuripsw-archival,
title = {{An Archival Perspective on Pretraining Data}},
author = {Desai, Meera and Jacobs, Abigail Z. and Card, Dallas},
booktitle = {NeurIPS 2023 Workshops: SoLaR},
year = {2023},
url = {https://mlanthology.org/neuripsw/2023/desai2023neuripsw-archival/}
}