Paloma: A Benchmark for Evaluating Language Model Fit
Abstract
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains—varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolates to others. We include two new datasets of the top 100 subreddits (e.g., r/depression on Reddit) and programming languages (e.g., Java on GitHub), both sources common in contemporary LMs. With our benchmark, we release 6 baseline 1B LMs carefully controlled to provide fair comparisons about which pretraining corpus is best and code for others to apply those controls to their own experiments. Our case studies demonstrate how the fine-grained results from Paloma surface findings such as that models pretrained without data beyond Common Crawl exhibit anomalous gaps in LM fit to many domains or that loss is dominated by the most frequently occurring strings in the vocabulary.
Cite
Text
Magnusson et al. "Paloma: A Benchmark for Evaluating Language Model Fit." Neural Information Processing Systems, 2024. doi:10.52202/079017-2052Markdown
[Magnusson et al. "Paloma: A Benchmark for Evaluating Language Model Fit." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/magnusson2024neurips-paloma/) doi:10.52202/079017-2052BibTeX
@inproceedings{magnusson2024neurips-paloma,
title = {{Paloma: A Benchmark for Evaluating Language Model Fit}},
author = {Magnusson, Ian and Bhagia, Akshita and Hofmann, Valentin and Soldaini, Luca and Jha, Ananya Harsh and Tafjord, Oyvind and Schwenk, Dustin and Walsh, Evan Pete and Elazar, Yanai and Lo, Kyle and Groeneveld, Dirk and Beltagy, Iz and Hajishirzi, Hannaneh and Smith, Noah A. and Richardson, Kyle and Dodge, Jesse},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2052},
url = {https://mlanthology.org/neurips/2024/magnusson2024neurips-paloma/}
}