Arnett, Catherine

2 publications

ICLR 2026 Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Pierre-Carl Langlais, Pavel Chizhov, Catherine Arnett, Carlos Rosas Hinostroza, Mattia Nee, Eliot Krzysztof Jones, Irène Girard, David Mach, Anastasia Stasenko, Ivan P. Yamshchikov
NeurIPS 2025 Explaining and Mitigating Crosslingual Tokenizer Inequities Catherine Arnett, Tyler A. Chang, Stella Biderman, Ben Bergen