Croissant: A Metadata Format for ML-Ready Datasets

Abstract

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.

Cite

Text

Akhtar et al. "Croissant: A Metadata Format for ML-Ready Datasets." Neural Information Processing Systems, 2024. doi:10.52202/079017-2610

Markdown

[Akhtar et al. "Croissant: A Metadata Format for ML-Ready Datasets." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/akhtar2024neurips-croissant/) doi:10.52202/079017-2610

BibTeX

@inproceedings{akhtar2024neurips-croissant,
  title     = {{Croissant: A Metadata Format for ML-Ready Datasets}},
  author    = {Akhtar, Mubashara and Benjelloun, Omar and Conforti, Costanza and Foschini, Luca and Gijsbers, Pieter and Giner-Miguelez, Joan and Goswami, Sujata and Jain, Nitisha and Karamousadakis, Michalis and Krishna, Satyapriya and Kuchnik, Michael and Lesage, Sylvain and Lhoest, Quentin and Marcenac, Pierre and Maskey, Manil and Mattson, Peter and Oala, Luis and Oderinwale, Hamidah and Ruyssen, Pierre and Santos, Tim and Shinde, Rajat and Simperl, Elena and Suresh, Arjun and Thomas, Goeffry and Tykhonov, Slava and Vanschoren, Joaquin and Varma, Susheel and van der Velde, Jos and Vogler, Steffen and Wu, Carole-Jean and Zhang, Luyao},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2610},
  url       = {https://mlanthology.org/neurips/2024/akhtar2024neurips-croissant/}
}