Data Acquisition: A New Frontier in Data-Centric AI

Abstract

As Machine Learning (ML) systems continue to grow, the demand for relevant and comprehensive datasets becomes imperative. There is limited study on the challenges of data acquisition due to ad-hoc processes and lack of consistent methodologies. We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets, transparent pricing, standardized data formats. With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers in a data marketplace. The benchmark was released as a part of DataPerf Mazumder et al. (2022). Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in ML.

Cite

Text

Chen et al. "Data Acquisition: A New Frontier in Data-Centric AI." Data-centric Machine Learning Research, 2025.

Markdown

[Chen et al. "Data Acquisition: A New Frontier in Data-Centric AI." Data-centric Machine Learning Research, 2025.](https://mlanthology.org/dmlr/2025/chen2025dmlr-data/)

BibTeX

@article{chen2025dmlr-data,
  title     = {{Data Acquisition: A New Frontier in Data-Centric AI}},
  author    = {Chen, Lingjiao and Acun, Bilge and Ardalani, Newsha and Sun, Yifan and Kang, Feiyang and Lyu, Hanrui and Kwon, Yongchan and Jia, Ruoxi and Wu, Carole-Jean and Zaharia, Matei and Zou, James},
  journal   = {Data-centric Machine Learning Research},
  year      = {2025},
  pages     = {1-19},
  volume    = {2},
  url       = {https://mlanthology.org/dmlr/2025/chen2025dmlr-data/}
}