SAVA: Scalable Learning-Agnostic Data Valuation

ICLR 2025

/iclr/2025/kessler2025iclr-sava/

Abstract

Selecting data for training machine learning models is crucial since large, web-scraped, real datasets contain noisy artifacts that affect the quality and relevance of individual data points. These noisy artifacts will impact model performance. We formulate this problem as a data valuation task, assigning a value to data points in the training set according to how similar or dissimilar they are to a clean and curated validation set. Recently, *LAVA* (Just et al., 2023) demonstrated the use of optimal transport (OT) between a large noisy training dataset and a clean validation set, to value training data efficiently, without the dependency on model performance. However, the *LAVA* algorithm requires the entire dataset as an input, this limits its application to larger datasets. Inspired by the scalability of stochastic (gradient) approaches which carry out computations on *batches* of data points instead of the entire dataset, we analogously propose *SAVA*, a scalable variant of *LAVA* with its computation on batches of data points. Intuitively, *SAVA* follows the same scheme as *LAVA* which leverages the hierarchically defined OT for data valuation. However, while *LAVA* processes the whole dataset, *SAVA* divides the dataset into batches of data points, and carries out the OT problem computation on those batches. Moreover, our theoretical derivations on the trade-off of using entropic regularization for OT problems include refinements of prior work. We perform extensive experiments, to demonstrate that *SAVA* can scale to large datasets with millions of data points and does not trade off data valuation performance. Our Github repository is available at \url{https://github.com/skezle/sava}.

PDF ICLR Semantic Scholar

Cite

Text

Kessler et al. "SAVA: Scalable Learning-Agnostic Data Valuation." International Conference on Learning Representations, 2025.

Markdown

[Kessler et al. "SAVA: Scalable Learning-Agnostic Data Valuation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/kessler2025iclr-sava/)

BibTeX

@inproceedings{kessler2025iclr-sava,
  title     = {{SAVA: Scalable Learning-Agnostic Data Valuation}},
  author    = {Kessler, Samuel and Le, Tam and Nguyen, Vu},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/kessler2025iclr-sava/}
}