KAIROS: Scalable Model-Agnostic Data Valuation
Abstract
Data valuation techniques quantify each training example's contribution to model performance, providing a principled basis for data cleaning, acquisition, and selection. Existing valuation methods remain inadequate: \emph{model-based} techniques depend on a single fitted model and inherit its biases, while \emph{algorithm-based} approaches like Data Shapley scale poorly due to their need to train multiple models. Recent work has proposed model-agnostic alternatives based on Wasserstein distance between the training set and a clean reference set, but exact computation is expensive and approximations often misrank examples. We introduce KAIROS, a model-agnostic framework that values examples by their contribution to the Maximum Mean Discrepancy (MMD) between the training set and a clean reference distribution. Unlike Wasserstein methods, MMD admits a closed-form solution that requires no approximations and is scalable to large datasets. Additionally, KAIROS enables efficient online valuation: adding a new batch of $m$ examples requires only $O(mN)$ computation to update all scores, compared to $O(N^2)$ in prior work where $N$ is the training set size. Empirical evaluations on noise, mislabeling, and poisoning benchmarks show that KAIROS consistently outperforms state-of-the-art baselines in both accuracy and runtime. On ImageNet, KAIROS achieves up to 15 $\times$ speedup over the fastest baseline while maintaining superior data valuation quality. Our results demonstrate that model-agnostic methods can match or exceed model-based approaches in performance while scaling to large datasets.
Cite
Text
Zhu et al. "KAIROS: Scalable Model-Agnostic Data Valuation." Advances in Neural Information Processing Systems, 2025.Markdown
[Zhu et al. "KAIROS: Scalable Model-Agnostic Data Valuation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhu2025neurips-kairos/)BibTeX
@inproceedings{zhu2025neurips-kairos,
title = {{KAIROS: Scalable Model-Agnostic Data Valuation}},
author = {Zhu, Jiongli and Prashant, Parjanya Prajakta and Cloninger, Alex and Salimi, Babak},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/zhu2025neurips-kairos/}
}