Coresets from Trajectories: Selecting Data via Correlation of Loss Differences
Abstract
Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences ($\mathtt{CLD}$), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. $\mathtt{CLD}$ is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for $\mathtt{CLD}$-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, $\mathtt{CLD}$-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1\% of more computationally expensive baselines even when not leading. $\mathtt{CLD}$ transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with $<1\%$ degradation. Moreover, $\mathtt{CLD}$ is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, $\mathtt{CLD}$ exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make $\mathtt{CLD}$ a principled, efficient, stable, and transferable tool for scalable dataset optimization.
Cite
Text
Nagaraj et al. "Coresets from Trajectories: Selecting Data via Correlation of Loss Differences." Transactions on Machine Learning Research, 2025.Markdown
[Nagaraj et al. "Coresets from Trajectories: Selecting Data via Correlation of Loss Differences." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/nagaraj2025tmlr-coresets/)BibTeX
@article{nagaraj2025tmlr-coresets,
title = {{Coresets from Trajectories: Selecting Data via Correlation of Loss Differences}},
author = {Nagaraj, Manish and Ravikumar, Deepak and Roy, Kaushik},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/nagaraj2025tmlr-coresets/}
}