REDUCR: Robust Data Downsampling Using Class Priority Reweighting

Abstract

Modern machine learning models are becoming increasingly expensive to train for real-world image and text classification tasks, where massive web-scale data is collected in a streaming fashion. To reduce the training cost, online batch selection techniques have been developed to choose the most informative datapoints. However, these techniques can suffer from poor worst-class generalization performance due to class imbalance and distributional shifts. This work introduces REDUCR, a robust and efficient data downsampling method that uses class priority reweighting. REDUCR reduces the training data while preserving worst-class generalization performance. REDUCR assigns priority weights to datapoints in a class-aware manner using an online learning algorithm. We demonstrate the data efficiency and robust performance of REDUCR on vision and text classification tasks. On web-scraped datasets with imbalanced class distributions, REDUCR achieves significant test accuracy boosts for the worst-performing class (but also on average), surpassing state-of-the-art methods by around 14%.

Cite

Text

Bankes et al. "REDUCR: Robust Data Downsampling Using Class Priority Reweighting." NeurIPS 2023 Workshops: ReALML, 2023.

Markdown

[Bankes et al. "REDUCR: Robust Data Downsampling Using Class Priority Reweighting." NeurIPS 2023 Workshops: ReALML, 2023.](https://mlanthology.org/neuripsw/2023/bankes2023neuripsw-reducr/)

BibTeX

@inproceedings{bankes2023neuripsw-reducr,
  title     = {{REDUCR: Robust Data Downsampling Using Class Priority Reweighting}},
  author    = {Bankes, William and Hughes, George and Bogunovic, Ilija and Wang, Zi},
  booktitle = {NeurIPS 2023 Workshops: ReALML},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/bankes2023neuripsw-reducr/}
}