Are Large-Scale Soft Labels Necessary for Large-Scale Dataset Distillation?

Xiao, Lingao; He, Yang

doi:10.52202/079017-0524

Are Large-Scale Soft Labels Necessary for Large-Scale Dataset Distillation?

Lingao Xiao, Yang He

NeurIPS 2024

doi:10.52202/079017-0524 /neurips/2024/xiao2024neurips-largescale/

Abstract

In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation?In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40$\times$ compression) with a 2.6\% performance gain. Code is available at: https://github.com/he-y/soft-label-pruning-for-dataset-distillation

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Xiao and He. "Are Large-Scale Soft Labels Necessary for Large-Scale Dataset Distillation?." Neural Information Processing Systems, 2024. doi:10.52202/079017-0524

Markdown

[Xiao and He. "Are Large-Scale Soft Labels Necessary for Large-Scale Dataset Distillation?." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/xiao2024neurips-largescale/) doi:10.52202/079017-0524

BibTeX

@inproceedings{xiao2024neurips-largescale,
  title     = {{Are Large-Scale Soft Labels Necessary for Large-Scale Dataset Distillation?}},
  author    = {Xiao, Lingao and He, Yang},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0524},
  url       = {https://mlanthology.org/neurips/2024/xiao2024neurips-largescale/}
}