Informative Synthetic Data Generation for Thorax Disease Classification
Abstract
Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets-resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.
Cite
Text
Wang et al. "Informative Synthetic Data Generation for Thorax Disease Classification." Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, 2025.Markdown
[Wang et al. "Informative Synthetic Data Generation for Thorax Disease Classification." Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, 2025.](https://mlanthology.org/uai/2025/wang2025uai-informative/)BibTeX
@inproceedings{wang2025uai-informative,
title = {{Informative Synthetic Data Generation for Thorax Disease Classification}},
author = {Wang, Yancheng and Goel, Rajeev and Jojic, Marko and Silva, Alvin C. and Wu, Teresa and Yang, Yingzhen},
booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence},
year = {2025},
pages = {4489-4514},
volume = {286},
url = {https://mlanthology.org/uai/2025/wang2025uai-informative/}
}