Benign Overfitting in Adversarial Training of Neural Networks

Abstract

Benign overfitting is the phenomenon wherein none of the predictors in the hypothesis class can achieve perfect accuracy (i.e., non-realizable or noisy setting), but a model that interpolates the training data still achieves good generalization. A series of recent works aim to understand this phenomenon for regression and classification tasks using linear predictors as well as two-layer neural networks. In this paper, we study such a benign overfitting phenomenon in an adversarial setting. We show that under a distributional assumption, interpolating neural networks found using adversarial training generalize well despite inference-time attacks. Specifically, we provide convergence and generalization guarantees for adversarial training of two-layer networks (with smooth as well as non-smooth activation functions) showing that under moderate $\ell_2$ norm perturbation budget, the trained model has near-zero robust training loss and near-optimal robust generalization error. We support our theoretical findings with an empirical study on synthetic and real-world data.

Cite

Text

Wang et al. "Benign Overfitting in Adversarial Training of Neural Networks." International Conference on Machine Learning, 2024.

Markdown

[Wang et al. "Benign Overfitting in Adversarial Training of Neural Networks." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/wang2024icml-benign/)

BibTeX

@inproceedings{wang2024icml-benign,
  title     = {{Benign Overfitting in Adversarial Training of Neural Networks}},
  author    = {Wang, Yunjuan and Zhang, Kaibo and Arora, Raman},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {52171-52232},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/wang2024icml-benign/}
}