Beyond Adversarial Robustness: Breaking the Robustness-Alignment Trade-Off in Object Recognition
Abstract
A well-known limitation of deep neural networks (DNNs) is their sensitivity to adversarial attacks. That DNNs can easily be fooled by minute image perturbations imperceptible to humans has long been considered a significant vulnerability of deep learning, which may eventually force a shift towards modeling paradigms that are faithful to biology. Nevertheless, the ever-evolving capabilities of DNNs have largely eclipsed these early concerns. Do adversarial perturbations continue to pose a threat to DNNs? Here, we investigate whether DNN improvements in image categorization have led to concurrent improvements in robustness to adversarial perturbations. We evaluated DNN adversarial robustness in two ways. First, we measured the tolerance of DNNs to adversarial perturbations by recording the norm of the smallest image perturbation needed to change a model's decision using a standard ``minimum-norm'' robustness metric. Second, we measured alignment of perturbations and the degree to which they target pixels that are diagnostic for human observers. We uncover a surprising trade-off: as DNNs have improved on ImageNet, they have grown more tolerant to adversarial perturbations. However, these perturbations are also progressively less aligned with features critical to humans for object recognition. To better understand the source of this trade-off, we turn to DNN training methods that have previously been reported to align DNNs with human vision, namely adversarial training and harmonization. Our results show that both methods improve this trade-off, significantly increasing the tolerance and alignment of DNN perturbations with human visual features. Harmonized models, unlike adversarially trained ones, are also able to maintain their ImageNet accuracy in the process. Our findings suggest that, the vulnerability of DNNs to adversarial perturbations can be at least partially mitigated by augmenting the trends in model scaling that are driving development today with training routines that align models with biological intelligence. We release our code and data to support continued progress toward studying the adversarial behavior of DNNs.
Cite
Text
Feng et al. "Beyond Adversarial Robustness: Breaking the Robustness-Alignment Trade-Off in Object Recognition." ICLR 2025 Workshops: Re-Align, 2025.Markdown
[Feng et al. "Beyond Adversarial Robustness: Breaking the Robustness-Alignment Trade-Off in Object Recognition." ICLR 2025 Workshops: Re-Align, 2025.](https://mlanthology.org/iclrw/2025/feng2025iclrw-beyond/)BibTeX
@inproceedings{feng2025iclrw-beyond,
title = {{Beyond Adversarial Robustness: Breaking the Robustness-Alignment Trade-Off in Object Recognition}},
author = {Feng, Pinyuan and Linsley, Drew and Boissin, Thibaut and Ashok, Alekh Karkada and Fel, Thomas and Olaiya, Stephanie and Serre, Thomas},
booktitle = {ICLR 2025 Workshops: Re-Align},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/feng2025iclrw-beyond/}
}