Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Abstract

Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation. Sandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model architectures, sizes, and sandbagging techniques, we establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, we find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so. Our findings provide a practical tool for AI evaluation and oversight, addressing a challenge in ensuring accurate capability assessment of frontier AI systems.

Cite

Text

Tice et al. "Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Tice et al. "Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/tice2025neurips-noise/)

BibTeX

@inproceedings{tice2025neurips-noise,
  title     = {{Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models}},
  author    = {Tice, Cameron and Kreer, Philipp Alexander and Helm-Burger, Nathan and Shahani, Prithviraj Singh and Ryzhenkov, Fedor and Roger, Fabien and Neo, Clement and Haimes, Jacob and Hofstätter, Felix and van der Weij, Teun},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/tice2025neurips-noise/}
}