Estimating Worst-Case Frontier Risks of Open-Weight LLMs
Abstract
In this paper, we study the worst-case frontier risks of the OpenAI gpt-oss model. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results led us to believe that the net new harm from releasing gpt-oss is limited, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.
Cite
Text
Wallace et al. "Estimating Worst-Case Frontier Risks of Open-Weight LLMs." International Conference on Learning Representations, 2026.Markdown
[Wallace et al. "Estimating Worst-Case Frontier Risks of Open-Weight LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wallace2026iclr-estimating/)BibTeX
@inproceedings{wallace2026iclr-estimating,
title = {{Estimating Worst-Case Frontier Risks of Open-Weight LLMs}},
author = {Wallace, Eric and Watkins, Olivia and Wang, Miles and Chen, Kai and Koch, Chris},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/wallace2026iclr-estimating/}
}