Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Wallace, Eric; Watkins, Olivia; Wang, Miles; Chen, Kai; Koch, Chris

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, Chris Koch

ICLR 2026

/iclr/2026/wallace2026iclr-estimating/

Abstract

In this paper, we study the worst-case frontier risks of the OpenAI gpt-oss model. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results led us to believe that the net new harm from releasing gpt-oss is limited, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wallace et al. "Estimating Worst-Case Frontier Risks of Open-Weight LLMs." International Conference on Learning Representations, 2026.

Markdown

[Wallace et al. "Estimating Worst-Case Frontier Risks of Open-Weight LLMs." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wallace2026iclr-estimating/)

BibTeX

@inproceedings{wallace2026iclr-estimating,
  title     = {{Estimating Worst-Case Frontier Risks of Open-Weight LLMs}},
  author    = {Wallace, Eric and Watkins, Olivia and Wang, Miles and Chen, Kai and Koch, Chris},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wallace2026iclr-estimating/}
}