Safety Pretraining: Toward the Next Generation of Safe AI

Abstract

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we synthetically generate pretraining datasets that actively teach models to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer models away from unsafe generations at inference-time. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance degradation on general tasks.

Cite

Text

Maini et al. "Safety Pretraining: Toward the Next Generation of Safe AI." Advances in Neural Information Processing Systems, 2025.

Markdown

[Maini et al. "Safety Pretraining: Toward the Next Generation of Safe AI." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/maini2025neurips-safety/)

BibTeX

@inproceedings{maini2025neurips-safety,
  title     = {{Safety Pretraining: Toward the Next Generation of Safe AI}},
  author    = {Maini, Pratyush and Goyal, Sachin and Sam, Dylan and Robey, Alexander and Savani, Yash and Jiang, Yiding and Zou, Andy and Fredrikson, Matt and Lipton, Zachary Chase and Kolter, J Zico},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/maini2025neurips-safety/}
}