Trust the Typical
Abstract
Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from \emph{deeply understanding what is safe}. We introduce \textbf{T}rust \textbf{T}he \textbf{T}ypical \textbf{(T3)}, a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6\% overhead even under dense evaluation intervals on large-scale workloads.
Cite
Text
Ganguly et al. "Trust the Typical." International Conference on Learning Representations, 2026.Markdown
[Ganguly et al. "Trust the Typical." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ganguly2026iclr-trust/)BibTeX
@inproceedings{ganguly2026iclr-trust,
title = {{Trust the Typical}},
author = {Ganguly, Debargha and Sankar, Sreehari and Zhang, Biyao and Singh, Vikash and Gupta, Kanan and Kavuru, Harshini and Luo, Alan and Chen, Weicong and Morningstar, Warren Richard and Machiraju, Raghu and Chaudhary, Vipin},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/ganguly2026iclr-trust/}
}