SpecEval: Evaluating Model Adherence to Behavior Specifications

Abstract

Companies that develop foundation models often publish behavioral guidelines they pledge their models will follow, but it remains unclear whether models actually do so, since there has been no systematic audit of adherence to these guidelines. We propose a simple but important baseline: at minimum, a foundation model should consistently satisfy its developer's own behavioral specifications when judged by the developer's own evaluator models. We focus on \emph{three-way consistency}: the relationship between a provider's specification, the provider's model outputs, and adherence scores from the provider model as a judge, extending prior two-way generator-validator consistency. We introduce an automated framework that audits models against their providers' specifications by (i) parsing statements that delineate desired behaviors, (ii) generating targeted prompts to elicit the aforementioned behaviors, and (iii) using the responses as inputs to models to judge adherence. We apply our framework to 16 models from six developers across 100+ behavioral statements, finding three-way consistency gaps of up to 20\% across providers, as measured by each provider's own model acting as judge.

Cite

Text

Ahmed et al. "SpecEval: Evaluating Model Adherence to Behavior Specifications." Transactions on Machine Learning Research, 2026.

Markdown

[Ahmed et al. "SpecEval: Evaluating Model Adherence to Behavior Specifications." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/ahmed2026tmlr-speceval/)

BibTeX

@article{ahmed2026tmlr-speceval,
  title     = {{SpecEval: Evaluating Model Adherence to Behavior Specifications}},
  author    = {Ahmed, Ahmed M and Klyman, Kevin and Zeng, Yi and Koyejo, Sanmi and Liang, Percy},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/ahmed2026tmlr-speceval/}
}