ML Anthology
Authors
Search
About
Hofstätter, Felix
6 publications
ICLR
2025
AI Sandbagging: Language Models Can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Oliver Jaffe
,
Samuel F. Brown
,
Francis Rhys Ward
NeurIPS
2025
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Cameron Tice
,
Philipp Alexander Kreer
,
Nathan Helm-Burger
,
Prithviraj Singh Shahani
,
Fedor Ryzhenkov
,
Fabien Roger
,
Clement Neo
,
Jacob Haimes
,
Felix Hofstätter
,
Teun van der Weij
ICML
2025
The Elicitation Game: Evaluating Capability Elicitation Techniques
Felix Hofstätter
,
Teun Van Der Weij
,
Jayden Teoh
,
Rada Djoneva
,
Henning Bartsch
,
Francis Rhys Ward
NeurIPSW
2024
AI Sandbagging: Language Models Can Selectively Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Oliver Jaffe
,
Samuel F. Brown
,
Francis Rhys Ward
NeurIPSW
2024
Sandbag Detection Through Model Impairment
Cameron Tice
,
Philipp Alexander Kreer
,
Nathan Helm-Burger
,
Prithviraj Singh Shahani
,
Fedor Ryzhenkov
,
Teun van der Weij
,
Felix Hofstätter
,
Jacob Haimes
NeurIPSW
2024
The Elicitation Game: Stress-Testing Capability Elicitation Techniques
Felix Hofstätter
,
Jayden Teoh
,
Teun van der Weij
,
Francis Rhys Ward