Hofstätter, Felix

6 publications

ICLR 2025 AI Sandbagging: Language Models Can Strategically Underperform on Evaluations Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel F. Brown, Francis Rhys Ward
NeurIPS 2025 Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Fabien Roger, Clement Neo, Jacob Haimes, Felix Hofstätter, Teun van der Weij
ICML 2025 The Elicitation Game: Evaluating Capability Elicitation Techniques Felix Hofstätter, Teun Van Der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward
NeurIPSW 2024 AI Sandbagging: Language Models Can Selectively Underperform on Evaluations Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel F. Brown, Francis Rhys Ward
NeurIPSW 2024 Sandbag Detection Through Model Impairment Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Teun van der Weij, Felix Hofstätter, Jacob Haimes
NeurIPSW 2024 The Elicitation Game: Stress-Testing Capability Elicitation Techniques Felix Hofstätter, Jayden Teoh, Teun van der Weij, Francis Rhys Ward