Ward, Francis Rhys

9 publications

ICLR 2025 AI Sandbagging: Language Models Can Strategically Underperform on Evaluations Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel F. Brown, Francis Rhys Ward
NeurIPS 2025 CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D Francis Rhys Ward, Teun van der Weij, Hanna Gábor, Sam Martin, Raja Mehta Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, Lauren Robson
ICML 2025 The Elicitation Game: Evaluating Capability Elicitation Techniques Felix Hofstätter, Teun Van Der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward
AAAI 2025 Towards a Theory of AI Personhood Francis Rhys Ward
NeurIPSW 2024 AI Sandbagging: Language Models Can Selectively Underperform on Evaluations Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel F. Brown, Francis Rhys Ward
NeurIPSW 2024 The Elicitation Game: Stress-Testing Capability Elicitation Techniques Felix Hofstätter, Jayden Teoh, Teun van der Weij, Francis Rhys Ward
NeurIPSW 2024 Towards a Theory of AI Personhood Francis Rhys Ward
NeurIPSW 2024 Towards a Theory of AI Personhood Francis Rhys Ward
NeurIPSW 2022 Towards Defining Deception in Structural Causal Games Francis Rhys Ward