Roger, Fabien

6 publications

NeurIPS 2025 Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Fabien Roger, Clement Neo, Jacob Haimes, Felix Hofstätter, Teun van der Weij
NeurIPS 2025 Quantifying Elicitation of Latent Capabilities in Language Models Elizabeth Donoway, Hailey Joren, Arushi Somani, Henry Sleight, Julian Michael, Michael R DeWeese, John Schulman, Ethan Perez, Fabien Roger, Jan Leike
NeurIPS 2025 Why Do Some Language Models Fake Alignment While Others Don't? Abhay Sheshadri, John Hughes, Julian Michael, Alex Troy Mallen, Arun Jose, Fabien Roger
ICML 2024 AI Control: Improving Safety Despite Intentional Subversion Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
TMLR 2024 Language Models Are Better than Humans at Next-Token Prediction Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean
NeurIPS 2024 Stress-Testing Capability Elicitation with Password-Locked Models Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger