Leike, Jan

22 publications

ICLR 2026 Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Andy Dau, Alek Dimitriev, Logan Howard, Yijin Hua, Rob Gilson, Mu Lin, Christopher Liu, Vladimir Mikulik, Rohit Mittapalli, Clare O'Hara, Jin Pan, Nikhil Saxena, Alex Silverstein, Yue Song, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez, Mrinank Sharma

NeurIPS 2025 Quantifying Elicitation of Latent Capabilities in Language Models Elizabeth Donoway, Hailey Joren, Arushi Somani, Henry Sleight, Julian Michael, Michael R DeWeese, John Schulman, Ethan Perez, Fabien Roger, Jan Leike

ICLR 2025 Scaling and Evaluating Sparse Autoencoders Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu

ICLR 2024 Let's Verify Step by Step Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

ICML 2024 Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeffrey Wu

NeurIPS 2022 Training Language Models to Follow Instructions with Human Feedback Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, Ryan Lowe

ICLR 2021 Quantifying Differences in Reward Functions Adam Gleave, Michael D Dennis, Shane Legg, Stuart Russell, Jan Leike

ICML 2020 Learning Human Objectives by Evaluating Hypothetical Behavior Siddharth Reddy, Anca Dragan, Sergey Levine, Shane Legg, Jan Leike

IJCAI 2020 Pitfalls of Learning a Reward Function Online Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg

ICLR 2019 Learning to Understand Goal Specifications by Modelling Reward Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, Edward Grefenstette

NeurIPS 2018 Reward Learning from Human Preferences and Demonstrations in Atari Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei

NeurIPS 2017 Deep Reinforcement Learning from Human Preferences Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

IJCAI 2017 On Thompson Sampling and Asymptotic Optimality Jan Leike, Tor Lattimore, Laurent Orseau, Marcus Hutter

IJCAI 2017 Universal Reinforcement Learning Algorithms: Survey and Experiments John Aslanides, Jan Leike, Marcus Hutter

UAI 2016 A Formal Solution to the Grain of Truth Problem Jan Leike, Jessica Taylor, Benya Fallenstein

AISTATS 2016 Loss Bounds and Time Complexity for Speed Priors Daniel Filan, Jan Leike, Marcus Hutter

UAI 2016 Thompson Sampling Is Asymptotically Optimal in General Environments Jan Leike, Tor Lattimore, Laurent Orseau, Marcus Hutter

COLT 2015 Bad Universal Priors and Notions of Optimality Jan Leike, Marcus Hutter

UAI 2015 On the Computability of AIXI Jan Leike, Marcus Hutter

ALT 2015 On the Computability of Solomonoff Induction and Knowledge-Seeking Jan Leike, Marcus Hutter

ALT 2015 Solomonoff Induction Violates Nicod's Criterion Jan Leike, Marcus Hutter

ALT 2014 Indefinitely Oscillating Martingales Jan Leike, Marcus Hutter