Krueger, David

74 publications

NeurIPS 2025 Detecting High-Stakes Interactions with Activation Probes Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
NeurIPS 2025 Distributional Training Data Attribution: What Do Influence Functions Sample? Bruno Kacper Mlodozeniec, Isaac Reid, Samuel Power, David Krueger, Murat A Erdogdu, Richard E. Turner, Roger Baker Grosse
NeurIPS 2025 From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou
ICLR 2025 Influence Functions for Scalable Data Attribution in Diffusion Models Bruno Kacper Mlodozeniec, Runa Eschenhagen, Juhan Bae, Alexander Immer, David Krueger, Richard E. Turner
ICLR 2025 Input Space Mode Connectivity in Deep Neural Networks Jakub Vrabel, Ori Shem-Ur, Yaron Oz, David Krueger
ICLR 2025 Interpreting Emergent Planning in Model-Free Reinforcement Learning Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger
TMLR 2025 Permissive Information-Flow Analysis for Large Language Models Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris Köpf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, Santiago Zanella-Beguelin
ICML 2025 PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B Cohen, David Krueger, Fazl Barez
ICML 2025 Position: Humanity Faces Existential Risk from Gradual Disempowerment Jan Kulveit, Raymond Douglas, Nora Ammann, Deger Turan, David Krueger, David Duvenaud
ICML 2025 Position: Probabilistic Modelling Is Sufficient for Causal Inference Bruno Kacper Mlodozeniec, David Krueger, Richard E Turner
ICLR 2025 Protecting Against Simultaneous Data Poisoning Attacks Neel Alex, Shoaib Ahmed Siddiqui, Amartya Sanyal, David Krueger
TMLR 2025 Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz
ICML 2025 The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Max Viktor Skalse
ICLR 2025 Towards Interpreting Visual Information Processing in Vision-Language Models Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, Fazl Barez
ICLRW 2025 Understanding (Un)Reliability of Steering Vectors in Language Models Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, Dmitrii Krasheninnikov
TMLR 2025 Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens Usman Anwar, Johannes von Oswald, Louis Kirsch, David Krueger, Spencer Frei
ICMLW 2024 A Deeper Look at Depth Pruning of LLMs Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, Pavlo Molchanov
NeurIPS 2024 A Generative Model of Symmetry Transformations James Urquhart Allingham, Bruno Kacper Mlodozeniec, Shreyas Padhy, Javier Antorán, David Krueger, Richard E. Turner, Eric Nalisnick, José Miguel Hernández-Lobato
NeurIPSW 2024 Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon, Manish Shrivastava, Ekdeep Singh Lubana, David Krueger
TMLR 2024 Blockwise Self-Supervised Learning at Scale Shoaib Siddiqui, David Krueger, Yann LeCun, Stephane Deny
NeurIPSW 2024 Comparing Bottom-up and Top-Down Steering Approaches on In-Context Learning Tasks Madeline Brumley, Joe Kwon, David Krueger, Dmitrii Krasheninnikov, Usman Anwar
TMLR 2024 Foundational Challenges in Assuring Alignment and Safety of Large Language Models Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric J Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Chenyu Zhang, Ruiqi Zhong, Sean O hEigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Aleksandar Petrov, Christian Schroeder de Witt, Sumeet Ramesh Motwani, Yoshua Bengio, Danqi Chen, Philip Torr, Samuel Albanie, Tegan Maharaj, Jakob Nicolaus Foerster, Florian Tramèr, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger
NeurIPSW 2024 IDs for AI Systems Alan Chan, Noam Kolt, Peter Wills, Usman Anwar, Christian Schroeder de Witt, Nitarshan Rajkumar, Lewis Hammond, David Krueger, Lennart Heim, Markus Anderljung
ICML 2024 Implicit Meta-Learning May Lead Language Models to Trust More Reliable Sources Dmitrii Krasheninnikov, Egor Krasheninnikov, Bruno Kacper Mlodozeniec, Tegan Maharaj, David Krueger
NeurIPSW 2024 Input Space Mode Connectivity in Deep Neural Networks Jakub Vrabel, Ori Shem-Ur, Yaron Oz, David Krueger
NeurIPS 2024 Interpreting Learned Feedback Patterns in Large Language Models Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, Fazl Barez
NeurIPSW 2024 Learning to Forget Using Hypernetworks Jose Miguel Lara Rangel, Usman Anwar, Stefan Schoepf, Jack Foster, David Krueger
ICLR 2024 Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tim Rocktäschel, Edward Grefenstette, David Krueger
ICLRW 2024 Mechanistically Analyzing the Effects of Fine-Tuning on Procedurally Defined Tasks Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tim Rocktäschel, Edward Grefenstette, David Krueger
NeurIPS 2024 Predicting Future Actions of Reinforcement Learning Agents Stephen Chung, Scott Niekum, David Krueger
ICLR 2024 Reward Model Ensembles Help Mitigate Overoptimization Thomas Coste, Usman Anwar, Robert Kirk, David Krueger
NeurIPSW 2024 Steering Clear: A Systematic Study of Activation Steering in a Toy Setup Dmitrii Krasheninnikov, David Krueger
NeurIPS 2024 Stress-Testing Capability Elicitation with Password-Locked Models Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger
NeurIPSW 2024 Towards Reliable Evaluation of Behavior Steering Interventions in LLMs Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, David Krueger
ICLR 2023 Broken Neural Scaling Laws Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger
ICLRW 2023 Broken Neural Scaling Laws Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger
NeurIPSW 2023 Detecting Backdoors with Meta-Models Lauro Langosco, Neel Alex, William Baker, David Quarel, Herbie Bradley, David Krueger
NeurIPSW 2023 Goal Misgeneralization as Implicit Goal Conditioning Diego Dorn, Neel Alex, David Krueger
NeurIPSW 2023 Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models Alan Chan, Benjamin Bucknall, Herbie Bradley, David Krueger
NeurIPSW 2023 How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tim Rocktäschel, Edward Grefenstette, David Krueger
NeurIPSW 2023 How Does Fine-Tuning Affect Your Model? Mechanistic Analysis on Procedural Tasks Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Tim Rocktäschel, Edward Grefenstette, David Krueger
ICML 2023 Mechanistic Mode Connectivity Ekdeep Singh Lubana, Eric J Bigelow, Robert P. Dick, David Krueger, Hidenori Tanaka
NeurIPSW 2023 Meta- (out-of-Context) Learning in Neural Networks Dmitrii Krasheninnikov, Egor Krasheninnikov, Bruno Mlodozeniec, David Krueger
ICLR 2023 Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, Sara Hooker
NeurIPSW 2023 Noisy ZSC: Breaking the Common Knowledge Assumption in Zero-Shot Coordination Games Usman Anwar, Jia Wan, David Krueger, Jakob Nicolaus Foerster
TMLR 2023 Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip J.K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
ICLRW 2023 Out-of-Context Meta-Learning in Large Language Models Dmitrii Krasheninnikov, Egor Krasheninnikov, David Krueger
NeurIPSW 2023 Reward Model Ensembles Help Mitigate Overoptimization Thomas Coste, Usman Anwar, Robert Kirk, David Krueger
NeurIPSW 2023 Reward Model Ensembles Help Mitigate Overoptimization Thomas Coste, Usman Anwar, Robert Kirk, David Krueger
NeurIPS 2023 Thinker: Learning to Plan and Act Stephen Chung, Ivan Anokhin, David Krueger
ICMLW 2023 Towards Out-of-Distribution Adversarial Robustness Adam Ibrahim, Charles Guille-Escuret, Ioannis Mitliagkas, Irina Rish, David Krueger, Pouya Bashivan
NeurIPSW 2023 What Mechanisms Does Knowledge Distillation Distill? Cindy Wu, Ekdeep Singh Lubana, Bruno Kacper Mlodozeniec, Robert Kirk, David Krueger
NeurIPSW 2022 A Mechanistic Lens on Mode Connectivity Ekdeep Singh Lubana, Eric J Bigelow, Robert P. Dick, David Krueger, Hidenori Tanaka
NeurIPSW 2022 Assistance with Large Language Models Dmitrii Krasheninnikov, Egor Krasheninnikov, David Krueger
NeurIPSW 2022 Broken Neural Scaling Laws Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger
NeurIPS 2022 Defining and Characterizing Reward Gaming Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger
NeurIPSW 2022 Domain Generalization for Robust Model-Based Offline Reinforcement Learning Alan Clark, Shoaib Ahmed Siddiqui, Robert Kirk, Usman Anwar, Stephen Chung, David Krueger
NeurIPSW 2022 Domain Generalization for Robust Model-Based Offline Reinforcement Learning Alan Clark, Shoaib Ahmed Siddiqui, Robert Kirk, Usman Anwar, Stephen Chung, David Krueger
ICML 2022 Goal Misgeneralization in Deep Reinforcement Learning Lauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger
NeurIPSW 2022 Mechanistic Lens on Mode Connectivity Ekdeep Singh Lubana, Eric J Bigelow, Robert P. Dick, David Krueger, Hidenori Tanaka
NeurIPSW 2022 On the Fragility of Learned Reward Functions Lev E. McKinney, Yawen Duan, David Krueger, Adam Gleave
NeurIPSW 2022 On the Fragility of Learned Reward Functions Lev E McKinney, Yawen Duan, David Krueger, Adam Gleave
NeurIPSW 2022 On the Fragility of Learned Reward Functions Lev E McKinney, Yawen Duan, David Krueger, Adam Gleave
NeurIPSW 2022 Training Equilibria in Reinforcement Learning Lauro Langosco, David Krueger, Adam Gleave
NeurIPSW 2022 Unifying Grokking and Double Descent Xander Davies, Lauro Langosco, David Krueger
ICML 2021 Out-of-Distribution Generalization via Risk Extrapolation (REx) David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, Aaron Courville
ICML 2018 Neural Autoregressive Flows Chin-Wei Huang, David Krueger, Alexandre Lacoste, Aaron Courville
ICML 2017 A Closer Look at Memorization in Deep Networks Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, Simon Lacoste-Julien
ICLR 2017 Deep Nets Don't Learn via Memorization David Krueger, Nicolas Ballas, Stanislaw Jastrzebski, Devansh Arpit, Maxinder S. Kanwal, Tegan Maharaj, Emmanuel Bengio, Asja Fischer, Aaron C. Courville
ACML 2017 Nested LSTMs Joel Ruben Antony Moniz, David Krueger
ICLR 2017 Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron C. Courville, Christopher J. Pal
ICLR 2016 Regularizing RNNs by Stabilizing Activations David Krueger, Roland Memisevic
ICLR 2015 NICE: Non-Linear Independent Components Estimation Laurent Dinh, David Krueger, Yoshua Bengio
ICLR 2015 Zero-Bias Autoencoders and the Benefits of Co-Adapting Features Roland Memisevic, Kishore Reddy Konda, David Krueger