Lindner, David

16 publications

NeurIPS 2025 Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought Under Process Supervision Robert McCarthy, Joey Skaf, Luis Ibanez-Lissen, Vasil Georgiev, Connor Watts, Hannes Whittingham, Lorena Gonzalez-Manzano, Cameron Tice, Edward James Young, Puria Radmard, David Lindner
ICML 2025 MONA: Myopic Optimization with Non-Myopic Approval Can Mitigate Multi-Step Reward Hacking Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah
AISTATS 2024 Learning Safety Constraints from Demonstrations with Unknown Rewards David Lindner, Xin Chen, Sebastian Tschiatschek, Katja Hofmann, Andreas Krause
NeurIPSW 2024 MISR: Measuring Instrumental Self-Reasoning in Frontier Models Kai Fronsdal, David Lindner
NeurIPS 2024 On Scalable Oversight with Weak LLMs Judging Strong LLMs Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah
ICLR 2024 Vision-Language Models Are Zero-Shot Reward Models for Reinforcement Learning Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner
TMLR 2023 Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip J.K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
ICMLW 2023 RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback Yannick Metz, David Lindner, Raphaël Baur, Daniel A. Keim, Mennatallah El-Assady
NeurIPS 2023 Tracr: Compiled Transformers as a Laboratory for Interpretability David Lindner, Janos Kramar, Sebastian Farquhar, Matthew Rahtz, Tom McGrath, Vladimir Mikulik
NeurIPSW 2023 Vision-Language Models Are Zero-Shot Reward Models for Reinforcement Learning Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner
NeurIPS 2022 Active Exploration for Inverse Reinforcement Learning David Lindner, Andreas Krause, Giorgia Ramponi
ICML 2022 Interactively Learning Preference Constraints in Linear Bandits David Lindner, Sebastian Tschiatschek, Katja Hofmann, Andreas Krause
NeurIPSW 2022 Red-Teaming the Stable Diffusion Safety Filter Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramer
IJCAI 2021 Addressing the Long-Term Impact of ML Decisions via Policy Regret David Lindner, Hoda Heidari, Andreas Krause
NeurIPS 2021 Information Directed Reward Learning for Reinforcement Learning David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, Andreas Krause
ICLR 2021 Learning What to Do by Simulating the past David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan