Shah, Rohin

12 publications

ICML 2025 MONA: Myopic Optimization with Non-Myopic Approval Can Mitigate Multi-Step Reward Hacking Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah

ICMLW 2024 BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents That Solve Fuzzy Tasks Stephanie Milani, Anssi Kanervisto, Karolis Jucys, Sander V Schulhoff, Brandon Houghton, Rohin Shah

NeurIPS 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

ICMLW 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda

NeurIPS 2024 On Scalable Oversight with Weak LLMs Judging Strong LLMs Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah

NeurIPS 2023 BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents That Solve Fuzzy Tasks Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah

ICLR 2021 Learning What to Do by Simulating the past David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan

NeurIPS 2021 Optimal Policies Tend to Seek Power Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

NeurIPS 2020 The MAGICAL Benchmark for Robust Imitation Sam Toyer, Rohin Shah, Andrew Critch, Stuart J. Russell

ICML 2019 On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan

NeurIPS 2019 On the Utility of Learning About Humans for Human-AI Coordination Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, Anca Dragan

ICLR 2019 Preferences Implicit in the State of the World Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan