Shah, Rohin

12 publications

ICML 2025 MONA: Myopic Optimization with Non-Myopic Approval Can Mitigate Multi-Step Reward Hacking Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah
ICMLW 2024 BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents That Solve Fuzzy Tasks Stephanie Milani, Anssi Kanervisto, Karolis Jucys, Sander V Schulhoff, Brandon Houghton, Rohin Shah
NeurIPS 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
ICMLW 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda
NeurIPS 2024 On Scalable Oversight with Weak LLMs Judging Strong LLMs Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah
NeurIPS 2023 BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents That Solve Fuzzy Tasks Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhoff, Brandon Houghton, Rohin Shah
ICLR 2021 Learning What to Do by Simulating the past David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan
NeurIPS 2021 Optimal Policies Tend to Seek Power Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli
NeurIPS 2020 The MAGICAL Benchmark for Robust Imitation Sam Toyer, Rohin Shah, Andrew Critch, Stuart J. Russell
ICML 2019 On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan
NeurIPS 2019 On the Utility of Learning About Humans for Human-AI Coordination Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, Anca Dragan
ICLR 2019 Preferences Implicit in the State of the World Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan