Dragan, Anca

79 publications

ICML 2025 Adversaries Can Misuse Combinations of Safe Models Erik Jones, Anca Dragan, Jacob Steinhardt
ICML 2025 AssistanceZero: Scalably Solving Assistance Games Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
ICLRW 2025 CTRL-Rec: Controlling Recommender Systems with Natural Language Micah Carroll, Adeline Foote, Marcus Williams, Anca Dragan, W. Bradley Knox, Smitha Milli
ICLR 2025 Context Steering: Controllable Personalization at Inference Time Jerry Zhi-Yang He, Sashrika Pandey, Mariah L Schrum, Anca Dragan
ICLR 2025 Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking Cassidy Laidlaw, Shivam Singhal, Anca Dragan
ICLRW 2025 Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty Brian Sui, Jessy Lin, Michelle Li, Anca Dragan, Dan Klein, Jacob Steinhardt
ICLR 2025 On Targeted Manipulation and Deception When Optimizing LLMs for User Feedback Marcus Williams, Micah Carroll, Adhyyan Narang, Constantin Weisser, Brendan Murphy, Anca Dragan
NeurIPS 2025 Planning Without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL Joey Hong, Anca Dragan, Sergey Levine
ICLR 2025 Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning Joey Hong, Anca Dragan, Sergey Levine
ICLRW 2025 Scalably Solving Assistance Games Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
NeurIPS 2025 Temporal Representation Alignment: Successor Features Enable Emergent Compositionality in Robot Instruction Following Vivek Myers, Bill Zheng, Anca Dragan, Kuan Fang, Sergey Levine
ICML 2024 AI Alignment with Changing and Influenceable Reward Functions Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan
ICLRW 2024 AI Alignment with Changing and Influenceable Reward Functions Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan
ICMLW 2024 AI Alignment with Changing and Influenceable Reward Functions Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan
ICMLW 2024 AI Alignment with Changing and Influenceable Reward Functions Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan
ICMLW 2024 AssistanceZero: Scalably Solving Assistance Games Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
NeurIPSW 2024 CoS: Enhancing Personalization and Mitigating Bias with Context Steering Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan
NeurIPSW 2024 CoS: Enhancing Personalization and Mitigating Bias with Context Steering Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan
NeurIPSW 2024 CoS: Enhancing Personalization with Context Steering Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan
NeurIPSW 2024 CoS: Enhancing Personalization with Context Steering Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan
NeurIPSW 2024 CoS: Enhancing Personalization with Context Steering Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan
ICLR 2024 Confronting Reward Model Overoptimization with Constrained RLHF Ted Moskovitz, Aaditya K Singh, Dj Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca Dragan, Stephen Marcus McAleer
ICML 2024 Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach for Adaptive Brain Stimulation Michelle Pan, Mariah L Schrum, Vivek Myers, Erdem Biyik, Anca Dragan
ICML 2024 Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making Vivek Myers, Chongyi Zheng, Anca Dragan, Sergey Levine, Benjamin Eysenbach
NeurIPS 2024 Learning to Assist Humans Without Inferring Rewards Vivek Myers, Evan Ellis, Sergey Levine, Benjamin Eysenbach, Anca Dragan
ICMLW 2024 Learning to Assist Humans Without Inferring Rewards Vivek Myers, Evan Ellis, Benjamin Eysenbach, Sergey Levine, Anca Dragan
ICML 2024 Learning to Model the World with Language Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan
ICLR 2024 Offline RL with Observation Histories: Analyzing and Improving Sample Complexity Joey Hong, Anca Dragan, Sergey Levine
ICMLW 2024 Scalable Oversight by Accounting for Unreliable Feedback Shivam Singhal, Cassidy Laidlaw, Anca Dragan
ICMLW 2024 Scalably Solving Assistance Games Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan
NeurIPSW 2024 Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback Marcus Williams, Micah Carroll, Constantin Weisser, Brendan Murphy, Adhyyan Narang, Anca Dragan
ICLR 2024 The Effective Horizon Explains Deep RL Performance in Stochastic Environments Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
CoRL 2024 Trajectory Improvement and Reward Learning from Comparative Language Feedback Zhaojing Yang, Miru Jun, Jeremy Tien, Stuart Russell, Anca Dragan, Erdem Biyik
NeurIPS 2024 When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons
NeurIPSW 2023 A Theoretical Explanation of Deep RL Performance in Stochastic Environments Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
NeurIPSW 2023 A Theoretical Explanation of Deep RL Performance in Stochastic Environments Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan
ICML 2023 Automatically Auditing Large Language Models via Discrete Optimization Erik Jones, Anca Dragan, Aditi Raghunathan, Jacob Steinhardt
TMLR 2023 Benchmarks and Algorithms for Offline Preference-Based Reward Learning Daniel Shin, Anca Dragan, Daniel S. Brown
NeurIPS 2023 Bridging RL Theory and Practice with the Effective Horizon Cassidy Laidlaw, Stuart J Russell, Anca Dragan
ICMLW 2023 Bridging RL Theory and Practice with the Effective Horizon Cassidy Laidlaw, Stuart Russell, Anca Dragan
ICLR 2023 Causal Confusion and Reward Misidentification in Preference-Based Reward Learning Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca Dragan, Daniel S. Brown
NeurIPSW 2023 Confronting Reward Model Overoptimization with Constrained RLHF Ted Moskovitz, Aaditya Singh, Dj Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca Dragan, Stephen McAleer
ICML 2023 Contextual Reliability: When Different Features Matter in Different Contexts Gaurav Rohit Ghosal, Amrith Setlur, Daniel S. Brown, Anca Dragan, Aditi Raghunathan
CoRL 2023 Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control Vivek Myers, Andre Wang He, Kuan Fang, Homer Rich Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine
ICMLW 2023 Learning Optimal Advantage from Preferences and Mistaking It for Reward W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum
NeurIPS 2023 Learning to Influence Human Behavior with Offline Reinforcement Learning Joey Hong, Sergey Levine, Anca Dragan
ICLR 2023 On the Sensitivity of Reward Inference to Misspecified Human Models Joey Hong, Kush Bhatia, Anca Dragan
TMLR 2023 Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip J.K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
ICMLW 2023 Preventing Reward Hacking with Occupancy Measure Regularization Cassidy Laidlaw, Shivam Singhal, Anca Dragan
ICMLW 2023 Preventing Reward Hacking with Occupancy Measure Regularization Cassidy Laidlaw, Shivam Singhal, Anca Dragan
CoRL 2023 Quantifying Assistive Robustness via the Natural-Adversarial Frontier Jerry Zhi-Yang He, Daniel S. Brown, Zackory Erickson, Anca Dragan
ICMLW 2023 Video-Guided Skill Discovery Manan Tomar, Dibya Ghosh, Vivek Myers, Anca Dragan, Matthew E. Taylor, Philip Bachman, Sergey Levine
NeurIPSW 2023 Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations Joey Hong, Sergey Levine, Anca Dragan
ICMLW 2022 A Study of Causal Confusion in Preference-Based Reward Learning Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca Dragan, Daniel S. Brown
NeurIPSW 2022 Aligning Robot Representations with Humans Andreea Bobu, Andi Peng, Pulkit Agrawal, Julie Shah, Anca Dragan
ICML 2022 Estimating and Penalizing Induced Preference Shifts in Recommender Systems Micah D Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell
NeurIPS 2022 First Contact: Unsupervised Human-Machine Co-Adaptation via Mutual Information Maximization Siddharth Reddy, Sergey Levine, Anca Dragan
CoRL 2022 Learning Representations That Enable Generalization in Assistive Tasks Jerry Zhi-Yang He, Zackory Erickson, Daniel S. Brown, Aditi Raghunathan, Anca Dragan
ICLR 2022 The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models Cassidy Laidlaw, Anca Dragan
ICLRW 2022 Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers Micah Carroll, Jessy Lin, Orr Paradise, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin
NeurIPS 2022 Uni[MASK]: Unified Inference in Sequential Decision Problems Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin
ICLR 2021 Learning What to Do by Simulating the past David Lindner, Rohin Shah, Pieter Abbeel, Anca Dragan
ICML 2021 Policy Gradient Bayesian Robust Optimization for Imitation Learning Zaynah Javed, Daniel S Brown, Satvik Sharma, Jerry Zhu, Ashwin Balakrishna, Marek Petrik, Anca Dragan, Ken Goldberg
NeurIPS 2021 Pragmatic Image Compression for Human-in-the-Loop Decision-Making Sid Reddy, Anca Dragan, Sergey Levine
ICML 2021 Value Alignment Verification Daniel S Brown, Jordan Schneider, Anca Dragan, Scott Niekum
ICLR 2021 X2T: Training an X-to-Text Typing Interface with Online Learning from User Feedback Jensen Gao, Siddharth Reddy, Glen Berseth, Nicholas Hardy, Nikhilesh Natraj, Karunesh Ganguly, Anca Dragan, Sergey Levine
CoRL 2020 Assisted Perception: Optimizing Observations to Communicate State Siddharth Reddy, Sergey Levine, Anca Dragan
NeurIPS 2020 AvE: Assistance via Empowerment Yuqing Du, Stas Tiomkin, Emre Kiciman, Daniel Polani, Pieter Abbeel, Anca Dragan
ICML 2020 Learning Human Objectives by Evaluating Hypothetical Behavior Siddharth Reddy, Anca Dragan, Sergey Levine, Shane Legg, Jan Leike
NeurIPS 2020 Preference Learning Along Multiple Criteria: A Game-Theoretic Perspective Kush Bhatia, Ashwin Pananjady, Peter L. Bartlett, Anca Dragan, Martin J. Wainwright
NeurIPS 2020 Reward-Rational (implicit) Choice: A Unifying Formalism for Reward Learning Hong Jun Jeon, Smitha Milli, Anca Dragan
ICML 2019 Learning a Prior over Intent via Meta-Inverse Reinforcement Learning Kelvin Xu, Ellis Ratner, Anca Dragan, Sergey Levine, Chelsea Finn
ICML 2019 On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference Rohin Shah, Noah Gundotra, Pieter Abbeel, Anca Dragan
NeurIPS 2019 On the Utility of Learning About Humans for Human-AI Coordination Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, Anca Dragan
ICLR 2019 Preferences Implicit in the State of the World Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan
ICML 2018 An Efficient, Generalized Bellman Update for Cooperative Inverse Reinforcement Learning Dhruv Malik, Malayandi Palaniappan, Jaime Fisac, Dylan Hadfield-Menell, Stuart Russell, Anca Dragan
NeurIPS 2018 Where Do You Think You're Going?: Inferring Beliefs About Dynamics from Behavior Sid Reddy, Anca Dragan, Sergey Levine
NeurIPS 2017 Inverse Reward Design Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan
NeurIPS 2016 Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell, Stuart Russell, Pieter Abbeel, Anca Dragan