Watkins, Olivia

13 publications

ICLR 2026 Estimating Worst-Case Frontier Risks of Open-Weight LLMs Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, Chris Koch
ICLR 2026 GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simon Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Samuel Miserendino, Gildas Chabot, David Li, Patrick Chao, Michael Sharman, Alexandra Barr, Amelia Glaese, Jerry Tworek
ICLR 2026 Persona Features Control Emergent Misalignment Miles Wang, Tom Dupre la Tour, Olivia Watkins, Aleksandar Makelov, Ryan Andrew Chi, Samuel Miserendino, Jeffrey George Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Daniel P Mossing
NeurIPS 2024 A StrongREJECT for Empty Jailbreaks Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
ICLRW 2024 A StrongREJECT for Empty Jailbreaks Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
ICML 2024 Learning to Model the World with Language Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan
ICLR 2024 Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
NeurIPS 2023 DPOK: Reinforcement Learning for Fine-Tuning Text-to-Image Diffusion Models Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, Kimin Lee
ICML 2023 Guiding Pretraining in Reinforcement Learning with Large Language Models Yuqing Du, Olivia Watkins, Zihan Wang, Cédric Colas, Trevor Darrell, Pieter Abbeel, Abhishek Gupta, Jacob Andreas
NeurIPSW 2023 Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game Sam Toyer, Olivia Watkins, Ethan Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
NeurIPSW 2023 Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
NeurIPS 2021 Teachable Reinforcement Learning via Advice Distillation Olivia Watkins, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, Jacob Andreas
ICMLW 2018 Program Language Translation Using a Grammar-Driven Tree-to-Tree Model Mehdi Drissi, Olivia Watkins, Aditya Khant, Vivaswat Ojha, Pedro Sandoval, Rakia Segev, Eric Weiner, Robert Keller