Ewart, Aidan

7 publications

TMLR 2025 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
ICML 2025 Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
TMLR 2025 Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
NeurIPSW 2024 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Aidan Ewart, Abhay Sheshadri, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
ICMLW 2024 Robust Knowledge Unlearning via Mechanistic Localizations Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
ICMLW 2024 Robust Unlearning via Mechanistic Localizations Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
ICLR 2024 Sparse Autoencoders Find Highly Interpretable Features in Language Models Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, Lee Sharkey