Sheshadri, Abhay

8 publications

TMLR 2025 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
ICML 2025 Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
NeurIPS 2025 Why Do Some Language Models Fake Alignment While Others Don't? Abhay Sheshadri, John Hughes, Julian Michael, Alex Troy Mallen, Arun Jose, Fabien Roger
ICLRW 2024 Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, Christian Bartelt
NeurIPSW 2024 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Aidan Ewart, Abhay Sheshadri, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
ICMLW 2024 Robust Knowledge Unlearning via Mechanistic Localizations Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
ICMLW 2024 Robust Unlearning via Mechanistic Localizations Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
NeurIPSW 2023 Eliciting Language Model Behaviors Using Reverse Language Models Jacob Pfau, Alex Infanger, Abhay Sheshadri, Ayush Panda, Julian Michael, Curtis Huebner