Guo, Phillip Huang

5 publications

TMLR 2025 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
ICML 2025 Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
NeurIPSW 2024 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Aidan Ewart, Abhay Sheshadri, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
ICMLW 2024 Robust Knowledge Unlearning via Mechanistic Localizations Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite
ICMLW 2024 Robust Unlearning via Mechanistic Localizations Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite