Lynch, Aengus

11 publications

NeurIPS 2025 Best-of-N Jailbreaking John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Arushi Somani, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma
ICML 2025 How Do Large Language Monkeys Get Their Power (Laws)? Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo
TMLR 2025 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Abhay Sheshadri, Aidan Ewart, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
ICLRW 2025 Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases Aengus Lynch, Gbetondji Jean-Sebastien Dovonon, Jean Kaddour, Ricardo Silva
NeurIPS 2024 Analysing the Generalisation and Reliability of Steering Vectors Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga-Alonso, Robert Kirk
ICMLW 2024 Analyzing the Generalization and Reliability of Steering Vectors Daniel Chee Hian Tan, David Chanin, Aengus Lynch, Adrià Garriga-Alonso, Dimitrios Kanoulas, Brooks Paige, Robert Kirk
NeurIPSW 2024 H-Space Sparse Autoencoders Ayodeji Ijishakin, Ming Liang Ang, Levente Baljer, Daniel Chee Hian Tan, Hugo Laurence Fry, Ahmed Abdulaal, Aengus Lynch, James H. Cole
NeurIPSW 2024 Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Aidan Ewart, Abhay Sheshadri, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
NeurIPS 2023 Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
NeurIPSW 2022 Evaluating the Impact of Geometric and Statistical Skews on Out-of-Distribution Generalization Performance Aengus Lynch, Jean Kaddour, Ricardo Silva
NeurIPSW 2022 Evaluating the Impact of Geometric and Statistical Skews on Out-of-Distribution Generalization Performance Aengus Lynch, Jean Kaddour, Ricardo Silva