Conmy, Arthur

17 publications

ICLRW 2025 Chain-of-Thought Reasoning in the Wild Is Not Always Faithful Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy
ICLRW 2025 LLM Neurosurgeon: Targeted Knowledge Removal in LLMs Using Sparse Autoencoders Kunal Patil, Dylan Zhou, Yifan Sun, Karthik Lakshmanan, Senthooran Rajamanoharan, Arthur Conmy
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
ICML 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
ICML 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda
ICLRW 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda
ICLRW 2025 Understanding Reasoning in Thinking Language Models via Steering Vectors Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
NeurIPSW 2024 Applying Sparse Autoencoders to Unlearn Knowledge in Language Models Eoin Farrell, Yeu-Tong Lau, Arthur Conmy
NeurIPS 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
ICMLW 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda
ICMLW 2024 Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
ICML 2024 Stealing Part of a Production Language Model Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, David Rolnick, Florian Tramèr
ICLR 2024 Successor Heads: Recurring, Interpretable Attention Heads in the Wild Rhys Gould, Euan Ong, George Ogden, Arthur Conmy
ICLR 2023 Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt
NeurIPSW 2023 Successor Heads: Recurring, Interpretable Attention Heads in the Wild Rhys Gould, Euan Ong, George Ogden, Arthur Conmy
NeurIPS 2023 Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
NeurIPSW 2022 Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt