Conmy, Arthur

17 publications

ICLRW 2025 Chain-of-Thought Reasoning in the Wild Is Not Always Faithful Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

ICLRW 2025 LLM Neurosurgeon: Targeted Knowledge Removal in LLMs Using Sparse Autoencoders Kunal Patil, Dylan Zhou, Yifan Sun, Karthik Lakshmanan, Senthooran Rajamanoharan, Arthur Conmy

TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath

ICML 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

ICML 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda

ICLRW 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda

ICLRW 2025 Understanding Reasoning in Thinking Language Models via Steering Vectors Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

NeurIPSW 2024 Applying Sparse Autoencoders to Unlearn Knowledge in Language Models Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

NeurIPS 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

ICMLW 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda

ICMLW 2024 Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

ICML 2024 Stealing Part of a Production Language Model Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, David Rolnick, Florian Tramèr

ICLR 2024 Successor Heads: Recurring, Interpretable Attention Heads in the Wild Rhys Gould, Euan Ong, George Ogden, Arthur Conmy

ICLR 2023 Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt

NeurIPSW 2023 Successor Heads: Recurring, Interpretable Attention Heads in the Wild Rhys Gould, Euan Ong, George Ogden, Arthur Conmy

NeurIPS 2023 Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

NeurIPSW 2022 Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, Jacob Steinhardt