Nanda, Neel

41 publications

ICML 2025 Are Sparse Autoencoders Useful? a Case Study in Sparse Probing Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda

ICLRW 2025 Chain-of-Thought Reasoning in the Wild Is Not Always Faithful Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy

ICLR 2025 Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando, Oscar Balcells Obeso, Senthooran Rajamanoharan, Neel Nanda

ICML 2025 Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask, Neel Nanda, Noura Al Moubayed

ICML 2025 Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda

TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath

NeurIPS 2025 Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, Neel Nanda

ICLRW 2025 Robustly Identifying Concepts Introduced During Chat Fine-Tuning Using Crosscoders Julian Minder, Clément Dumas, Bilal Chughtai, Neel Nanda

ICML 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

ICML 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda

ICLRW 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda

ICLR 2025 Sparse Autoencoders Do Not Find Canonical Units of Analysis Patrick Leask, Bart Bussmann, Michael T Pearce, Joseph Isaac Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda

ICLRW 2025 Steering Fine-Tuning Generalization with Targeted Concept Ablation Helena Casademunt, Caden Juang, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda

ICLRW 2025 Steering Fine-Tuning Generalization with Targeted Concept Ablation Helena Casademunt, Caden Juang, Senthooran Rajamanoharan, Neel Nanda

NeurIPS 2025 Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda

ICLR 2025 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda

ICLRW 2025 Understanding Reasoning in Thinking Language Models via Steering Vectors Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

NeurIPSW 2024 BatchTopK Sparse Autoencoders Bart Bussmann, Patrick Leask, Neel Nanda

NeurIPS 2024 Confidence Regulation Neurons in Language Models Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

ICMLW 2024 Confidence Regulation Neurons in Language Models Alessandro Stolfo, Ben Peng Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

ICLRW 2024 Explorations of Self-Repair in Language Model Cody Rushing, Neel Nanda

ICML 2024 Explorations of Self-Repair in Language Models Cody Rushing, Neel Nanda

NeurIPS 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

ICMLW 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda

ICMLW 2024 Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

ICLR 2024 Is This the Subspace You Are Looking for? an Interpretability Illusion for Subspace Activation Patching Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda

ICMLW 2024 Language Models Linearly Represent Sentiment Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

NeurIPS 2024 Refusal in Language Models Is Mediated by a Single Direction Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

ICMLW 2024 Refusal in Language Models Is Mediated by a Single Direction Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda

NeurIPSW 2024 Stitching Sparse Autoencoders of Different Sizes Patrick Leask, Bart Bussmann, Joseph Isaac Bloom, Curt Tigges, Noura Al Moubayed, Neel Nanda

ICLR 2024 Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang, Neel Nanda

ICLRW 2024 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda

NeurIPS 2024 Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky, Philippe Chlenski, Neel Nanda

ICMLW 2024 Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky, Philippe Chlenski, Neel Nanda

TMLR 2024 Universal Neurons in GPT2 Language Models Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

ICML 2023 A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Bilal Chughtai, Lawrence Chan, Neel Nanda

TMLR 2023 Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

ICLRW 2023 N2G: A Scalable Approach for Quantifying Interpretable Neuron Representation in LLMs Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Fazl Barez

ICLRW 2023 Neural Networks Learn Representation Theory: Reverse Engineering How Networks Perform Group Operations Bilal Chughtai, Lawrence Chan, Neel Nanda

ICLR 2023 Progress Measures for Grokking via Mechanistic Interpretability Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

JMLR 2022 Fully General Online Imitation Learning Michael K. Cohen, Marcus Hutter, Neel Nanda