Nanda, Neel

41 publications

ICML 2025 Are Sparse Autoencoders Useful? a Case Study in Sparse Probing Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda
ICLRW 2025 Chain-of-Thought Reasoning in the Wild Is Not Always Faithful Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy
ICLR 2025 Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando, Oscar Balcells Obeso, Senthooran Rajamanoharan, Neel Nanda
ICML 2025 Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask, Neel Nanda, Noura Al Moubayed
ICML 2025 Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
NeurIPS 2025 Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, Neel Nanda
ICLRW 2025 Robustly Identifying Concepts Introduced During Chat Fine-Tuning Using Crosscoders Julian Minder, Clément Dumas, Bilal Chughtai, Neel Nanda
ICML 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
ICML 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda
ICLRW 2025 Scaling Sparse Feature Circuits for Studying In-Context Learning Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda
ICLR 2025 Sparse Autoencoders Do Not Find Canonical Units of Analysis Patrick Leask, Bart Bussmann, Michael T Pearce, Joseph Isaac Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda
ICLRW 2025 Steering Fine-Tuning Generalization with Targeted Concept Ablation Helena Casademunt, Caden Juang, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
ICLRW 2025 Steering Fine-Tuning Generalization with Targeted Concept Ablation Helena Casademunt, Caden Juang, Senthooran Rajamanoharan, Neel Nanda
NeurIPS 2025 Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval Constantin Venhoff, Ashkan Khakzar, Sonia Joseph, Philip Torr, Neel Nanda
ICLR 2025 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda
ICLRW 2025 Understanding Reasoning in Thinking Language Models via Steering Vectors Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda
NeurIPSW 2024 BatchTopK Sparse Autoencoders Bart Bussmann, Patrick Leask, Neel Nanda
NeurIPS 2024 Confidence Regulation Neurons in Language Models Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
ICMLW 2024 Confidence Regulation Neurons in Language Models Alessandro Stolfo, Ben Peng Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
ICLRW 2024 Explorations of Self-Repair in Language Model Cody Rushing, Neel Nanda
ICML 2024 Explorations of Self-Repair in Language Models Cody Rushing, Neel Nanda
NeurIPS 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda
ICMLW 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda
ICMLW 2024 Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda
ICLR 2024 Is This the Subspace You Are Looking for? an Interpretability Illusion for Subspace Activation Patching Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda
ICMLW 2024 Language Models Linearly Represent Sentiment Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda
NeurIPS 2024 Refusal in Language Models Is Mediated by a Single Direction Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda
ICMLW 2024 Refusal in Language Models Is Mediated by a Single Direction Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda
NeurIPSW 2024 Stitching Sparse Autoencoders of Different Sizes Patrick Leask, Bart Bussmann, Joseph Isaac Bloom, Curt Tigges, Noura Al Moubayed, Neel Nanda
ICLR 2024 Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang, Neel Nanda
ICLRW 2024 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda
NeurIPS 2024 Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky, Philippe Chlenski, Neel Nanda
ICMLW 2024 Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky, Philippe Chlenski, Neel Nanda
TMLR 2024 Universal Neurons in GPT2 Language Models Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas
ICML 2023 A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Bilal Chughtai, Lawrence Chan, Neel Nanda
TMLR 2023 Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas
ICLRW 2023 N2G: A Scalable Approach for Quantifying Interpretable Neuron Representation in LLMs Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Fazl Barez
ICLRW 2023 Neural Networks Learn Representation Theory: Reverse Engineering How Networks Perform Group Operations Bilal Chughtai, Lawrence Chan, Neel Nanda
ICLR 2023 Progress Measures for Grokking via Mechanistic Interpretability Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
JMLR 2022 Fully General Online Imitation Learning Michael K. Cohen, Marcus Hutter, Neel Nanda