Geiger, Atticus

18 publications

ICML 2025 AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts
JMLR 2025 Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard
CLeaR 2025 Combining Causal Models for More Accurate Abstractions of Neural Networks Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger
ICML 2025 How Do Transformers Learn Variable Binding in Symbolic Programs? Yiwei Wu, Atticus Geiger, Raphaël Millière
ICLR 2025 HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D'Oosterlinck, Christopher Potts, Michael Sklar, Atticus Geiger
ICML 2025 MIB: A Mechanistic Interpretability Benchmark Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
CLeaR 2024 Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah Goodman
NeurIPSW 2024 GPT-2 Small Fine-Tuned on Logical Reasoning Summarizes Information on Punctuation Tokens Sonakshi Chauhan, Atticus Geiger
ICLR 2024 Is This the Subspace You Are Looking for? an Interpretability Illusion for Subspace Activation Patching Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda
ICMLW 2024 Language Models Linearly Represent Sentiment Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda
NeurIPS 2024 ReFT: Representation Finetuning for Language Models Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts
CLeaR 2023 Causal Abstraction with Soft Interventions Riccardo Massidda, Atticus Geiger, Thomas Icard, Davide Bacciu
ICML 2023 Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu, Karel D’Oosterlinck, Atticus Geiger, Amir Zur, Christopher Potts
NeurIPS 2023 Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah Goodman
NeurIPS 2022 CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior Eldar D Abraham, Karel D'Oosterlinck, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, Zhengxuan Wu
ICML 2022 Inducing Causal Structure for Interpretable Neural Networks Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, Christopher Potts
NeurIPS 2021 Causal Abstractions of Neural Networks Atticus Geiger, Hanson Lu, Thomas Icard, Christopher Potts