Michaud, Eric J

8 publications

ICLR 2025 Efficient Dictionary Learning with Switch Sparse Autoencoders Anish Mudide, Joshua Engels, Eric J Michaud, Max Tegmark, Christian Schroeder de Witt
ICLR 2025 Not All Language Model Features Are One-Dimensionally Linear Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, Max Tegmark
NeurIPS 2025 On the Creation of Narrow AI: Hierarchy and Nonlocality of Neural Network Skills Eric J Michaud, Asher Parker-Sartori, Max Tegmark
TMLR 2025 Open Problems in Mechanistic Interpretability Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath
ICLR 2025 Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, Aaron Mueller
ICMLW 2024 Survival of the Fittest Representation: A Case Study with Modular Addition Xiaoman Delores Ding, Zifan Carl Guo, Eric J Michaud, Ziming Liu, Max Tegmark
ICLR 2023 Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu, Eric J Michaud, Max Tegmark
TMLR 2023 Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip J.K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell