Rager, Can

7 publications

ICLR 2025 NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals Jaden Fried Fiotto-Kaufman, Alexander Russell Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla E. Brodley, Arjun Guha, Jonathan Bell, Byron C Wallace, David Bau
ICML 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda
ICLR 2025 Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, Aaron Mueller
ICMLW 2024 An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau
NeurIPS 2024 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
ICMLW 2024 Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
NeurIPSW 2023 Linearly Structured World Representations in Maze-Solving Transformers Michael Ivanitskiy, Alexander F Spies, Tilman Räuker, Guillaume Corlouer, Christopher Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung