Belrose, Nora

8 publications

ICML 2025 Automatically Interpreting Millions of Features in Large Language Models Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose

AAAI 2025 Do Transformer Interpretability Methods Transfer to RNNs? Gonçalo Paulo, Thomas Marshall, Nora Belrose

ICLRW 2025 Mechanistic Anomaly Detection for "Quirky'' Language Models David O. Johnston, Arkajyoti Chakraborty, Nora Belrose

ICML 2024 Neural Networks Learn Statistics of Increasing Complexity Nora Belrose, Quintin Pope, Lucia Quirke, Alex Troy Mallen, Xiaoli Fern

ICML 2023 Adversarial Policies Beat Superhuman Go AIs Tony Tong Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

NeurIPS 2023 LEACE: Perfect Linear Concept Erasure in Closed Form Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

NeurIPSW 2022 Adversarial Policies Beat Professional-Level Go AIs Tony Tong Wang, Adam Gleave, Nora Belrose, Tom Tseng, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Joseph Miller, Sergey Levine, Stuart Russell

NeurIPSW 2022 Adversarial Policies Beat Professional-Level Go AIs Tony Tong Wang, Adam Gleave, Nora Belrose, Tom Tseng, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell