Belrose, Nora

8 publications

ICML 2025 Automatically Interpreting Millions of Features in Large Language Models Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose
AAAI 2025 Do Transformer Interpretability Methods Transfer to RNNs? Gonçalo Paulo, Thomas Marshall, Nora Belrose
ICLRW 2025 Mechanistic Anomaly Detection for "Quirky'' Language Models David O. Johnston, Arkajyoti Chakraborty, Nora Belrose
ICML 2024 Neural Networks Learn Statistics of Increasing Complexity Nora Belrose, Quintin Pope, Lucia Quirke, Alex Troy Mallen, Xiaoli Fern
ICML 2023 Adversarial Policies Beat Superhuman Go AIs Tony Tong Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell
NeurIPS 2023 LEACE: Perfect Linear Concept Erasure in Closed Form Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman
NeurIPSW 2022 Adversarial Policies Beat Professional-Level Go AIs Tony Tong Wang, Adam Gleave, Nora Belrose, Tom Tseng, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Joseph Miller, Sergey Levine, Stuart Russell
NeurIPSW 2022 Adversarial Policies Beat Professional-Level Go AIs Tony Tong Wang, Adam Gleave, Nora Belrose, Tom Tseng, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell