ML Anthology
Authors
Search
About
Belrose, Nora
8 publications
ICML
2025
Automatically Interpreting Millions of Features in Large Language Models
Gonçalo Santos Paulo
,
Alex Troy Mallen
,
Caden Juang
,
Nora Belrose
AAAI
2025
Do Transformer Interpretability Methods Transfer to RNNs?
Gonçalo Paulo
,
Thomas Marshall
,
Nora Belrose
ICLRW
2025
Mechanistic Anomaly Detection for "Quirky'' Language Models
David O. Johnston
,
Arkajyoti Chakraborty
,
Nora Belrose
ICML
2024
Neural Networks Learn Statistics of Increasing Complexity
Nora Belrose
,
Quintin Pope
,
Lucia Quirke
,
Alex Troy Mallen
,
Xiaoli Fern
ICML
2023
Adversarial Policies Beat Superhuman Go AIs
Tony Tong Wang
,
Adam Gleave
,
Tom Tseng
,
Kellin Pelrine
,
Nora Belrose
,
Joseph Miller
,
Michael D Dennis
,
Yawen Duan
,
Viktor Pogrebniak
,
Sergey Levine
,
Stuart Russell
NeurIPS
2023
LEACE: Perfect Linear Concept Erasure in Closed Form
Nora Belrose
,
David Schneider-Joseph
,
Shauli Ravfogel
,
Ryan Cotterell
,
Edward Raff
,
Stella Biderman
NeurIPSW
2022
Adversarial Policies Beat Professional-Level Go AIs
Tony Tong Wang
,
Adam Gleave
,
Nora Belrose
,
Tom Tseng
,
Michael D Dennis
,
Yawen Duan
,
Viktor Pogrebniak
,
Joseph Miller
,
Sergey Levine
,
Stuart Russell
NeurIPSW
2022
Adversarial Policies Beat Professional-Level Go AIs
Tony Tong Wang
,
Adam Gleave
,
Nora Belrose
,
Tom Tseng
,
Michael D Dennis
,
Yawen Duan
,
Viktor Pogrebniak
,
Sergey Levine
,
Stuart Russell