Mallen, Alex Troy

3 publications

ICML 2025 Automatically Interpreting Millions of Features in Large Language Models Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose

NeurIPS 2025 Why Do Some Language Models Fake Alignment While Others Don't? Abhay Sheshadri, John Hughes, Julian Michael, Alex Troy Mallen, Arun Jose, Fabien Roger

ICML 2024 Neural Networks Learn Statistics of Increasing Complexity Nora Belrose, Quintin Pope, Lucia Quirke, Alex Troy Mallen, Xiaoli Fern