Mallen, Alex Troy

3 publications

ICML 2025 Automatically Interpreting Millions of Features in Large Language Models Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose
NeurIPS 2025 Why Do Some Language Models Fake Alignment While Others Don't? Abhay Sheshadri, John Hughes, Julian Michael, Alex Troy Mallen, Arun Jose, Fabien Roger
ICML 2024 Neural Networks Learn Statistics of Increasing Complexity Nora Belrose, Quintin Pope, Lucia Quirke, Alex Troy Mallen, Xiaoli Fern