Makelov, Aleksandar

7 publications

ICLR 2026 Persona Features Control Emergent Misalignment Miles Wang, Tom Dupre la Tour, Olivia Watkins, Aleksandar Makelov, Ryan Andrew Chi, Samuel Miserendino, Jeffrey George Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Daniel P Mossing

ICLR 2025 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2024 Is This the Subspace You Are Looking for? an Interpretability Illusion for Subspace Activation Patching Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda

ICMLW 2024 Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task Aleksandar Makelov

ICLRW 2024 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda

ICML 2023 Rethinking Backdoor Attacks Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry

ICLR 2018 Towards Deep Learning Models Resistant to Adversarial Attacks Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu