Makelov, Aleksandar

7 publications

ICLR 2026 Persona Features Control Emergent Misalignment Miles Wang, Tom Dupre la Tour, Olivia Watkins, Aleksandar Makelov, Ryan Andrew Chi, Samuel Miserendino, Jeffrey George Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Daniel P Mossing
ICLR 2025 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda
ICLR 2024 Is This the Subspace You Are Looking for? an Interpretability Illusion for Subspace Activation Patching Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda
ICMLW 2024 Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task Aleksandar Makelov
ICLRW 2024 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda
ICML 2023 Rethinking Backdoor Attacks Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry
ICLR 2018 Towards Deep Learning Models Resistant to Adversarial Attacks Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu