Makelov, Aleksandar

6 publications

ICLR 2025 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda
ICLR 2024 Is This the Subspace You Are Looking for? an Interpretability Illusion for Subspace Activation Patching Aleksandar Makelov, Georg Lange, Atticus Geiger, Neel Nanda
ICMLW 2024 Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task Aleksandar Makelov
ICLRW 2024 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov, Georg Lange, Neel Nanda
ICML 2023 Rethinking Backdoor Attacks Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, Aleksander Madry
ICLR 2018 Towards Deep Learning Models Resistant to Adversarial Attacks Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu