Farnik, Lucy

6 publications

ICLR 2026 Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers Thomas Heap, Tim Lawson, Lucy Farnik, Laurence Aitchison
ICML 2025 Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison
ICLR 2025 Residual Stream Analysis with Multi-Layer SAEs Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison
NeurIPSW 2024 Inducing Human-like Biases in Moral Reasoning Language Models Austin Meek, Artem Karpov, Seong Hah Cho, Raymond Koopmanschap, Lucy Farnik, Bogdan-Ionut Cirstea
NeurIPSW 2024 Residual Stream Analysis with Multi-Layer SAEs Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison
ICLR 2024 STARC: A General Framework for Quantifying Differences Between Reward Functions Joar Max Viktor Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate