LLM Neurosurgeon: Targeted Knowledge Removal in LLMs Using Sparse Autoencoders
Abstract
Generative AI's widespread use has raised concerns about trust, safety, steerability, and interpretability. Existing solutions, like prompt engineering, fine-tuning, and reinforcement learning (e.g., RLHF, DPO), are often hard to iterate, computationally expensive, and rely heavily on dataset quality. This paper introduces Neurosurgeon, an efficient procedure that uses sparse autoencoders to identify and remove specific topics from a language model’s internal representations. This approach offers precise control over model responses while maintaining overall behavior. Experiments on the Gemma 2-9B model show Neurosurgeon’s ability to reduce bias in targeted areas without altering the model’s core functionality.
Cite
Text
Patil et al. "LLM Neurosurgeon: Targeted Knowledge Removal in LLMs Using Sparse Autoencoders." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Patil et al. "LLM Neurosurgeon: Targeted Knowledge Removal in LLMs Using Sparse Autoencoders." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/patil2025iclrw-llm/)BibTeX
@inproceedings{patil2025iclrw-llm,
title = {{LLM Neurosurgeon: Targeted Knowledge Removal in LLMs Using Sparse Autoencoders}},
author = {Patil, Kunal and Zhou, Dylan and Sun, Yifan and Lakshmanan, Karthik and Rajamanoharan, Senthooran and Conmy, Arthur},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/patil2025iclrw-llm/}
}