H-Space Sparse Autoencoders
Abstract
In this work, we introduce a computationally efficient method that allows Sparse Autoencoders (SAEs) to automatically detect interpretable directions within the latent space of diffusion models. We show that intervening on a single neuron in SAE representation space at a single diffusion time step leads to meaningful feature changes in model output. This marks a step toward applying techniques from mechanistic interpretability to controlling the outputs of diffusion models, further ensuring the safety of their generations. As such, we establish a connection between safety/interpretability methods from language modelling and image generative modelling.
Cite
Text
Ijishakin et al. "H-Space Sparse Autoencoders." NeurIPS 2024 Workshops: SafeGenAi, 2024.Markdown
[Ijishakin et al. "H-Space Sparse Autoencoders." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/ijishakin2024neuripsw-hspace/)BibTeX
@inproceedings{ijishakin2024neuripsw-hspace,
title = {{H-Space Sparse Autoencoders}},
author = {Ijishakin, Ayodeji and Ang, Ming Liang and Baljer, Levente and Tan, Daniel Chee Hian and Fry, Hugo Laurence and Abdulaal, Ahmed and Lynch, Aengus and Cole, James H.},
booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/ijishakin2024neuripsw-hspace/}
}