Antipodal Pairing and Mechanistic Signals in Dense SAE Latents
Abstract
Sparse autoencoders (SAEs) are designed to extract interpretable features from language models, yet they often yield frequently activating latents that remain difficult to interpret. It is still an open question whether these \textit{dense} latents are an undesired training artifact or whether they represent fundamentally dense signals in the model's activations. Our study provides evidence for the latter explanation: dense latents capture fundamental signals which (1) align with principal directions of variance in the model's residual stream and (2) reconstruct a subspace of the unembedding matrix that was linked by previous work to internal model computation. Furthermore, we show that these latents typically emerge as nearly antipodal pairs that collaboratively reconstruct specific residual stream directions. These findings reveal a mechanistic role for dense latents in language model behavior and suggest avenues for refining SAE training strategies.
Cite
Text
Stolfo et al. "Antipodal Pairing and Mechanistic Signals in Dense SAE Latents." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Stolfo et al. "Antipodal Pairing and Mechanistic Signals in Dense SAE Latents." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/stolfo2025iclrw-antipodal/)BibTeX
@inproceedings{stolfo2025iclrw-antipodal,
title = {{Antipodal Pairing and Mechanistic Signals in Dense SAE Latents}},
author = {Stolfo, Alessandro and Wu, Ben Peng and Sachan, Mrinmaya},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/stolfo2025iclrw-antipodal/}
}