Unlocking Hierarchical Concept Discovery in Language Models Through Geometric Regularization

Abstract

We present Exponentially-Weighted Group Sparse Autoencoders (EWG-SAE) that aims to balance reconstruction quality and feature sparsity whilst resolving emerging problem such as feature absorption in interpretable language model analysis in a linguistically principled way through geometrically decaying group sparsity. Current sparse autoencoders struggle with merged hierarchical features due to uniform regularization encouraging absorption of broader features into more specific ones (e.g., "starts with S" being absorbed into "short"). Our architecture introduces hierarchical sparsity via $K=9$ dimension groups with exponential regularization decay ($\lambda_k = \lambda_{base} \times 0.5^k$), reducing absorption while maintaining state-of-the-art reconstruction fidelity, sparse probing score, and decent $\ell_1$ loss. The geometric structure enables precise feature isolation with negative inter-group correlations confirming hierarchical organization.

Cite

Text

Li and Ren. "Unlocking Hierarchical Concept Discovery in Language Models Through Geometric Regularization." ICLR 2025 Workshops: BuildingTrust, 2025.

Markdown

[Li and Ren. "Unlocking Hierarchical Concept Discovery in Language Models Through Geometric Regularization." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/li2025iclrw-unlocking/)

BibTeX

@inproceedings{li2025iclrw-unlocking,
  title     = {{Unlocking Hierarchical Concept Discovery in Language Models Through Geometric Regularization}},
  author    = {Li, Ed and Ren, Junyu},
  booktitle = {ICLR 2025 Workshops: BuildingTrust},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/li2025iclrw-unlocking/}
}